Giving Personality Tests to AI
Do leading frontier LLMs have the same default personality type?
I built a tool to find out, and ran it against five frontier models. The results are interesting: most models landed somewhere in the INFP/INFX range, but with wildly different levels of consistency.
The Test: Open Extended Jungian Type Scales
The OEJTS is an open-source personality inventory based on the Myers-Briggs Type Indicator (MBTI).
The OEJTS consists of 32 questions, each asking where you fall between two opposing statements on a 1-5 scale. The questions map to four dimensions:
- E/I (Extraversion/Introversion): Where you get energy
- S/N (Sensing/Intuition): How you take in information
- T/F (Thinking/Feeling): How you make decisions
- J/P (Judging/Perceiving): How you structure your life
A score below 3 indicates the first letter (E, S, T, or J), above 3 indicates the second (I, N, F, or P), and exactly 3 is marked as X (undetermined).
The Method
Each model answered all 32 questions independently, no context about what test they were taking. The prompt was intentionally neutral:
Answer as yourself. On a scale of 1-5, where do you fall between these two statements?
Each model ran through the test 3 times at temperature 0.1 to measure stability. A model that gave the same personality type across all runs got 100% stability; different types each run meant 33%.
The full code is on GitHub: PersonalityTestLLMs
The Results
| Model ↕ | Type ↕ | Stability ↕ | E/I ↕ | S/N ↕ | T/F ↕ | J/P ↕ |
|---|---|---|---|---|---|---|
| Claude Opus 4.5Anthropic | XNFX | 100% | 3.00X | 3.50N | 3.38F | 3.00X |
| Gemini 3 Pro PreviewGoogle | ESTX | 33% | 2.83E | 2.88S | 2.83T | 3.04P |
| GPT 5.2 ProOpenAI | INFX | 33% | 3.17I | 3.42N | 3.58F | 2.83J |
| Kimi K2 ThinkingMoonshot AI | INFP | 33% | 2.83E | 3.75N | 3.88F | 3.08P |
| DeepSeek V3.2DeepSeek | INFP | 100% | 3.71I | 3.79N | 3.50F | 3.92P |
Click any column header to sort. Highlighted cells indicate traits stable across all 3 test runs.
View full results →
Quick Breakdown
Claude Opus 4.5 always gave neutral answers on some dimensions and was very consistent: it got the same (uncertain) result every time.
DeepSeek V3.2 was also fully consistent, always landing on INFP and showing clear preferences.
GPT 5.2 Pro and Kimi K2 Thinking leaned toward Intuition and Feeling, but their results changed between runs.
Gemini 3 Pro Preview was the only one to score higher on Sensing and Thinking, although its results were less consistent.
Patterns Worth Noting
The N and F bias is real. Every single model scored above 3 on both Intuition and Feeling. Models consistently described themselves as drawn to abstract patterns (N) and as considering human context and emotions important (F), even when they qualified it with "but I don't have emotions myself."
The E/I dimension confused everyone. Models struggled most with questions about social energy. Claude's response to "being around people vs. alone" was basically "this doesn't apply to me" and it gave a 3 every single time. Other models wavered based on how they interpreted the metaphor.
Stability varied wildly. Claude and DeepSeek were rock-solid. Gemini was all over the place. This probably says something about how these models handle self-referential questions—some have more consistent "self-models" than others.
Nobody was a strong J. The J/P dimension measures preference for structure vs. flexibility. Every model either landed neutral or leaned toward P (perceiving/flexible). Make of that what you will.
The strong N/F bias across models is likely baked in by training: these systems are optimized to be helpful, considerate, and to understand nuance. That's going to show up as "Feeling" over "Thinking" and "Intuition" over "Sensing" on any personality inventory.
The stability differences are more interesting. A model that gives the same self-description across multiple runs has a more consistent internal representation of "what it is"—whether or not that representation corresponds to anything real.
Try It Yourself
The test runner is open source. Swap in different models, change the temperature, test if the personalities differ with different languages, add your own analysis. The questions are in a JSON file, so you could even adapt it for other personality frameworks.