Guozhen AIGlobal AI field notes and model intelligence

AI Model Benchmark Hub

Understand model leaderboards without blindly chasing first place

This hub explains how major AI model evaluation sites differ. Arena reflects human preference, Artificial Analysis helps compare capability, speed, and price, Vals AI focuses on industry tasks, and HELM emphasizes transparency and reproducibility.

Guozhen AI Composite Ranking v0.1

Weighted composite ranking

This is Guozhen AI's original synthesis layer. It normalizes Arena multi-domain preference, Vals real-task evidence, Artificial Analysis production signals, and HELM-style transparency signals into a 0-100 weighted score.

Auto snapshot: 2026-06-01
Refreshes every 3 days; next refresh around 2026-06-05

The ranking combines public benchmark signals from LMArena Text, WebDev, Vision, and Document, then uses Vals, Artificial Analysis, and HELM-style methodology for editorial calibration. If an external source is temporarily unavailable, the page keeps a stable composite ranking without exposing fetch diagnostics to readers.

40%

Arena multi-domain preference

Combines Text, WebDev, Vision, Document, and related preference signals from real users.

25%

Vals and real tasks

Uses coding, terminal, industry, and agentic task evidence to avoid chat-only evaluation.

25%

Artificial Analysis

Adds production signals such as intelligence, speed, latency, and price.

10%

HELM and transparent evals

Rewards reproducibility, robustness, multi-metric reporting, and research transparency.

RankModelCompositeArenaTasksEfficiencyTransparencyBest for
1
claude-opus-4-7-thinking
Anthropic
94.8
99949087
Complex reasoning, long documents, engineering agents, WebDev

Strong across Arena-style preference and real engineering tasks, making it the strongest composite choice in this snapshot.

2
claude-opus-4-6-thinking
Anthropic
93.6
98928987
Document understanding, deep writing, reasoning-heavy work

Very stable across Text, Vision, and Document preference signals, slightly behind the newer thinking model.

3
gemini-3.1-pro-preview
Google
91.7
91969184
Coding, long context, multimodal, search-augmented tasks

Strong Vals coding and long-context signals lift its composite score beyond pure chat preference ranking.

4
gpt-5.5-high
OpenAI
90.9
88959683
General intelligence, code repair, production API selection

Strong on SWE-style tasks and general intelligence signals, with favorable production trade-offs.

5
claude-opus-4-7
Anthropic
89.4
96888686
Writing, chat, documents, lighter agent workflows

Still very strong in Arena and WebDev, but slightly less reliable than the thinking variant for complex tasks.

6
claude-opus-4-6
Anthropic
88.8
95878686
Text creation, visual understanding, document analysis

A stable all-round model for high-quality content and complex material analysis.

7
gemini-3-pro
Google
88.3
90899184
Vision, multimodal, long context

Strong vision and multimodal performance keep it high among Google models.

8
gpt-5.4-high
OpenAI
84.1
87888582
Competitive coding, stable API use, general assistant tasks

Still strong in selected academic and coding tasks, but trails GPT-5.5 and newer Claude models overall.

9
qwen3.7-max-20260517
Alibaba
83.7
86838679
Chinese tasks, WebDev, cost-sensitive API use

Notable WebDev performance, with extra value for Chinese-language and cost-aware use cases.

10
gemini-3.5-flash
Google
82.6
84819380
Low latency, multimodal, high-throughput workloads

Not the strongest intelligence model, but speed and cost make it useful at production scale.

11
claude-sonnet-4-6
Anthropic
80.8
82818284
Daily writing, code explanation, cost-controlled tasks

Below the Opus tier, but balanced for quality and cost.

12
glm-5.1
Zhipu AI
79.2
82788275
Chinese Q&A, domestic ecosystem, enterprise private deployment review

Good WebDev signal, worth further testing for Chinese and domestic ecosystem scenarios.

13
kimi-k2.6
Moonshot AI
78.4
81778274
Chinese long documents, knowledge organization, cost-aware workflows

Interesting for Chinese long-document work, though cross-source coverage is less complete than top labs.

14
muse-spark
Meta
77.1
85737876
General chat and open-ecosystem tracking

Strong Text preference signal, but weaker cross-source task and production coverage lowers the composite rank.

15
deepseek-r1-202605
DeepSeek
76.4
78798372
Chinese reasoning, math, cost-sensitive API use

Good reasoning and cost signals, worth testing for Chinese technical Q&A and budget-sensitive work.

16
deepseek-v3.1
DeepSeek
75.8
77768672
General Chinese tasks, batch processing, tool use

Efficient and cost-aware, useful as a candidate for batch workflows.

17
llama-4-maverick
Meta
74.9
75747888
Open ecosystem, local deployment, research reproducibility

Strong openness and transparency, though top task capability trails frontier closed models.

18
qwen3.7-plus
Alibaba
74.2
76738476
Chinese apps, low-cost production, domestic ecosystem

Good Chinese ecosystem and cost profile, useful as an enterprise fallback.

19
grok-4
xAI
73.6
76727770
Fresh information, creative Q&A, social context

Interesting for freshness and creative Q&A, with less cross-source coverage than top labs.

20
mistral-large-2
Mistral
72.8
73728079
EU compliance, open ecosystem, multilingual tasks

Useful for multilingual and compliance-sensitive work, though not a top composite performer.

Trusted Sources

How to read major model benchmark sites

Arena / LMArena

Human preference

Source

Uses anonymous pairwise voting from real users. It is useful for general chat, writing, and preference-driven quality, but a single score should not be treated as the best choice for every workflow.

General chatWriting qualityMultimodal preferenceFrontier tracking

Limitation: Preference data can be affected by sampling, traffic allocation, prompt mix, and model exposure.

Artificial Analysis

Capability, speed, and cost

Source

Tracks intelligence, throughput, latency, and pricing, making it useful for API selection, cost control, and production trade-off analysis.

API selectionCost comparisonLatency and speedGeneral capability

Limitation: Composite scores cannot represent every private workflow; teams still need task-specific evaluation.

Vals AI

Industry task evaluation

Source

Focuses on high-value industry tasks such as finance, law, healthcare, coding, and education, with attention to documents, long context, and agentic workflows.

Finance and lawIndustry documentsLong contextAgent workflows

Limitation: Some datasets and judging details are private, so it is best used as an industry signal rather than a fully reproducible experiment.

Stanford HELM

Transparent reproducible evaluation

Source

Emphasizes transparent scenarios, metrics, and reproducible evaluation, which helps research-minded readers inspect model capability and robustness.

Research reproducibilityCapability breakdownsEvaluation methodsMulti-metric analysis

Limitation: Updates may be slower than commercial leaderboards, so the newest models can lag behind.

Guozhen AI Scorecard

A practical synthesis framework

30%

General intelligence

Compare reasoning, science, math, knowledge, and instruction following instead of trusting one top-ranked model.

25%

Real tasks

Prefer evidence from documents, codebases, tool use, multi-turn workflows, and long context over exam-only scores.

20%

Reliability

Check hallucination risk, format consistency, and whether the model stays coherent across long tasks.

15%

Cost and speed

For similar quality, compare input and output price, latency, throughput, and context window.

10%

Openness and control

Separate closed APIs, open weights, local deployment, compliance, and auditability.

Model Selection

Choose models by real scenario

Writing, Q&A, and knowledge organization

Start with Arena-style preference data, then check Artificial Analysis for speed and cost.

Coding, debugging, and engineering agents

Use LiveCodeBench, SWE-bench, Terminal-Bench, Vals coding tasks, and your own repository tests.

Finance, law, healthcare, and education

Prioritize industry-task benchmarks such as Vals, then add private internal evaluations.

Research and model capability analysis

Use HELM, GPQA, MMLU-Pro, HLE, and methodology notes from benchmark authors.

Local deployment, private data, and compliance

Compare open weights, licenses, deployment cost, context windows, and data retention policy.

Writing, Q&A, and knowledge organization

Start with Arena preference data, then add Artificial Analysis speed and cost signals.

Weights: Arena Text/Document preference 50%, general intelligence 20%, speed and cost 20%, knowledge organization 10%.

#1
claude-opus-4-7-thinking
Anthropic
96.2

Best overall for high-quality writing, long-answer structure, complex Q&A, and document summaries.

#2
claude-opus-4-6-thinking
Anthropic
95.1

Very stable in Text and Document signals, especially for long documents and deep writing.

#3
gemini-3.1-pro-preview
Google
91.8

Strong multimodal, long-context, and information organization ability.

#4
gpt-5.5-high
OpenAI
90.7

Good structured output, production API fit, and general Q&A performance.

#5
gemini-3.5-flash
Google
84.4

Not the highest quality, but useful for fast summarization, rewriting, and lightweight Q&A.

#6
claude-opus-4-7
Anthropic
88.9

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#7
gemini-3-pro
Google
87.6

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#8
gpt-5.4-high
OpenAI
86.8

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#9
claude-sonnet-4-6
Anthropic
85.7

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#10
qwen3.7-max-20260517
Alibaba
84.9

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#11
deepseek-r1-202605
DeepSeek
83.8

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#12
kimi-k2.6
Moonshot AI
82.7

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#13
glm-5.1
Zhipu AI
81.9

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#14
deepseek-v3.1
DeepSeek
81.3

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#15
qwen3.7-plus
Alibaba
80.6

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#16
muse-spark
Meta
79.8

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#17
llama-4-maverick
Meta
78.9

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#18
grok-4
xAI
78.1

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#19
mistral-large-2
Mistral
76.8

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

#20
command-r-plus-next
Cohere
75.9

Extended candidate for writing, Q&A, and knowledge organization; verify with your own prompts before production use.

Coding, debugging, and engineering agents

Prioritize LiveCodeBench, SWE-bench, Terminal-Bench, Vals coding tasks, and your own repository tests.

Weights: Vals/SWE real tasks 40%, WebDev/Arena engineering preference 25%, agent reliability 20%, speed and cost 15%.

#1
gemini-3.1-pro-preview
Google
96.0

Strong coding, long-context, and repository-level understanding signals.

#2
gpt-5.5-high
OpenAI
95.2

Strong SWE-style repair, tool use, and production API behavior.

#3
claude-opus-4-7-thinking
Anthropic
94.5

Excellent WebDev and reasoning signal for frontend refactors and architecture analysis.

#4
qwen3.7-max-20260517
Alibaba
87.6

Notable WebDev signal and worth testing for Chinese engineering workflows.

#5
claude-sonnet-4-6
Anthropic
84.9

Balanced for code explanation, local fixes, and lighter agent workflows.

#6
claude-opus-4-6-thinking
Anthropic
84.0

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#7
gpt-5.4-high
OpenAI
83.4

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#8
claude-opus-4-7
Anthropic
82.9

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#9
gemini-3-pro
Google
82.1

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#10
deepseek-r1-202605
DeepSeek
81.6

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#11
deepseek-v3.1
DeepSeek
80.7

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#12
qwen3.7-plus
Alibaba
79.9

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#13
glm-5.1
Zhipu AI
78.8

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#14
kimi-k2.6
Moonshot AI
77.9

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#15
llama-4-maverick
Meta
76.8

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#16
mistral-large-2
Mistral
75.8

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#17
muse-spark
Meta
74.9

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#18
grok-4
xAI
74.1

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#19
command-r-plus-next
Cohere
73.4

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

#20
yi-large-next
01.AI
72.8

Extended candidate for coding and agent workflows; validate on your own repository and test suite.

Finance, law, healthcare, and education

Prioritize industry-task benchmarks such as Vals, then add private internal evaluations.

Weights: Vals industry tasks 45%, long-document reasoning 25%, compliance control 15%, cost and speed 15%.

#1
claude-opus-4-7-thinking
Anthropic
95.0

Strong long-document reasoning and safer professional-answer style.

#2
gemini-3.1-pro-preview
Google
93.8

Strong long context and multimodal handling for reports and industry documents.

#3
gpt-5.5-high
OpenAI
92.9

Good tool ecosystem for knowledge bases, customer support, and internal workflow automation.

#4
claude-opus-4-6-thinking
Anthropic
91.5

Stable document reasoning for professional material review.

#5
kimi-k2.6
Moonshot AI
82.3

Worth testing for Chinese long-document and cost-sensitive industry workflows.

#6
claude-opus-4-7
Anthropic
89.9

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#7
gemini-3-pro
Google
88.4

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#8
gpt-5.4-high
OpenAI
87.8

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#9
qwen3.7-max-20260517
Alibaba
86.2

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#10
claude-sonnet-4-6
Anthropic
85.4

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#11
deepseek-r1-202605
DeepSeek
84.1

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#12
glm-5.1
Zhipu AI
83.3

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#13
deepseek-v3.1
DeepSeek
82.4

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#14
qwen3.7-plus
Alibaba
81.6

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#15
llama-4-maverick
Meta
80.5

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#16
mistral-large-2
Mistral
79.7

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#17
muse-spark
Meta
78.8

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#18
command-r-plus-next
Cohere
78.0

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#19
grok-4
xAI
77.1

Extended candidate for industry workflows; combine public signals with private internal evaluation.

#20
yi-large-next
01.AI
76.2

Extended candidate for industry workflows; combine public signals with private internal evaluation.

Research, papers, and model capability analysis

Use HELM, GPQA, MMLU-Pro, HLE, and methodology notes from benchmark authors.

Weights: transparent academic evaluation 35%, reasoning and knowledge 30%, reproducibility 20%, tools and retrieval 15%.

#1
claude-opus-4-7-thinking
Anthropic
94.2

Strong for complex reasoning, paper summaries, and long-form research analysis.

#2
gpt-5.5-high
OpenAI
93.4

Strong general knowledge, tool ecosystem, and structured analysis.

#3
gemini-3.1-pro-preview
Google
92.8

Strong long-context and multimodal analysis for papers, charts, and data materials.

#4
claude-opus-4-6-thinking
Anthropic
91.0

Stable reasoning and document comprehension for serious reading.

#5
gemini-3-pro
Google
87.1

Useful for visual and multimodal interpretation of figures and experiment materials.

#6
claude-opus-4-7
Anthropic
88.9

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#7
gpt-5.4-high
OpenAI
88.2

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#8
deepseek-r1-202605
DeepSeek
86.7

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#9
qwen3.7-max-20260517
Alibaba
85.6

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#10
claude-sonnet-4-6
Anthropic
84.9

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#11
llama-4-maverick
Meta
84.1

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#12
deepseek-v3.1
DeepSeek
83.3

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#13
glm-5.1
Zhipu AI
82.2

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#14
kimi-k2.6
Moonshot AI
81.5

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#15
qwen3.7-plus
Alibaba
80.6

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#16
mistral-large-2
Mistral
79.8

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#17
muse-spark
Meta
78.7

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#18
grok-4
xAI
77.9

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#19
command-r-plus-next
Cohere
77.0

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

#20
yi-large-next
01.AI
76.2

Extended candidate for research analysis; check transparent benchmark methodology and source reliability.

Local deployment, private data, and compliance

Compare open weights, licenses, deployment cost, context windows, and data retention policy separately.

Weights: openness and deployability 35%, data control 25%, Chinese usability 15%, cost efficiency 15%, capability 10%.

#1
qwen3.7-max / Qwen open ecosystem
Alibaba
89.0

Strong Chinese ecosystem, open community, and practical private-deployment route.

#2
glm-5.1 / GLM open ecosystem
Zhipu AI
86.4

Good Chinese capability and enterprise deployment fit.

#3
kimi-k2.6 / Moonshot ecosystem
Moonshot AI
83.2

Interesting for Chinese long documents and internal knowledge Q&A tests.

#4
muse-spark / Meta open ecosystem
Meta
81.5

Strong open ecosystem, though Chinese and industry coverage need more validation.

#5
gemini-3.5-flash
Google
78.8

Not a local-first model, but useful for low-cost high-throughput workloads after data sanitization.

#6
deepseek-r1 / DeepSeek open ecosystem
DeepSeek
77.9

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#7
deepseek-v3.1 / DeepSeek ecosystem
DeepSeek
77.2

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#8
mistral-large-2 / Mistral ecosystem
Mistral
76.5

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#9
qwen3.7-plus / Qwen open ecosystem
Alibaba
75.8

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#10
command-r-plus-next
Cohere
74.9

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#11
yi-large-next
01.AI
74.0

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#12
baichuan-4-next
Baichuan
73.2

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#13
internlm3-latest
Shanghai AI Lab
72.6

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#14
minimax-text-01
MiniMax
71.8

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#15
ernie-4.5
Baidu
71.1

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#16
gemini-3.5-flash
Google
70.5

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#17
gpt-5.5-high
OpenAI
69.4

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#18
claude-sonnet-4-6
Anthropic
68.8

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#19
claude-opus-4-7-thinking
Anthropic
68.1

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

#20
gemini-3.1-pro-preview
Google
67.6

Extended candidate for local, private, and compliance-sensitive workflows; check licenses and deployment terms.

Editorial note

This page does not copy external leaderboards or claim that one model is always best. Guozhen AI combines public benchmark sources, methodology differences, and practical scenarios so readers can make better model decisions.

Counting page reads