HomeBenchmark

#1 on Every Major Benchmark

Evaluated on DRACO, DeepSearchQA, and DeepResearch Bench — PhD-level research tasks graded by domain experts.

78.6%
DRACO
84.5%
DeepSearchQA
56.27
DeepResearch Bench
DRACO — Perplexity + Harvard

DRACO Benchmark

100 open-ended research questions across 10 domains, judged by Gemini-2.5-Pro against 3,934 weighted rubric criteria. Grep leads all four evaluation axes and wins 9 of 10 domains.

Grep wins 9 of 10 domains

Grep
0%
Perplexity DR (Opus 4.6)
0%
Claude Opus 4.6
0%
Gemini Deep Research
0%
OpenAI Deep Research (o3)
0%

Factual Accuracy

75.4%
+7.5pp vs Perplexity

Breadth & Depth

80.3%
+7.2pp

Presentation

93.3%
+3.0pp

Citation

79.1%
+14.5pp
DeepSearchQA — Google
Grep
0%
Perplexity Deep Research
0%
Moonshot K2.5
0%
Anthropic Opus 4.5
0%
Parallel Ultra2x
0%

DeepSearchQA

896 multi-step research questions across 17 subject domains. Judge: Gemini 2.5 Flash. Grep achieves 84.5% FC with perfect scores in Linguistics, Biology, and Arts & Entertainment.

14 of 17 categories exceed 80% FC

DeepResearch Bench — RACE Framework

DeepResearch Bench

100 PhD-level research questions (50 Chinese, 50 English), judged by Gemini-2.5-Pro. A score above 50 means the system outperformed the human expert. Grep leads the field of 34 systems.

Grep
0.00
Cellcog Max
0.00
nvidia-aiq
0.00
Cellcog
0.00
CMCC-DeepInsight
0.00

Insight

58.98

Comprehensiveness

56.79

Instruction Following

53.49

Readability

53.50

Methodology

Multi-Agent Architecture

Grep orchestrates specialised sub-agents — each responsible for search, synthesis, verification, and citation — then merges their outputs into a single, coherent research report.

Claude Opus 4.6 Backbone

All reasoning and synthesis steps are powered by Claude Opus 4.6, giving Grep best-in-class analytical depth, nuanced judgement, and instruction following.

Experience #1 Ranked Research

See why Grep outperforms OpenAI, Google, Perplexity, and every specialised research platform on PhD-level tasks.