#1 on Every Major Benchmark
Evaluated on DRACO, DeepSearchQA, and DeepResearch Bench — PhD-level research tasks graded by domain experts.
DRACO Benchmark
100 open-ended research questions across 10 domains, judged by Gemini-2.5-Pro against 3,934 weighted rubric criteria. Grep leads all four evaluation axes and wins 9 of 10 domains.
Grep wins 9 of 10 domains
Factual Accuracy
Breadth & Depth
Presentation
Citation
DeepSearchQA
896 multi-step research questions across 17 subject domains. Judge: Gemini 2.5 Flash. Grep achieves 84.5% FC with perfect scores in Linguistics, Biology, and Arts & Entertainment.
14 of 17 categories exceed 80% FC
DeepResearch Bench
100 PhD-level research questions (50 Chinese, 50 English), judged by Gemini-2.5-Pro. A score above 50 means the system outperformed the human expert. Grep leads the field of 34 systems.
Insight
Comprehensiveness
Instruction Following
Readability
Methodology
Multi-Agent Architecture
Grep orchestrates specialised sub-agents — each responsible for search, synthesis, verification, and citation — then merges their outputs into a single, coherent research report.
Claude Opus 4.6 Backbone
All reasoning and synthesis steps are powered by Claude Opus 4.6, giving Grep best-in-class analytical depth, nuanced judgement, and instruction following.
Experience #1 Ranked Research
See why Grep outperforms OpenAI, Google, Perplexity, and every specialised research platform on PhD-level tasks.