
SRE-skills-bench
Can Language Models Resolve Real-world SRE Tasks?
Leaderboard
Overall performance ranking across all SRE tasks
Scroll right to see all columns →
| Rank | Model | Score | Cost/1M output tokens | Date |
|---|---|---|---|---|
| 1 | 95.3 | $168.00 | 2025-12-12 | |
| 2 | 94.9 | $12.00 | 2025-11-24 | |
| 3 | 93.6 | $14.00 | 2025-12-11 | |
| 4 | 92.9 | $25.00 | 2025-11-24 | |
| 5 | 92.7 | $15.00 | 2025-11-24 | |
| 6 | 92.5 | $10.00 | 2025-12-10 | |
| 7 | 89.6 | $10.00 | 2025-11-24 | |
| 8 | 86.9 | $10.00 | 2025-12-03 | |
| 9 | 85.2 | $3.95 | 2025-11-24 | |
| 10 | 60.4 | $2.50 | 2025-12-05 |
Task Results
Performance breakdown across SRE task categories (scores shown as percentages).
Scroll right to see all columns →
Model | Avg | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 95.3 | 92.8 | 95.5 | 93.5 | 97.9 | 97.1 | 97.0 | 94.7 | 97.7 | 97.7 | 89.5 | |
| 94.9 | 95.7 | 95.0 | 97.1 | 96.8 | 96.4 | 95.3 | 95.4 | 94.4 | 93.0 | 89.7 | |
| 93.6 | 89.9 | 93.7 | 93.1 | 97.9 | 95.6 | 93.2 | 94.5 | 94.4 | 95.0 | 89.0 | |
| 92.9 | 92.8 | 92.3 | 88.6 | 95.8 | 94.9 | 93.2 | 94.5 | 92.0 | 95.0 | 90.2 | |
| 92.7 | 88.4 | 95.5 | 93.5 | 98.9 | 89.1 | 94.0 | 93.6 | 93.2 | 91.9 | 89.0 | |
| 92.5 | 89.0 | 93.0 | 91.0 | 97.0 | 92.0 | 94.0 | 92.0 | 93.0 | 95.0 | 89.0 | |
| 89.6 | 82.6 | 89.6 | 88.6 | 92.6 | 91.2 | 90.2 | 90.4 | 88.9 | 92.6 | 89.2 | |
| 86.9 | 83.0 | 86.0 | 82.0 | 94.0 | 89.0 | 89.0 | 84.0 | 91.0 | 84.0 | 87.0 | |
| 85.2 | 76.8 | 85.5 | 86.5 | 88.4 | 89.1 | 87.2 | 84.0 | 82.1 | 86.0 | 86.2 | |
| 60.4 | 41.0 | 63.0 | 54.0 | 66.0 | 61.0 | 64.0 | 53.0 | 63.0 | 55.0 | 84.0 | |
| 49.2 | 34.0 | 52.5 | 46.1 | 47.4 | 48.9 | 47.4 | 45.2 | 47.5 | 42.6 | 80.0 |
News
Latest updates and announcements from the SRE-skills-bench project
Methodology
How we evaluate language models on Site Reliability Engineering tasks
Task Collection
We curate real-world SRE scenarios from production incidents, infrastructure challenges, and cloud platform configurations. Each task is designed to test specific SRE competencies including cloud security, networking, compute management, and identity access management.
Evaluation Framework
Models are evaluated using multiple-choice questions (MCQ) format to ensure consistent and reproducible results. Each question has been validated by experienced SRE practitioners to ensure accuracy and relevance to real-world scenarios.
Scoring System
The overall score is calculated as a weighted average across all task categories. Each category is weighted based on its importance in typical SRE workflows. Costs are measured using standard API pricing for 1,000 tokens of output.
Quality Assurance
All benchmark questions undergo rigorous review by multiple SRE experts. We regularly update the benchmark to reflect the latest cloud platform features and security best practices, ensuring the evaluation remains relevant and challenging.
About Us

Rootly AI Labs
Advancing AI for Site Reliability Engineering
Rootly AI Labs is the research division of Rootly, dedicated to exploring how artificial intelligence can transform incident management and site reliability engineering. Our mission is to push the boundaries of what's possible when AI meets operational excellence.
We created SRE-skills-bench to provide the industry with a rigorous, transparent benchmark for evaluating how well language models understand and can assist with real-world SRE tasks. As AI becomes increasingly integrated into DevOps workflows, it's crucial to have standardized ways to measure and compare model capabilities.
Our team consists of experienced SRE practitioners, ML researchers, and software engineers who are passionate about building tools that make on-call less painful and incident response more effective. We believe that the future of SRE will be deeply augmented by AI, and we're committed to ensuring that augmentation is built on solid foundations.