
SRE-skills-bench
Can Language Models Resolve Real-world SRE Tasks?
Leaderboard V1 · MCQ
Overall performance ranking across all SRE multiple-choice tasks
Scroll right to see all columns →
| Rank | Model | Score | Cost/1M output tokens | Date |
|---|---|---|---|---|
| 1 | 98.8 | $12.00 | 2026-02-19 | |
| 2 | 98.3 | $15.00 | 2026-03-06 | |
| 3 | 98.0 | $14.00 | 2026-03-13 | |
| 4 | 96.7 | $12.00 | 2026-02-17 | |
| 5 | 96.5 | $168.00 | 2026-02-17 | |
| 6 | 95.9 | $2.06 | 2026-02-17 | |
| 7 | 94.7 | $25.00 | 2026-02-17 | |
| 8 | 94.6 | $25.00 | 2026-02-17 | |
| 9 | 93.3 | $14.00 | 2026-02-17 | |
| 10 | 92.5 | $10.00 | 2025-12-10 |
Task Results
Performance breakdown across SRE task categories (scores shown as percentages).
Scroll right to see all columns →
Model | Avg | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 98.8 | 100.0 | 99.3 | 100.0 | 100.0 | 98.9 | 99.3 | 99.3 | 100.0 | 99.5 | 92.0 | |
| 98.3 | 100.0 | 99.3 | 98.4 | 100.0 | 98.9 | 99.3 | 99.3 | 99.0 | 98.4 | 90.3 | |
| 98.0 | 100.0 | 99.3 | 98.4 | 100.0 | 96.6 | 98.6 | 100.0 | 99.0 | 98.4 | 90.0 | |
| 96.7 | 97.3 | 96.6 | 96.9 | 100.0 | 100.0 | 97.3 | 96.3 | 99.0 | 97.8 | 85.3 | |
| 96.5 | 94.6 | 96.6 | 94.5 | 100.0 | 96.6 | 98.6 | 99.3 | 98.1 | 97.3 | 89.5 | |
| 95.9 | 91.9 | 97.3 | 96.9 | 98.6 | 98.9 | 97.9 | 97.1 | 98.1 | 93.4 | 88.7 | |
| 94.7 | 91.9 | 99.3 | 92.2 | 97.1 | 96.6 | 97.3 | 95.6 | 96.1 | 94.0 | 87.0 | |
| 94.6 | 89.2 | 98.0 | 90.6 | 100.0 | 97.7 | 96.6 | 97.8 | 96.1 | 92.3 | 87.5 | |
| 93.3 | 81.1 | 96.6 | 91.4 | 97.1 | 95.5 | 93.2 | 97.1 | 96.1 | 95.1 | 89.8 | |
| 92.5 | 89.0 | 93.0 | 91.0 | 97.0 | 92.0 | 94.0 | 92.0 | 93.0 | 95.0 | 89.0 | |
| 90.4 | 75.7 | 94.6 | 85.2 | 97.1 | 94.3 | 94.5 | 92.6 | 92.2 | 90.2 | 88.0 | |
| 89.6 | 82.6 | 89.6 | 88.6 | 92.6 | 91.2 | 90.2 | 90.4 | 88.9 | 92.6 | 89.2 | |
| 89.4 | 79.7 | 90.0 | 86.9 | 96.8 | 91.2 | 91.0 | 90.9 | 88.9 | 91.1 | 87.5 | |
| 86.9 | 83.0 | 86.0 | 82.0 | 94.0 | 89.0 | 89.0 | 84.0 | 91.0 | 84.0 | 87.0 | |
| 86.5 | 75.7 | 89.9 | 78.9 | 97.1 | 89.8 | 89.7 | 83.8 | 87.4 | 86.9 | 85.3 | |
| 85.2 | 76.8 | 85.5 | 86.5 | 88.4 | 89.1 | 87.2 | 84.0 | 82.1 | 86.0 | 86.2 | |
| 83.6 | 64.9 | 92.6 | 79.7 | 95.7 | 86.4 | 88.4 | 91.2 | 87.4 | 86.9 | 77.0 | |
| 70.8 | 53.6 | 71.0 | 70.6 | 73.7 | 78.1 | 70.9 | 68.5 | 67.9 | 67.0 | 86.5 | |
| 68.3 | 68.1 | 64.3 | 68.6 | 70.5 | 73.7 | 68.0 | 76.7 | 61.7 | 55.4 | 75.8 | |
| 60.4 | 41.0 | 63.0 | 54.0 | 66.0 | 61.0 | 64.0 | 53.0 | 63.0 | 55.0 | 84.0 | |
| 49.2 | 34.0 | 52.5 | 46.1 | 47.4 | 48.9 | 47.4 | 45.2 | 47.5 | 42.6 | 80.0 |
News
Latest updates and announcements from the SRE-skills-bench project
Methodology
How we evaluate language models on Site Reliability Engineering tasks
Task Collection
We curate real-world SRE scenarios from production incidents, infrastructure challenges, and cloud platform configurations. Each task is designed to test specific SRE competencies including cloud security, networking, compute management, and identity access management across AWS, Azure, and GCP.
V1 — Multiple-Choice Questions
V1 evaluates models using multiple-choice questions derived from real SRE scenarios. Each question has been validated by experienced SRE practitioners. The overall score is a weighted average across task categories, providing a broad measure of SRE knowledge across cloud providers and disciplines.
Quality Assurance
All benchmark tasks undergo rigorous review by multiple SRE experts. We regularly update V1 to reflect the latest cloud platform features and security best practices, ensuring the evaluation remains relevant and challenging. Costs are measured using standard API pricing per 1M output tokens.
About Us

Rootly AI Labs
Advancing AI for Site Reliability Engineering
Rootly AI Labs is the research division of Rootly, dedicated to exploring how artificial intelligence can transform incident management and site reliability engineering. Our mission is to push the boundaries of what's possible when AI meets operational excellence.
We created SRE-skills-bench to provide the industry with a rigorous, transparent benchmark for evaluating how well language models understand and can assist with real-world SRE tasks. As AI becomes increasingly integrated into DevOps workflows, it's crucial to have standardized ways to measure and compare model capabilities.
Our team consists of experienced SRE practitioners, ML researchers, and software engineers who are passionate about building tools that make on-call less painful and incident response more effective. We believe that the future of SRE will be deeply augmented by AI, and we're committed to ensuring that augmentation is built on solid foundations.