SRE-skills-bench logo

SRE-skills-bench

Can Language Models Resolve Real-world SRE Tasks?

Leaderboard

Overall performance ranking across all SRE tasks

Scroll right to see all columns →

RankModelScoreCost/1M output tokensDate
1
OpenAIgpt-5.2-pro
95.3$168.002025-12-12
2
GoogleGemini-3-pro
94.9$12.002025-11-24
3
OpenAIgpt-5.2
93.6$14.002025-12-11
4
Anthropicopus-4.5
92.9$25.002025-11-24
5
Anthropicsonnet-4.5
92.7$15.002025-11-24
6
OpenAIgpt-5.1-codex-max
92.5$10.002025-12-10
7
OpenAIGPT-5.1
89.6$10.002025-11-24
8
OpenAIGPT-5
86.9$10.002025-12-03
9
Alibabaqwen3-vl-235b
85.2$3.952025-11-24
10
AWSnova-pro-v1
60.4$2.502025-12-05

Task Results

Performance breakdown across SRE task categories (scores shown as percentages).

Scroll right to see all columns →

First Place
Second Place
Third Place
Model
Avg
AWSS3
AWSVPC
AWSIAM
AzureNetwork
AzureCompute
AzureKubernetes
GCPNetwork
GCPCompute
GCPStorage
GMCQGMCQ
OpenAIgpt-5.2-pro
95.392.895.593.597.997.197.094.797.797.789.5
GoogleGemini-3-pro
94.995.795.097.196.896.495.395.494.493.089.7
OpenAIgpt-5.2
93.689.993.793.197.995.693.294.594.495.089.0
Anthropicopus-4.5
92.992.892.388.695.894.993.294.592.095.090.2
Anthropicsonnet-4.5
92.788.495.593.598.989.194.093.693.291.989.0
OpenAIgpt-5.1-codex-max
92.589.093.091.097.092.094.092.093.095.089.0
OpenAIGPT-5.1
89.682.689.688.692.691.290.290.488.992.689.2
OpenAIGPT-5
86.983.086.082.094.089.089.084.091.084.087.0
Alibabaqwen3-vl-235b
85.276.885.586.588.489.187.284.082.186.086.2
AWSnova-pro-v1
60.441.063.054.066.061.064.053.063.055.084.0
AWSnova-2-lite
49.234.052.546.147.448.947.445.247.542.680.0

News

Latest updates and announcements from the SRE-skills-bench project

Methodology

How we evaluate language models on Site Reliability Engineering tasks

Task Collection

We curate real-world SRE scenarios from production incidents, infrastructure challenges, and cloud platform configurations. Each task is designed to test specific SRE competencies including cloud security, networking, compute management, and identity access management.

Evaluation Framework

Models are evaluated using multiple-choice questions (MCQ) format to ensure consistent and reproducible results. Each question has been validated by experienced SRE practitioners to ensure accuracy and relevance to real-world scenarios.

Scoring System

The overall score is calculated as a weighted average across all task categories. Each category is weighted based on its importance in typical SRE workflows. Costs are measured using standard API pricing for 1,000 tokens of output.

Quality Assurance

All benchmark questions undergo rigorous review by multiple SRE experts. We regularly update the benchmark to reflect the latest cloud platform features and security best practices, ensuring the evaluation remains relevant and challenging.

About Us

Rootly Logo

Rootly AI Labs

Advancing AI for Site Reliability Engineering

Rootly AI Labs is the research division of Rootly, dedicated to exploring how artificial intelligence can transform incident management and site reliability engineering. Our mission is to push the boundaries of what's possible when AI meets operational excellence.

We created SRE-skills-bench to provide the industry with a rigorous, transparent benchmark for evaluating how well language models understand and can assist with real-world SRE tasks. As AI becomes increasingly integrated into DevOps workflows, it's crucial to have standardized ways to measure and compare model capabilities.

Our team consists of experienced SRE practitioners, ML researchers, and software engineers who are passionate about building tools that make on-call less painful and incident response more effective. We believe that the future of SRE will be deeply augmented by AI, and we're committed to ensuring that augmentation is built on solid foundations.