SRE-skills-bench logo

SRE-skills-bench

Can Language Models Resolve Real-world SRE Tasks?

Leaderboard V1 · MCQ

Overall performance ranking across all SRE multiple-choice tasks

Scroll right to see all columns →

RankModelScoreCost/1M output tokensDate
1
GoogleGemini-3.1-pro
98.8$12.002026-02-19
2
OpenAIGPT-5.4
98.3$15.002026-03-06
3
OpenAIgpt-5.3-codex
98.0$14.002026-03-13
4
GoogleGemini-3-pro
96.7$12.002026-02-17
5
OpenAIgpt-5.2-pro
96.5$168.002026-02-17
6
Moonshotkimi-k2.5
95.9$2.062026-02-17
7
Anthropicopus-4.6
94.7$25.002026-02-17
8
Anthropicopus-4.5
94.6$25.002026-02-17
9
OpenAIgpt-5.2
93.3$14.002026-02-17
10
OpenAIgpt-5.1-codex-max
92.5$10.002025-12-10

Task Results

Performance breakdown across SRE task categories (scores shown as percentages).

Scroll right to see all columns →

First Place
Second Place
Third Place
Model
Avg
AWSS3
AWSVPC
AWSIAM
AzureNetwork
AzureCompute
AzureKubernetes
GCPNetwork
GCPCompute
GCPStorage
GMCQGMCQ
GoogleGemini-3.1-pro
98.8100.099.3100.0100.098.999.399.3100.099.592.0
OpenAIGPT-5.4
98.3100.099.398.4100.098.999.399.399.098.490.3
OpenAIgpt-5.3-codex
98.0100.099.398.4100.096.698.6100.099.098.490.0
GoogleGemini-3-pro
96.797.396.696.9100.0100.097.396.399.097.885.3
OpenAIgpt-5.2-pro
96.594.696.694.5100.096.698.699.398.197.389.5
Moonshotkimi-k2.5
95.991.997.396.998.698.997.997.198.193.488.7
Anthropicopus-4.6
94.791.999.392.297.196.697.395.696.194.087.0
Anthropicopus-4.5
94.689.298.090.6100.097.796.697.896.192.387.5
OpenAIgpt-5.2
93.381.196.691.497.195.593.297.196.195.189.8
OpenAIgpt-5.1-codex-max
92.589.093.091.097.092.094.092.093.095.089.0
Anthropicsonnet-4.6
90.475.794.685.297.194.394.592.692.290.288.0
OpenAIGPT-5.1
89.682.689.688.692.691.290.290.488.992.689.2
Anthropicsonnet-4.6
89.479.790.086.996.891.291.090.988.991.187.5
OpenAIGPT-5
86.983.086.082.094.089.089.084.091.084.087.0
GoogleGemini-3.1-Flash-Lite
86.575.789.978.997.189.889.783.887.486.985.3
Alibabaqwen3-vl-235b
85.276.885.586.588.489.187.284.082.186.086.2
Anthropicsonnet-4.5
83.664.992.679.795.786.488.491.287.486.977.0
MiniMaxminimax-m2.5
70.853.671.070.673.778.170.968.567.967.086.5
ZhipuAIglm-5
68.368.164.368.670.573.768.076.761.755.475.8
AWSnova-pro-v1
60.441.063.054.066.061.064.053.063.055.084.0
AWSnova-2-lite
49.234.052.546.147.448.947.445.247.542.680.0

News

Latest updates and announcements from the SRE-skills-bench project

Methodology

How we evaluate language models on Site Reliability Engineering tasks

Task Collection

We curate real-world SRE scenarios from production incidents, infrastructure challenges, and cloud platform configurations. Each task is designed to test specific SRE competencies including cloud security, networking, compute management, and identity access management across AWS, Azure, and GCP.

V1 — Multiple-Choice Questions

V1 evaluates models using multiple-choice questions derived from real SRE scenarios. Each question has been validated by experienced SRE practitioners. The overall score is a weighted average across task categories, providing a broad measure of SRE knowledge across cloud providers and disciplines.

Quality Assurance

All benchmark tasks undergo rigorous review by multiple SRE experts. We regularly update V1 to reflect the latest cloud platform features and security best practices, ensuring the evaluation remains relevant and challenging. Costs are measured using standard API pricing per 1M output tokens.

About Us

Rootly Logo

Rootly AI Labs

Advancing AI for Site Reliability Engineering

Rootly AI Labs is the research division of Rootly, dedicated to exploring how artificial intelligence can transform incident management and site reliability engineering. Our mission is to push the boundaries of what's possible when AI meets operational excellence.

We created SRE-skills-bench to provide the industry with a rigorous, transparent benchmark for evaluating how well language models understand and can assist with real-world SRE tasks. As AI becomes increasingly integrated into DevOps workflows, it's crucial to have standardized ways to measure and compare model capabilities.

Our team consists of experienced SRE practitioners, ML researchers, and software engineers who are passionate about building tools that make on-call less painful and incident response more effective. We believe that the future of SRE will be deeply augmented by AI, and we're committed to ensuring that augmentation is built on solid foundations.