Stay updated on frontier models SRE capabilities

SRE-skills-bench

Can Language Models Resolve Real-world SRE Tasks?

Leaderboard V1 · MCQ

Overall performance ranking across all SRE multiple-choice tasks

Scroll right to see all columns →

Rank	Model	Score	Cost/1M output tokens	Date
1	Gemini-3.1-pro	98.8	$12.00	2026-02-19
2	GPT-5.4	98.3	$15.00	2026-03-06
3	gpt-5.3-codex	98.0	$14.00	2026-03-13
4	Gemini-3-pro	96.7	$12.00	2026-02-17
5	gpt-5.2-pro	96.5	$168.00	2026-02-17
6	kimi-k2.5	95.9	$2.06	2026-02-17
7	opus-4.6	94.7	$25.00	2026-02-17
8	opus-4.5	94.6	$25.00	2026-02-17
9	gpt-5.2	93.3	$14.00	2026-02-17
10	gpt-5.1-codex-max	92.5	$10.00	2025-12-10

Task Results

Performance breakdown across SRE task categories (scores shown as percentages).

Scroll right to see all columns →

First Place

Second Place

Third Place

Model	Avg	S3	VPC	IAM	Network	Compute	Kubernetes	Network	Compute	Storage	GMCQ
Gemini-3.1-pro	98.8	100.0	99.3	100.0	100.0	98.9	99.3	99.3	100.0	99.5	92.0
GPT-5.4	98.3	100.0	99.3	98.4	100.0	98.9	99.3	99.3	99.0	98.4	90.3
gpt-5.3-codex	98.0	100.0	99.3	98.4	100.0	96.6	98.6	100.0	99.0	98.4	90.0
Gemini-3-pro	96.7	97.3	96.6	96.9	100.0	100.0	97.3	96.3	99.0	97.8	85.3
gpt-5.2-pro	96.5	94.6	96.6	94.5	100.0	96.6	98.6	99.3	98.1	97.3	89.5
kimi-k2.5	95.9	91.9	97.3	96.9	98.6	98.9	97.9	97.1	98.1	93.4	88.7
opus-4.6	94.7	91.9	99.3	92.2	97.1	96.6	97.3	95.6	96.1	94.0	87.0
opus-4.5	94.6	89.2	98.0	90.6	100.0	97.7	96.6	97.8	96.1	92.3	87.5
gpt-5.2	93.3	81.1	96.6	91.4	97.1	95.5	93.2	97.1	96.1	95.1	89.8
gpt-5.1-codex-max	92.5	89.0	93.0	91.0	97.0	92.0	94.0	92.0	93.0	95.0	89.0
sonnet-4.6	90.4	75.7	94.6	85.2	97.1	94.3	94.5	92.6	92.2	90.2	88.0
GPT-5.1	89.6	82.6	89.6	88.6	92.6	91.2	90.2	90.4	88.9	92.6	89.2
sonnet-4.6	89.4	79.7	90.0	86.9	96.8	91.2	91.0	90.9	88.9	91.1	87.5
GPT-5	86.9	83.0	86.0	82.0	94.0	89.0	89.0	84.0	91.0	84.0	87.0
Gemini-3.1-Flash-Lite	86.5	75.7	89.9	78.9	97.1	89.8	89.7	83.8	87.4	86.9	85.3
qwen3-vl-235b	85.2	76.8	85.5	86.5	88.4	89.1	87.2	84.0	82.1	86.0	86.2
sonnet-4.5	83.6	64.9	92.6	79.7	95.7	86.4	88.4	91.2	87.4	86.9	77.0
minimax-m2.5	70.8	53.6	71.0	70.6	73.7	78.1	70.9	68.5	67.9	67.0	86.5
glm-5	68.3	68.1	64.3	68.6	70.5	73.7	68.0	76.7	61.7	55.4	75.8
nova-pro-v1	60.4	41.0	63.0	54.0	66.0	61.0	64.0	53.0	63.0	55.0	84.0
nova-2-lite	49.2	34.0	52.5	46.1	47.4	48.9	47.4	45.2	47.5	42.6	80.0

News

Latest updates and announcements from the SRE-skills-bench project

Methodology

How we evaluate language models on Site Reliability Engineering tasks

Task Collection

We curate real-world SRE scenarios from production incidents, infrastructure challenges, and cloud platform configurations. Each task is designed to test specific SRE competencies including cloud security, networking, compute management, and identity access management across AWS, Azure, and GCP.

V1 — Multiple-Choice Questions

V1 evaluates models using multiple-choice questions derived from real SRE scenarios. Each question has been validated by experienced SRE practitioners. The overall score is a weighted average across task categories, providing a broad measure of SRE knowledge across cloud providers and disciplines.

Quality Assurance

All benchmark tasks undergo rigorous review by multiple SRE experts. We regularly update V1 to reflect the latest cloud platform features and security best practices, ensuring the evaluation remains relevant and challenging. Costs are measured using standard API pricing per 1M output tokens.

About Us

Rootly AI Labs

Advancing AI for Site Reliability Engineering

Rootly AI Labs is the research division of Rootly, dedicated to exploring how artificial intelligence can transform incident management and site reliability engineering. Our mission is to push the boundaries of what's possible when AI meets operational excellence.

We created SRE-skills-bench to provide the industry with a rigorous, transparent benchmark for evaluating how well language models understand and can assist with real-world SRE tasks. As AI becomes increasingly integrated into DevOps workflows, it's crucial to have standardized ways to measure and compare model capabilities.

Our team consists of experienced SRE practitioners, ML researchers, and software engineers who are passionate about building tools that make on-call less painful and incident response more effective. We believe that the future of SRE will be deeply augmented by AI, and we're committed to ensuring that augmentation is built on solid foundations.

Visit Rootly Rootly AI Labs on GitHub

SRE-skills-bench

Leaderboard V1 · MCQ

Task Results

News

Paper Presentation at NeurIPS

3,000 Terraform Tasks Added to SRE-skills-bench

Benchmark Collaboration with Not Diamond

SRE-skills-bench Integrated with Groq OpenBench

Paper Presentation at ACL

Paper Presentation at ICML

Methodology

Task Collection

V1 — Multiple-Choice Questions

Quality Assurance

About Us

Rootly AI Labs

SRE-skills-bench

Leaderboard V1 · MCQ

Task Results

News

Paper Presentation at NeurIPSDec. 2, 2025

Paper Presentation at NeurIPS

3,000 Terraform Tasks Added to SRE-skills-benchNov. 24, 2025

3,000 Terraform Tasks Added to SRE-skills-bench

Benchmark Collaboration with Not DiamondOct. 8, 2025

Benchmark Collaboration with Not Diamond

SRE-skills-bench Integrated with Groq OpenBenchAug. 28, 2025

SRE-skills-bench Integrated with Groq OpenBench

Paper Presentation at ACLJul. 27, 2025

Paper Presentation at ACL

Paper Presentation at ICMLJul. 19, 2025

Paper Presentation at ICML

Methodology

Task Collection

V1 — Multiple-Choice Questions

Quality Assurance

About Us

Rootly AI Labs