Introduction

Welcome to Scaling SRE. This is a roundtable discussion where everyone is encouraged to contribute, challenge ideas, and share their own experiences.

Agenda

What does "scaling SRE" actually mean in your organisation?
What are the core ingredients of an SRE function that can grow from a handful of experts to 300+ engineers?
Where should AI be applied in alerting, SLOs, incident response, root cause, and chaos engineering?
What tooling is essential to avoid linear growth of people with traffic?
How do you build a culture where every engineer owns reliability?
How do you justify the investment to senior leadership?
What metrics prove success when operating at high throughput and tight SLAs?

Areas We'll Explore

📈 Scale

🤖 AI

🔧 Tooling

🤝 Culture

My Background

2009

Education

Computing Science

BSc

The foundation of everything

2010-2018

8 years

E-Commerce, Betting & Satellite Communications

Senior / Lead / Manager

Backend services development
UI: Angular, React, WPF (Windows Desktop apps)
Hate SQL with passion

2019

Education

Business

MBA

Focus on business strategy and scaling

2018-2020

2 years

E-Commerce

Senior

Transition period

2020-2024

4 years

FinTech

Lead / Head Of / Principal

GolangAWS/GCP/AzureTerraformPlatform EngineeringSRE

2025-Present

Current

Healthcare

Principal DevOps

TerraformAzureSREGolang

Working Preference

🏠 Remote

🛠️ SRE

⚙️ Operations

🚀 Transformation

🤝 Lead by Example

🤖 AI Focused

Outside of Work

🐕

Dogs

Massive dog person. Love border terriers

🏎️

Racing

Cars and Formula 1 + GT racing

🎹

Piano

Music and piano

🤖

Recently tinkering with AI. Absolutely love it

Context and Scale

Form3

UK fintech processing 70% of UK faster payments + some US/EU presence

Key Numbers

70%

UK Faster Payments

300

Senior Engineers

Incident Response

Cloud Providers

1200

TPS

Peak throughput

p99.99

Availability

Service target

Environments & Clusters

Stack = end-to-end copy of platform. Each stack has 3 clusters (GCP/AWS/Azure). Each cluster = 12 nodes.

Clusters per Stack

GCP, AWS, Azure

Nodes per Cluster

36 nodes per stack

50-70

APIs per Cluster

Environment	Stacks	Clusters	Nodes	APIs
Production	4	12	144	200-280
UAT	3	9	108	150-210
Dev/Integration	6	18	216	300-420
TOTAL	13	39	468	650-910

Scaling SRE

Scaling SRE is the shift from doing operations to building capabilities. It is about creating the platforms and practices that empower hundreds of engineers to own reliability without scaling the SRE team proportionally.

Traditional

10-15% SRE Ratio

Modern

4% SRE Ratio

Pillars of Scale

🏗️ Platform

🔌 Self-Service

📜 Standards

🤝 Culture

SLOs

Strategic Impact

Reliability from customer's point of view
Actionable alerts
Error budget prioritisation
Sustainable on-call
Data driven prioritisation

👤

SRE Lead - SLI/SLOs and Observability

Key Skillset

Prometheus God

Deep interest in both SLOs and SLAs

Understand business flows and metrics

Understands market

Can model availability to calculate true nines the system is designed for

Typical Day

⚡

Builds SLO tooling

⚡

Collaborates with other SRE Leads on common problems

⚡

Coaches payment team for 3 weeks to build SLOs

⚡

Coaches product and sales on how tight SLOs affect costs

⚡

Helps legal draft meaningful SLAs

⚡

Troubleshoots missing metrics

⚡

Manages vendors around observability

Tooling

Who played Wolfenstein?

The Golden Rule Don't force "Death Incarnate" on everyone. Meet your engineers where they are with tiered templates.

Product SLO Module

Minimal config Terraform module for scheme-level reliability targets.

Product SLO Module

terraform

module "payment_latency_slo" {
  source = "./modules/slo-product"
  query  = "p99(payment_processing_duration_ms) < 250"
  sensitivity  = "high"
  when_active  = "wake-me-up"
}

* Illustrative purposes only. Key fields, complex queries, and parameters have been simplified for clarity.

Stack-Aware Service SLO

Standardized templates for cluster-level infrastructure monitoring.

Stack-Aware Service SLO

terraform

module "cluster_cpu_slo" {
  source = "./modules/slo-service"
  query  = "avg(container_cpu_usage) by (cluster) > 0.7"
  sensitivity  = "medium"
  when_active  = "business-hours-only"
}

* Illustrative purposes only. Key fields, complex queries, and parameters have been simplified for clarity.

Examples

Bad Alert

container_cpu_usage_seconds_total{pod=~"payment.*"} / container_cpu_limit > 0.7

Good Alert

Payments might be delayed due to CPU bottleneck on prod-tier1-bank-cluster

A clean, English description of the failure mode, business impact, and a clear set of actionable steps for the on-call engineer.

Gaps & Challenges

🔗

SLO Dependencies

Mapping complex microservice dependencies to a single SLO is incredibly tough.

🏆

Single Dashboard

Creating a single 'Executive' health dashboard remained elusive for years.

⚖️

Product/Service SLO Balance

Finding the right balance between Service-level and Product-level SLOs is a constant struggle.

🗺️

Complexity

SLOs get messy when you only control half the payment journey; the 'other side' is a black box.

Performance Assurance

Strategic Impact

Find scaling limits of the system
Find optimal performance/cost balance
Unlocks chaos engineering
Unlocks incident rehearsal
Unlocks ad-hoc testing
Plan capacity 6 months ahead

👤

SRE Lead - Performance & Reliability

Key Skillset

Performance Tuning Specialist

Expert in Go and Distributed Systems

Deep understanding of K8s resource management

Debugging god

Typical Day

⚡

Design and build load testing frameworks

⚡

Analyze performance bottlenecks in core APIs

⚡

Advise product teams on capacity planning

⚡

Check performance deviations against baseline

⚡

Assign investigations to other teams

⚡

Review 6 months capacity forecasts

Tooling

PaymentProfile CRD

Custom Kubernetes resource for defining complex, time-bound payment simulation scenarios.

PaymentProfile CRD

yaml

apiVersion: reliability.f3.io/v1alpha1
kind: PaymentProfile
metadata:
  name: fps-peak-friday-morning
spec:
  scheme: "fps"
  peak_tps: 1200
  ramp_up_duration: "15m"
  steady_state_duration: "2h"
  schedule:
    start_time: "2026-01-30T08:00:00Z"
    end_time: "2026-01-30T10:15:00Z"
  geo_distribution:
    primary: "eu-west-1"
    failover: "us-east-1"

* Illustrative purposes only. Key fields, complex queries, and parameters have been simplified for clarity.

Ad-hoc Load Trigger

Quick bash utility to trigger a specific load scenario via CLI.

Ad-hoc Load Trigger

bash

#!/bin/bash
# Trigger ad-hoc load test
f3-load trigger --scheme fps --tps 500 --duration 10m --env uat

* Illustrative purposes only. Key fields, complex queries, and parameters have been simplified for clarity.

Examples

Payment Profile

Gaps & Challenges

🎭

Realistic Profiles

Designing simulators that accurately match production behavior is incredibly difficult.

⚔️

Stack Contention

Environment stacks are expensive and often shared across teams, leading to noisy and unreliable performance results.

⏳

Not a Priority

Teams focus on shipping features and APIs, often neglecting the investment needed for simulators and performance tooling.

Incident Rehearsals

Strategic Impact

Operational muscle memory
Validation of on-call playbooks
Reduced panic during real incidents
Prevents real incidents
Protects reputation

👤

SRE Lead - Incident Response

Key Skillset

Master of the 'GameDay' philosophy

Excellent communicator under pressure

Expert in incident management workflows

Data analysis

Terraform/IaC

Systems thinking

Coach by heart

Typical Day

⚡

Design incident scenarios

⚡

Facilitate organization-wide drills

⚡

Conduct post-rehearsal analysis and feedback sessions

⚡

Build tooling to silence alerts/stacks

⚡

Build tooling to analyse alerts/incidents/response

⚡

Automate incident workflows, escalation policies, etc

⚡

Automate postmortem data gathering

⚡

Follow-up on postmortem actions

⚡

Learn from industry postmortems

Tooling

Incident Leaderboard

Fetch a ranking of incident frequency and response quality for specific teams.

Incident Leaderboard

bash

#!/bin/bash
# Get leaderboard for current team
f3-incident leaderboard --team core-payments --timeframe 30d

* Illustrative purposes only. Key fields, complex queries, and parameters have been simplified for clarity.

Stack Silence Utility

Quickly silence all alerts on a specific stack during a rehearsal or maintenance.

Stack Silence Utility

bash

#!/bin/bash
# Silence all alerts on a stack
f3-incident silence --stack prod-eu-west-1-alpha --duration 2h --reason "Scheduled Rehearsal"

* Illustrative purposes only. Key fields, complex queries, and parameters have been simplified for clarity.

Chaos Trigger

Orchestrate a specific failure scenario to test team response in real-time.

Chaos Trigger

bash

#!/bin/bash
# Trigger a chaos scenario
f3-chaos trigger --scenario partial-network-partition --target-stack uat-beta

* Illustrative purposes only. Key fields, complex queries, and parameters have been simplified for clarity.

Examples

Weekly Rehearsal Window

Every Tuesday from 1:00 PM to 5:00 PM, we rehearse failure modes. We alternate between Major and Minor incidents every other week.

13:0014:0015:0016:0017:00

Prep & Calibration

13:00 - 14:00

Live Rehearsal

14:00 - 15:00

Learning & Debrief

15:00 - 16:00

Tidy up & Actions

16:00 - 17:00

Gaps & Challenges

👑

Leadership Buy-in

Senior engineering leaders MUST attend rehearsals to show they care about reliability culture.

⚔️

Stack Contention

Breaking a shared stack for 2-3 hours is expensive and can block other teams.

💰

The Price of Readiness

It is an expensive exercise, but still far cheaper than a messy, uncoordinated P1 incident.

⏳

Patience & Perseverance

It takes months to build the habit and see the real benefits. You cannot give up early.

📝

Manual Learnings

Data gathering and insight extraction remain largely manual; we need a better way to automate this.

Chaos Engineering

Strategic Impact

Prevents major system-wide incidents
Protects brand reputation through proven resilience
Builds institutional trust in system reliability
Enables a culture of continuous learning from failure
Unlocks ability to serve mission-critical Tier-1 customers
Empirically verifies Disaster Recovery (DR) plans
Discovers 'unknown unknowns' in complex architecture

👤

SRE Lead - Chaos & Resilience

Key Skillset

Polyglot (Programmer & Platform Engineer)

Systems Thinking

Experimentation and Learning Mindset

Understand End-to-End Flows

Understands both Platform and Services

Scaling Mindset (Build capabilities, don't just run tests)

Typical Day

⚡

Review chaos results from previous night

⚡

Build new chaos scenarios

⚡

Build tooling to enable scaling of assertions

⚡

Build dashboards for verifying behaviour during experiments

⚡

Write up tickets on identified system weaknesses

⚡

Support with scenario execution and troubleshooting

⚡

Work with InfoSec on aligning chaos with DR plans

Tooling

ChaosMesh: Network Partition

Custom ChaosMesh resource to simulate total regional connectivity loss for GCP nodes.

ChaosMesh: Network Partition

yaml

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: gcp-connectivity-loss
spec:
  action: partition
  mode: one
  selector:
    labelSelectors:
      cloud: "gcp"
  direction: both
  externalTargets:
    - "8.8.8.8" # Simulate external scheme connectivity
  duration: "5m"

* Illustrative purposes only. Key fields, complex queries, and parameters have been simplified for clarity.

f3-chaos CLI

Orchestrate complex scenarios and verify assertions directly from your terminal or CI/CD.

f3-chaos CLI

bash

#!/bin/bash
# Trigger the GCP failover scenario
f3-chaos run --scenario gcp-failover-validation --wait-for-assertions

# Output: 
# [SUCCESS] GCP Connectivity lost
# [SUCCESS] Assertion: Product SLO Latency < 250ms
# [SUCCESS] Assertion: Service SLO Success > 99.99%

* Illustrative purposes only. Key fields, complex queries, and parameters have been simplified for clarity.

Examples

Given

The platform is healthy and payments are flowing correctly

And

Payment simulators are generating a steady 300 TPS

When

GCP Cloud connectivity is completely lost

Then

Payments continue flowing through AWS and Azure

Owner: SRE Lead - Chaos

And

Product SLO (End-to-End Latency) remains < 250ms

Owner: Payments Team

And

Service SLO (Success Rate) remains > 99.99%

Owner: Infrastructure Team

And

All automated failover assertions pass successfully

GCP Failover Validation

80% Pass Rate

✓

Payments continue flowing through AWS and Azure

Owner: SRE Lead - Chaos

✓

Service SLO (Success Rate) remains > 99.99%

Owner: Infrastructure Team

Product SLO (End-to-End Latency) remains < 250ms

Owner: Payments Team

🔔 Notifying Payments Team...

24-Hour Operational Cycle

Our environments are fully utilized 24/7, transitioning automatically between human development, automated performance baselining, and proactive chaos experiments.

07:0013:0019:0001:0007:00

Development & Delivery

07:00 - 19:00

Performance Testing

19:00 - 00:00

Chaos Experiments

00:00 - 07:00

Gaps & Challenges

💰

The Price of Impact

Hugely expensive to do well. It's not just the build cost, but the downstream work generated for other teams.

⚔️

Shared Stack Contention

Shared environments break often. Branch overrides or unrelated changes constantly void experimental results.

⏳

The 7-Hour Limit

You can only run so much in a single night. Scheduling diverse scenarios becomes a complex orchestration challenge.

👹

The Final Boss (Prod)

Running chaos in production requires a level of maturity that is incredibly rare in the financial sector.

⚙️

Tooling Misalignment

We constantly struggled to align our chaos engineering tools with our troubleshooting and observability stacks.

⚖️

SLO Dependency

Your automated assertions are only ever as good as the underlying SLOs they query.

Head of SRE

Strategic Impact

Identifies critical gaps in engineering capabilities
Orchestrates the engineering-wide reliability roadmap
Sees synergies between capabilities to create impact combos
Secures executive buy-in and financial alignment for reliability
Champions reliability culture through talks and evangelism
Selects strategic vendors for the observability ecosystem

👤

Head of SRE

Key Skillset

Orchestrates the entire reliability umbrella and roadmap

Engineer at heart (Can dive into App code, IaC, and Metrics)

Big picture / systems thinking

Leads from the back by example and persuasion

FinOps

Collaborates across functions to understand unique challenges

Typical Day

⚡

Runs roadmap sessions for SLO, Chaos, Performance, and IM streams

⚡

Pairs with SRE Leads on specific scenarios or investigations

⚡

Facilitates workshops between leads to identify common problems

⚡

Contributes to the codebase (app, IaC, or tooling) 20% of the time

⚡

Manages vendor calls regarding renewals and missing capabilities

⚡

Prepares and runs demos/townhalls for the wider engineering org

Tooling

LucidChart boards with various examples, plans, roadmaps, designs, etc.
OKRs for executive reporting
GitHub milestones for workstream tracking
Slack / Zoom / Loom

Examples

Let's use load test plugin to simulate realistic payment load during chaos tests and then assert on the SLOs that we are building in the parallel stream

Let's report on incidents averted by demonstrating how many issues chaos engineering identified and fixed before they surfaced in production

Can we use SLOs and Jupyter notebooks to speed up troubleshooting process?

Gaps & Challenges

📈

The Maturity Price Wall

Maturity levels 1-2 are relatively cheap. Level 3 is manageable, but Level 4+ costs grow exponentially.

🧩

Product Support Gap

Finding product people who truly understand the technical nuances of the SRE domain is incredibly difficult.

⚖️

The Generalist Burden

To do this job well, you must be exceptionally well-rounded, pivoting instantly between deep tech and business strategy.

📉

Fragile Funding

Reliability is often the first area to be slashed during budget cuts—it's hard to justify 'nothing happening'.

🏛️

Executive Influence

The role demands legitimate organizational 'clout' and a seat at the table to drive cross-functional impact.

🔭

The Lead Alignment Struggle

Getting specialized SRE leads to see the 'big picture' over their individual domains is a constant leadership challenge.

Participant Questions

“

How to improve the release process by leveraging AI to make releases lower risk so the team can release every day

”

👤

Redacted

Challenges

🧪 Lack of Tests

🐢 Slow CAB

📦 Large Releases

🤝 Trust Gap

Initial Thoughts

Fundamentally, if we cannot release every day with low risk, it means we are missing core capabilities like automated tests, pipelines, and SLOs. Or, we have bureaucracy like slow CAB processes where changes are queued, creating a large, risky blast radius. Finally, we might have a very large/complex app that simply cannot be shipped in small chunks. AI isn't a magic bullet for a broken foundation.

Solutions

Risk & Impact Scoring

Score changes using specific criteria to come up with a weighted risk/impact profile for every release.

Unlocks: Prioritized manual validation

Automated PR Reviews

Leverage tools like CodeRabbit to perform deep architectural and logic reviews before a human ever sees the code.

Unlocks: Cleaner code and faster review cycles

Lower-Env SLOs

Report on SLOs in lower environments (Dev/UAT) to catch regressions before they hit production.

Unlocks: Shift-left reliability

Example

20+

Releases / Day

Form3 Release Cadence

At Form3, we released dozens of times per day and nobody knew it was happening. It was a complete non-event. This was all before AI. The basics—automation, tests, and monitoring—matter far more than any AI tooling.

“

How to shift to a modern SRE model at pace (without loss of velocity)

”

👤

Redacted

Initial Thoughts

There is always a tradeoff. We cannot magically create time, but we can reprioritise to buy more time. I think it's a mixture of having SRE head/lead who knows what good looks like and then moving priorities to focus on important and non-urgent SRE activities. It's also about doing near to no Ops/toil and focusing on engineering capabilities.

Challenges

🔥 Firefighting

🛡️ Resistance

🛠️ Poor Tooling

Solutions

Buy More Time

A quarterly theme: 'What can we do this quarter to have a lot more time next quarter?'. This creates a virtuous cycle of efficiency.

Unlocks: Compounding engineering velocity

Estate as Code

Move everything to Terraform. No major changes unless they are in code. If it's not in code, it doesn't scale, so it won't be fast.

Unlocks: Infinite scalability of infrastructure

The Polyglot Bet

Hire polyglots who can move across the stack. This eliminates handovers between teams which is the primary killer of pace.

Unlocks: Zero-friction delivery

Lean Practices

Implement 'Quiet Afternoons' and 'Investment Wednesdays' where the entire team focuses solely on building tooling and automation.

Unlocks: Consistent innovation cadence

Outsource Side Quests

Pick vendors such that your side quest is their main quest. Don't build what you can buy from someone whose business depends on it.

Unlocks: Focus on your core business value

Examples

20d

Saved / Year

AKS Maintenance Automation

Our AKS maintenance used to consume about 20 days per year of manual effort. By investing in automation, this is now a zero-touch process, buying back nearly a month of engineering time annually.

95%

Faster

RBAC & Access Requests

Access requests used to take 10-20 minutes of manual work per request. We moved it all into Terraform and built Cursor commands that guide engineers through permission changes and raise PRs automatically. Now, we only spend seconds on the final review.

“

How to integrate Agentic AI as part of the team to improve and increase throughput via smart automation

”

👤

Redacted

Initial Thoughts

I haven't yet seen Agentic AI deployed in production within the SRE domain. We are currently still struggling to scale basic AI to wider engineering, so scaling agentic AI feels like the 'next level' in the AI adoption maturity model.

Challenges

🤝 Trust

♾️ Run-away Automation

🔒 Safety (RBAC)

💸 Cost at Scale

Solutions

Level 1: Parallel Operation

The AI runs in parallel with humans, suggesting what it would have done without taking any action.

Unlocks: Initial trust and baseline data

Level 2: Supervised Execution

The AI operates under direct human supervision—we sit and watch it execute every step.

Unlocks: Safe operational feedback loops

Level 3: Low-Risk Autonomy

The AI is given full autonomy but only within low-risk/low-impact domains.

Unlocks: Increased throughput for simple tasks

Level 4: Full Ownership

The AI has full ownership of its domain, escalating only when necessary.

Unlocks: True team scaling

Level 5: Self-Improvement (AGI)

The AI begins to self-improve its own logic and capabilities.

Unlocks: AGI / Paradigm shift

Example

FastBurn

Hypothetical

SLO Alert Scenario

A fast SLO burn rate alert fires. An Agent detects a release correlating with the issue. Based on the maturity model, it either reports the anomaly, suggests a rollback, offers to perform the rollback, or just executes it.

“

To what extent do you think AI agents should be used natively in Chaos engineering (bringing their own blend of chaos) vs using them to model scenarios and create helper tools and pipelines to model specific failure modes

”

👤

Redacted

Initial Thoughts

AI should be your buddy through the chaos maturity model. My experience is that building tooling and interpreting results are the two most difficult hurdles—so I'd focus AI there first. High maturity requires more AI; low maturity plus AI will just create complexity you won't understand.

Challenges

🧐 Interpreting Results

🏗️ Tooling Burden

😱 Production Fear

😵‍💫 Misunderstood Failure

The Chaos Maturity Model

The Analysis Buddy

Level 1: AI helps you understand your system architecture and brainstorm scenarios on paper.

Unlocks: Rigorous experiment design

The Builder

Level 2: AI helps build the custom tooling, dashboards, and reporting capabilities for chaos.

Unlocks: Reduced engineering overhead

The Aggregator

Level 3: AI queries SLOs, logs, and metrics to bring disparate information into a cohesive view.

Unlocks: Faster result interpretation

The Inventor

Level 4-5: AI starts inventing new scenarios and novel faults based on system knowledge.

Unlocks: Discovery of 'unknown unknowns'

Example

Maturity Level

Form3 Chaos Practice

At Form3, we reached Level 3 without any AI. Today, I'd use AI to turn those technical logs into customer-friendly reports, explaining exactly how we validated their RTO/RPO contracts through controlled failures.

“

The oldest question in Enterprise IT: to what extent should SRE be a separate function and where should it embed the skills in core platform and service Dev/Ops teams

”

👤

Anonymous

Initial Thoughts

Capability, tooling, and expertise should be centralised. You need domain experts who build tooling for others to use. The rest of SRE—incident response, observability, scaling—should be handled by the whole engineering org. It's like building an airplane: you can't just hire a 'reliability team' after it's built to keep it in the sky. We are all responsible for reliability.

Challenges

✋ Dev Reluctance

🧱 Knowledge Silos

⏳ Delayed Priority

Solutions

The SRE Guild

A formal guild where teams and individuals rotate in and out for fixed periods to solve their specific reliability problems.

Unlocks: Orgnisation-wide skill elevation

Embedded SRE Leads

SRE leads embedded into each domain who collaborate 2-4 times per week on common problems and cross-pollinate knowledge.

Unlocks: Consistent standards across domains

Greenfield Hiring

Hire top talent from the start who already understand that SRE is an engineering discipline, not a support function.

Unlocks: Self-sustaining reliability culture

Example

Ownership

Of other teams' alerts

Our SRE leads build the SLO tooling and modules for the entire organization, but they are not responsible for using that tooling or responding to the resulting alerts. They build the 'car', but the domain teams drive it.

Closing Thoughts

We've only touched the surface of what it takes to scale SRE. I hope this gave you an insight into what I've built in the past, and just as importantly, where I struggled or failed.

I felt privileged to have had this opportunity—it was mega fun, but also incredibly expensive. We reached the level we did because of my amazing colleagues; we hired people with both deep technical skills and an incredible can-do, hungry-to-learn attitude.

I still have a soft spot for SRE. If you have interesting challenges and want to benefit from my experience, please reach out directly via LinkedIn or through Prism.

PS. I'm going with 'Über'—because if it isn't challenging, it isn't SRE.