Introduction

Viktor Peacock
Welcome to Scaling SRE. This is a roundtable discussion where everyone is encouraged to contribute, challenge ideas, and share their own experiences.

Agenda

  • What does "scaling SRE" actually mean in your organisation?
  • What are the core ingredients of an SRE function that can grow from a handful of experts to 300+ engineers?
  • Where should AI be applied in alerting, SLOs, incident response, root cause, and chaos engineering?
  • What tooling is essential to avoid linear growth of people with traffic?
  • How do you build a culture where every engineer owns reliability?
  • How do you justify the investment to senior leadership?
  • What metrics prove success when operating at high throughput and tight SLAs?

Areas We'll Explore

📈 Scale
🤖 AI
🔧 Tooling
🤝 Culture

My Background

2009
Education
Computing Science
BSc
  • The foundation of everything
2010-2018
8 years
E-Commerce, Betting & Satellite Communications
Senior / Lead / Manager
C#
  • Backend services development
  • UI: Angular, React, WPF (Windows Desktop apps)
  • Hate SQL with passion
2019
Education
Business
MBA
  • Focus on business strategy and scaling
2018-2020
2 years
E-Commerce
Senior
  • Transition period
2020-2024
4 years
FinTech
Lead / Head Of / Principal
GolangAWS/GCP/AzureTerraformPlatform EngineeringSRE
2025-Present
Current
Healthcare
Principal DevOps
TerraformAzureSREGolang

Working Preference

🏠 Remote
🛠️ SRE
⚙️ Operations
🚀 Transformation
🤝 Lead by Example
🤖 AI Focused

Outside of Work

🐕
Dogs
Massive dog person. Love border terriers
🏎️
Racing
Cars and Formula 1 + GT racing
🎹
Piano
Music and piano
🤖
AI
Recently tinkering with AI. Absolutely love it

Context and Scale

Form3
UK fintech processing 70% of UK faster payments + some US/EU presence

Key Numbers

70%
UK Faster Payments
300
Senior Engineers
8
Incident Response
3
Cloud Providers
1200
TPS
Peak throughput
p99.99
Availability
Service target

Environments & Clusters

Stack = end-to-end copy of platform. Each stack has 3 clusters (GCP/AWS/Azure). Each cluster = 12 nodes.
3
Clusters per Stack
GCP, AWS, Azure
12
Nodes per Cluster
36 nodes per stack
50-70
APIs per Cluster
EnvironmentStacksClustersNodesAPIs
Production412144200-280
UAT39108150-210
Dev/Integration618216300-420
TOTAL1339468650-910

Scaling SRE

Scaling SRE is the shift from doing operations to building capabilities. It is about creating the platforms and practices that empower hundreds of engineers to own reliability without scaling the SRE team proportionally.

Traditional

10-15% SRE Ratio

Modern

4% SRE Ratio

Pillars of Scale

🏗️ Platform
🔌 Self-Service
📜 Standards
🤝 Culture

SLOs

Strategic Impact

  • Reliability from customer's point of view
  • Actionable alerts
  • Error budget prioritisation
  • Sustainable on-call
  • Data driven prioritisation
👤
SRE Lead - SLI/SLOs and Observability
Key Skillset
Prometheus God
Deep interest in both SLOs and SLAs
Understand business flows and metrics
Understands market
Can model availability to calculate true nines the system is designed for
Typical Day
Builds SLO tooling
Collaborates with other SRE Leads on common problems
Coaches payment team for 3 weeks to build SLOs
Coaches product and sales on how tight SLOs affect costs
Helps legal draft meaningful SLAs
Troubleshoots missing metrics
Manages vendors around observability

Tooling

Who played Wolfenstein?

Wolfenstein 3D Difficulty Levels
The Golden Rule Don't force "Death Incarnate" on everyone. Meet your engineers where they are with tiered templates.

Product SLO Module

Minimal config Terraform module for scheme-level reliability targets.

Product SLO Module
terraform
module "payment_latency_slo" {
  source = "./modules/slo-product"
  query  = "p99(payment_processing_duration_ms) < 250"
  sensitivity  = "high"
  when_active  = "wake-me-up"
}
* Illustrative purposes only. Key fields, complex queries, and parameters have been simplified for clarity.

Stack-Aware Service SLO

Standardized templates for cluster-level infrastructure monitoring.

Stack-Aware Service SLO
terraform
module "cluster_cpu_slo" {
  source = "./modules/slo-service"
  query  = "avg(container_cpu_usage) by (cluster) > 0.7"
  sensitivity  = "medium"
  when_active  = "business-hours-only"
}
* Illustrative purposes only. Key fields, complex queries, and parameters have been simplified for clarity.

Examples

Bad Alert
container_cpu_usage_seconds_total{pod=~"payment.*"} / container_cpu_limit > 0.7
Good Alert
Payments might be delayed due to CPU bottleneck on prod-tier1-bank-cluster
A clean, English description of the failure mode, business impact, and a clear set of actionable steps for the on-call engineer.

Gaps & Challenges

🔗

SLO Dependencies

Mapping complex microservice dependencies to a single SLO is incredibly tough.

🏆

Single Dashboard

Creating a single 'Executive' health dashboard remained elusive for years.

⚖️

Product/Service SLO Balance

Finding the right balance between Service-level and Product-level SLOs is a constant struggle.

🗺️

Complexity

SLOs get messy when you only control half the payment journey; the 'other side' is a black box.

Performance Assurance

Strategic Impact

  • Find scaling limits of the system
  • Find optimal performance/cost balance
  • Unlocks chaos engineering
  • Unlocks incident rehearsal
  • Unlocks ad-hoc testing
  • Plan capacity 6 months ahead
👤
SRE Lead - Performance & Reliability
Key Skillset
Performance Tuning Specialist
Expert in Go and Distributed Systems
Deep understanding of K8s resource management
Debugging god
Typical Day
Design and build load testing frameworks
Analyze performance bottlenecks in core APIs
Advise product teams on capacity planning
Check performance deviations against baseline
Assign investigations to other teams
Review 6 months capacity forecasts

Tooling

PaymentProfile CRD

Custom Kubernetes resource for defining complex, time-bound payment simulation scenarios.

PaymentProfile CRD
yaml
apiVersion: reliability.f3.io/v1alpha1
kind: PaymentProfile
metadata:
  name: fps-peak-friday-morning
spec:
  scheme: "fps"
  peak_tps: 1200
  ramp_up_duration: "15m"
  steady_state_duration: "2h"
  schedule:
    start_time: "2026-01-30T08:00:00Z"
    end_time: "2026-01-30T10:15:00Z"
  geo_distribution:
    primary: "eu-west-1"
    failover: "us-east-1"
* Illustrative purposes only. Key fields, complex queries, and parameters have been simplified for clarity.

Ad-hoc Load Trigger

Quick bash utility to trigger a specific load scenario via CLI.

Ad-hoc Load Trigger
bash
#!/bin/bash
# Trigger ad-hoc load test
f3-load trigger --scheme fps --tps 500 --duration 10m --env uat
* Illustrative purposes only. Key fields, complex queries, and parameters have been simplified for clarity.

Examples

Payment Profile

Gaps & Challenges

🎭

Realistic Profiles

Designing simulators that accurately match production behavior is incredibly difficult.

⚔️

Stack Contention

Environment stacks are expensive and often shared across teams, leading to noisy and unreliable performance results.

Not a Priority

Teams focus on shipping features and APIs, often neglecting the investment needed for simulators and performance tooling.

Incident Rehearsals

Strategic Impact

  • Operational muscle memory
  • Validation of on-call playbooks
  • Reduced panic during real incidents
  • Prevents real incidents
  • Protects reputation
👤
SRE Lead - Incident Response
Key Skillset
Master of the 'GameDay' philosophy
Excellent communicator under pressure
Expert in incident management workflows
Data analysis
Terraform/IaC
Systems thinking
Coach by heart
Typical Day
Design incident scenarios
Facilitate organization-wide drills
Conduct post-rehearsal analysis and feedback sessions
Build tooling to silence alerts/stacks
Build tooling to analyse alerts/incidents/response
Automate incident workflows, escalation policies, etc
Automate postmortem data gathering
Follow-up on postmortem actions
Learn from industry postmortems

Tooling

Incident Leaderboard

Fetch a ranking of incident frequency and response quality for specific teams.

Incident Leaderboard
bash
#!/bin/bash
# Get leaderboard for current team
f3-incident leaderboard --team core-payments --timeframe 30d
* Illustrative purposes only. Key fields, complex queries, and parameters have been simplified for clarity.

Stack Silence Utility

Quickly silence all alerts on a specific stack during a rehearsal or maintenance.

Stack Silence Utility
bash
#!/bin/bash
# Silence all alerts on a stack
f3-incident silence --stack prod-eu-west-1-alpha --duration 2h --reason "Scheduled Rehearsal"
* Illustrative purposes only. Key fields, complex queries, and parameters have been simplified for clarity.

Chaos Trigger

Orchestrate a specific failure scenario to test team response in real-time.

Chaos Trigger
bash
#!/bin/bash
# Trigger a chaos scenario
f3-chaos trigger --scenario partial-network-partition --target-stack uat-beta
* Illustrative purposes only. Key fields, complex queries, and parameters have been simplified for clarity.

Examples

Weekly Rehearsal Window

Every Tuesday from 1:00 PM to 5:00 PM, we rehearse failure modes. We alternate between Major and Minor incidents every other week.

13:0014:0015:0016:0017:00
Prep & Calibration
13:00 - 14:00
Live Rehearsal
14:00 - 15:00
Learning & Debrief
15:00 - 16:00
Tidy up & Actions
16:00 - 17:00

Gaps & Challenges

👑

Leadership Buy-in

Senior engineering leaders MUST attend rehearsals to show they care about reliability culture.

⚔️

Stack Contention

Breaking a shared stack for 2-3 hours is expensive and can block other teams.

💰

The Price of Readiness

It is an expensive exercise, but still far cheaper than a messy, uncoordinated P1 incident.

Patience & Perseverance

It takes months to build the habit and see the real benefits. You cannot give up early.

📝

Manual Learnings

Data gathering and insight extraction remain largely manual; we need a better way to automate this.

Chaos Engineering

Strategic Impact

  • Prevents major system-wide incidents
  • Protects brand reputation through proven resilience
  • Builds institutional trust in system reliability
  • Enables a culture of continuous learning from failure
  • Unlocks ability to serve mission-critical Tier-1 customers
  • Empirically verifies Disaster Recovery (DR) plans
  • Discovers 'unknown unknowns' in complex architecture
👤
SRE Lead - Chaos & Resilience
Key Skillset
Polyglot (Programmer & Platform Engineer)
Systems Thinking
Experimentation and Learning Mindset
Understand End-to-End Flows
Understands both Platform and Services
Scaling Mindset (Build capabilities, don't just run tests)
Typical Day
Review chaos results from previous night
Build new chaos scenarios
Build tooling to enable scaling of assertions
Build dashboards for verifying behaviour during experiments
Write up tickets on identified system weaknesses
Support with scenario execution and troubleshooting
Work with InfoSec on aligning chaos with DR plans

Tooling

ChaosMesh: Network Partition

Custom ChaosMesh resource to simulate total regional connectivity loss for GCP nodes.

ChaosMesh: Network Partition
yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: gcp-connectivity-loss
spec:
  action: partition
  mode: one
  selector:
    labelSelectors:
      cloud: "gcp"
  direction: both
  externalTargets:
    - "8.8.8.8" # Simulate external scheme connectivity
  duration: "5m"
* Illustrative purposes only. Key fields, complex queries, and parameters have been simplified for clarity.

f3-chaos CLI

Orchestrate complex scenarios and verify assertions directly from your terminal or CI/CD.

f3-chaos CLI
bash
#!/bin/bash
# Trigger the GCP failover scenario
f3-chaos run --scenario gcp-failover-validation --wait-for-assertions

# Output: 
# [SUCCESS] GCP Connectivity lost
# [SUCCESS] Assertion: Product SLO Latency < 250ms
# [SUCCESS] Assertion: Service SLO Success > 99.99%
* Illustrative purposes only. Key fields, complex queries, and parameters have been simplified for clarity.

Examples

Given
The platform is healthy and payments are flowing correctly
And
Payment simulators are generating a steady 300 TPS
When
GCP Cloud connectivity is completely lost
Then
Payments continue flowing through AWS and Azure
Owner: SRE Lead - Chaos
And
Product SLO (End-to-End Latency) remains < 250ms
Owner: Payments Team
And
Service SLO (Success Rate) remains > 99.99%
Owner: Infrastructure Team
And
All automated failover assertions pass successfully

GCP Failover Validation

80% Pass Rate
Payments continue flowing through AWS and Azure
Owner: SRE Lead - Chaos
Service SLO (Success Rate) remains > 99.99%
Owner: Infrastructure Team
×
Product SLO (End-to-End Latency) remains < 250ms
Owner: Payments Team
🔔 Notifying Payments Team...

24-Hour Operational Cycle

Our environments are fully utilized 24/7, transitioning automatically between human development, automated performance baselining, and proactive chaos experiments.

07:0013:0019:0001:0007:00
Development & Delivery
07:00 - 19:00
Performance Testing
19:00 - 00:00
Chaos Experiments
00:00 - 07:00

Gaps & Challenges

💰

The Price of Impact

Hugely expensive to do well. It's not just the build cost, but the downstream work generated for other teams.

⚔️

Shared Stack Contention

Shared environments break often. Branch overrides or unrelated changes constantly void experimental results.

The 7-Hour Limit

You can only run so much in a single night. Scheduling diverse scenarios becomes a complex orchestration challenge.

👹

The Final Boss (Prod)

Running chaos in production requires a level of maturity that is incredibly rare in the financial sector.

⚙️

Tooling Misalignment

We constantly struggled to align our chaos engineering tools with our troubleshooting and observability stacks.

⚖️

SLO Dependency

Your automated assertions are only ever as good as the underlying SLOs they query.

Head of SRE

Strategic Impact

  • Identifies critical gaps in engineering capabilities
  • Orchestrates the engineering-wide reliability roadmap
  • Sees synergies between capabilities to create impact combos
  • Secures executive buy-in and financial alignment for reliability
  • Champions reliability culture through talks and evangelism
  • Selects strategic vendors for the observability ecosystem
👤
Head of SRE
Key Skillset
Orchestrates the entire reliability umbrella and roadmap
Engineer at heart (Can dive into App code, IaC, and Metrics)
Big picture / systems thinking
Leads from the back by example and persuasion
FinOps
Collaborates across functions to understand unique challenges
Typical Day
Runs roadmap sessions for SLO, Chaos, Performance, and IM streams
Pairs with SRE Leads on specific scenarios or investigations
Facilitates workshops between leads to identify common problems
Contributes to the codebase (app, IaC, or tooling) 20% of the time
Manages vendor calls regarding renewals and missing capabilities
Prepares and runs demos/townhalls for the wider engineering org

Tooling

  • LucidChart boards with various examples, plans, roadmaps, designs, etc.
  • OKRs for executive reporting
  • GitHub milestones for workstream tracking
  • Slack / Zoom / Loom

Examples

Let's use load test plugin to simulate realistic payment load during chaos tests and then assert on the SLOs that we are building in the parallel stream

Let's report on incidents averted by demonstrating how many issues chaos engineering identified and fixed before they surfaced in production

Can we use SLOs and Jupyter notebooks to speed up troubleshooting process?

Gaps & Challenges

📈

The Maturity Price Wall

Maturity levels 1-2 are relatively cheap. Level 3 is manageable, but Level 4+ costs grow exponentially.

🧩

Product Support Gap

Finding product people who truly understand the technical nuances of the SRE domain is incredibly difficult.

⚖️

The Generalist Burden

To do this job well, you must be exceptionally well-rounded, pivoting instantly between deep tech and business strategy.

📉

Fragile Funding

Reliability is often the first area to be slashed during budget cuts—it's hard to justify 'nothing happening'.

🏛️

Executive Influence

The role demands legitimate organizational 'clout' and a seat at the table to drive cross-functional impact.

🔭

The Lead Alignment Struggle

Getting specialized SRE leads to see the 'big picture' over their individual domains is a constant leadership challenge.

Participant Questions

How to improve the release process by leveraging AI to make releases lower risk so the team can release every day
👤
Antonio Picernio

Challenges

🧪 Lack of Tests
🐢 Slow CAB
📦 Large Releases
🤝 Trust Gap

Initial Thoughts


Fundamentally, if we cannot release every day with low risk, it means we are missing core capabilities like automated tests, pipelines, and SLOs. Or, we have bureaucracy like slow CAB processes where changes are queued, creating a large, risky blast radius. Finally, we might have a very large/complex app that simply cannot be shipped in small chunks. AI isn't a magic bullet for a broken foundation.

Solutions

Risk & Impact Scoring

Score changes using specific criteria to come up with a weighted risk/impact profile for every release.

Unlocks: Prioritized manual validation

Automated PR Reviews

Leverage tools like CodeRabbit to perform deep architectural and logic reviews before a human ever sees the code.

Unlocks: Cleaner code and faster review cycles

Lower-Env SLOs

Report on SLOs in lower environments (Dev/UAT) to catch regressions before they hit production.

Unlocks: Shift-left reliability

Example

20+
Releases / Day
Form3 Release Cadence

At Form3, we released dozens of times per day and nobody knew it was happening. It was a complete non-event. This was all before AI. The basics—automation, tests, and monitoring—matter far more than any AI tooling.

How to shift to a modern SRE model at pace (without loss of velocity)
👤
Rob Dudley

Initial Thoughts


There is always a tradeoff. We cannot magically create time, but we can reprioritise to buy more time. I think it's a mixture of having SRE head/lead who knows what good looks like and then moving priorities to focus on important and non-urgent SRE activities. It's also about doing near to no Ops/toil and focusing on engineering capabilities.

Challenges

🔥 Firefighting
🛡️ Resistance
🛠️ Poor Tooling

Solutions

Buy More Time

A quarterly theme: 'What can we do this quarter to have a lot more time next quarter?'. This creates a virtuous cycle of efficiency.

Unlocks: Compounding engineering velocity

Estate as Code

Move everything to Terraform. No major changes unless they are in code. If it's not in code, it doesn't scale, so it won't be fast.

Unlocks: Infinite scalability of infrastructure

The Polyglot Bet

Hire polyglots who can move across the stack. This eliminates handovers between teams which is the primary killer of pace.

Unlocks: Zero-friction delivery

Lean Practices

Implement 'Quiet Afternoons' and 'Investment Wednesdays' where the entire team focuses solely on building tooling and automation.

Unlocks: Consistent innovation cadence

Outsource Side Quests

Pick vendors such that your side quest is their main quest. Don't build what you can buy from someone whose business depends on it.

Unlocks: Focus on your core business value

Examples

20d
Saved / Year
AKS Maintenance Automation

Our AKS maintenance used to consume about 20 days per year of manual effort. By investing in automation, this is now a zero-touch process, buying back nearly a month of engineering time annually.

95%
Faster
RBAC & Access Requests

Access requests used to take 10-20 minutes of manual work per request. We moved it all into Terraform and built Cursor commands that guide engineers through permission changes and raise PRs automatically. Now, we only spend seconds on the final review.

How to integrate Agentic AI as part of the team to improve and increase throughput via smart automation
👤
Rob Dudley

Initial Thoughts


I haven't yet seen Agentic AI deployed in production within the SRE domain. We are currently still struggling to scale basic AI to wider engineering, so scaling agentic AI feels like the 'next level' in the AI adoption maturity model.

Challenges

🤝 Trust
♾️ Run-away Automation
🔒 Safety (RBAC)
💸 Cost at Scale

Solutions

Level 1: Parallel Operation

The AI runs in parallel with humans, suggesting what it would have done without taking any action.

Unlocks: Initial trust and baseline data

Level 2: Supervised Execution

The AI operates under direct human supervision—we sit and watch it execute every step.

Unlocks: Safe operational feedback loops

Level 3: Low-Risk Autonomy

The AI is given full autonomy but only within low-risk/low-impact domains.

Unlocks: Increased throughput for simple tasks

Level 4: Full Ownership

The AI has full ownership of its domain, escalating only when necessary.

Unlocks: True team scaling

Level 5: Self-Improvement (AGI)

The AI begins to self-improve its own logic and capabilities.

Unlocks: AGI / Paradigm shift

Example

FastBurn
Hypothetical
SLO Alert Scenario

A fast SLO burn rate alert fires. An Agent detects a release correlating with the issue. Based on the maturity model, it either reports the anomaly, suggests a rollback, offers to perform the rollback, or just executes it.

To what extent do you think AI agents should be used natively in Chaos engineering (bringing their own blend of chaos) vs using them to model scenarios and create helper tools and pipelines to model specific failure modes
👤
Alistair McLaurin

Initial Thoughts


AI should be your buddy through the chaos maturity model. My experience is that building tooling and interpreting results are the two most difficult hurdles—so I'd focus AI there first. High maturity requires more AI; low maturity plus AI will just create complexity you won't understand.

Challenges

🧐 Interpreting Results
🏗️ Tooling Burden
😱 Production Fear
😵‍💫 Misunderstood Failure

The Chaos Maturity Model

The Analysis Buddy

Level 1: AI helps you understand your system architecture and brainstorm scenarios on paper.

Unlocks: Rigorous experiment design

The Builder

Level 2: AI helps build the custom tooling, dashboards, and reporting capabilities for chaos.

Unlocks: Reduced engineering overhead

The Aggregator

Level 3: AI queries SLOs, logs, and metrics to bring disparate information into a cohesive view.

Unlocks: Faster result interpretation

The Inventor

Level 4-5: AI starts inventing new scenarios and novel faults based on system knowledge.

Unlocks: Discovery of 'unknown unknowns'

Example

3
Maturity Level
Form3 Chaos Practice

At Form3, we reached Level 3 without any AI. Today, I'd use AI to turn those technical logs into customer-friendly reports, explaining exactly how we validated their RTO/RPO contracts through controlled failures.

The oldest question in Enterprise IT: to what extent should SRE be a separate function and where should it embed the skills in core platform and service Dev/Ops teams
👤
Anonymous

Initial Thoughts


Capability, tooling, and expertise should be centralised. You need domain experts who build tooling for others to use. The rest of SRE—incident response, observability, scaling—should be handled by the whole engineering org. It's like building an airplane: you can't just hire a 'reliability team' after it's built to keep it in the sky. We are all responsible for reliability.

Challenges

Dev Reluctance
🧱 Knowledge Silos
Delayed Priority

Solutions

The SRE Guild

A formal guild where teams and individuals rotate in and out for fixed periods to solve their specific reliability problems.

Unlocks: Orgnisation-wide skill elevation

Embedded SRE Leads

SRE leads embedded into each domain who collaborate 2-4 times per week on common problems and cross-pollinate knowledge.

Unlocks: Consistent standards across domains

Greenfield Hiring

Hire top talent from the start who already understand that SRE is an engineering discipline, not a support function.

Unlocks: Self-sustaining reliability culture

Example

0%
Ownership
Of other teams' alerts

Our SRE leads build the SLO tooling and modules for the entire organization, but they are not responsible for using that tooling or responding to the resulting alerts. They build the 'car', but the domain teams drive it.

Closing Thoughts

We've only touched the surface of what it takes to scale SRE. I hope this gave you an insight into what I've built in the past, and just as importantly, where I struggled or failed.

I felt privileged to have had this opportunity—it was mega fun, but also incredibly expensive. We reached the level we did because of my amazing colleagues; we hired people with both deep technical skills and an incredible can-do, hungry-to-learn attitude.

I still have a soft spot for SRE. If you have interesting challenges and want to benefit from my experience, please reach out directly via LinkedIn or through Prism.
Wolfenstein Final Selection
PS. I'm going with 'Über'—because if it isn't challenging, it isn't SRE.
Viktor Peacock in