
Senior Site Reliability Engineer
- Leicester London
- Permanent
- Full-time
- Step 1: Introductory video call (around 45 minutes) with one of our team to get to know you, explain the role, and hear about your experience and goals
- Step 2: A 90-minute technical discussion with several members of the SRE team. You will work through scenario-based questions designed to help you highlight your knowledge and approach
- SLOs, SLIs, and Service Ownership: Help teams define and adopt meaningful SLIs and SLOs. Guide product teams in using observability data to make reliability measurable.
- Incident Response and Reliability Engineering: Lead on-call investigations when issues arise. Drive blameless post-incident reviews and help to recommend mitigating actions that stem any losses, but also permanent technical fixes that prevent recurrence.
- Infrastructure and Automation: Use Pulumi, Terraform, CDK etc. to model effective infrastructure in AWS and other PaaS and SaaS providers. Improve CI/CD pipelines and support safe deployment patterns, such as ‘canary’ and ‘blue green’.
- Engineering and Development: Build automation and reliability tooling using well-structured, testable code. Contribute to shared libraries, observability components and internal platforms.
- Mentoring and Team Growth: Support and coach other engineers. Lead technical discussions and share knowledge through pairing, planning, and documentation.
- Continuous Learning and Innovation: Stay ahead of emerging practices in observability, resilience, and platform engineering. Lead team proof-of-concepts and introduce new patterns or tools that improve our platform.
- Strategic Development: Contribute to prioritisation of the SRE roadmap. Help shape observability tooling, telemetry patterns, and platform-wide approaches to service ownership and reliability.
- Aligning to Business Goals: Use observability insights to support product and platform goals. Ensure SRE priorities align with Dunelm’s wider objectives for quality, performance, and customer experience.
- Solid experience with TypeScript or similar strongly typed programming language(s).
- Proven ability to write idiomatic, pragmatic, and testable code, with strong, appropriate, automated testing.
- Knowledge and understanding of OpenTelemetry tools, specification, APIs etc.
- Excellent understanding of SRE principles, including embracing risk, service level objectives, eliminating toil, monitoring distributed systems, automation and release engineering
- AWS expertise, including Lambda, ECS/Fargate, EC2, EventBridge
- , SQS, S3, DynamoDB and general networking principles
- System administration knowledge – able to comfortably use a command line to navigate and troubleshoot a server or container running a Linux OS
- Knowledge and experience configuring and using telemetry back-ends, such as Datadog and the Grafana stack.
- Experience with infrastructure-as-code tools, such as Pulumi and Terraform
- Familiar with Kubernetes and how to deploy and monitor workloads running in k8s
- Skilled in CI/CD pipelines (GitLab or similar) and build/test/deploy automation
- Proven ability to lead incident response and post-incident review processes
- Strong problem-solving mindset and attention to detail
- Some experience in Rust or similar compiled language e.g. Go
- Experience instrumenting and running OpenTelemetry in production at scale. Knowledge of distributed tracing and trace sampling
- Experience reducing observability or cloud costs through architectural changes
- Exposure to Google Cloud Platform (GCP)
- Experience with Kubernetes observability, metrics exporters, or service mesh
- Familiarity with challenges in the retail sector is a bonus but not expected.
- Support and build trust with teammates, always assuming positive intent
- Communicate clearly and share knowledge to build shared understanding
- Stay curious, ask why, and always look to improve how things work
- Embrace change, adapt quickly, and take on a variety of challenges
- Drive innovation by looking for better ways forward and pushing for progress