Site Reliability Engineer (London)
kwiff
- London
- Permanent
- Full-time
- Act as the primary responder in the on-call rotation for production incidents.
- Lead incident response and coordination during platform emergencies.
- Own and maintain the incident management process, including documentation and communication protocols.
- Facilitate post-incident reviews and ensure follow-up actions are tracked and implemented (on-call engineer leads the review).
- Own the implementation and optimisation of monitoring tools across the platform.
- Design and implement monitoring for application performance, database health, cache clusters, API endpoints, and service latencies.
- Establish and track reliability metrics
- Monitor and alert on security risks via application monitoring.
- Ensure secure handling and rotation of monitoring credentials and tokens in DataDog and other tools.
- Track and manage external API versions and deprecation timelines, including documentation, alerting, and coordination with dev teams.
- Own the dependency update process across services: track outdated dependencies, prioritise updates, coordinate rollouts, and monitor for vulnerabilities.
- Maintain dashboards for dependency freshness and API version status.
- Contribute to security incident response when related to monitoring or integrations.
- Define and maintain SLOs for critical services.
- Create and manage a stability and performance-focused feature roadmap.
- Implement automated alerting with thresholds that minimise alert fatigue.
- Contribute to architectural discussions, prioritising scalability, reliability, and cost efficiency.
- Participate in release planning and deployment processes to enforce reliability standards.
- Review and validate production deployments from a reliability perspective.
- Collaborate with developers on robust deployment practices.
- Monitor post-deployment performance and stability.
- Support deployment automation improvements.
- Participate in daily standups and team meetings.
- Respond to requests in relevant Slack channels.
- Maintain clear documentation of SRE processes and procedures.
- Track work items in ClickUp for visibility and prioritisation.
- Work closely with development, operations, and product teams to embed reliability best practices.
- Proven experience in Site Reliability Engineering, DevOps, or related fields.
- Strong knowledge of monitoring/observability tools (DataDog preferred).
- Solid understanding of cloud infrastructure, APIs, and CI/CD pipelines.
- Experience with incident response, incident management, and post-mortem facilitation.
- Familiarity with security practices and dependency management.
- Ability to work cross-functionally and communicate clearly with technical and non-technical stakeholders.
- Experience with SLO framework development.
- Hands-on experience implementing DORA and SPACE metrics.
- Familiarity with HappierPlace or similar dev environment management tools.
- Prior exposure to cost-driven architectural decision making.