Site Reliability Engineer (London)

kwiff

London
Permanent
Full-time

7 days ago

About kwiff:kwiff isn’t gambling as you know it. We’re redefining the experience with a bold, player-first approach to sports betting and casino, powered by our proprietary tech platform, fully automated sportsbook, and standout UX across web and mobile. What truly sets us apart? Supercharging. Our signature feature allows players’ odds, cash outs and more to be 'supercharged' at random, creating surprise wins and a thrilling betting experience. ⚡The role & responsibilities:We’re looking for a Site Reliability Engineer to join our team. This role bridges development and operations, with a strong focus on automation, monitoring, incident management and reliability engineering. You’ll be pivotal in ensuring our platform runs smoothly, securely, and at scale, while collaborating closely with development and product teams.Platform Reliability & Incident Management

Act as the primary responder in the on-call rotation for production incidents.
Lead incident response and coordination during platform emergencies.
Own and maintain the incident management process, including documentation and communication protocols.
Facilitate post-incident reviews and ensure follow-up actions are tracked and implemented (on-call engineer leads the review).

Monitoring & Observability

Own the implementation and optimisation of monitoring tools across the platform.
Design and implement monitoring for application performance, database health, cache clusters, API endpoints, and service latencies.
Establish and track reliability metrics

Security & Dependency Management

Monitor and alert on security risks via application monitoring.
Ensure secure handling and rotation of monitoring credentials and tokens in DataDog and other tools.
Track and manage external API versions and deprecation timelines, including documentation, alerting, and coordination with dev teams.
Own the dependency update process across services: track outdated dependencies, prioritise updates, coordinate rollouts, and monitor for vulnerabilities.
Maintain dashboards for dependency freshness and API version status.
Contribute to security incident response when related to monitoring or integrations.

Performance & Reliability Engineering

Define and maintain SLOs for critical services.
Create and manage a stability and performance-focused feature roadmap.
Implement automated alerting with thresholds that minimise alert fatigue.
Contribute to architectural discussions, prioritising scalability, reliability, and cost efficiency.
Participate in release planning and deployment processes to enforce reliability standards.

Release Management

Review and validate production deployments from a reliability perspective.
Collaborate with developers on robust deployment practices.
Monitor post-deployment performance and stability.
Support deployment automation improvements.

Collaboration & Communication

Participate in daily standups and team meetings.
Respond to requests in relevant Slack channels.
Maintain clear documentation of SRE processes and procedures.
Track work items in ClickUp for visibility and prioritisation.
Work closely with development, operations, and product teams to embed reliability best practices.

Skills we’re looking for:

Proven experience in Site Reliability Engineering, DevOps, or related fields.
Strong knowledge of monitoring/observability tools (DataDog preferred).
Solid understanding of cloud infrastructure, APIs, and CI/CD pipelines.
Experience with incident response, incident management, and post-mortem facilitation.
Familiarity with security practices and dependency management.
Ability to work cross-functionally and communicate clearly with technical and non-technical stakeholders.

Nice to haves:

Experience with SLO framework development.
Hands-on experience implementing DORA and SPACE metrics.
Familiarity with HappierPlace or similar dev environment management tools.
Prior exposure to cost-driven architectural decision making.

What we can offer you:At Kwiff, we believe in rewarding our team with an environment that’s both exciting and supportive. Here’s what you’ll enjoy as part of our team:Private Healthcare – Comprehensive medical insurance through Vitality Health.Life Insurance – Coverage through Yulife for added peace of mind.Performance Bonuses – Quarterly bonuses based on team achievements.Wellbeing Allowance – Spend on gym memberships or other wellness activities.Lunch Budget – Enjoy a budget to spend on food and beverages when working from the office.Sustainable Commuting – Cycle to Work schemes on offer.Parental Support – Nursery schemes to reduce monthly fees.Long Service Rewards – Exciting travel rewards for dedication after five years of service.Learning Budget – Financial support for role-specific training to level up your skills.Team Socials & Activities – Regular events, plus office perks like ping pong, darts, and PlayStation.Why join us?At kwiff, we don’t just follow trends. We create them. From unlimited betting options to surprise wins and slick user journeys, we’re building a product that players love. Join us and help shape the future of betting.Kwiff is an equal opportunity employer. We value diversity and are committed to creating an inclusive environment for all employees.We aim for equity at all three stages of the recruitment process. Please let us know if there’s anything we can do to make the process more accessible to you.kwiff

kwiff

Apply Now