
Manager, Technical Product Management (Reliability & Incident Response)
- London
- Permanent
- Full-time
- Lead the response to major incidents impacting our ecommerce platform
- Coordinate with technical teams across DevOps, AI Ops, distributed computing, and other areas to prevent future incidents through AI Ops correlation improvements and advancement of service restoration tools
- Handle communication with collaborators, including customers, business partners, and senior management, to provide regular updates and level set expectations
- Develop and implement processes for incident management, including escalation procedures, activation of service restoration processes and tools, validation of AI Ops correlation models
- Continuously review and improve incident management processes to ensure efficiency and effectiveness
- Collaborate with technical teams to identify areas for improvement and implement changes to prevent future incidents
- Conduct incident trend analysis to identify recurring issues and proactively address them
- Lead vendor relationships related to incident management tools and services
- Provide guidance and support to incident management team members and other technical staff
- Bachelor's degree in Computer Science, Information Systems, or related field
- Demonstrated success supporting reliability and uptime for cloud-based, distributed platforms
- Experience in incident management or related technical fields
- Strong knowledge of DevOps, AI Ops, distributed computing, and ecommerce platforms
- Experience with incident management tools, such as ServiceNow, PagerDuty, and VictorOps
- Excellent communication and collaboration skills, with the ability to manage partners at all levels of the organization
- Strong problem-solving and analytical skills, with the ability to lead teams in resolving complex technical issues
- Demonstrable ability to manage incidents and post-mortems and lead process improvement initiatives
- Experience in agile methodologies is preferred