
Lead Data Ops
- Watford, Hertfordshire
- Permanent
- Full-time
- Ensure day-to-day smooth running of BI data pipelines, reporting systems, and cloud-based ETL/ELT workflows (AWS, Redshift, Glue, Airflow).Monitor system uptime and performance, proactively addressing issues to maintain SLA compliance.Setup monitoring dashboards and look after systems availability and operations.
- Strong observability setup skills to constantly look for thresholds and alerting mechanisms.
- Automation skills to be able to identify automations which would reduce manual efforts.
- Investigate, triage, and resolve L2 incidents, escalating L3 issues when necessary.
- Perform root cause analysis and implement fixes to improve system stability.
- Perform deployments on production environments and ensuring change management process is followed.
- Approve or review changes published by Dev/Test teams on Production environments.
- Maintain SLAs for uptime, latency, and data freshness, using monitoring tools like AWS CloudWatch.
- Track and optimise system performance, reducing recurring issues.
- Oversee data validation, ensuring compliance with governance frameworks (Ataccama or similar).
- Reduce flagged data anomalies and drive improvements in data quality KPIs.
- Automate manual support processes (alerting, monitoring) to improve efficiency and reliability.
- Identify and implement workflow optimisations to reduce operational bottlenecks.
- Work closely with cross-functional teams (DevOps, Data Engineering, AI) to align on fixes and enhancements
- Provide regular updates on incident resolution progress to key stakeholders and leadership.
- 5+ years of experience (minimum 3 years in Production Support for AWS BI/Data Engineering).
- Hands-on experience with AWS BI stack (Redshift, Glue, Airflow, S3, Step Functions).
- Data Engineering & ETL: Redshift, Glue, Airflow, S3 Step Functions, Spark.
- Reporting & Visualisation: Power BI (preferred), Business Objects, Qlik.
- Data Warehousing: Data modeling, ETL processes, and reverse engineering.
- Database Expertise: Strong SQL scripting, database optimisation, and troubleshooting.
- Incident & Problem Management: Identify root causes, troubleshoot, and implement fixes for data pipeline and platform issues. Manage L2 escalations effectively, ensuring timely resolution and minimizing business impact.
- Monitoring & Automation for Stability: Track data pipeline health, dependencies, and performance metrics using monitoring tools. Implement automated alerting and self-healing mechanisms to reduce manual intervention.
- SLA Compliance & Reporting: Ensure adherence to uptime, latency, and data freshness SLAs. Proactively monitor for performance bottlenecks and report insights to key stakeholders.
- Change & Access Management: Oversee deployments, version control, and rollback strategies to ensure smooth transitions. Manage user roles, security controls, and production access to maintain compliance.
- ITIL Framework & Continuous Improvement: Apply ITIL best practices for incident, change, and problem management. Drive process automation and workflow optimisations to enhance operational efficiency.
- Collaboration & Stakeholder Engagement: Work closely with cross-functional teams, including DevOps, Data Engineering, and AI, to coordinate issue resolution and system enhancements. Provide clear updates on incident progress and operational improvements.
- Team Management: Hiring, retention, and mentoring of a motivated team, providing regular feedback and career growth opportunities.
- Management Reporting: Collaborating with Heads of Functions and key stakeholders to report on team performance, status updates, and operational insights.
- Vendor & Service Provider Management: Overseeing Managed Service Providers (MSPs), tracking efficiency, and ensuring smooth collaboration.
- Process & Performance Optimisation: Driving continuous improvement initiatives, optimising workflows, and ensuring best practices are followed.
- Communication & Leadership: Strong interpersonal skills, fostering a collaborative culture, and ensuring clear communication across teams.
- Communication & Leadership: Strong interpersonal skills, fostering a collaborative culture, and ensuring clear communication across teams.
- Experience with multiple programming languages, including Python, Shell Scripting, Spark, and Scala.
- Experience with Amazon Sagemaker.
We are sorry but this recruiter does not accept applications from abroad.