Description

Investec's Private Client proposition is evolving rapidly as we expand our Transactional Banking and Lending propositions. This growth presents an opportunity to build a modern, resilient, and highly observable technology estate that delivers a seamless and reliable client experience. The DevOps Engineer plays a critical role in enabling this by ensuring services are reliable, secure, and operable in production, while working with delivery teams to help influence how new platforms are designed and delivered to meet these standards from day one. This is role focused on reliability, automation, and production excellence. It sits within the PCT Tech Operations capability, which spans both the ongoing operation of live services and the enablement of future platforms to be resilient, observable, and supportable at launch.

Key Responsibilities

Service Reliability & Performance

Own reliability outcomes for production services, particularly Tier 1 and Tier 2 systems

Help define and implement Service Level Objectives (SLOs)

Work with teams to reduce Change Failure Rate and improve Mean Time to Recover (MTTR)

Identify and eliminate repeat failure modes through systemic fixes

Ensure services are designed for graceful degradation, redundancy, and recovery

Observability & Incident Response

Ensure services emit high-quality logs, metrics, and traces aligned to defined standards

Work with delivery teams to alert quality and lifecycle, reducing noise and improving signal accuracy

Participate in and help lead Major Incident response and technical triage

Enable structured, blameless post-incident reviews and ensure actions are delivered

Improve detection and response through automation and proactive monitoring

Resilience & Operational Readiness

Check services meet defined operational readiness standards before production release

Support disaster recovery readiness, backup integrity, and resilience testing

Validate that systems are recoverable, observable, and supportable under failure conditions

Contribute to business continuity and cyber resilience exercises

Risk, Security & Compliance

Partner with engineering teams to reduce vulnerability backlog and improve security posture

Ensure least privilege access, credential hygiene, and secure configurations

Support compliance with regulatory and operational frameworks (e.g. PSD2, IBS, GDPR)

Communication & Influence

Build strong relationships across Engineering, Architecture, Cyber, and Risk teams

Understand client journeys and business processes to identify reliability and operational improvements

Clearly communicate production risks, trade-offs, and technical constraints

Provide constructive feedback to improve operability

Share knowledge and context to support teams and colleagues

Impact & Delivery Expectations

Deliver measurable improvements in service reliability and operational resilience

Reduce repeat incidents through automation and systemic fixes

Improve MTTR and detection capabilities across services

Ensure services have defined SLOs, telemetry, and operational standards

Balance short-term remediation with long-term reliability engineering

Qualifications, Experience and Skills

Experience in cloud environments

Hands-on experience with Kubernetes and containerised systems

Experience working with CI/CD pipelines and deployment automation

Proficiency in scripting (Python, Bash, or PowerShell)

Experience with monitoring and observability tooling (e.g. Splunk, Dynatrace, Prometheus)

Understanding of Infrastructure as Code and configuration management

Experience operating in regulated or high-availability environments preferred

Strong understanding of version control systems (Git-based workflows)