Description
Investec's Private Client proposition is evolving rapidly as we expand our Transactional Banking and Lending propositions. This growth presents an opportunity to build a modern, resilient, and highly observable technology estate that delivers a seamless and reliable client experience. The DevOps Engineer plays a critical role in enabling this by ensuring services are reliable, secure, and operable in production, while working with delivery teams to help influence how new platforms are designed and delivered to meet these standards from day one. This is role focused on reliability, automation, and production excellence. It sits within the PCT Tech Operations capability, which spans both the ongoing operation of live services and the enablement of future platforms to be resilient, observable, and supportable at launch.
Key Responsibilities
Service Reliability & Performance
Own reliability outcomes for production services, particularly Tier 1 and Tier 2 systems
Help define and implement Service Level Objectives (SLOs)
Work with teams to reduce Change Failure Rate and improve Mean Time to Recover (MTTR)
Identify and eliminate repeat failure modes through systemic fixes
Ensure services are designed for graceful degradation, redundancy, and recovery
Observability & Incident Response
Ensure services emit high-quality logs, metrics, and traces aligned to defined standards
Work with delivery teams to alert quality and lifecycle, reducing noise and improving signal accuracy
Participate in and help lead Major Incident response and technical triage
Enable structured, blameless post-incident reviews and ensure actions are delivered
Improve detection and response through automation and proactive monitoring
Resilience & Operational Readiness
Check services meet defined operational readiness standards before production release
Support disaster recovery readiness, backup integrity, and resilience testing
Validate that systems are recoverable, observable, and supportable under failure conditions
Contribute to business continuity and cyber resilience exercises
Risk, Security & Compliance
Partner with engineering teams to reduce vulnerability backlog and improve security posture
Ensure least privilege access, credential hygiene, and secure configurations
Support compliance with regulatory and operational frameworks (e.g. PSD2, IBS, GDPR)
Communication & Influence
Build strong relationships across Engineering, Architecture, Cyber, and Risk teams
Understand client journeys and business processes to identify reliability and operational improvements
Clearly communicate production risks, trade-offs, and technical constraints
Provide constructive feedback to improve operability
Share knowledge and context to support teams and colleagues
Impact & Delivery Expectations
Deliver measurable improvements in service reliability and operational resilience
Reduce repeat incidents through automation and systemic fixes
Improve MTTR and detection capabilities across services
Ensure services have defined SLOs, telemetry, and operational standards
Balance short-term remediation with long-term reliability engineering
Qualifications, Experience and Skills
Experience in cloud environments
Hands-on experience with Kubernetes and containerised systems
Experience working with CI/CD pipelines and deployment automation
Proficiency in scripting (Python, Bash, or PowerShell)
Experience with monitoring and observability tooling (e.g. Splunk, Dynatrace, Prometheus)
Understanding of Infrastructure as Code and configuration management
Experience operating in regulated or high-availability environments preferred
Strong understanding of version control systems (Git-based workflows)