Senior Site Reliability Engineer (Azure)

Permanent - Leeds, London and Uppingham

As a Senior Site Reliability Engineer, you’ll be directly responsible for uptime and resilience of client infrastructure and applications. You’ll be working closely with our clients to ensure safe, accurate delivery of services. You’ll own end-to-end availability and performance of key services, and build automation to prevent problem recurrence.

Assisting in the roll-out and deployment of new product features and installations, you’ll be working closely with application and cloud architects to ensure that platforms are designed with scale and operability in mind. You’ll automate current manual infrastructure management and alerts handling processes, and find scalability bottlenecks and areas for performance improvements.

You’ll assist in the discussion and formulation of architectural strategies to maximise performance, stability and efficiency of the solution, and work with the Systems Engineering Manager and wider Systems Engineering Teams to continually assess and improve ways of working.

As a senior member of the team, you’ll help SREs in your team to grow and develop their careers through mentorship as well as sharing your knowledge and experience by organising Lunch and Learn, Lightning talks and Brown Bag sessions where you’ll be able to educate and onboard members from the wider business of working in an SRE model.

What You'll Bring

In order to flourish in this role, you’ll need the following:

  • Experience in building highly resilient systems in an Agile environment

  • Experience with running production systems, triaging and solving outages

  • Experience with using monitoring and observability tools like Datadog/Splunk

  • In-depth knowledge of Terraform

  • Deep understanding of large-scale system architecture

  • Working knowledge of UNIX/Linux internals

  • Proven success at adopting new methodologies driving an SRE culture

  • In-depth knowledge of using Azure, particularly around AKS, networking, App Services and Serverless resources.

  • Hands-on source control experience, knowledge of Docker and container runtimes and experience of writing build-release pipelines, preferably using Azure DevOps

  • Software engineering experience

  • Ability to listen effectively, communicate, challenge and influence team members, peers and your management

  • Excellent troubleshooting skills, able to drive out root causes of complex technical problems

It would also be great (but not essential) if you have:

  • Had the chance to use and implement the Google SRE working model for Enterprise clients

  • Experience with Hybrid cloud solutions

  • Versatility in one or more scripting languages (Python, Perl, Bash, PowerShell)

  • Knowledge of test frameworks (Pester, Terratest, etc)

  • Working knowledge of Windows Server