Description :

- Implements, tunes, and conducts ongoing administration of data layer and infrastructures, Conducts ongoing administration of data layer, application and infrastructure including proposing application systems changes, better uses and improvements

- Improves the whole lifecycle of services from inception and design, through deployment, operation, and refinement.

- Provides support services before they launch through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews.

- Leads availability, performance monitoring, and capacity planning of critical services, on building automation to prevent problem recurrence, and on building automated responses for non-exceptional service conditions.

- Maintains services once they are live by measuring and monitoring availability, latency, and overall system health.

- Handles sustainable incident response.

- Scales systems sustainably through mechanisms like automation, evolving evolve systems by advocating for changes that improve reliability and velocity.

- Writes highly optimised and accurate code for LSEG products and solutions.

- Proactively continues to build and apply relevant domain knowledge that may relate to workflows, data pipelines, business policies, configurations and constraints

- Supports essential processes while ensuring high quality standards are met

Qualifications & Experience :

- Degree in Computer Science, Software Engineering or Electronics / Electrical Engineering, or equivalent

- Moderate experience in software development in one or more programming languages.

Job Description (Job advert content) :

We are looking for a Senior Site Reliability Engineer to join our team delivering and supporting critical applications running on Azure.

The ideal candidate will be an expert in Azure services, have a combination of SRE and DevOps skills including automation, monitoring, observability, CI/CD, incident management, and have a deep understanding of end to end application workflow.

As a Senior Site Reliability Engineer, you will play a pivotal role in ensuring the reliability and performance of our applications throughout the lifecycle.

Role, Responsibilities and Key Accountabilities :

- Investigates and resolves complex incidents escalated to the team.

- Runs post incident review sessions and implements fixes and improvements.

- Conduct service transition activities including establishing metrics to track performance, setting up monitoring, Runbook updates, executing Game Day/OAT, and support team training.

- Maintains services once they are live by measuring and monitoring availability, latency, and overall system health.

- Scales systems sustainably through mechanisms like automation and observability, evolving systems by advocating for changes that improve reliability and velocity.

- Maintains scalable and efficient CI/CD pipelines for application enhancement and fixes.

- Conducts regular capacity and finops review based on usage trends and growth projections.

- Develops disaster recovery (DR) plans and conducts regular DR testing to validate recovery procedures and identify areas for improvement.

- Ensures application compliance with regulatory and security requirements.

- Proactively continues to build and apply relevant domain knowledge that may relate to workflows, data pipelines, business policies, configurations, and constraints.

- Coordinates on security principal access management and triages security issues.

Qualifications and Experience :

- Degree in Computer Science, Software Engineering, Electronics/Electrical Engineering, or equivalent.

- 5+ years of experience working as a site reliability engineer or DevOps engineer responsible for application availability and reliability, implementing automation, and optimizing system performance.

- Extensive hands-on experience with Azure services preferably Microsoft Fabric and Purview.

- Familiarity with infrastructure-as-a-code tools such as Terraform and Azure Resource Manager.

- Scripting and automation skills using Python, PowerShell, or other languages

- Strong knowledge of ITIL framework and best practices for incident, change, configuration, and problem management.

- Have a good understanding of REST API.

- Excellent English communication skill.

- Must be able to work with stakeholders located globally.

- Excellent troubleshooting skills and ability to analyze complex issues.

We are looking for intellectually curious people, passionate about the bigger picture of how technology industry is evolving, ready to ask difficult questions and deal with complicated scenarios! If you are creative and a problem solver, this is the place to be as will be supporting you to fast-forward your career!