HamburgerMenu
hirist

Site Reliability Engineer - Docker/Kubernetes

Posted on: 02/12/2025

Job Description

Description :

Role : Site Reliability Engineer (SRE)

Experience : 10 - 15 Years

Job Summary :


The Site Reliability Engineer (SRE) will play a critical role in ensuring the reliability, scalability, and performance of Citizens Banks enterprise systems and cloud environments.

The ideal candidate brings deep technical expertise across multi-cloud platforms, automation, observability, and incident management driving reliability engineering practices and operational excellence in a complex financial services environment.

Key Responsibilities :


- Manage and support cloud-based solutions across AWS, Azure, GCP, and other IaaS/PaaS/SaaS/CDN environments.

- Design, implement, and maintain reliable, scalable, and secure infrastructure, ensuring high availability and performance.

- Collaborate with DevOps and security teams to implement DevSecOps workflows using Git, Jenkins, Docker, Kubernetes (EKS/AKS).

- Automate infrastructure and configuration management using Terraform, Ansible, and scripting languages like Python, Bash, or PowerShell.

- Analyze traffic flows, system logs, and application events to troubleshoot issues and identify interdependencies across systems.

- Utilize monitoring and observability tools such as DataDog, Splunk, and CloudWatch for proactive system health management.

- Implement on-call support processes, develop and maintain runbook documentation, and work toward full automation of repetitive tasks.

- Collaborate with other SREs to build resilient systems and promote Site Reliability Engineering best practices across the enterprise.

- Handle critical application outages, perform root cause analysis, and drive incident resolution and preventive measures.

- Work within an Agile environment, partnering with cross-functional teams to continuously improve performance and reliability.

Technical Skills Required :


- Cloud Platforms : AWS, Azure, GCP

- DevOps/DevSecOps Tools : Jenkins, Git, Docker, Kubernetes (EKS, AKS)

- Infrastructure as Code (IaC) : Terraform, Ansible

- Monitoring & Logging : DataDog, Splunk, CloudWatch

- Scripting : Python, Bash, PowerShell

- Networking : TCP/IP, DNS, HTTP, Load Balancing, Routing

- OS Environments : Linux, Windows Server

- Familiarity with AMI builds, patching, and rehydration processes

Core Competencies :


- Strong analytical and troubleshooting skills

- Proven ability to drive incident response and post-incident reviews

- Excellent communication and stakeholder management

- Ability to collaborate in global, distributed teams

- Focus on automation, resilience, and continuous improvement


info-icon

Did you find something suspicious?