Posted on: 09/02/2026
Description :
Key Responsibilities :
Reliability Engineering & Operations :
- Contribute to the availability and performance of large-scale, customer-facing systems through monitoring, alerting, and incident response.
- Assist in designing and implementing resiliency strategies, including health checks, failovers, circuit breakers, and retries.
- Participate in on-call rotations, help triage incidents, and assist in root cause analysis and post-incident reviews.
Automation & CI/CD Support :
- Develop scripts, tools, and automation to reduce manual toil and improve operational efficiency.
- Support infrastructure deployment and service rollout via CI/CD pipelines and Infrastructure-as-Code workflows (e.g., Terraform, Helm).
- Work with developers to improve service deployment, configuration management, and rollback strategies.
Observability & Metrics :
- Help build and maintain dashboards, alerts, and logs that provide visibility into system health and application behavior.
- Use tools such as Prometheus, Grafana, Splunk, or Open Telemetry to monitor services and infrastructure.
- Analyze system performance data to guide optimizations and proactively detect issues.
Cross-Team Collaboration :
- Work with DevOps, SREs, and software engineers to ensure that services are built for reliability and observability.
- Contribute to documentation, runbooks, playbooks, and operational readiness reviews.
- Support development teams in designing systems that meet SLOs and operational standards.
Qualifications :
- Bachelors degree in computer science, Engineering, or a related technical field.
- 8+ years of experience in infrastructure, operations, DevOps, or SRE roles.
- Proficiency in scripting or programming languages such as Java, Python, Go, and Bash.
- Strong familiarity with Linux systems, container orchestration (Kubernetes), and cloud platforms (Azure preferred/GCP also relevant).
- Hands-on experience with monitoring and observability tools such as Grafana, Splunk, and Open Telemetry.
- Expertise in Kubernetes and container orchestration, including Docker templates, Helm charts, and GitLab templates.
- Knowledge of authentication, authorization, encryption, SSL/TLS, SSH/SFTP, PKI, X.509 certificates, and PGP.
- Solid understanding of incident management tools such as ServiceNow.
Preferred Skills :
- Exposure to incident management frameworks, including alerting, escalation, and postmortem practices.
- Understanding of SRE principles : SLOs, SLIs, error budgets, and service-level indicators.
- Familiarity with tools like HAProxy, Envoy Proxy, Kafka, RabbitMQ, or other core infrastructure components.
- Experience with performance tuning of Kubernetes runtime components.
- Experience with CI/CD systems (e.g., GitLab CI/CD, Jenkins, Spinnaker).
Knowledge, Skills, and Abilities :
- Strong problem-solving and analytical skills for diagnosing issues in distributed systems.
- A growth mindset with a passion for learning observability, automation, and platform engineering best practices.
- Strong communication skills and the ability to collaborate across teams.
- Drive to improve system reliability, developer productivity, and customer experience
Mandatory Skills :
- SQL Or NOSQL
- Nagios Or CloudWatch Or Zabbix Or Datadog Or New Relic Or Prometheus Or Grafana Or App Dynamics Or Site24x7 Or Telemetry Or Splunk
- SRE Or Site Reliability
- CI CD Or CI/CD Or CI-CD Or CICD Or DevOps
- Kubernetes Or K8S
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1611043