HamburgerMenu
hirist

Senior Engineer - DevOps & Site Reliability

Interface Consultancy Services
8 - 12 Years
Hyderabad

Posted on: 09/02/2026

Job Description

Description :


Key Responsibilities :


Reliability Engineering & Operations :


- Contribute to the availability and performance of large-scale, customer-facing systems through monitoring, alerting, and incident response.


- Assist in designing and implementing resiliency strategies, including health checks, failovers, circuit breakers, and retries.


- Participate in on-call rotations, help triage incidents, and assist in root cause analysis and post-incident reviews.


Automation & CI/CD Support :


- Develop scripts, tools, and automation to reduce manual toil and improve operational efficiency.


- Support infrastructure deployment and service rollout via CI/CD pipelines and Infrastructure-as-Code workflows (e.g., Terraform, Helm).


- Work with developers to improve service deployment, configuration management, and rollback strategies.


Observability & Metrics :


- Help build and maintain dashboards, alerts, and logs that provide visibility into system health and application behavior.


- Use tools such as Prometheus, Grafana, Splunk, or Open Telemetry to monitor services and infrastructure.


- Analyze system performance data to guide optimizations and proactively detect issues.


Cross-Team Collaboration :


- Work with DevOps, SREs, and software engineers to ensure that services are built for reliability and observability.


- Contribute to documentation, runbooks, playbooks, and operational readiness reviews.


- Support development teams in designing systems that meet SLOs and operational standards.


Qualifications :


- Bachelors degree in computer science, Engineering, or a related technical field.


- 8+ years of experience in infrastructure, operations, DevOps, or SRE roles.


- Proficiency in scripting or programming languages such as Java, Python, Go, and Bash.


- Strong familiarity with Linux systems, container orchestration (Kubernetes), and cloud platforms (Azure preferred/GCP also relevant).


- Hands-on experience with monitoring and observability tools such as Grafana, Splunk, and Open Telemetry.


- Expertise in Kubernetes and container orchestration, including Docker templates, Helm charts, and GitLab templates.


- Knowledge of authentication, authorization, encryption, SSL/TLS, SSH/SFTP, PKI, X.509 certificates, and PGP.


- Solid understanding of incident management tools such as ServiceNow.


Preferred Skills :


- Exposure to incident management frameworks, including alerting, escalation, and postmortem practices.


- Understanding of SRE principles : SLOs, SLIs, error budgets, and service-level indicators.


- Familiarity with tools like HAProxy, Envoy Proxy, Kafka, RabbitMQ, or other core infrastructure components.


- Experience with performance tuning of Kubernetes runtime components.


- Experience with CI/CD systems (e.g., GitLab CI/CD, Jenkins, Spinnaker).


Knowledge, Skills, and Abilities :


- Strong problem-solving and analytical skills for diagnosing issues in distributed systems.


- A growth mindset with a passion for learning observability, automation, and platform engineering best practices.


- Strong communication skills and the ability to collaborate across teams.


- Drive to improve system reliability, developer productivity, and customer experience


Mandatory Skills :


- SQL Or NOSQL


- Nagios Or CloudWatch Or Zabbix Or Datadog Or New Relic Or Prometheus Or Grafana Or App Dynamics Or Site24x7 Or Telemetry Or Splunk


- SRE Or Site Reliability


- CI CD Or CI/CD Or CI-CD Or CICD Or DevOps


- Kubernetes Or K8S


info-icon

Did you find something suspicious?

Similar jobs that you might be interested in