Posted on: 03/12/2025
Description :
- Develop and maintain CI/CD pipelines, deployment automation, and infrastructure-as-code solutions.
- Implement robust monitoring, logging, alerting, and observability frameworks to ensure uptime and reliability.
- Lead incident response, troubleshoot production issues, perform root cause analysis, and drive post-mortem reviews.
- Optimize system performance, conduct capacity planning, and establish SLOs/SLIs for key services.
- Ensure security, compliance, and best practices across infrastructure and deployment workflows.
- Implement and maintain disaster recovery (DR) strategies, backups, and business continuity
plans.
- Collaborate closely with software engineering teams to improve developer productivity and platform reliability.
- Automate operational tasks using Python, Bash, or other scripting languages.
- Continuously evaluate and integrate new tools, technologies, and best practices to improve
DevOps/SRE maturity.
Skills & Qualifications :
- Expertise in CI/CD tools like Jenkins, GitLab CI, GitHub Actions, or ArgoCD.
- Proficiency in IaC tools such as Terraform, CloudFormation, or Pulumi.
- Solid understanding of Docker, Kubernetes, and container orchestration.
- Experience with monitoring & observability tools (Prometheus, Grafana, ELK/EFK, CloudWatch, Datadog, New Relic, etc.).
- Strong scripting abilities using Python, Bash, or equivalent.
- Knowledge of networking fundamentals, load balancing, and distributed systems.
- Experience implementing security best practices, secrets management, and compliance frameworks.
- Proven experience handling incident management, on-call rotations, and post-mortems.
- Excellent communication, collaboration, and problem-solving skills.
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1584641
Interview Questions for you
View All