Posted on: 05/11/2025
- Design, build, and maintain multi-region infrastructure using Terraform and Atlantis.
- Continuously optimize system performance, scalability, and cost efficiency.
- Implement infrastructure automation and self-healing capabilities.
- Develop and maintain Datadog dashboards, SLOs, SLIs, and alerting mechanisms.
- Automate incident detection, recovery, and runbook execution.
- Implement monitoring for reliability, availability, and latency across distributed systems.
- Manage and enhance CI/CD pipelines with security and quality gates (GitLeaks, static code checks, GitOps).
- Ensure deployment consistency and eliminate manual infrastructure drift.
- Collaborate with development teams to improve deployment processes and accelerate release cycles.
- Enforce strong IAM (Identity and Access Management) practices and maintain compliance across systems.
- Automate checks and reporting for SOC 2, ISO 27001, and GDPR compliance.
- Implement policies and automation for least privilege access and secure network configurations.
- Coach engineers on infrastructure best practices, observability, and cloud reliability.
- Advocate for DevOps and reliability engineering culture across the organization.
- Partner with cross-functional teams to define infrastructure standards and long-term roadmap.
Requirements & Qualifications :
- 4- 6 years of hands-on experience in Infrastructure, DevOps, or Platform Engineering roles.
- Strong expertise in AWS (ECS/Fargate, EKS) GCP (GKE) Terraform and Atlantis for Infrastructure as Code (IaC)
- Experience managing multi-region, multi-cloud environments.
- Proficiency with CI/CD pipelines, GitOps, and infrastructure security automation.
- Deep understanding of observability tools such as Datadog, Last9, and CloudWatch.
- Strong debugging, troubleshooting, and performance optimization skills.
- Demonstrated experience in cost management, monitoring automation, and incident management.
- Excellent communication and documentation skills able to explain complex technical topics clearly.
Preferred Skills :
- Experience with Cloudflare, Linear, or similar DevOps tools.
- Familiarity with container orchestration (Docker, Kubernetes) and service mesh technologies.
- Knowledge of security scanning, compliance automation, and infrastructure observability design.
- Understanding of SRE (Site Reliability Engineering) principles and error budgets.
- Experience in mentoring or leading small infrastructure teams.
Soft Skills :
- Proactive and detail-oriented problem solver.
- Strong leadership and mentoring capabilities.
- Collaborative team player with a build and automate everything mindset.
- Passion for innovation, reliability, and continuous improvement.
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1569893
Interview Questions for you
View All