Posted on: 12/02/2026
Description :
We are looking for a DevOps / Site Reliability Engineer (L5) to own and scale the production reliability of a large-scale, AI-first platform. You will be responsible for running mission-critical workloads on cloud infrastructure, hardening Kubernetes-based systems, and ensuring high availability, performance, and cost efficiency across platform and AI services.
This role is deeply hands-on and ownership-driven. You will be trusted to run day-2 production systems end-to-end, lead incident response, and continuously raise the reliability bar for AI and data-intensive workloads.
At Proximity, you wont just keep systems running youll shape how reliability, observability, and operational excellence are built into the platform from the ground up.
Responsibilities :
- Own day-2 production operations of a large-scale, AI-first platform running on cloud infrastructure
- Run, scale, and harden Kubernetes-based workloads integrated with a broad set of managed cloud services across data, messaging, AI, networking, and security
- Define, implement, and operate SLIs, SLOs, and error budgets across core platform and AI services
- Build and own observability end-to-end, including :
a. APM
b. Infrastructure monitoring
c. Logs, alerts, and operational dashboards
- Improve and maintain CI/CD pipelines and Terraform-driven infrastructure automation
- Operate and integrate AI platform services for LLM deployments and model lifecycle management
- Lead incident response, conduct blameless postmortems, and drive systemic reliability improvements
- Optimize cost, performance, and autoscaling for AI, ML, and data-intensive workloads
- Partner closely with backend, data, and ML engineers to ensure production readiness and operational best practices
What Matters (Non-Negotiable Alignment) :
- Infra owners, not operators.
- This role is for engineers who design, build, and own infrastructure, not those limited to ticket-based operations.
- Built and operated production-grade cloud infrastructure end-to-end
- Strong Kubernetes experience in real, high-traffic production environments
- AWS experience is mandatory, with GCP as a strong plus
- Experience operating AI / ML workloads in production
- Including GPU-based systems
- Strong ownership of CI/CD systems and Infrastructure as Code
- End-to-end observability ownership
- Monitoring, logging, alerting, dashboards
- Comfortable making infrastructure decisions under ambiguity
- Proven ability to collaborate deeply with ML and backend teams to take systems from design ? production ? scale
Requirements :
- 6+ years of hands-on experience in DevOps, SRE, or Platform Engineering roles.
- Strong, production-grade experience with cloud platforms
- AWS required
- GCP strongly preferred, especially Kubernetes and managed services
- Proven expertise running Kubernetes at scale in live production environments.
- Deep hands-on experience with New Relic in complex, distributed systems.
- Experience operating AI/ML or LLM-driven platforms in production environments.
- Solid background in Terraform, CI/CD systems, cloud networking, and security fundamentals.
- Strong understanding of reliability engineering principles, including capacity planning, failure modes, and resilience patterns.
- Comfortable owning production systems end-to-end with minimal supervision.
- Strong communication skills and the ability to operate calmly and effectively during incidents.
- Experience building internal platform tooling for developer productivity.
Desired Skills :
- Experience managing multi-cloud environments or cross-cloud integrations.
- Familiarity with cost optimization strategies for large-scale Kubernetes and AI workloads.
- Exposure to service meshes, advanced traffic management, or zero-trust security models.
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1612062