Description :

We are looking for a DevOps / Site Reliability Engineer (L5) to own and scale the production reliability of a large-scale, AI-first platform. You will be responsible for running mission-critical workloads on cloud infrastructure, hardening Kubernetes-based systems, and ensuring high availability, performance, and cost efficiency across platform and AI services.

This role is deeply hands-on and ownership-driven. You will be trusted to run day-2 production systems end-to-end, lead incident response, and continuously raise the reliability bar for AI and data-intensive workloads.

At Proximity, you wont just keep systems running youll shape how reliability, observability, and operational excellence are built into the platform from the ground up.

Responsibilities :

- Own day-2 production operations of a large-scale, AI-first platform running on cloud infrastructure

- Run, scale, and harden Kubernetes-based workloads integrated with a broad set of managed cloud services across data, messaging, AI, networking, and security

- Define, implement, and operate SLIs, SLOs, and error budgets across core platform and AI services

- Build and own observability end-to-end, including :

a. APM

b. Infrastructure monitoring

c. Logs, alerts, and operational dashboards

- Improve and maintain CI/CD pipelines and Terraform-driven infrastructure automation

- Operate and integrate AI platform services for LLM deployments and model lifecycle management

- Lead incident response, conduct blameless postmortems, and drive systemic reliability improvements

- Optimize cost, performance, and autoscaling for AI, ML, and data-intensive workloads

- Partner closely with backend, data, and ML engineers to ensure production readiness and operational best practices

What Matters (Non-Negotiable Alignment) :

- Infra owners, not operators.

- This role is for engineers who design, build, and own infrastructure, not those limited to ticket-based operations.

- Built and operated production-grade cloud infrastructure end-to-end

- Strong Kubernetes experience in real, high-traffic production environments

- AWS experience is mandatory, with GCP as a strong plus

- Experience operating AI / ML workloads in production

- Including GPU-based systems

- Strong ownership of CI/CD systems and Infrastructure as Code

- End-to-end observability ownership

- Monitoring, logging, alerting, dashboards

- Comfortable making infrastructure decisions under ambiguity

- Proven ability to collaborate deeply with ML and backend teams to take systems from design ? production ? scale

Requirements :

- 6+ years of hands-on experience in DevOps, SRE, or Platform Engineering roles.

- Strong, production-grade experience with cloud platforms

- AWS required

- GCP strongly preferred, especially Kubernetes and managed services

- Proven expertise running Kubernetes at scale in live production environments.

- Deep hands-on experience with New Relic in complex, distributed systems.

- Experience operating AI/ML or LLM-driven platforms in production environments.

- Solid background in Terraform, CI/CD systems, cloud networking, and security fundamentals.

- Strong understanding of reliability engineering principles, including capacity planning, failure modes, and resilience patterns.

- Comfortable owning production systems end-to-end with minimal supervision.

- Strong communication skills and the ability to operate calmly and effectively during incidents.

- Experience building internal platform tooling for developer productivity.

Desired Skills :

- Experience managing multi-cloud environments or cross-cloud integrations.

- Familiarity with cost optimization strategies for large-scale Kubernetes and AI workloads.

- Exposure to service meshes, advanced traffic management, or zero-trust security models.