HamburgerMenu
hirist

DevOps Architect - SaaS Product

Episeio Business Solutions
Mumbai
7 - 12 Years
star-icon
4.9white-divider7+ Reviews

Posted on: 21/11/2025

Job Description

Description :

About the Role

We are seeking a DevOps Architect who will design, implement, and optimize the cloud-native infrastructure powering our SaaS platform.

This role is highly technical and requires deep expertise in distributed systems, microservices, CI/CD pipelines, system reliability, infra automation, and high-availability architectures.

You will work closely with backend engineers, SREs, and data/ML teams to ensure a stable, scalable, and secure production environment.

Key Responsibilities :

- Architect end-to-end DevOps pipelines, cloud environments, and automation frameworks.

- Build, maintain, and scale microservices infrastructure on AWS with Kubernetes, service mesh, and GitOps.

- Create CI/CD pipelines from scratch with auto-scaling, blue/green, and canary strategy support.

- Implement observability frameworks including metrics, logs, traces, and automated alerting.

- Solve complex infra/code/network performance bottlenecks and drive RCA for critical incidents.

- Orchestrate real-time data pipelines, ML pipelines, vector stores, and API infrastructure.

- Create infrastructure-as-code modules (Terraform/CloudFormation/CDK).

- Ensure strong DevSecOps posture identity, access control, encryption, vulnerability scanning.

Core Technical Skills (Essential) :

- Strong hands-on experience with Node.js, Python, or Go for backend + tooling.

- Expert in Docker, Kubernetes (EKS preferred), GitOps (ArgoCD/Flux).

- Deep AWS expertise (EC2, S3, RDS, DynamoDB, EKS, Lambda, CloudWatch, IAM).

- CI/CD design with GitHub Actions, GitLab CI, Jenkins, or Argo Workflows.

- Strong debugging across infra (K8s), networks (VPC, NACL, SG), and services (APIs).

- Observability stack: Prometheus, Grafana, Loki/ELK, OpenTelemetry.

Desirable Technical Skills :

- Experience with Kafka/Kinesis, event-driven systems.

- AI/ML infra exposure model deployments, embeddings, vector databases like Redis/Pinecone.

- Knowledge of RLHF, RL4LM, or similar ML lifecycle workflows.

- Experience with performance hardening and resilience testing (chaos engineering)


info-icon

Did you find something suspicious?