Posted on: 22/08/2025
We have an exciting role as below in Hyderabad for an AI SaaS Fintech Product Firm.
SRE DevOps Lead Engineer (SaaS) || 8-12 Y || Hyderabad (Hybrid) || Quick Starter ||
Key Responsibilities :
- Architect, design, and deploy end-to-end infrastructure solutions for a multi-tenant microservices-based SaaS application with a focus on AI/ML model integration.
- Ensure system reliability, scalability, performance, and security, specifically enhancing AI/ML processing pipelines and workflows.
- Utilize Terraform scripting for on-demand environment provisioning within the AWS cloud, optimized for AI/ML workloads.
- Diagnose, support, and resolve production issues and alerts, participating in a 24/7 on-call rotation to maintain seamless AI/ML service operations.
Scope Of Work :
- Actively participate in the Scrum team, delivering test automation for sprint features and ensuring high-quality product increments by certifying new and regression features using automated test suites
- Integrate automated tests into the CI/CD pipeline and schedule them to run periodically in product development environments
- Identify defects, collaborate with development engineers to resolve them, and verify the fixes
- Maintain continuous availability in alignment with startup culture, staying informed and up to date with communications across various channels and email threads
- Focus on the primary goal of minimizing customer-reported bugs to near zero.
Required Qualification :
- 8+ years of experience in Site Reliability Engineering (SRE) and DevOps roles with a track record of managing large-scale enterprise SaaS services in production, including 1+ year in AI/ML infrastructure
- Skilled in Infrastructure as Code (IaC) using Terraform, and container technologies such as Docker and Kubernetes.
- Proficient in scripting and programming for automation (Python, Bash, etc.), with strong Linux OS and networking fundamentals relevant to AI/ML workloads.
- Experience in establishing monitoring systems to ensure high availability, performance, and security integrity, using tools like ELK Stack, CloudWatch, and others tailored for AI/ML monitoring.
- Hands-on experience managing microservices architecture SaaS products, enabling RESTful web services, SSO integration (Okta, Auth0), and utilizing cloud databases like EC2-RDS, MySQL, and Elasticsearch, especially in AI/ML deployments.
- Proficient in backup and disaster recovery strategies specific to AI/ML data resources like RDS and Elasticsearch.
- AWS Certified Solutions Architect is strongly preferred.
- Self-driven, proactive, and adaptable to thrive in an early-stage startup environment, with a keen interest in integrating AI/ML technologies into modern SaaS solutions.
- Strictly, prefer applicants with stable career (consistent employment) within 0-30 days NP only!
The job is for:
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1533893
Interview Questions for you
View All