We are looking for a seasoned Staff MLOps Engineer to lead the design, implementation, and scaling of enterprise-grade machine learning platforms on AWS .

This role will focus on building reliable, secure, and cost-efficient MLOps systems that enable data scientists and engineers to deploy, monitor, and manage ML models in production.

As a Staff Engineer, you will provide technical leadership, define best practices, and drive cross-team alignment on ML platform architecture.

Duties & Responsibilities :

Key Responsibilities :

MLOps Platform & Architecture :

- Architect and own scalable MLOps platforms on AWS supporting model training, deployment, monitoring, and governance.

- Design and maintain end-to-end ML CI/CD pipelines, including data validation, model training, testing, approval, and deployment.

- Establish standards for model lifecycle management, experiment tracking, versioning, reproducibility, and rollback.

Model Deployment & Monitoring :

- Enable real-time, batch, and asynchronous model inference using AWS-native and container-based solutions.

- Implement monitoring for model performance, data drift, concept drift, and operational metrics.

- Ensure high availability, fault tolerance, and observability for production ML systems.

AWS Cloud & Infrastructure :

- Lead design and implementation using AWS services, including but not limited to :

1. Amazon SageMaker (training, hosting, pipelines, feature store).

2. EKS, ECS, EC2, Lambda for model serving and orchestration.

3. S3, Glue, Athena, Redshift for data storage and analytics.

4. CloudWatch, X-Ray for logging and monitoring.

- Implement Infrastructure as Code ( IaC ) using Terraform or AWS CloudFormation.

- Optimize ML workloads for cost, performance, and scalability, including GPU/spot instance strategies.

DevOps, Security & Compliance :

- Build and maintain CI/CD pipelines using tools such as GitHub Actions, GitLab CI, Jenkins, or AWS CodePipeline .

- Enforce security best practices (IAM, VPC, encryption, secrets management).

- Support compliance, auditability, and governance requirements for ML systems.

Technical Leadership & Collaboration :

- Serve as a Staff-level technical leader, influencing MLOps architecture across multiple teams.

- Mentor engineers and data scientists on production ML best practices.

- Partner with Data Science, Data Engineering, Platform, and Product teams to align ML solutions with business goals.

- Contribute to the long-term ML platform roadmap and strategy.

Skills Required :

- 11 - 13 years of overall experience, with 5+ years in MLOps , ML Platform, or ML Infrastructure roles.

- Strong experience deploying and operating machine learning models in production on AWS.

- Proficiency in Python and experience with ML frameworks such as TensorFlow, PyTorch , Scikit-learn.

- Deep hands-on experience with Docker and Kubernetes (EKS).

- Strong understanding of Amazon SageMaker and its ecosystem.

- Experience with CI/CD systems and Git-based workflows.

- Solid background in distributed systems, system design, and cloud architecture.

Preferred / Nice-to-Have Skills :

- Experience with SageMaker Feature Store, Pipelines, Model Registry, or MLflow.

- Exposure to LLMOps/GenAI on AWS (Bedrock, custom LLM deployment, vector databases like OpenSearch, Pinecone).

- Experience with streaming and real-time pipelines (Kafka, Kinesis, Spark).

- Experience in regulated or high-scale environments (finance, healthcare, retail, etc.

- AWS certifications (Solutions Architect, Machine Learning Specialty) are a plus.

Soft Skills :

- Strong ownership and decision-making ability at a Staff level.

- Excellent communication skills across engineering, data science, and leadership teams.

- Ability to balance short-term delivery with long-term platform vision.

- Passion for building reliable, scalable, and maintainable ML systems.

Qualifications Required :

- Bachelors degree from four-year college or university, or equivalent combination of education and experience.

- 11 - 13 years of overall experience, with 5+ years in MLOps , ML Platform, or ML Infrastructure roles.

About Symplr :

- As a leader in healthcare operations solutions, we empower healthcare organizations to navigate the complexities of integrating critical business operations.

- Our customers are at the heart of everything we do, and they rely on our mission-critical systems to drive better operations and better outcomes.

- We are a remote-first company with employees working across the United States, India, and the Netherlands.

- Guided by values, we focus on teamwork, championing our customers, being rooted in action and outcomes, overcoming challenges, and leading through equality and integrity.