Posted on: 31/07/2025
As a Sr. Machine Learning Ops Platform Engineer, you will be responsible for building automation and leading-edge architecture around Data and AI/ML engineering on the ResMed AI platform. Specifically, you will code and help architect a production-grade, scalable platform to be used by dozens of data scientists. You will help define and ensure best coding and CI/CD practices within a team of excellent and engaged engineers. You will be given creative freedom and will work in a supportive team environment. This is a hands-on role that involves coding and regular interaction with business stakeholders.
Let's talk about responsibilities :
- Ensure AI platforms are reliable, scalable, and resilient by establishing foundational blueprints for upgrade and release strategies, implementing comprehensive logging, monitoring, and metrics, and automating critical system management tasks.
- Work with Generative AI development, including embeddings and fine-tuning of generative models.
- Build and maintain systems using DevOps, LLMOps, and AIOps practices (Kubernetes, Docker), AWS, Python, and Terraform.
- Push the boundaries of whats possible with AI by thinking beyond current technology and stack constraints, and collaboratively delivering innovative yet practical solutions.
- Participate in and set up Proofs of Concept (POCs) to demonstrate proposed solutions.
- Enable team members through training, culture-building, and mentorship.
- Identify, design, and implement internal process improvements : automate manual processes, re-architect infrastructure for greater scalability, etc.
- Build infrastructure for the AWS platform, including Lambdas, ECS, EC2, SNS/SQS, Bedrock, ML pipeline engineering, data monitoring, alerting, and networking.
- Implement observability stacks such as Prometheus, Loki, VictoriaLogs, Grafana, and Datadog.
- Design, build, and support Data & ML model pipelines using the latest CI/CD and deployment technologies.
- Collaborate with stakeholders including Executive, Product, Data, and Design teams to resolve technical issues and support infrastructure needs.
- Participate in code reviews and process improvements.
Let's talk about qualifications and experience :
- 7+ years of experience in a complex, technical environment. Proven experience developing production-grade code in Python, SQL, and Pandas.
- Deep expertise in Kubernetes as a foundational platform for deploying, scaling, and managing AI/ML workloads in production.
- Experience with 3 or more of the following AWS tools : Lambda, EC2, EMR, S3, Glue, Athena, RDS, Networking, IAM, Batch Processing, SageMaker, Airflow.
- Proficiency in Terraform and common DevOps/DevSecOps tools and techniques such as Docker, GitHub, GitHub Actions, SonarQube, Checkmarx, and JFrog.
- Experience with Kubeflow and MLflow is a strong advantage.
- Skilled in creating and managing CI/CD pipelines and APIs tailored for AI/ML workloads.
- Experience with both relational (SQL) and NoSQL databases; Snowflake experience is a plus.
- Solid understanding of the OAuth 2.0 protocol for secure authorization.
- Hands-on experience implementing A/B testing for models and AI applications.
- Familiarity with AI governance, observability, and compliance practices across AI/ML workflows.
- Experience working with LLMs, AI agents, and multi-agent coordination platforms (MCP); strong hands-on exposure to AI platform engineering.
- Exposure to LLM orchestration frameworks such as Flowise, Langflow, and LangGraph is a plus.
- Familiarity with AgentOps tools and methodologies is a strong advantage.
- Demonstrated ability to work with AI/ML teams and cross-functional groups in fast-paced, dynamic environments.
All listed duties, requirements and responsibilities are considered as essential functions to this position; however, business conditions may require reasonable accommodation for added tasks and responsibilities.
Lets talk about what you can expect :
- A supportive environment that focuses on people development and best implementation.
- Opportunity to design, influence, and be innovative.
- Work with inclusive global teams and the open sharing of new ideas. We want your ideas!
- Be supported both inside and outside of the work environment.
- The opportunity to build something meaningful and see a direct positive impact on peoples lives! Dream big, iterate and experiment to drive innovation!!
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
ML / DL / AI Research
Job Code
1522988
Interview Questions for you
View All