Posted on: 28/11/2025
Description :
Primary Focus Infrastructure & Platform : Building the scalable, reliable "operating system" that runs the agents.
Core Expertise Distributed Systems, Kubernetes, Cloud, Scalability, APIs. Key Deliverable A robust, high-performance platform that can manage thousands of agents Analogy Builds the highway and traffic control system.
We are looking for an experienced Senior AI Platform Engineer to join our team and lead the development of our core AI execution infrastructure. This role is central to our strategy, focusing on designing, building, and scaling a robust, high-performance platform to deploy and manage thousands of concurrent AI agents. You will be responsible for ensuring the platform provides reliability, observability, and cost-efficiency at scale.
What You Will Do :
- Platform Architecture : Design and implement the core components of the AI platform, including runtime environments, distributed scheduling, and resource management systems (e.g., CPU/GPU compute clusters, autoscaling).
- Scalability & Performance : Develop and optimise distributed systems and microservices to handle massive scale and low-latency requirements for agent execution.
- Infrastructure Automation : Work closely with DevOps/SRE teams to automate deployment, scaling, and monitoring using technologies like Kubernetes, Terraform, and CI/CD pipelines.
- Agent Lifecycle : Implement APIs and services that manage the full lifecycle of an AI agent, from ingestion and registration to execution, monitoring, and versioning.
- Observability : Implement comprehensive logging, tracing, and metrics for all platform components to provide deep insights into agent behaviour and system health.
- Best Practices : Drive best practices for code quality, security, and operational excellence within the platform engineering team.
Key Requirements :
- 10+ years of professional software engineering experience, with a significant focus on cloud infrastructure and platform development.
- Mandatory : Proven expertise in developing, deploying, and scaling distributed systems (e.g., Kafka or HPC distributed compute frameworks) in a production environment.
- Expert-level proficiency in at least one modern programming language (e.g., .NET, Python, Rust).
- Deep practical experience with container orchestration (e.g., Kubernetes, Knative), cloud providers (e.g., Azure, AWS) and infra as code (Terraform).
- Solid understanding of networking, security, and performance optimisation for data-intensive applications.
- Experience building high-throughput APIs (REST/gRPC) and developing platform service interfaces.
Bonus Points :
- Prior experience in MLOps, LLMOps, or building systems specifically for running machine learning models or AI agents.
- Familiarity with agentic frameworks, large language models (LLMs), agent protocols (MCP, A2A) and their unique deployment challenges.
- Experience with high-performance computing (HPC) environments or GPU virtualisation.
- Comes from an leading AI software company or a software company delivering LLM frameworks.
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
ML / DL Engineering
Job Code
1582059
Interview Questions for you
View All