Job Title : MLOps Engineer

Experience : 5+ Years

Location : Remote

Interview Mode : Fully Virtual

Job Type : Contract / Full-time

Job Description :

We are seeking an experienced ML Ops Engineer to design, build, and manage scalable machine learning infrastructure and deployment pipelines for AI solutions. The ideal candidate will have hands-on experience in GPU cluster development, model lifecycle management, and deployment of large-scale neural networks. This role requires expertise in orchestrating end-to-end AI and LLM (Large Language Model) applications in production environments.

Key Responsibilities :

ML Infrastructure & Cluster Management :

- Design, implement, and manage GPU clusters for training and inference of machine learning models.

- Optimize resource utilization and performance of large-scale ML workloads.

- Ensure high availability, reliability, and scalability of ML infrastructure.

Model Development & Deployment :

- Manage the entire model lifecycle, including development, training, validation, deployment, monitoring, and retraining.

- Deploy large-scale neural network models to production environments with efficiency and reliability.

- Implement best practices for continuous integration and continuous deployment (CI/CD) in ML workflows.

Orchestration & AI Solutions :

- Architect and deploy end-to-end AI solutions, integrating ML pipelines with data engineering and application layers.

- Orchestrate LLM applications and ensure optimal performance for large-scale inference.

- Collaborate with data scientists, ML engineers, and software engineers to deploy models into production.

Monitoring & Optimization :

- Monitor model performance, resource utilization, and infrastructure health in production.

- Identify bottlenecks and implement optimization strategies for training and inference pipelines.

- Ensure reproducibility, scalability, and robustness of ML models in production.

Required Skills & Qualifications :

- 5+ years of experience in ML Ops, AI infrastructure, or related roles.

- Hands-on experience with GPU cluster development and management.

- Strong understanding of model lifecycle management and end-to-end AI solutions.

- Experience in deployment of large-scale neural network models in production.

- Familiarity with orchestrating LLM applications.

- Proficiency with ML frameworks such as PyTorch, TensorFlow, or JAX.

- Strong programming and scripting skills (Python, Bash, or similar).

- Experience with CI/CD pipelines and ML workflow orchestration tools (e.g., Kubeflow, MLflow, Airflow).

Preferred Attributes :

- Experience in cloud-based ML infrastructure (AWS, GCP, Azure) and distributed training.

- Knowledge of containerization and orchestration (Docker, Kubernetes).

- Strong problem-solving, analytical, and debugging skills.

- Excellent communication skills and ability to work in a fully remote setup.