Posted on: 03/02/2026
Role Overview :
Lead the DevOps and infrastructure team as both a technical leader and hands-on individual contributor, managing the company's growing cloud and on-premise resources with exceptional reliability and performance. You'll be responsible for maintaining 99% uptime for our high-throughput AdTech platform while optimizing costs and building a world-class infrastructure team.
Key Responsibilities :
- Maintain 99% uptime and meet SLAs across all environments while reducing infrastructure costs by 20-30%
- Design and implement deployment architecture for high-throughput systems (25,000-30,000 QPS, sub-100ms latency)
- Manage multi-cloud infrastructure (AWS, DigitalOcean, GCP) using Infrastructure as Code
- Build CI/CD pipelines, monitoring systems, and automation for distributed microservices
- Troubleshoot production issues including Kafka lag, RabbitMQ failures, Nodejs, Python and Java application performance
- Lead incident response (on-call rotation), post-mortems, and implement preventive measures
- Implement security best practices (OAuth, OIDC, SSO) and disaster recovery protocols
- Build and mentor a team of infrastructure engineers
Required Skills & Experience :
Experience : 5+ years in DevOps/Infrastructure roles, including 2+ years with high-throughput systems (10,000+ QPS)
Infrastructure & Cloud (MUST HAVE) :
- Strong production experience with Infrastructure as Code (Terraform, Terragrunt, Ansible)
- Production Kubernetes and Docker experience with complex microservices architectures
- Multi-cloud expertise : AWS (VPC, EC2, ECS, Fargate, S3, Glacier, RDS, Route 53, CloudFront, Lambda, API Gateway, CloudWatch), DigitalOcean, Azure, or GCP
- Advanced Linux system administration (RHEL, Ubuntu, Amazon Linux) and networking concepts
Data Systems (Added Advantage) :
- ClickHouse : Production operations, query optimization, data retention policies for billions of auction records
- Kafka : Consumer/producer optimization, lag management, performance tuning for high-volume message streams (millions of messages/day)
- RabbitMQ : Message routing, cluster management, troubleshooting connection failures in K8s environments
- MySQL : Database administration, replication, backup/recovery
- Elasticsearch : Bulk indexing optimization, cluster health management
Development & CI/CD :
- CI/CD tools : GitHub Actions, Jenkins, GitLab CI, or similar
- Programming : Python (required), Shell scripting (required); Rust or Go strongly preferred
- JVM troubleshooting : Profiling, GC tuning, memory leak detection, understanding Java Spring Boot applications
- Microservices architectures and API design patterns
- Software development lifecycle and agile methodologies
Monitoring & Observability :
- Prometheus, Grafana, ELK stack (Elasticsearch, Logstash, Kibana, Filebeat)
- System performance troubleshooting under load (CPU bottlenecks, memory leaks, network latency)
- Incident response and production support with systematic debugging approach
- Understanding of RED metrics (Rate, Errors, Duration) and USE metrics (Utilization, Saturation, Errors)
Nice to Have (Strong Bonus) :
AdTech & Domain Knowledge :
- Experience with programmatic advertising and Real-Time Bidding (RTB) systems
- Understanding of ad auction mechanics and sub-100ms latency requirements
- Familiarity with ad fraud prevention and transparency measures
- Knowledge of supply-side platforms (SSP) and demand-side platforms (DSP)
Blockchain & Distributed Systems :
- Blockchain infrastructure and node operations (Sui ecosystem experience is a major bonus)
- Experience with decentralized storage systems (Walrus, IPFS, Arweave)
- Data pipeline integration between blockchain and distributed storage
- Understanding of consensus mechanisms and distributed ledger technology
Advanced Technical Skills :
- Rust or Go programming experience
- MLOps practices and tooling
- Security systems implementation (OAuth 2.0, OIDC, SSO with Okta/Auth0)
- Data lifecycle management and GDPR/privacy compliance awareness
- Experience with high-frequency trading or financial systems
- Start-up or R&D environments with rapid iteration
- Relevant cloud certifications (AWS Certified DevOps Engineer Professional, CKA, CKAD)
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1609418