Posted on: 19/08/2025
L2Observability/AIOps :
Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems.
SRE ensures internally critical and externally visible systems have reliability and uptime appropriate to users' needs and a fast rate of improvement while keeping an ever-watchful eye on capacity and performance.
SRE is a mindset, and a set of engineering approaches focused on optimizing existing systems, building infrastructure, and eliminating work through automation.
As a Site Reliability Engineer with focus on observability you will build and operate next generation observability platforms.
As an SRE with Observability focus you will :
- Explore the complex IT estates of our clients to understand their observability/AIOps opportunities, identify the areas to improvise.
- Collaborate to architect unified observability and AIOps strategies which employ leading AI technology.
- Implement enterprise observability/AIOps technology and processes.
- Amplify observability/AIOps outcomes by accelerating adoption across technology and business organizations.
Responsibilities include:
- Architect observability solutions to address the gaps in order to reduce organizational MTTD and MTTR objectives.
- Developing API-driven micro-services that combine into large and complex platforms.
- Planning and executing highly parallel distributed object storage transformations and migrations.
- Maintaining automated test suites using CI/CD tools.
- Participating in collaborative projects with small software engineering teams.
- Develop automation, processes, and tools designed to make our services simpler and more robust.
- Participate in troubleshooting, capacity planning and analysis, performance analysis activities.
- Advise management on service onboarding strategies and execution.
What we are looking for :
- Entrepreneurs who seek challenging problems to solve.
- Creativity, initiative and acute attention to detail.
- Thirst for innovation and solving problems at lightning speed.
- Passion for automating everything repetitive.
- Obsession with software scalability and performance under high loads.
- Love for using and contributing to open-source software.
Please bring to the table :
- Experience in architecting complex IT solutions.
- Understanding of observability dimensions(Metrics, logs, traces).
- Excellent communication and stakeholder management skills.
- Development experience, comfortable working in multiple languages(Python, Java, Go and Ruby a plus).
- Experience working in collaborative coding environments (peer review, continuous integration, etc).
- 7+ years of application development.
- Experience working in distributed remote teams across multiple time zones.
- Experience in large scale operations environments.
- 7+ years of experience with Linux/Unix development or systems administration.
- 3+ years of experience with networking systems and technologies.
- Deep understanding of network performance and security.
- Ability to identify tasks which require automation and implement required automation.
- Configuration Management tools experience with Puppet, Chef, SaltStack.
- Hands-on operational experience in a high-volume or critical production service environment distributed systems, capacity planning, continuous deployment.
- BA/BS in Computer Science preferred, or equivalent experience (advanced degrees preferred).
We have opportunities to work with and learn :
- Object Storage Minio/S3/etc.
- Data Collection OpenTelemetry/Grafana Alloy/etc.
- Message Bus Kafka/NSQ/etc.
- Scaling Databases Druid/Clickhouse/Cassandra/etc.
- Relational database technologies at large scale Timescale/Vitess/Postgres/etc.
- Scheduling & Orchestration Kubernetes/OpenShift/Docker.
- Cloud Platforms AWS/Azure.
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1532064
Interview Questions for you
View All