L2Observability/AIOps :

Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems.

SRE ensures internally critical and externally visible systems have reliability and uptime appropriate to users' needs and a fast rate of improvement while keeping an ever-watchful eye on capacity and performance.

SRE is a mindset, and a set of engineering approaches focused on optimizing existing systems, building infrastructure, and eliminating work through automation.

As a Site Reliability Engineer with focus on observability you will build and operate next generation observability platforms.

As an SRE with Observability focus you will :

- Explore the complex IT estates of our clients to understand their observability/AIOps opportunities, identify the areas to improvise.

- Collaborate to architect unified observability and AIOps strategies which employ leading AI technology.

- Implement enterprise observability/AIOps technology and processes.

- Amplify observability/AIOps outcomes by accelerating adoption across technology and business organizations.

Responsibilities include:

- Architect observability solutions to address the gaps in order to reduce organizational MTTD and MTTR objectives.

- Developing API-driven micro-services that combine into large and complex platforms.

- Planning and executing highly parallel distributed object storage transformations and migrations.

- Maintaining automated test suites using CI/CD tools.

- Participating in collaborative projects with small software engineering teams.

- Develop automation, processes, and tools designed to make our services simpler and more robust.

- Participate in troubleshooting, capacity planning and analysis, performance analysis activities.

- Advise management on service onboarding strategies and execution.

What we are looking for :

- Entrepreneurs who seek challenging problems to solve.

- Creativity, initiative and acute attention to detail.

- Thirst for innovation and solving problems at lightning speed.

- Passion for automating everything repetitive.

- Obsession with software scalability and performance under high loads.

- Love for using and contributing to open-source software.

Please bring to the table :

- Experience in architecting complex IT solutions.

- Understanding of observability dimensions(Metrics, logs, traces).

- Excellent communication and stakeholder management skills.

- Development experience, comfortable working in multiple languages(Python, Java, Go and Ruby a plus).

- Experience working in collaborative coding environments (peer review, continuous integration, etc).

- 7+ years of application development.

- Experience working in distributed remote teams across multiple time zones.

- Experience in large scale operations environments.

- 7+ years of experience with Linux/Unix development or systems administration.

- 3+ years of experience with networking systems and technologies.

- Deep understanding of network performance and security.

- Ability to identify tasks which require automation and implement required automation.

- Configuration Management tools experience with Puppet, Chef, SaltStack.

- Hands-on operational experience in a high-volume or critical production service environment distributed systems, capacity planning, continuous deployment.

- BA/BS in Computer Science preferred, or equivalent experience (advanced degrees preferred).

We have opportunities to work with and learn :

- Object Storage Minio/S3/etc.

- Data Collection OpenTelemetry/Grafana Alloy/etc.

- Message Bus Kafka/NSQ/etc.

- Scaling Databases Druid/Clickhouse/Cassandra/etc.

- Relational database technologies at large scale Timescale/Vitess/Postgres/etc.

- Scheduling & Orchestration Kubernetes/OpenShift/Docker.

- Cloud Platforms AWS/Azure.