Observability Systems Management :

- Design, deploy, and maintain observability tools and platforms, including monitoring, logging, and tracing systems.

- Ensure optimal configuration and performance of observability tools such as Prometheus, Loki, Grafana, ELK stack (Elasticsearch, Logstash, Kibana), Jaeger and cloud (AWS/GCP/Azure) Observability Tools.

Monitoring and Alerting :

- Develop and manage dashboards using Kibana/Grafana and set up alerts with ElastAlert and Prometheus Alert Manager to monitor the health and performance of applications and infrastructure.

- Implement robust alerting mechanisms to detect and notify of anomalies, outages, and system performance issues in real-time.

Logging and Tracing :

- Implement centralized logging solutions to aggregate logs from various systems and applications.

- Develop and maintain distributed tracing solutions to provide end-to-end visibility into system transactions.

Performance Analysis and Optimization :

- Analyze system performance metrics and identify bottlenecks and performance degradation. Understanding of SLOs and SLIs

- Work with development and operations teams to remediate performance issues and optimize system performance.

Automation and Scripting :

- Create automation scripts to streamline observability tasks and processes.

- Develop self-healing mechanisms through automated incident response.

Collaboration and Communication :

- Work closely with development, operations, and SRE teams to align observability solutions with business and technical requirements.

- Provide guidance and training on observability tools and best practices to other team members.

Documentation and Reporting :

- Create and maintain detailed documentation for observability systems, processes, and procedures.

- Generate periodic reports and dashboards to provide insights into system performance and reliability.

Qualifications and Experience :

Education : Bachelor's degree in Computer Science, Information Technology, or a related field. Advanced degree preferred.

Experience :

- Minimum of 4+ years of experience in IT infrastructure, with at least 3+ years in a observability or monitoring role.

- Proven experience in observability engineering, including deploying and managing observability solutions.

- Experience with monitoring tools (e.g., Prometheus, Grafana), logging tools (e.g., ELK stack), and tracing tools (e.g., Jaeger, OpenTelemetry).

- Experience with cloud platforms such as AWS, Azure, or Google Cloud and Database like MySQL.

Technical Skills :

- Strong understanding of observability concepts including metrics, logging, and tracing.

- Proficiency in scripting languages such as Bash, Python, Perl or Go.

- Familiarity with containerization (e.g., Docker) and orchestration tools (e.g., Kubernetes) and CI/CD pipelines.

- Understanding of IP Network and monitoring on Network device (e.g. Router, Firewall).

- Experience with infrastructure as code tools (e.g., Terraform, Ansible).

Soft Skills :

- Excellent problem-solving and analytical skills.

- Strong communication and collaboration skills.

- Ability to work independently and in a team-oriented environment.

Preferred Qualifications :

- Experience with APM tools like New Relic, Datadog, or Dynatrace.

- Knowledge of service mesh technologies (e.g., Istio).

- Open-source contributions or relevant certifications in observability tools and methodologies.