- Design, deploy, and maintain observability tools and platforms, including monitoring, logging, and tracing systems.
- Ensure optimal configuration and performance of observability tools such as Prometheus, Loki, Grafana, ELK stack (Elasticsearch, Logstash, Kibana), Jaeger and cloud (AWS/GCP/Azure) Observability Tools.
Monitoring and Alerting :
- Develop and manage dashboards using Kibana/Grafana and set up alerts with ElastAlert and Prometheus Alert Manager to monitor the health and performance of applications and infrastructure.
- Implement robust alerting mechanisms to detect and notify of anomalies, outages, and system performance issues in real-time.
Logging and Tracing :
- Implement centralized logging solutions to aggregate logs from various systems and applications.
- Develop and maintain distributed tracing solutions to provide end-to-end visibility into system transactions.
Performance Analysis and Optimization :
- Analyze system performance metrics and identify bottlenecks and performance degradation. Understanding of SLOs and SLIs
- Work with development and operations teams to remediate performance issues and optimize system performance.
Automation and Scripting :
- Create automation scripts to streamline observability tasks and processes.
- Develop self-healing mechanisms through automated incident response.
Collaboration and Communication :
- Work closely with development, operations, and SRE teams to align observability solutions with business and technical requirements.
- Provide guidance and training on observability tools and best practices to other team members.
Documentation and Reporting :
- Create and maintain detailed documentation for observability systems, processes, and procedures.
- Generate periodic reports and dashboards to provide insights into system performance and reliability.
Qualifications and Experience :
Education : Bachelor's degree in Computer Science, Information Technology, or a related field. Advanced degree preferred.
Experience :
- Minimum of 4+ years of experience in IT infrastructure, with at least 3+ years in a observability or monitoring role.
- Proven experience in observability engineering, including deploying and managing observability solutions.
- Experience with monitoring tools (e.g., Prometheus, Grafana), logging tools (e.g., ELK stack), and tracing tools (e.g., Jaeger, OpenTelemetry).
- Experience with cloud platforms such as AWS, Azure, or Google Cloud and Database like MySQL.
Technical Skills :
- Strong understanding of observability concepts including metrics, logging, and tracing.
- Proficiency in scripting languages such as Bash, Python, Perl or Go.
- Familiarity with containerization (e.g., Docker) and orchestration tools (e.g., Kubernetes) and CI/CD pipelines.
- Understanding of IP Network and monitoring on Network device (e.g. Router, Firewall).
- Experience with infrastructure as code tools (e.g., Terraform, Ansible).
Soft Skills :
- Excellent problem-solving and analytical skills.
- Strong communication and collaboration skills.
- Ability to work independently and in a team-oriented environment.
Preferred Qualifications :
- Experience with APM tools like New Relic, Datadog, or Dynatrace.
- Knowledge of service mesh technologies (e.g., Istio).
- Open-source contributions or relevant certifications in observability tools and methodologies.