Observability Platform Engineering for Platform & Infrastructure Systems Support

Inquire now

Duration 5 days – 35 hrs

 

Overview

 

This course provides a comprehensive understanding of building and supporting observability platforms essential for maintaining system reliability, uptime, and performance in modern IT environments. Participants will explore the three pillars of observability—metrics, logs, and traces—while learning to deploy and operate popular open-source and enterprise-grade tools like Prometheus, Grafana, Loki, Tempo, OpenTelemetry, and ELK Stack. The course emphasizes hands-on implementation, platform integration, dashboarding, alerting, and real-time troubleshooting.

 

Objectives

  • Understand the principles of observability and its role in infrastructure support.
  • Deploy and configure metrics collection systems (e.g., Prometheus, Node Exporter).
  • Implement logging pipelines using ELK (Elasticsearch, Logstash, Kibana) and Loki.
  • Set up distributed tracing using OpenTelemetry and Tempo/Jaeger.
  • Visualize infrastructure health using Grafana dashboards.
  • Configure alerts and incident response pipelines.
  • Integrate observability tools into cloud and Kubernetes environments.
  • Perform root cause analysis (RCA) and capacity planning using observability data.

 

Audience

 

  • Platform Engineers
  • Infrastructure Support Specialists
  • DevOps and SRE Teams
  • System and Cloud Administrators
  • Network Operations Center (NOC) Teams
  • Monitoring and Alerting Engineers

 

Pre-requisites

  • Basic Linux system administration
  • Understanding of infrastructure components (CPU, memory, disk, network)
  • Familiarity with containers, VMs, or cloud infrastructure
  • Basic experience with YAML and shell scripting (recommended)

Content

 

Module 1: Introduction to Observability

 

  • Observability vs. Monitoring
  • The Three Pillars: Metrics, Logs, Traces
  • Why Observability Matters in Platform Support
  • Tool Landscape Overview (Prometheus, Grafana, ELK, Loki, Tempo)

 

Module 2: Metrics Collection and Analysis

 

  • Introduction to Prometheus
  • Node Exporter and Application Exporters
  • Service Discovery, Pull vs. Push Models
  • Grafana Integration for Metrics
  • Hands-on: Deploying Prometheus + Grafana for Server Monitoring

 

Module 3: Centralized Logging Systems

 

  • Architecture of ELK Stack and Loki
  • Ingesting Logs from Linux, Docker, Kubernetes
  • Structuring Logs for Query and Analysis
  • Visualizing Logs in Kibana or Grafana Loki
  • Hands-on: Deploying Loki or ELK Stack and Searching Logs

 

Module 4: Distributed Tracing and OpenTelemetry

 

  • Introduction to Tracing Concepts
  • Tracers, Spans, and Context Propagation
  • OpenTelemetry: Unified Collection Framework
  • Tempo vs. Jaeger for Trace Storage and Visualization
  • Hands-on: Tracing a Sample App and Viewing in Grafana Tempo

 

Module 5: Dashboards, Alerting, and Notifications

 

  • Building Grafana Dashboards for Infra & App Health
  • Alertmanager and Notification Channels (Email, Slack, etc.)
  • Threshold-based and Behavior-based Alerting
  • Hands-on: Configuring Dashboards and Alert Rules

 

Module 6: Observability in Kubernetes and Cloud

 

  • Monitoring Pods and Nodes with Prometheus Operator
  • Logging and Tracing with Fluent Bit, Loki, and OpenTelemetry Collector
  • Observability in AWS CloudWatch, Azure Monitor, GCP Stackdriver
  • Hands-on: Deploying Observability Stack in a K8s Cluster

 

Module 7: Root Cause Analysis and Performance Insights

 

  • Investigating Incidents with Metrics and Logs
  • Tracing User Requests Across Services
  • Using Observability Data for Capacity Planning
  • Hands-on: Performing RCA with a Simulated Outage

 

Module 8: Scaling and Maintaining Observability Platforms

 

  • Scaling Prometheus and Long-Term Storage Options
  • Centralized Logging Optimization (Indices, Retention)
  • Securing Observability Data (RBAC, HTTPS, Token Access)
  • Best Practices for Maintenance and Upgrades

 

Inquire now

Best selling courses

Duration 3 days – 21 hrs   Overview    This Portfolio Management Training Course is designed to provide banking professionals with a comprehensive understanding of how to effectively manage investment...

Duration 2 days – 14 hrs   Overview   This comprehensive Planning and Forecasting Training Course is designed to empower professionals with the tools and techniques necessary to accurately predict...

Duration 2 days – 14 hrs   Overview   This hands-on course provides an introduction to Splunk, a powerful platform for searching, monitoring, and analyzing machine-generated data. The training focuses...

Duration 3 days – 21 hrs   Overview.   This course is designed for fresh graduates aspiring to build a career in Data Science. It introduces the fundamentals of data...

Among the most popular and widely implemented NoSQL databases is MongoDB. Its scalability, robustness, and flexibility have made it extremely popular among the Fortune 500 and Global 500 companies who use it to implement a variety of activities including social communications, analytics, content management, archiving, and other activities.

PROGRAMMING / CODING

ASP.NET

SP.NET is a framework for developing dynamic web applications. It supports languages like VB.Net, C#, Jscript.Net, etc. The programming logic and content can be developed separately in Microsoft Asp.Net.

CYBER SECURITY

Physical Security

Duration 3 days – 21 hrs   Overview   This course provides a comprehensive introduction to physical security principles, policies, technologies, and practices. It covers methods to assess physical risks,...

Duration 5 days – 35 hrs   Overview   This intensive 5-day course is designed for professionals seeking advanced-level skills in Microsoft SQL Server’s BI stack: SSRS (SQL Server Reporting...

We use cookies on our website to personalize your experience by storing your preferences and recognizing repeat visits. By clicking “Accept”, you agree to the use of all cookies. You can also select “Cookie Settings” to adjust your preferences and provide more specific consent. Cookie Policy