Observability Platform Engineering for Platform & Infrastructure Systems Support

Inquire now

Duration 5 days – 35 hrs

 

Overview

 

This course provides a comprehensive understanding of building and supporting observability platforms essential for maintaining system reliability, uptime, and performance in modern IT environments. Participants will explore the three pillars of observability—metrics, logs, and traces—while learning to deploy and operate popular open-source and enterprise-grade tools like Prometheus, Grafana, Loki, Tempo, OpenTelemetry, and ELK Stack. The course emphasizes hands-on implementation, platform integration, dashboarding, alerting, and real-time troubleshooting.

 

Objectives

  • Understand the principles of observability and its role in infrastructure support.
  • Deploy and configure metrics collection systems (e.g., Prometheus, Node Exporter).
  • Implement logging pipelines using ELK (Elasticsearch, Logstash, Kibana) and Loki.
  • Set up distributed tracing using OpenTelemetry and Tempo/Jaeger.
  • Visualize infrastructure health using Grafana dashboards.
  • Configure alerts and incident response pipelines.
  • Integrate observability tools into cloud and Kubernetes environments.
  • Perform root cause analysis (RCA) and capacity planning using observability data.

 

Audience

 

  • Platform Engineers
  • Infrastructure Support Specialists
  • DevOps and SRE Teams
  • System and Cloud Administrators
  • Network Operations Center (NOC) Teams
  • Monitoring and Alerting Engineers

 

Pre-requisites

  • Basic Linux system administration
  • Understanding of infrastructure components (CPU, memory, disk, network)
  • Familiarity with containers, VMs, or cloud infrastructure
  • Basic experience with YAML and shell scripting (recommended)

Content

 

Module 1: Introduction to Observability

 

  • Observability vs. Monitoring
  • The Three Pillars: Metrics, Logs, Traces
  • Why Observability Matters in Platform Support
  • Tool Landscape Overview (Prometheus, Grafana, ELK, Loki, Tempo)

 

Module 2: Metrics Collection and Analysis

 

  • Introduction to Prometheus
  • Node Exporter and Application Exporters
  • Service Discovery, Pull vs. Push Models
  • Grafana Integration for Metrics
  • Hands-on: Deploying Prometheus + Grafana for Server Monitoring

 

Module 3: Centralized Logging Systems

 

  • Architecture of ELK Stack and Loki
  • Ingesting Logs from Linux, Docker, Kubernetes
  • Structuring Logs for Query and Analysis
  • Visualizing Logs in Kibana or Grafana Loki
  • Hands-on: Deploying Loki or ELK Stack and Searching Logs

 

Module 4: Distributed Tracing and OpenTelemetry

 

  • Introduction to Tracing Concepts
  • Tracers, Spans, and Context Propagation
  • OpenTelemetry: Unified Collection Framework
  • Tempo vs. Jaeger for Trace Storage and Visualization
  • Hands-on: Tracing a Sample App and Viewing in Grafana Tempo

 

Module 5: Dashboards, Alerting, and Notifications

 

  • Building Grafana Dashboards for Infra & App Health
  • Alertmanager and Notification Channels (Email, Slack, etc.)
  • Threshold-based and Behavior-based Alerting
  • Hands-on: Configuring Dashboards and Alert Rules

 

Module 6: Observability in Kubernetes and Cloud

 

  • Monitoring Pods and Nodes with Prometheus Operator
  • Logging and Tracing with Fluent Bit, Loki, and OpenTelemetry Collector
  • Observability in AWS CloudWatch, Azure Monitor, GCP Stackdriver
  • Hands-on: Deploying Observability Stack in a K8s Cluster

 

Module 7: Root Cause Analysis and Performance Insights

 

  • Investigating Incidents with Metrics and Logs
  • Tracing User Requests Across Services
  • Using Observability Data for Capacity Planning
  • Hands-on: Performing RCA with a Simulated Outage

 

Module 8: Scaling and Maintaining Observability Platforms

 

  • Scaling Prometheus and Long-Term Storage Options
  • Centralized Logging Optimization (Indices, Retention)
  • Securing Observability Data (RBAC, HTTPS, Token Access)
  • Best Practices for Maintenance and Upgrades

 

Inquire now

Best selling courses

We use cookies on our website to personalize your experience by storing your preferences and recognizing repeat visits. By clicking “Accept”, you agree to the use of all cookies. You can also select “Cookie Settings” to adjust your preferences and provide more specific consent. Cookie Policy