MUSA Monitoring Solution

Transparent monitoring solution for
multi-cluster service deployment systems

SESAR Lab logo

Why monitor a system?

Resource allocation

How many of each resources are used by each service and how many should each user/service have access to

Development & Debugging

  • Execution traces
  • Errors/warnings
  • Resource usage
  • Behavior analysis
  • Network traffic

Security & QoS Assurance

Enforce strong security and QoS guarantees on the infrastructure and services running

“The best investment is in the tools of one’s own trade.”

Benjamin Franklin

Events and Time series

Events are tuples

$$\langle{}t, v\rangle{}$$ where $t$ is a timestamp and $v$ is a value of some kind

Time series are sets of events which values are related. E.g., temperature over time, service logs

$$ {\langle{}t_1,v_1\rangle{},\ldots,\langle{}t_n,v_n\rangle{}} $$

Events are stored in time series databases which are optimized for storage compression and data retrieval through queries (DSL)

Logging

Logging records textual events in time series

Values are collected as batch and can be filtered by level or context

Values can be under sampled in production (only errors, up to a certain number per minute)

image.png

Common usages

Warning and Error logging Application debugging

Monitoring

Monitoring records data events in time series

Generally values are sampled at fixed intervals (5s)

Common usages

Resource usage (CPU, RAM, threads, storage) Application counters (replies, errors count, connected users) Performance metrics (replies per second, latency)

image.png

Structured metrics

Metrics values can be nested structures (E.g. JSON)

{
	"DCGM_FI_DRIVER_VERSION" = "550.127.08",
	"Hostname" = "melchior", "UUID" = "GPU-b3ea62f8-2bd1-5416-4a5d-f4a65b2522e2",
	"container" = "prediction", "device" = "nvidia0",
	"gpu" = "0", "instance" = "10.42.0.155:9400",
	"job" = "gpu-metrics", "kubernetes_node" = "melchior",
	"modelName" = "NVIDIA L40S", "namespace" = "rainfall",
	"pci_bus_id" = "00000000:01:00.0", "pod" = "prediction-0"
}

Tracing

Tracing records execution traces

Values are contexts of execution (variables states, function parameters)

Values are collected as batch and can be filtered by level or context

image.png

Common usages

Application debugging Distributed request tracing

Distributed request tracing

Each request is associated with an ID Traces with that ID can be tracked across multiple microservices

Extradata

Each record in time series can contain several metadata that provide additional context to the value

  • Application
  • Version
  • Host
  • Deployment identifier

Especially useful when considering complex and dynamic environments

Current Monitoring Infrastructure

D2 Scheme

Components

Prometheus

Scrapes metrics from monitoring-enabled services

Tempo

Monitoring-enabled services can push logs and traces

Loki

Collects monitoring data from Prometheus and Tempo and acts as a long-storage database.

Promtail collects pods logs from Kubernetes into Loki

Grafana

Visualizer for metrics, logs and traces queries

What is already being monitored

Kubernetes containers

  • Black box-style resource and logs monitoring
  • Network traffic metrics

Kubernetes infrastructure

  • Authentik proxied requests (OIDC, Outpost)
  • DNS (CoreDNS)
  • Deployments (Argo CD)
  • Docker registry (Harbor)
  • GitLab pipelines
  • Ingress Controller (Nginx)
  • Jupyter notebooks (JupyterHub)
  • Network Certificates (cert-manager)
  • Resources (CPU, RAM, GPUs)
  • S3 Storage (MinIO)
  • Block Storage (Longhorn)
  • Virtual Machines (Kubevirt)

How to enable monitoring on your application?

Metrics

  • Prometheus format metrics enabled on a certain HTTP endpoint
  • ServiceMonitor resource pointing to said endpoint

Logs and traces

  • Instrument the application with OpenTelemetry
  • Configure your application to point to Tempo collection endpoint

What are we still missing?

  • Monitoring authentication
  • In-cluster communication encryption
  • OpenTelemetry operator for service discovery automation
  • Logs/Traces collection limits
  • Integration with Moon Cloud probes and dashboard