MUSA Monitoring Solution

Transparent monitoring solution for
multi-cluster service deployment systems

Why monitor a system?

Resource allocation

How many of each resources are used by each service and how many should each user/service have access to

Development & Debugging

Execution traces
Errors/warnings
Resource usage
Behavior analysis
Network traffic

Security & QoS Assurance

Enforce strong security and QoS guarantees on the infrastructure and services running

“The best investment is in the tools of one’s own trade.”

Benjamin Franklin

Events and Time series

Events are tuples

$$\langle{}t, v\rangle{}$$ where $t$ is a timestamp and $v$ is a value of some kind

Time series are sets of events which values are related. E.g., temperature over time, service logs

$$ {\langle{}t_1,v_1\rangle{},\ldots,\langle{}t_n,v_n\rangle{}} $$

Events are stored in time series databases which are optimized for storage compression and data retrieval through queries (DSL)

Logging

Logging records textual events in time series

Values are collected as batch and can be filtered by level or context

Values can be under sampled in production (only errors, up to a certain number per minute)

Common usages

Warning and Error logging Application debugging

Monitoring

Monitoring records data events in time series

Generally values are sampled at fixed intervals (5s)

Common usages

Resource usage (CPU, RAM, threads, storage) Application counters (replies, errors count, connected users) Performance metrics (replies per second, latency)

Structured metrics

Metrics values can be nested structures (E.g. JSON)

{
	"DCGM_FI_DRIVER_VERSION" = "550.127.08",
	"Hostname" = "melchior", "UUID" = "GPU-b3ea62f8-2bd1-5416-4a5d-f4a65b2522e2",
	"container" = "prediction", "device" = "nvidia0",
	"gpu" = "0", "instance" = "10.42.0.155:9400",
	"job" = "gpu-metrics", "kubernetes_node" = "melchior",
	"modelName" = "NVIDIA L40S", "namespace" = "rainfall",
	"pci_bus_id" = "00000000:01:00.0", "pod" = "prediction-0"
}

Tracing

Tracing records execution traces

Values are contexts of execution (variables states, function parameters)

Values are collected as batch and can be filtered by level or context

Common usages

Application debugging Distributed request tracing

Distributed request tracing

Each request is associated with an ID Traces with that ID can be tracked across multiple microservices

Extradata

Each record in time series can contain several metadata that provide additional context to the value

Application
Version
Host
Deployment identifier
…

Especially useful when considering complex and dynamic environments

Current Monitoring Infrastructure

Components

Prometheus

Scrapes metrics from monitoring-enabled services

Tempo

Monitoring-enabled services can push logs and traces

Loki

Collects monitoring data from Prometheus and Tempo and acts as a long-storage database.

Promtail collects pods logs from Kubernetes into Loki

Grafana

Visualizer for metrics, logs and traces queries

What is already being monitored

Kubernetes containers

Black box-style resource and logs monitoring
Network traffic metrics

Kubernetes infrastructure

Authentik proxied requests (OIDC, Outpost)
DNS (CoreDNS)
Deployments (Argo CD)
Docker registry (Harbor)
GitLab pipelines
Ingress Controller (Nginx)

Jupyter notebooks (JupyterHub)
Network Certificates (cert-manager)
Resources (CPU, RAM, GPUs)
S3 Storage (MinIO)
Block Storage (Longhorn)
Virtual Machines (Kubevirt)

How to enable monitoring on your application?

Metrics

Prometheus format metrics enabled on a certain HTTP endpoint
ServiceMonitor resource pointing to said endpoint

Logs and traces

Instrument the application with OpenTelemetry
Configure your application to point to Tempo collection endpoint

What are we still missing?

Monitoring authentication
In-cluster communication encryption
OpenTelemetry operator for service discovery automation
Logs/Traces collection limits
Integration with Moon Cloud probes and dashboard