Transparent monitoring solution for
multi-cluster service deployment systems
How many of each resources are used by each service and how many should each user/service have access to
Enforce strong security and QoS guarantees on the infrastructure and services running
“The best investment is in the tools of one’s own trade.”
Benjamin Franklin
Events are tuples
$$\langle{}t, v\rangle{}$$ where $t$ is a timestamp and $v$ is a value of some kind
Time series are sets of events which values are related. E.g., temperature over time, service logs
$$ {\langle{}t_1,v_1\rangle{},\ldots,\langle{}t_n,v_n\rangle{}} $$
Events are stored in time series databases which are optimized for storage compression and data retrieval through queries (DSL)
Logging records textual events in time series
Values are collected as batch and can be filtered by level or context
Values can be under sampled in production (only errors, up to a certain number per minute)
Warning and Error logging Application debugging
Monitoring records data events in time series
Generally values are sampled at fixed intervals (5s)
Resource usage (CPU, RAM, threads, storage) Application counters (replies, errors count, connected users) Performance metrics (replies per second, latency)
Metrics values can be nested structures (E.g. JSON)
{
"DCGM_FI_DRIVER_VERSION" = "550.127.08",
"Hostname" = "melchior", "UUID" = "GPU-b3ea62f8-2bd1-5416-4a5d-f4a65b2522e2",
"container" = "prediction", "device" = "nvidia0",
"gpu" = "0", "instance" = "10.42.0.155:9400",
"job" = "gpu-metrics", "kubernetes_node" = "melchior",
"modelName" = "NVIDIA L40S", "namespace" = "rainfall",
"pci_bus_id" = "00000000:01:00.0", "pod" = "prediction-0"
}
Tracing records execution traces
Values are contexts of execution (variables states, function parameters)
Values are collected as batch and can be filtered by level or context
Application debugging Distributed request tracing
Each request is associated with an ID Traces with that ID can be tracked across multiple microservices
Each record in time series can contain several metadata that provide additional context to the value
Especially useful when considering complex and dynamic environments
Scrapes metrics from monitoring-enabled services
Monitoring-enabled services can push logs and traces
Collects monitoring data from Prometheus and Tempo and acts as a long-storage database.
Promtail collects pods logs from Kubernetes into Loki
Visualizer for metrics, logs and traces queries