Monitoring and Observability
Comprehensive monitoring setup for the Tailscale Gateway Operator, including metrics, logging, tracing, and alerting.
Overview
The Tailscale Gateway Operator provides extensive observability features:
- Metrics: Prometheus-compatible metrics for performance monitoring
- Logging: Structured logging with configurable levels
- Tracing: OpenTelemetry-based distributed tracing
- Health Checks: Liveness and readiness probes
- Alerting: Pre-configured alerts for common issues
Metrics Configuration
Enable Metrics Collection
metrics-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: monitoring-config
namespace: tailscale-gateway-system
data:
config.yaml: |
metrics:
enabled: true
port: 8080
path: "/metrics"
# Metric collection settings
collection:
interval: "15s"
timeout: "10s"
# Custom metrics
custom:
enabled: true
labels:
cluster: "production"
region: "us-east-1"
# Health probes
health:
enabled: true
port: 8081
livenessPath: "/healthz"
readinessPath: "/readyz"
ServiceMonitor for Prometheus
servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: tailscale-gateway-operator
namespace: tailscale-gateway-system
labels:
app: tailscale-gateway-operator
spec:
selector:
matchLabels:
app: tailscale-gateway-operator
endpoints:
- port: metrics
interval: 30s
path: /metrics
scheme: http
Key Metrics
The operator exposes the following metrics:
Controller Metrics
controller_runtime_reconcile_total
- Total reconciliation attemptscontroller_runtime_reconcile_time_seconds
- Time spent in reconciliationcontroller_runtime_reconcile_errors_total
- Reconciliation errors
Extension Server Metrics
grpc_server_requests_total
- Total gRPC requestsgrpc_server_request_duration_seconds
- gRPC request durationgrpc_server_errors_total
- gRPC server errors
Service Coordination Metrics
service_coordinator_services_total
- Total managed servicesservice_coordinator_registrations_total
- Service registrationsservice_coordinator_cleanup_operations_total
- Cleanup operations
Health Check Metrics
health_check_duration_seconds
- Health check durationhealth_check_success_total
- Successful health checkshealth_check_failure_total
- Failed health checks
Prometheus Configuration
Prometheus Deployment
prometheus.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:v2.45.0
ports:
- containerPort: 9090
volumeMounts:
- name: config
mountPath: /etc/prometheus
- name: storage
mountPath: /prometheus
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=15d'
- '--web.enable-lifecycle'
volumes:
- name: config
configMap:
name: prometheus-config
- name: storage
emptyDir: {}
---
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
# Tailscale Gateway Operator
- job_name: 'tailscale-gateway-operator'
static_configs:
- targets: ['tailscale-gateway-operator.tailscale-gateway-system:8080']
scrape_interval: 30s
metrics_path: /metrics
# Extension Server
- job_name: 'tailscale-gateway-extension-server'
static_configs:
- targets: ['tailscale-gateway-extension-server.tailscale-gateway-system:8080']
scrape_interval: 30s
# Kubernetes components
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
Grafana Dashboards
Operator Overview Dashboard
operator-dashboard.json
{
"dashboard": {
"id": null,
"title": "Tailscale Gateway Operator",
"tags": ["tailscale", "gateway", "operator"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "Reconciliation Rate",
"type": "graph",
"targets": [
{
"expr": "rate(controller_runtime_reconcile_total[5m])",
"legendFormat": "{{controller}} - {{result}}"
}
],
"yAxes": [
{
"label": "Reconciliations/sec"
}
]
},
{
"id": 2,
"title": "Reconciliation Duration",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(controller_runtime_reconcile_time_seconds_bucket[5m]))",
"legendFormat": "95th percentile"
},
{
"expr": "histogram_quantile(0.50, rate(controller_runtime_reconcile_time_seconds_bucket[5m]))",
"legendFormat": "50th percentile"
}
]
},
{
"id": 3,
"title": "Active Services",
"type": "singlestat",
"targets": [
{
"expr": "service_coordinator_services_total",
"legendFormat": "Services"
}
]
},
{
"id": 4,
"title": "Health Check Success Rate",
"type": "graph",
"targets": [
{
"expr": "rate(health_check_success_total[5m]) / (rate(health_check_success_total[5m]) + rate(health_check_failure_total[5m])) * 100",
"legendFormat": "Success Rate %"
}
]
}
],
"time": {
"from": "now-1h",
"to": "now"
},
"refresh": "30s"
}
}
Service Health Dashboard
service-health-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: service-health-dashboard
namespace: monitoring
data:
dashboard.json: |
{
"dashboard": {
"title": "Tailscale Service Health",
"panels": [
{
"title": "Service Availability",
"type": "table",
"targets": [
{
"expr": "up{job=~\"tailscale.*\"}",
"format": "table",
"instant": true
}
],
"transformations": [
{
"id": "organize",
"options": {
"includeByName": {
"instance": true,
"job": true,
"Value": true
},
"renameByName": {
"Value": "Status"
}
}
}
]
},
{
"title": "Response Time Distribution",
"type": "heatmap",
"targets": [
{
"expr": "increase(grpc_server_request_duration_seconds_bucket[1m])",
"legendFormat": "{{le}}"
}
]
}
]
}
}
Logging Configuration
Structured Logging Setup
logging-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: logging-config
namespace: tailscale-gateway-system
data:
logging.yaml: |
logging:
# Global logging settings
level: "info" # debug, info, warn, error
format: "json" # json, console
# Component-specific logging
components:
controller:
level: "info"
output: "stdout"
extensionServer:
level: "debug"
output: "stdout"
serviceCoordinator:
level: "info"
output: "stdout"
# Log rotation
rotation:
enabled: true
maxSize: "100MB"
maxFiles: 10
maxAge: "7d"
compress: true
# Structured fields
fields:
cluster: "production"
component: "tailscale-gateway"
version: "v1.0.0"
# Sampling (for high-volume logs)
sampling:
enabled: true
rate: 100 # Log every 100th message for debug level
levels:
debug: 100
info: 1
warn: 1
error: 1
Fluentd Configuration for Log Aggregation
fluentd-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd-config
namespace: logging
data:
fluent.conf: |
<source>
@type tail
@id tailscale_operator_logs
path /var/log/containers/tailscale-gateway-operator-*.log
pos_file /var/log/fluentd-tailscale-operator.log.pos
tag kubernetes.tailscale.operator
format json
time_format %Y-%m-%dT%H:%M:%S.%NZ
</source>
<filter kubernetes.tailscale.**>
@type kubernetes_metadata
@id filter_tailscale_metadata
</filter>
<filter kubernetes.tailscale.**>
@type parser
key_name log
reserve_data true
<parse>
@type json
time_key ts
time_format %Y-%m-%dT%H:%M:%S.%NZ
</parse>
</filter>
<match kubernetes.tailscale.**>
@type elasticsearch
@id out_es_tailscale
host elasticsearch.logging.svc.cluster.local
port 9200
logstash_format true
logstash_prefix tailscale-gateway
<buffer>
@type file
path /var/log/fluentd-buffers/tailscale-gateway.buffer
flush_mode interval
retry_type exponential_backoff
flush_thread_count 2
flush_interval 5s
retry_forever true
retry_max_interval 30s
chunk_limit_size 2M
queue_limit_length 8
overflow_action block
</buffer>
</match>
Distributed Tracing
OpenTelemetry Configuration
opentelemetry-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-config
namespace: tailscale-gateway-system
data:
config.yaml: |
tracing:
enabled: true
# Sampling configuration
sampling:
rate: 0.1 # Sample 10% of traces
# Exporter configuration
exporters:
jaeger:
endpoint: "http://jaeger-collector.observability:14268/api/traces"
timeout: "10s"
otlp:
endpoint: "http://otel-collector.observability:4317"
compression: "gzip"
timeout: "10s"
# Resource attributes
resource:
attributes:
service.name: "tailscale-gateway-operator"
service.version: "v1.0.0"
deployment.environment: "production"
k8s.cluster.name: "main-cluster"
k8s.namespace.name: "tailscale-gateway-system"
# Instrumentation
instrumentation:
http:
enabled: true
captureHeaders: true
grpc:
enabled: true
database:
enabled: true
redis:
enabled: true
Jaeger Deployment
jaeger.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: observability
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:1.47
ports:
- containerPort: 16686
name: ui
- containerPort: 14268
name: collector
env:
- name: COLLECTOR_OTLP_ENABLED
value: "true"
- name: SPAN_STORAGE_TYPE
value: "elasticsearch"
- name: ES_SERVER_URLS
value: "http://elasticsearch.observability:9200"
---
apiVersion: v1
kind: Service
metadata:
name: jaeger
namespace: observability
spec:
selector:
app: jaeger
ports:
- name: ui
port: 16686
targetPort: 16686
- name: collector
port: 14268
targetPort: 14268
Alerting Rules
Prometheus Alerting Rules
alert-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alert-rules
namespace: monitoring
data:
alert_rules.yml: |
groups:
- name: tailscale-gateway.rules
rules:
# Operator availability
- alert: TailscaleGatewayOperatorDown
expr: up{job="tailscale-gateway-operator"} == 0
for: 5m
labels:
severity: critical
component: operator
annotations:
summary: "Tailscale Gateway Operator is down"
description: "The Tailscale Gateway Operator has been down for more than 5 minutes."
# High reconciliation error rate
- alert: HighReconciliationErrorRate
expr: |
(
rate(controller_runtime_reconcile_errors_total[5m]) /
rate(controller_runtime_reconcile_total[5m])
) > 0.1
for: 10m
labels:
severity: warning
component: controller
annotations:
summary: "High reconciliation error rate"
description: "Reconciliation error rate is {{ $value | humanizePercentage }} for controller {{ $labels.controller }}"
# Extension server connectivity
- alert: ExtensionServerDown
expr: up{job="tailscale-gateway-extension-server"} == 0
for: 3m
labels:
severity: critical
component: extension-server
annotations:
summary: "Extension server is down"
description: "The Tailscale Gateway extension server is unreachable."
# Service health check failures
- alert: ServiceHealthCheckFailure
expr: |
(
rate(health_check_failure_total[5m]) /
(rate(health_check_success_total[5m]) + rate(health_check_failure_total[5m]))
) > 0.5
for: 5m
labels:
severity: warning
component: health-check
annotations:
summary: "High health check failure rate"
description: "Health check failure rate is {{ $value | humanizePercentage }} for service {{ $labels.service }}"
# Memory usage
- alert: HighMemoryUsage
expr: |
(
container_memory_working_set_bytes{container="manager", namespace="tailscale-gateway-system"} /
container_spec_memory_limit_bytes{container="manager", namespace="tailscale-gateway-system"}
) > 0.8
for: 10m
labels:
severity: warning
component: operator
annotations:
summary: "High memory usage"
description: "Memory usage is {{ $value | humanizePercentage }} of limit."
# gRPC request errors
- alert: HighGRPCErrorRate
expr: |
(
rate(grpc_server_errors_total[5m]) /
rate(grpc_server_requests_total[5m])
) > 0.05
for: 5m
labels:
severity: warning
component: extension-server
annotations:
summary: "High gRPC error rate"
description: "gRPC error rate is {{ $value | humanizePercentage }}"
Alertmanager Configuration
alertmanager.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: monitoring
data:
alertmanager.yml: |
global:
smtp_smarthost: 'smtp.company.com:587'
smtp_from: 'alerts@company.com'
route:
group_by: ['alertname', 'component']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
- match:
component: operator
receiver: 'platform-team'
- match:
component: extension-server
receiver: 'gateway-team'
receivers:
- name: 'default'
email_configs:
- to: 'team@company.com'
subject: 'Tailscale Gateway Alert: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
Labels: {{ range .Labels.SortedPairs }} {{ .Name }}={{ .Value }} {{ end }}
{{ end }}
- name: 'critical-alerts'
email_configs:
- to: 'oncall@company.com'
subject: 'CRITICAL: Tailscale Gateway Alert'
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#alerts'
title: 'Critical Alert'
text: '{{ .CommonAnnotations.summary }}'
- name: 'platform-team'
email_configs:
- to: 'platform-team@company.com'
- name: 'gateway-team'
email_configs:
- to: 'gateway-team@company.com'
Health Checks
Liveness and Readiness Probes
health-probes.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: tailscale-gateway-operator
namespace: tailscale-gateway-system
spec:
template:
spec:
containers:
- name: manager
livenessProbe:
httpGet:
path: /healthz
port: 8081
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 20
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /readyz
port: 8081
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# Startup probe for slow starts
startupProbe:
httpGet:
path: /healthz
port: 8081
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 30 # Allow up to 5 minutes for startup
Custom Health Check Endpoints
custom-health-checks.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: health-check-config
namespace: tailscale-gateway-system
data:
health.yaml: |
healthChecks:
# Custom health check implementations
custom:
- name: "tailscale-connectivity"
endpoint: "/health/tailscale"
timeout: "10s"
interval: "30s"
- name: "extension-server-grpc"
endpoint: "/health/grpc"
timeout: "5s"
interval: "15s"
- name: "service-registry"
endpoint: "/health/registry"
timeout: "5s"
interval: "60s"
# Dependency checks
dependencies:
- name: "kubernetes-api"
type: "http"
url: "https://kubernetes.default.svc.cluster.local/healthz"
timeout: "5s"
- name: "tailscale-api"
type: "http"
url: "https://api.tailscale.com/health"
timeout: "10s"
Performance Monitoring
Resource Usage Dashboards
resource-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: resource-dashboard
namespace: monitoring
data:
dashboard.json: |
{
"dashboard": {
"title": "Tailscale Gateway Resources",
"panels": [
{
"title": "CPU Usage",
"type": "graph",
"targets": [
{
"expr": "rate(container_cpu_usage_seconds_total{namespace=\"tailscale-gateway-system\"}[5m]) * 100",
"legendFormat": "{{ container }}"
}
]
},
{
"title": "Memory Usage",
"type": "graph",
"targets": [
{
"expr": "container_memory_working_set_bytes{namespace=\"tailscale-gateway-system\"} / 1024 / 1024",
"legendFormat": "{{ container }} MB"
}
]
},
{
"title": "Network I/O",
"type": "graph",
"targets": [
{
"expr": "rate(container_network_receive_bytes_total{namespace=\"tailscale-gateway-system\"}[5m])",
"legendFormat": "RX {{ container }}"
},
{
"expr": "rate(container_network_transmit_bytes_total{namespace=\"tailscale-gateway-system\"}[5m])",
"legendFormat": "TX {{ container }}"
}
]
}
]
}
}
Log Analysis
Common Log Queries
# View operator startup logs
kubectl logs -n tailscale-gateway-system deployment/tailscale-gateway-operator | grep "Starting"
# Monitor reconciliation errors
kubectl logs -n tailscale-gateway-system deployment/tailscale-gateway-operator | grep "ERROR.*reconcile"
# Check extension server gRPC logs
kubectl logs -n tailscale-gateway-system deployment/tailscale-gateway-extension-server | grep "grpc"
# Monitor service coordination
kubectl logs -n tailscale-gateway-system deployment/tailscale-gateway-operator | grep "ServiceCoordinator"
# Track health check failures
kubectl logs -n tailscale-gateway-system deployment/tailscale-gateway-operator | grep "health.*failed"
Kibana/Elasticsearch Queries
{
"query": {
"bool": {
"must": [
{"match": {"kubernetes.namespace": "tailscale-gateway-system"}},
{"match": {"component": "tailscale-gateway"}},
{"range": {"@timestamp": {"gte": "now-1h"}}}
],
"should": [
{"match": {"level": "ERROR"}},
{"match": {"level": "WARN"}}
]
}
},
"sort": [{"@timestamp": {"order": "desc"}}]
}
Troubleshooting Common Issues
Debugging Performance Issues
-
High CPU Usage:
# Check CPU metrics
kubectl top pods -n tailscale-gateway-system
# Analyze reconciliation rate
curl http://tailscale-gateway-operator:8080/metrics | grep reconcile_total -
Memory Leaks:
# Monitor memory growth
kubectl top pods -n tailscale-gateway-system --containers
# Check for goroutine leaks
curl http://tailscale-gateway-operator:8080/debug/pprof/goroutine -
Extension Server Issues:
# Check gRPC connection status
grpcurl -plaintext tailscale-gateway-extension-server:5005 grpc.health.v1.Health/Check
# Monitor request latency
curl http://tailscale-gateway-extension-server:8080/metrics | grep grpc_duration
Next Steps
- Production Deployment - Production best practices
- Troubleshooting Guide - Common issues and solutions
- Configuration Reference - Advanced configurations