Production Deployment Guide

Best practices and recommendations for deploying Tailscale Gateway Operator in production environments.

Production Readiness Checklist

✅ Security Requirements

OAuth Credentials: Secure storage in Kubernetes secrets
TLS Configuration: Proper certificate management
RBAC Setup: Least-privilege access controls
Network Policies: Restricted network access
Secret Management: External secret management integration
Image Security: Signed and scanned container images

✅ High Availability

Multiple Replicas: Operator and extension server redundancy
Leader Election: Enabled for controller high availability
Pod Disruption Budgets: Configured for maintenance resilience
Anti-Affinity Rules: Spread replicas across nodes/zones
Resource Limits: Proper CPU and memory constraints
Health Checks: Comprehensive liveness and readiness probes

✅ Monitoring & Observability

Metrics Collection: Prometheus integration
Log Aggregation: Centralized logging setup
Distributed Tracing: OpenTelemetry configuration
Alerting Rules: Critical alert definitions
Dashboards: Grafana dashboards for visibility
SLI/SLO Definition: Service level indicators and objectives

✅ Operational Excellence

Backup Strategy: Configuration and state backup
Disaster Recovery: Recovery procedures documented
Capacity Planning: Resource usage projections
Update Strategy: Rolling update procedures
Runbooks: Incident response procedures
Documentation: Complete operational documentation

Production Configuration

High Availability Deployment

production-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tailscale-gateway-operator
  namespace: tailscale-gateway-system
  labels:
    app: tailscale-gateway-operator
    version: v1.0.0
spec:
  replicas: 3  # Multiple replicas for HA
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  selector:
    matchLabels:
      app: tailscale-gateway-operator
  template:
    metadata:
      labels:
        app: tailscale-gateway-operator
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      # Security context
      securityContext:
        runAsNonRoot: true
        runAsUser: 65534
        fsGroup: 65534
      
      # Service account with RBAC
      serviceAccountName: tailscale-gateway-operator
      
      # Anti-affinity for HA
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - tailscale-gateway-operator
              topologyKey: kubernetes.io/hostname
          - weight: 50
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - tailscale-gateway-operator
              topologyKey: topology.kubernetes.io/zone
      
      containers:
      - name: manager
        image: ghcr.io/rajsinghtech/tailscale-gateway:v1.0.0
        imagePullPolicy: IfNotPresent
        
        # Security context
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop:
            - ALL
        
        # Resource limits
        resources:
          requests:
            cpu: 200m
            memory: 256Mi
          limits:
            cpu: 1000m
            memory: 512Mi
        
        # Environment variables
        env:
        - name: TAILSCALE_CLIENT_ID
          valueFrom:
            secretKeyRef:
              name: tailscale-oauth
              key: client-id
        - name: TAILSCALE_CLIENT_SECRET
          valueFrom:
            secretKeyRef:
              name: tailscale-oauth
              key: client-secret
        - name: LEADER_ELECTION_ENABLED
          value: "true"
        - name: LEADER_ELECTION_ID
          value: "tailscale-gateway-operator"
        - name: METRICS_BIND_ADDRESS
          value: ":8080"
        - name: HEALTH_PROBE_BIND_ADDRESS
          value: ":8081"
        - name: LOG_LEVEL
          value: "info"
        - name: LOG_FORMAT
          value: "json"
        
        # Health probes
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8081
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 30
          timeoutSeconds: 10
          failureThreshold: 3
          
        readinessProbe:
          httpGet:
            path: /readyz
            port: 8081
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
          
        startupProbe:
          httpGet:
            path: /healthz
            port: 8081
            scheme: HTTP
          initialDelaySeconds: 15
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 30
        
        # Volume mounts
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: config
          mountPath: /etc/tailscale-gateway
          readOnly: true
      
      volumes:
      - name: tmp
        emptyDir: {}
      - name: config
        configMap:
          name: tailscale-gateway-config
      
      # Node selector for dedicated nodes (optional)
      nodeSelector:
        node-role.kubernetes.io/worker: "true"
      
      # Tolerations for tainted nodes (optional)
      tolerations:
      - key: "dedicated"
        operator: "Equal"
        value: "tailscale-gateway"
        effect: "NoSchedule"

---
# Pod Disruption Budget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: tailscale-gateway-operator-pdb
  namespace: tailscale-gateway-system
spec:
  minAvailable: 2  # Keep at least 2 replicas running
  selector:
    matchLabels:
      app: tailscale-gateway-operator

Extension Server Production Configuration

extension-server-production.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tailscale-gateway-extension-server
  namespace: tailscale-gateway-system
spec:
  replicas: 2  # HA for extension server
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  selector:
    matchLabels:
      app: tailscale-gateway-extension-server
  template:
    metadata:
      labels:
        app: tailscale-gateway-extension-server
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 65534
        fsGroup: 65534
      
      serviceAccountName: tailscale-gateway-extension-server
      
      # Anti-affinity
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - tailscale-gateway-extension-server
              topologyKey: kubernetes.io/hostname
      
      containers:
      - name: extension-server
        image: ghcr.io/rajsinghtech/tailscale-gateway-extension-server:v1.0.0
        
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop:
            - ALL
        
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 256Mi
        
        env:
        - name: GRPC_PORT
          value: "5005"
        - name: METRICS_PORT
          value: "8080"
        - name: LOG_LEVEL
          value: "info"
        - name: MAX_MESSAGE_SIZE
          value: "1000M"
        
        ports:
        - containerPort: 5005
          name: grpc
          protocol: TCP
        - containerPort: 8080
          name: metrics
          protocol: TCP
        
        livenessProbe:
          grpc:
            port: 5005
          initialDelaySeconds: 15
          periodSeconds: 20
          
        readinessProbe:
          grpc:
            port: 5005
          initialDelaySeconds: 5
          periodSeconds: 10
        
        volumeMounts:
        - name: tmp
          mountPath: /tmp
      
      volumes:
      - name: tmp
        emptyDir: {}

---
# Service for extension server
apiVersion: v1
kind: Service
metadata:
  name: tailscale-gateway-extension-server
  namespace: tailscale-gateway-system
  labels:
    app: tailscale-gateway-extension-server
spec:
  type: ClusterIP
  ports:
  - port: 5005
    targetPort: 5005
    protocol: TCP
    name: grpc
  - port: 8080
    targetPort: 8080
    protocol: TCP
    name: metrics
  selector:
    app: tailscale-gateway-extension-server

Security Hardening

RBAC Configuration

rbac-production.yaml
---
# Operator service account
apiVersion: v1
kind: ServiceAccount
metadata:
  name: tailscale-gateway-operator
  namespace: tailscale-gateway-system
  labels:
    app: tailscale-gateway-operator

---
# Operator cluster role (minimal permissions)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: tailscale-gateway-operator
rules:
# Core resources
- apiGroups: [""]
  resources: ["services", "endpoints", "configmaps", "secrets"]
  verbs: ["get", "list", "watch", "create", "update", "patch"]
- apiGroups: [""]
  resources: ["events"]
  verbs: ["create", "patch"]

# StatefulSets for tailscale proxies
- apiGroups: ["apps"]
  resources: ["statefulsets"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

# Gateway API resources
- apiGroups: ["gateway.networking.k8s.io"]
  resources: ["gateways", "httproutes", "tcproutes", "udproutes", "tlsroutes", "grpcroutes"]
  verbs: ["get", "list", "watch"]

# Custom resources
- apiGroups: ["gateway.tailscale.com"]
  resources: ["tailscaleendpoints", "tailscalegateways", "tailscaleroutepolicies", "tailscaletailnets"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["gateway.tailscale.com"]
  resources: ["tailscaleendpoints/status", "tailscalegateways/status", "tailscaleroutepolicies/status", "tailscaletailnets/status"]
  verbs: ["get", "update", "patch"]

# Leader election
- apiGroups: ["coordination.k8s.io"]
  resources: ["leases"]
  verbs: ["get", "list", "watch", "create", "update", "patch"]

---
# Bind cluster role to service account
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: tailscale-gateway-operator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: tailscale-gateway-operator
subjects:
- kind: ServiceAccount
  name: tailscale-gateway-operator
  namespace: tailscale-gateway-system

---
# Extension server service account (even more restricted)
apiVersion: v1
kind: ServiceAccount
metadata:
  name: tailscale-gateway-extension-server
  namespace: tailscale-gateway-system

---
# Extension server role (namespace-scoped)
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: tailscale-gateway-system
  name: tailscale-gateway-extension-server
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["gateway.tailscale.com"]
  resources: ["tailscaleendpoints"]
  verbs: ["get", "list", "watch"]

---
# Bind role to extension server service account
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: tailscale-gateway-extension-server
  namespace: tailscale-gateway-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: tailscale-gateway-extension-server
subjects:
- kind: ServiceAccount
  name: tailscale-gateway-extension-server
  namespace: tailscale-gateway-system

Network Policies

network-policies.yaml
---
# Operator network policy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: tailscale-gateway-operator
  namespace: tailscale-gateway-system
spec:
  podSelector:
    matchLabels:
      app: tailscale-gateway-operator
  policyTypes:
  - Ingress
  - Egress
  
  ingress:
  # Allow metrics scraping
  - from:
    - namespaceSelector:
        matchLabels:
          name: monitoring
    ports:
    - protocol: TCP
      port: 8080
  
  # Allow health checks
  - from: []
    ports:
    - protocol: TCP
      port: 8081
  
  egress:
  # Kubernetes API server
  - to: []
    ports:
    - protocol: TCP
      port: 6443
  
  # Tailscale API
  - to: []
    ports:
    - protocol: TCP
      port: 443
    # Note: In production, consider restricting to specific IPs
  
  # DNS
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53

---
# Extension server network policy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: tailscale-gateway-extension-server
  namespace: tailscale-gateway-system
spec:
  podSelector:
    matchLabels:
      app: tailscale-gateway-extension-server
  policyTypes:
  - Ingress
  - Egress
  
  ingress:
  # Allow gRPC from Envoy Gateway
  - from:
    - namespaceSelector:
        matchLabels:
          name: envoy-gateway-system
    ports:
    - protocol: TCP
      port: 5005
  
  # Allow metrics scraping
  - from:
    - namespaceSelector:
        matchLabels:
          name: monitoring
    ports:
    - protocol: TCP
      port: 8080
  
  egress:
  # Allow communication with operator
  - to:
    - podSelector:
        matchLabels:
          app: tailscale-gateway-operator
    ports:
    - protocol: TCP
      port: 8080  # For coordination
  
  # DNS
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53

Secret Management

external-secrets.yaml
# Using External Secrets Operator for production secret management
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: vault-backend
  namespace: tailscale-gateway-system
spec:
  provider:
    vault:
      server: "https://vault.company.com"
      path: "secret"
      version: "v2"
      auth:
        kubernetes:
          mountPath: "kubernetes"
          role: "tailscale-gateway"
          serviceAccountRef:
            name: "tailscale-gateway-operator"

---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: tailscale-oauth
  namespace: tailscale-gateway-system
spec:
  refreshInterval: 24h  # Refresh daily
  secretStoreRef:
    name: vault-backend
    kind: SecretStore
  
  target:
    name: tailscale-oauth
    creationPolicy: Owner
    template:
      type: Opaque
      data:
        client-id: "&#123;&#123; .clientId &#125;&#125;"
        client-secret: "&#123;&#123; .clientSecret &#125;&#125;"
  
  data:
  - secretKey: clientId
    remoteRef:
      key: tailscale/oauth
      property: client_id
  - secretKey: clientSecret
    remoteRef:
      key: tailscale/oauth
      property: client_secret

Backup and Disaster Recovery

Configuration Backup

backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: tailscale-gateway-backup
  namespace: tailscale-gateway-system
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: tailscale-gateway-backup
          containers:
          - name: backup
            image: bitnami/kubectl:latest
            command:
            - /bin/bash
            - -c
            - |
              # Create backup directory
              mkdir -p /backup/$(date +%Y%m%d)
              
              # Backup CRDs
              kubectl get crd -o yaml > /backup/$(date +%Y%m%d)/crds.yaml
              
              # Backup custom resources
              kubectl get tailscaleendpoints -A -o yaml > /backup/$(date +%Y%m%d)/endpoints.yaml
              kubectl get tailscalegateways -A -o yaml > /backup/$(date +%Y%m%d)/gateways.yaml
              kubectl get tailscaleroutepolicies -A -o yaml > /backup/$(date +%Y%m%d)/policies.yaml
              kubectl get tailscaletailnets -A -o yaml > /backup/$(date +%Y%m%d)/tailnets.yaml
              
              # Backup operator configuration
              kubectl get configmaps -n tailscale-gateway-system -o yaml > /backup/$(date +%Y%m%d)/configmaps.yaml
              
              # Upload to S3 (or your backup storage)
              aws s3 sync /backup/$(date +%Y%m%d) s3://company-backups/tailscale-gateway/$(date +%Y%m%d)/
              
              # Cleanup old local backups
              find /backup -type d -mtime +7 -exec rm -rf {} \;
            
            env:
            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: backup-credentials
                  key: access-key-id
            - name: AWS_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: backup-credentials
                  key: secret-access-key
            
            volumeMounts:
            - name: backup-storage
              mountPath: /backup
          
          volumes:
          - name: backup-storage
            persistentVolumeClaim:
              claimName: backup-pvc
          
          restartPolicy: OnFailure

Disaster Recovery Procedures

disaster-recovery.sh
#!/bin/bash
# Disaster Recovery Script

set -e

BACKUP_DATE=${1:-$(date +%Y%m%d)}
NAMESPACE=${2:-tailscale-gateway-system}

echo "Starting disaster recovery for date: $BACKUP_DATE"

# 1. Restore CRDs first
echo "Restoring CRDs..."
kubectl apply -f s3://company-backups/tailscale-gateway/$BACKUP_DATE/crds.yaml

# 2. Wait for CRDs to be established
echo "Waiting for CRDs to be established..."
kubectl wait --for condition=established --timeout=60s crd/tailscaleendpoints.gateway.tailscale.com
kubectl wait --for condition=established --timeout=60s crd/tailscalegateways.gateway.tailscale.com

# 3. Restore namespace and RBAC
echo "Restoring namespace and RBAC..."
kubectl create namespace $NAMESPACE --dry-run=client -o yaml | kubectl apply -f -
kubectl apply -f rbac-production.yaml

# 4. Restore secrets (manual step - fetch from vault)
echo "Restoring secrets..."
kubectl create secret generic tailscale-oauth \
  --from-literal=client-id="$TAILSCALE_CLIENT_ID" \
  --from-literal=client-secret="$TAILSCALE_CLIENT_SECRET" \
  -n $NAMESPACE

# 5. Restore configuration
echo "Restoring configuration..."
kubectl apply -f s3://company-backups/tailscale-gateway/$BACKUP_DATE/configmaps.yaml

# 6. Deploy operator
echo "Deploying operator..."
kubectl apply -f production-deployment.yaml

# 7. Wait for operator to be ready
echo "Waiting for operator to be ready..."
kubectl wait --for=condition=available --timeout=300s deployment/tailscale-gateway-operator -n $NAMESPACE

# 8. Restore custom resources
echo "Restoring custom resources..."
kubectl apply -f s3://company-backups/tailscale-gateway/$BACKUP_DATE/tailnets.yaml
kubectl apply -f s3://company-backups/tailscale-gateway/$BACKUP_DATE/endpoints.yaml
kubectl apply -f s3://company-backups/tailscale-gateway/$BACKUP_DATE/gateways.yaml
kubectl apply -f s3://company-backups/tailscale-gateway/$BACKUP_DATE/policies.yaml

# 9. Verify recovery
echo "Verifying recovery..."
kubectl get pods -n $NAMESPACE
kubectl get tailscalegateways -A
kubectl get tailscaleendpoints -A

echo "Disaster recovery completed successfully!"

Capacity Planning

Resource Estimation

resource-calculator.yaml
# Resource estimation guidelines for production

# Base operator requirements per 100 TailscaleEndpoints:
# CPU: 100m (baseline) + 10m per endpoint
# Memory: 128Mi (baseline) + 2Mi per endpoint

# Extension server requirements per 1000 requests/second:
# CPU: 50m + 50m per 1000 RPS
# Memory: 64Mi + 32Mi per 1000 RPS

# Example calculation for 500 endpoints, 5000 RPS:
operator:
  cpu: "600m"      # 100m + (500 * 10m/100) = 150m (round up for safety)
  memory: "256Mi"  # 128Mi + (500 * 2Mi/100) = 138Mi (round up)

extensionServer:
  cpu: "300m"      # 50m + (5000/1000 * 50m) = 300m
  memory: "224Mi"  # 64Mi + (5000/1000 * 32Mi) = 224Mi

# Recommended production scaling:
scaling:
  horizontal:
    # Operator: 1 replica per 1000 endpoints
    # Extension Server: 1 replica per 10000 RPS
  vertical:
    # Monitor CPU utilization - scale up if >70%
    # Monitor memory usage - scale up if >80%

Horizontal Pod Autoscaler

hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: tailscale-gateway-operator-hpa
  namespace: tailscale-gateway-system
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tailscale-gateway-operator
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: reconcile_queue_length
      target:
        type: AverageValue
        averageValue: "50"
  
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 900
      policies:
      - type: Percent
        value: 50
        periodSeconds: 300

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: tailscale-gateway-extension-server-hpa
  namespace: tailscale-gateway-system
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tailscale-gateway-extension-server
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: grpc_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"

Update and Rollback Procedures

Rolling Update Strategy

update-strategy.yaml
# Production update procedure
apiVersion: v1
kind: ConfigMap
metadata:
  name: update-procedures
  namespace: tailscale-gateway-system
data:
  update.sh: |
    #!/bin/bash
    # Safe production update procedure
    
    set -e
    
    NEW_VERSION=$1
    if [ -z "$NEW_VERSION" ]; then
      echo "Usage: $0 <new-version>"
      exit 1
    fi
    
    echo "Starting update to version $NEW_VERSION"
    
    # 1. Pre-update backup
    echo "Creating pre-update backup..."
    kubectl create job backup-pre-update-$(date +%s) \
      --from=cronjob/tailscale-gateway-backup \
      -n tailscale-gateway-system
    
    # 2. Update operator with rollback capability
    echo "Updating operator..."
    kubectl patch deployment tailscale-gateway-operator \
      -n tailscale-gateway-system \
      -p '{"spec":{"template":{"spec":{"containers":[{"name":"manager","image":"ghcr.io/rajsinghtech/tailscale-gateway:'$NEW_VERSION'"}]}}}}'
    
    # 3. Wait for rollout
    echo "Waiting for rollout to complete..."
    kubectl rollout status deployment/tailscale-gateway-operator \
      -n tailscale-gateway-system \
      --timeout=600s
    
    # 4. Health check
    echo "Performing health check..."
    sleep 30
    kubectl get pods -n tailscale-gateway-system -l app=tailscale-gateway-operator
    
    # 5. Functional verification
    echo "Running functional tests..."
    if ! ./verify-functionality.sh; then
      echo "Functional tests failed! Rolling back..."
      kubectl rollout undo deployment/tailscale-gateway-operator -n tailscale-gateway-system
      exit 1
    fi
    
    # 6. Update extension server
    echo "Updating extension server..."
    kubectl patch deployment tailscale-gateway-extension-server \
      -n tailscale-gateway-system \
      -p '{"spec":{"template":{"spec":{"containers":[{"name":"extension-server","image":"ghcr.io/rajsinghtech/tailscale-gateway-extension-server:'$NEW_VERSION'"}]}}}}'
    
    kubectl rollout status deployment/tailscale-gateway-extension-server \
      -n tailscale-gateway-system \
      --timeout=600s
    
    echo "Update completed successfully!"
  
  verify-functionality.sh: |
    #!/bin/bash
    # Functional verification script
    
    # Check operator health
    if ! curl -f http://tailscale-gateway-operator:8081/healthz; then
      echo "Operator health check failed"
      return 1
    fi
    
    # Check extension server health
    if ! grpcurl -plaintext tailscale-gateway-extension-server:5005 grpc.health.v1.Health/Check; then
      echo "Extension server health check failed"
      return 1
    fi
    
    # Check custom resources
    if ! kubectl get tailscalegateways --all-namespaces -o jsonpath='{.items[0].status.conditions[?(@.type=="Ready")].status}' | grep -q "True"; then
      echo "TailscaleGateway not ready"
      return 1
    fi
    
    echo "All functional tests passed"
    return 0

Production Monitoring

SLI/SLO Definitions

sli-slo.yaml
# Service Level Indicators and Objectives
apiVersion: v1
kind: ConfigMap
metadata:
  name: sli-slo-definitions
  namespace: monitoring
data:
  sli-slo.yaml: |
    sli:
      # Availability SLI
      availability:
        name: "Operator Availability"
        description: "Percentage of time the operator is responding to health checks"
        query: "avg_over_time(up{job='tailscale-gateway-operator'}[5m])"
        
      # Reconciliation Success Rate SLI
      reconciliation_success_rate:
        name: "Reconciliation Success Rate"
        description: "Percentage of successful reconciliations"
        query: |
          (
            rate(controller_runtime_reconcile_total{result="success"}[5m]) / 
            rate(controller_runtime_reconcile_total[5m])
          ) * 100
      
      # Extension Server Response Time SLI
      response_time:
        name: "Extension Server Response Time"
        description: "95th percentile of gRPC response times"
        query: "histogram_quantile(0.95, rate(grpc_server_request_duration_seconds_bucket[5m]))"
      
      # Service Health Check Success Rate SLI
      service_health:
        name: "Service Health Check Success Rate"
        description: "Percentage of successful health checks"
        query: |
          (
            rate(health_check_success_total[5m]) / 
            (rate(health_check_success_total[5m]) + rate(health_check_failure_total[5m]))
          ) * 100
    
    slo:
      # 99.9% availability (43 minutes downtime per month)
      availability:
        target: 99.9
        window: "30d"
        
      # 99.5% reconciliation success rate
      reconciliation_success_rate:
        target: 99.5
        window: "7d"
        
      # 95th percentile response time < 100ms
      response_time:
        target: 0.1  # 100ms
        window: "1h"
        
      # 98% service health check success rate
      service_health:
        target: 98.0
        window: "24h"

Runbooks

Incident Response Runbook

runbook.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: incident-runbook
  namespace: tailscale-gateway-system
data:
  runbook.md: |
    # Tailscale Gateway Incident Response Runbook
    
    ## Common Incident Types
    
    ### 1. Operator Pod CrashLooping
    
    **Symptoms:**
    - Operator pods restarting frequently
    - Alert: `TailscaleGatewayOperatorDown`
    
    **Investigation:**
    ```bash
    # Check pod status
    kubectl get pods -n tailscale-gateway-system -l app=tailscale-gateway-operator
    
    # Check logs
    kubectl logs -n tailscale-gateway-system deployment/tailscale-gateway-operator --previous
    
    # Check events
    kubectl get events -n tailscale-gateway-system --sort-by='.lastTimestamp'

Common Causes & Solutions:

OOM Killed: Increase memory limits
OAuth Issues: Verify secret exists and is valid
RBAC Issues: Check service account permissions
Resource Conflicts: Check for competing operators

2. High Reconciliation Error Rate

Symptoms:

Alert: HighReconciliationErrorRate
Services not updating properly

Investigation:

# Check reconciliation metrics
curl http://tailscale-gateway-operator:8080/metrics | grep reconcile_errors

# Check specific controller logs
kubectl logs -n tailscale-gateway-system deployment/tailscale-gateway-operator | grep ERROR

Common Causes & Solutions:

Tailscale API Rate Limits: Implement backoff, check quota
Network Issues: Verify connectivity to api.tailscale.com
Invalid Configurations: Validate TailscaleEndpoints specs

3. Extension Server Connectivity Issues

Symptoms:

Alert: ExtensionServerDown
Routes not being injected

Investigation:

# Test gRPC connectivity
grpcurl -plaintext tailscale-gateway-extension-server:5005 grpc.health.v1.Health/Check

# Check Envoy Gateway configuration
kubectl get envoyproxy -A -o yaml

Solutions:

Restart extension server pods
Verify Envoy Gateway extension configuration
Check network policies

Emergency Procedures

Complete Operator Restart

# Scale down
kubectl scale deployment tailscale-gateway-operator --replicas=0 -n tailscale-gateway-system

# Wait for pods to terminate
kubectl wait --for=delete pod -l app=tailscale-gateway-operator -n tailscale-gateway-system --timeout=60s

# Scale back up
kubectl scale deployment tailscale-gateway-operator --replicas=3 -n tailscale-gateway-system

Rollback to Previous Version

# Check rollout history
kubectl rollout history deployment/tailscale-gateway-operator -n tailscale-gateway-system

# Rollback to previous version
kubectl rollout undo deployment/tailscale-gateway-operator -n tailscale-gateway-system

# Monitor rollback
kubectl rollout status deployment/tailscale-gateway-operator -n tailscale-gateway-system

Escalation Contacts

Primary On-Call: Platform Team (+1-xxx-xxx-xxxx)
Secondary On-Call: Infrastructure Team (+1-xxx-xxx-xxxx)
Manager: Engineering Manager (+1-xxx-xxx-xxxx)

## Performance Optimization

### Production Performance Tuning

```yaml title="performance-tuning.yaml"
apiVersion: v1
kind: ConfigMap
metadata:
  name: performance-config
  namespace: tailscale-gateway-system
data:
  config.yaml: |
    # High-performance production configuration
    operator:
      # Controller tuning
      maxConcurrentReconciles: 20
      syncPeriod: "30s"
      resyncPeriod: "10m"
      
      # Client configuration
      clientConfig:
        qps: 50
        burst: 100
        timeout: "30s"
        
      # Cache tuning
      cache:
        syncPeriod: "30s"
        resyncPeriod: "10m"
        
      # Memory optimization
      gc:
        targetPercentage: 80
        
    extensionServer:
      # gRPC performance
      grpcServer:
        maxReceiveMessageSize: 104857600  # 100MB
        maxSendMessageSize: 104857600     # 100MB
        keepaliveTime: "30s"
        keepaliveTimeout: "5s"
        maxConnectionIdle: "300s"
        maxConnectionAge: "30m"
        
      # Connection pooling
      connectionPool:
        maxIdleConns: 100
        maxIdleConnsPerHost: 10
        idleConnTimeout: "90s"
        
    serviceCoordinator:
      # Coordination optimization
      batchSize: 100
      batchTimeout: "1s"
      maxRetries: 3
      cleanupInterval: "5m"

Cost Optimization

Resource Right-Sizing

cost-optimization.yaml
# Cost-optimized configuration for different environments

# Development environment (lower costs)
development:
  operator:
    replicas: 1
    resources:
      requests:
        cpu: "50m"
        memory: "64Mi"
      limits:
        cpu: "200m"
        memory: "128Mi"
  
  extensionServer:
    replicas: 1
    resources:
      requests:
        cpu: "25m"
        memory: "32Mi"
      limits:
        cpu: "100m"
        memory: "64Mi"

# Staging environment (balanced)
staging:
  operator:
    replicas: 2
    resources:
      requests:
        cpu: "100m"
        memory: "128Mi"
      limits:
        cpu: "500m"
        memory: "256Mi"
  
  extensionServer:
    replicas: 1
    resources:
      requests:
        cpu: "50m"
        memory: "64Mi"
      limits:
        cpu: "250m"
        memory: "128Mi"

# Production environment (high availability)
production:
  operator:
    replicas: 3
    resources:
      requests:
        cpu: "200m"
        memory: "256Mi"
      limits:
        cpu: "1000m"
        memory: "512Mi"
  
  extensionServer:
    replicas: 2
    resources:
      requests:
        cpu: "100m"
        memory: "128Mi"
      limits:
        cpu: "500m"
        memory: "256Mi"

Compliance and Auditing

Audit Logging Configuration

audit-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: audit-config
  namespace: tailscale-gateway-system
data:
  audit.yaml: |
    audit:
      enabled: true
      
      # Audit events to log
      events:
        - "create"
        - "update"
        - "delete"
        - "patch"
      
      # Resources to audit
      resources:
        - "tailscaleendpoints"
        - "tailscalegateways"
        - "tailscaleroutepolicies"
        - "tailscaletailnets"
        - "secrets"
        - "configmaps"
      
      # Audit log configuration
      logging:
        format: "json"
        destination: "stdout"
        level: "info"
        
        # Additional metadata
        metadata:
          user: true
          timestamp: true
          resource: true
          action: true
          
      # Retention policy
      retention:
        days: 90
        
      # External audit systems
      external:
        enabled: true
        endpoints:
          - type: "webhook"
            url: "https://audit.company.com/webhook"
            headers:
              "Authorization": "Bearer ${AUDIT_TOKEN}"
          - type: "siem"
            url: "https://siem.company.com/api/events"

Next Steps

Monitoring Guide - Detailed monitoring setup
Troubleshooting - Common issues and solutions
Configuration Reference - Advanced configurations

Production Readiness Checklist​

✅ Security Requirements​

✅ High Availability​

✅ Monitoring & Observability​

✅ Operational Excellence​

Production Configuration​

High Availability Deployment​

Extension Server Production Configuration​

Security Hardening​

RBAC Configuration​

Network Policies​

Secret Management​

Backup and Disaster Recovery​

Configuration Backup​

Disaster Recovery Procedures​

Capacity Planning​

Resource Estimation​

Horizontal Pod Autoscaler​

Update and Rollback Procedures​

Rolling Update Strategy​

Production Monitoring​

SLI/SLO Definitions​

Runbooks​

Incident Response Runbook​

2. High Reconciliation Error Rate​

3. Extension Server Connectivity Issues​

Emergency Procedures​

Complete Operator Restart​

Rollback to Previous Version​

Escalation Contacts​

Cost Optimization​

Resource Right-Sizing​

Compliance and Auditing​

Audit Logging Configuration​

Next Steps​