Files
log/doc/heartbeat-monitoring.md

8.8 KiB

Heartbeat Monitoring

← Disk Management | ← Back to README | Performance →

Guide to using heartbeat messages for operational monitoring and system health tracking.

Table of Contents

Overview

Heartbeats are periodic log messages that provide operational statistics about the logger and system. They bypass normal log level filtering, ensuring visibility even when running at higher log levels.

Key Features

  • Always Visible: Heartbeats use special log levels that bypass filtering
  • Multi-Level Detail: Choose from process, disk, or system statistics
  • Production Monitoring: Track logger health without debug logs
  • Metrics Source: Parse heartbeats for monitoring dashboards

Heartbeat Levels

Level 0: Disabled (Default)

No heartbeat messages are generated.

logger.InitWithDefaults(
    "heartbeat_level=0",  // No heartbeats
)

Level 1: Process Statistics (PROC)

Basic logger operation metrics:

logger.InitWithDefaults(
    "heartbeat_level=1",
    "heartbeat_interval_s=300",  // Every 5 minutes
)

Output:

2024-01-15T10:30:00Z PROC type="proc" sequence=1 uptime_hours="24.50" processed_logs=1847293 dropped_logs=0

Fields:

  • sequence: Incrementing counter
  • uptime_hours: Logger uptime
  • processed_logs: Successfully written logs
  • dropped_logs: Logs lost due to buffer overflow

Level 2: Process + Disk Statistics (DISK)

Includes file and disk usage information:

logger.InitWithDefaults(
    "heartbeat_level=2",
    "heartbeat_interval_s=300",
)

Additional Output:

2024-01-15T10:30:00Z DISK type="disk" sequence=1 rotated_files=12 deleted_files=5 total_log_size_mb="487.32" log_file_count=8 current_file_size_mb="23.45" disk_status_ok=true disk_free_mb="5234.67"

Additional Fields:

  • rotated_files: Total file rotations
  • deleted_files: Files removed by cleanup
  • total_log_size_mb: Size of all log files
  • log_file_count: Number of log files
  • current_file_size_mb: Active file size
  • disk_status_ok: Disk health status
  • disk_free_mb: Available disk space

Level 3: Process + Disk + System Statistics (SYS)

Includes runtime and memory metrics:

logger.InitWithDefaults(
    "heartbeat_level=3",
    "heartbeat_interval_s=60",  // Every minute for detailed monitoring
)

Additional Output:

2024-01-15T10:30:00Z SYS type="sys" sequence=1 alloc_mb="45.23" sys_mb="128.45" num_gc=1523 num_goroutine=42

Additional Fields:

  • alloc_mb: Allocated memory
  • sys_mb: System memory reserved
  • num_gc: Garbage collection runs
  • num_goroutine: Active goroutines

Configuration

Basic Configuration

logger.InitWithDefaults(
    "heartbeat_level=2",         // Process + Disk stats
    "heartbeat_interval_s=300",  // Every 5 minutes
)

Interval Recommendations

Environment Level Interval Rationale
Development 3 30s Detailed debugging info
Staging 2 300s Balance detail vs noise
Production 1-2 300-600s Minimize overhead
High-Load 1 600s Reduce I/O impact

Dynamic Adjustment

// Start with basic monitoring
logger.InitWithDefaults(
    "heartbeat_level=1",
    "heartbeat_interval_s=600",
)

// During incident, increase detail
logger.InitWithDefaults(
    "heartbeat_level=3",
    "heartbeat_interval_s=60",
)

// After resolution, reduce back
logger.InitWithDefaults(
    "heartbeat_level=1",
    "heartbeat_interval_s=600",
)

Heartbeat Messages

JSON Format Example

With format=json, heartbeats are structured for easy parsing:

{
  "time": "2024-01-15T10:30:00.123456789Z",
  "level": "PROC",
  "fields": [
    "type", "proc",
    "sequence", 42,
    "uptime_hours", "24.50",
    "processed_logs", 1847293,
    "dropped_logs", 0
  ]
}

Text Format Example

With format=txt, heartbeats are human-readable:

2024-01-15T10:30:00.123456789Z PROC type="proc" sequence=42 uptime_hours="24.50" processed_logs=1847293 dropped_logs=0

Monitoring Integration

Prometheus Exporter

type LoggerMetrics struct {
    logger *log.Logger
    
    uptime          prometheus.Gauge
    processedTotal  prometheus.Counter
    droppedTotal    prometheus.Counter
    diskUsageMB     prometheus.Gauge
    diskFreeSpace   prometheus.Gauge
    fileCount       prometheus.Gauge
}

func (m *LoggerMetrics) ParseHeartbeat(line string) {
    if strings.Contains(line, "type=\"proc\"") {
        // Extract and update process metrics
        if match := regexp.MustCompile(`processed_logs=(\d+)`).FindStringSubmatch(line); match != nil {
            if val, err := strconv.ParseFloat(match[1], 64); err == nil {
                m.processedTotal.Set(val)
            }
        }
    }
    
    if strings.Contains(line, "type=\"disk\"") {
        // Extract and update disk metrics
        if match := regexp.MustCompile(`total_log_size_mb="([0-9.]+)"`).FindStringSubmatch(line); match != nil {
            if val, err := strconv.ParseFloat(match[1], 64); err == nil {
                m.diskUsageMB.Set(val)
            }
        }
    }
}

Grafana Dashboard

Create alerts based on heartbeat metrics:

# Dropped logs alert
- alert: HighLogDropRate
  expr: rate(logger_dropped_total[5m]) > 10
  annotations:
    summary: "High log drop rate detected"
    description: "Logger dropping {{ $value }} logs/sec"

# Disk space alert
- alert: LogDiskSpaceLow
  expr: logger_disk_free_mb < 1000
  annotations:
    summary: "Low log disk space"
    description: "Only {{ $value }}MB free on log disk"

# Logger health alert
- alert: LoggerUnhealthy
  expr: logger_disk_status_ok == 0
  annotations:
    summary: "Logger disk status unhealthy"

ELK Stack Integration

Logstash filter for parsing heartbeats:

filter {
  if [message] =~ /type="(proc|disk|sys)"/ {
    grok {
      match => {
        "message" => [
          '%{TIMESTAMP_ISO8601:timestamp} %{WORD:level} type="%{WORD:heartbeat_type}" sequence=%{NUMBER:sequence:int} uptime_hours="%{NUMBER:uptime_hours:float}" processed_logs=%{NUMBER:processed_logs:int} dropped_logs=%{NUMBER:dropped_logs:int}',
          '%{TIMESTAMP_ISO8601:timestamp} %{WORD:level} type="%{WORD:heartbeat_type}" sequence=%{NUMBER:sequence:int} rotated_files=%{NUMBER:rotated_files:int} deleted_files=%{NUMBER:deleted_files:int} total_log_size_mb="%{NUMBER:total_log_size_mb:float}"'
        ]
      }
    }
    
    mutate {
      add_tag => [ "heartbeat", "metrics" ]
    }
  }
}

Use Cases

1. Production Health Monitoring

// Production configuration
logger.InitWithDefaults(
    "level=4",                   // Warn and Error only
    "heartbeat_level=2",         // But still get disk stats
    "heartbeat_interval_s=300",  // Every 5 minutes
)

// Monitor for:
// - Dropped logs (buffer overflow)
// - Disk space issues
// - File rotation frequency
// - Logger uptime (crash detection)

2. Performance Tuning

// Detailed monitoring during load test
logger.InitWithDefaults(
    "heartbeat_level=3",        // All stats
    "heartbeat_interval_s=10",  // Frequent updates
)

// Track:
// - Memory usage trends
// - Goroutine leaks
// - GC frequency
// - Log throughput

3. Capacity Planning

// Long-term trending
logger.InitWithDefaults(
    "heartbeat_level=2",
    "heartbeat_interval_s=3600",  // Hourly
)

// Analyze:
// - Log growth rate
// - Rotation frequency
// - Disk usage trends
// - Seasonal patterns

4. Debugging Logger Issues

// When investigating logger problems
logger.InitWithDefaults(
    "level=-4",                 // Debug everything
    "heartbeat_level=3",        // All heartbeats
    "heartbeat_interval_s=5",   // Very frequent
    "enable_stdout=true",       // Console output
)

5. Alerting Script

#!/bin/bash
# Monitor heartbeats for issues

tail -f /var/log/myapp/*.log | while read line; do
    if [[ $line =~ type=\"proc\" ]]; then
        if [[ $line =~ dropped_logs=([0-9]+) ]] && [[ ${BASH_REMATCH[1]} -gt 0 ]]; then
            alert "Logs being dropped: ${BASH_REMATCH[1]}"
        fi
    fi
    
    if [[ $line =~ type=\"disk\" ]]; then
        if [[ $line =~ disk_status_ok=false ]]; then
            alert "Logger disk unhealthy!"
        fi
        
        if [[ $line =~ disk_free_mb=\"([0-9.]+)\" ]]; then
            free_mb=${BASH_REMATCH[1]}
            if (( $(echo "$free_mb < 500" | bc -l) )); then
                alert "Low disk space: ${free_mb}MB"
            fi
        fi
    fi
done

← Disk Management | ← Back to README | Performance →