lixen/logwisp

SHA256

Fork 0

Files

Lixen Wraith 5936f82970 v0.1.11 configurable logging added, minor refactoring, orgnized docs added

2025-07-10 01:17:06 -04:00

11 KiB

Raw Blame History

Monitoring & Status Guide

LogWisp provides comprehensive monitoring capabilities through status endpoints, operational logs, and metrics.

Status Endpoints

Stream Status

Each stream exposes its own status endpoint:

# Standalone mode
curl http://localhost:8080/status

# Router mode
curl http://localhost:8080/streamname/status

Example response:

{
  "service": "LogWisp",
  "version": "1.0.0",
  "server": {
    "type": "http",
    "port": 8080,
    "active_clients": 5,
    "buffer_size": 1000,
    "uptime_seconds": 3600,
    "mode": {
      "standalone": true,
      "router": false
    }
  },
  "monitor": {
    "active_watchers": 3,
    "total_entries": 152341,
    "dropped_entries": 12,
    "start_time": "2024-01-20T10:00:00Z",
    "last_entry_time": "2024-01-20T11:00:00Z"
  },
  "filters": {
    "filter_count": 2,
    "total_processed": 152341,
    "total_passed": 48234,
    "filters": [
      {
        "type": "include",
        "logic": "or",
        "pattern_count": 3,
        "total_processed": 152341,
        "total_matched": 48234,
        "total_dropped": 0
      }
    ]
  },
  "features": {
    "heartbeat": {
      "enabled": true,
      "interval": 30,
      "format": "comment"
    },
    "rate_limit": {
      "enabled": true,
      "total_requests": 8234,
      "blocked_requests": 89,
      "active_ips": 12,
      "total_connections": 5
    }
  }
}

Global Status (Router Mode)

In router mode, a global status endpoint provides aggregated information:

curl http://localhost:8080/status

Key Metrics

Monitor Metrics

Track file watching performance:

Metric	Description	Healthy Range
`active_watchers`	Number of files being watched	1-1000
`total_entries`	Total log entries processed	Increasing
`dropped_entries`	Entries dropped due to buffer full	< 1% of total
`entries_per_second`	Current processing rate	Varies

Connection Metrics

Monitor client connections:

Metric	Description	Warning Signs
`active_clients`	Current SSE connections	Near limit
`tcp_connections`	Current TCP connections	Near limit
`total_connections`	All active connections	> 80% of max

Filter Metrics

Understand filtering effectiveness:

Metric	Description	Optimization
`total_processed`	Entries checked	-
`total_passed`	Entries that passed	Very low = too restrictive
`total_dropped`	Entries filtered out	Very high = review patterns

Rate Limit Metrics

Track rate limiting impact:

Metric	Description	Action Needed
`blocked_requests`	Rejected requests	High = increase limits
`active_ips`	Unique clients	High = scale out
`blocked_percentage`	Rejection rate	> 10% = review

Operational Logging

Log Levels

Configure LogWisp's operational logging:

[logging]
output = "both"     # file and stderr
level = "info"      # info for production

Log levels and their use:

DEBUG: Detailed internal operations
INFO: Normal operations, connections
WARN: Recoverable issues
ERROR: Errors requiring attention

Important Log Messages

Startup Messages

LogWisp starting version=1.0.0 config_file=/etc/logwisp.toml
Stream registered with router stream=app
TCP endpoint configured transport=system port=9090
HTTP endpoints configured transport=app stream_url=http://localhost:8080/stream

Connection Events

HTTP client connected remote_addr=192.168.1.100:54231 active_clients=6
HTTP client disconnected remote_addr=192.168.1.100:54231 active_clients=5
TCP connection opened remote_addr=192.168.1.100:54232 active_connections=3

Error Conditions

Failed to open file for checking path=/var/log/app.log error=permission denied
Scanner error while reading file path=/var/log/huge.log error=token too long
Request rate limited ip=192.168.1.100
Connection limit exceeded ip=192.168.1.100 connections=5 limit=5

Performance Warnings

Dropped log entry - subscriber buffer full
Dropped entry for slow client remote_addr=192.168.1.100
Check interval too small: 5ms (min: 10ms)

Health Checks

Basic Health Check

Simple up/down check:

#!/bin/bash
# health_check.sh

STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/status)

if [ "$STATUS" -eq 200 ]; then
    echo "LogWisp is healthy"
    exit 0
else
    echo "LogWisp is unhealthy (status: $STATUS)"
    exit 1
fi

Advanced Health Check

Check specific conditions:

#!/bin/bash
# advanced_health_check.sh

RESPONSE=$(curl -s http://localhost:8080/status)

# Check if processing logs
ENTRIES=$(echo "$RESPONSE" | jq -r '.monitor.total_entries')
if [ "$ENTRIES" -eq 0 ]; then
    echo "WARNING: No log entries processed"
    exit 1
fi

# Check dropped entries
DROPPED=$(echo "$RESPONSE" | jq -r '.monitor.dropped_entries')
TOTAL=$(echo "$RESPONSE" | jq -r '.monitor.total_entries')
DROP_PERCENT=$(( DROPPED * 100 / TOTAL ))

if [ "$DROP_PERCENT" -gt 5 ]; then
    echo "WARNING: High drop rate: ${DROP_PERCENT}%"
    exit 1
fi

# Check connections
CONNECTIONS=$(echo "$RESPONSE" | jq -r '.server.active_clients')
echo "OK: Processing logs, $CONNECTIONS active clients"
exit 0

Container Health Check

Docker/Kubernetes configuration:

# Dockerfile
HEALTHCHECK --interval=30s --timeout=3s --retries=3 \
    CMD curl -f http://localhost:8080/status || exit 1

# Kubernetes
livenessProbe:
  httpGet:
    path: /status
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 30

readinessProbe:
  httpGet:
    path: /status
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10

Monitoring Integration

Prometheus Metrics

Export metrics in Prometheus format:

#!/bin/bash
# prometheus_exporter.sh

while true; do
    STATUS=$(curl -s http://localhost:8080/status)
    
    # Extract metrics
    CLIENTS=$(echo "$STATUS" | jq -r '.server.active_clients')
    ENTRIES=$(echo "$STATUS" | jq -r '.monitor.total_entries')
    DROPPED=$(echo "$STATUS" | jq -r '.monitor.dropped_entries')
    
    # Output Prometheus format
    cat << EOF
# HELP logwisp_active_clients Number of active streaming clients
# TYPE logwisp_active_clients gauge
logwisp_active_clients $CLIENTS

# HELP logwisp_total_entries Total log entries processed
# TYPE logwisp_total_entries counter
logwisp_total_entries $ENTRIES

# HELP logwisp_dropped_entries Total log entries dropped
# TYPE logwisp_dropped_entries counter
logwisp_dropped_entries $DROPPED
EOF

    sleep 60
done

Grafana Dashboard

Key panels for Grafana:

Active Connections
- Query: logwisp_active_clients
- Visualization: Graph
- Alert: > 80% of max
Log Processing Rate
- Query: rate(logwisp_total_entries[5m])
- Visualization: Graph
- Alert: < 1 entry/min
Drop Rate
- Query: rate(logwisp_dropped_entries[5m]) / rate(logwisp_total_entries[5m])
- Visualization: Gauge
- Alert: > 5%
Rate Limit Rejections
- Query: rate(logwisp_blocked_requests[5m])
- Visualization: Graph
- Alert: > 10/min

Datadog Integration

Send custom metrics:

#!/bin/bash
# datadog_metrics.sh

while true; do
    STATUS=$(curl -s http://localhost:8080/status)
    
    # Send metrics to Datadog
    echo "$STATUS" | jq -r '
        "logwisp.connections:\(.server.active_clients)|g",
        "logwisp.entries:\(.monitor.total_entries)|c",
        "logwisp.dropped:\(.monitor.dropped_entries)|c"
    ' | while read metric; do
        echo "$metric" | nc -u -w1 localhost 8125
    done
    
    sleep 60
done

Performance Monitoring

CPU Usage

Monitor CPU usage by component:

# Check process CPU
top -p $(pgrep logwisp) -b -n 1

# Profile CPU usage
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

Common CPU consumers:

File watching (reduce check_interval_ms)
Regex filtering (simplify patterns)
JSON encoding (reduce clients)

Memory Usage

Track memory consumption:

# Check process memory
ps aux | grep logwisp

# Detailed memory stats
cat /proc/$(pgrep logwisp)/status | grep -E "Vm(RSS|Size)"

Memory optimization:

Reduce buffer sizes
Limit connections
Simplify filters

Network Bandwidth

Monitor streaming bandwidth:

# Network statistics
netstat -i
iftop -i eth0 -f "port 8080"

# Connection count
ss -tan | grep :8080 | wc -l

Alerting

Basic Alerts

Essential alerts to configure:

Alert	Condition	Severity
Service Down	Status endpoint fails	Critical
High Drop Rate	> 10% entries dropped	Warning
No Log Activity	0 entries/min for 5 min	Warning
Connection Limit	> 90% of max connections	Warning
Rate Limit High	> 20% requests blocked	Warning

Alert Script

Example monitoring script:

#!/bin/bash
# monitor_alerts.sh

check_alert() {
    local name=$1
    local condition=$2
    local message=$3
    
    if eval "$condition"; then
        echo "ALERT: $name - $message"
        # Send to alerting system
        # curl -X POST https://alerts.example.com/...
    fi
}

while true; do
    STATUS=$(curl -s http://localhost:8080/status)
    
    if [ -z "$STATUS" ]; then
        check_alert "SERVICE_DOWN" "true" "LogWisp not responding"
        sleep 60
        continue
    fi
    
    # Extract metrics
    DROPPED=$(echo "$STATUS" | jq -r '.monitor.dropped_entries')
    TOTAL=$(echo "$STATUS" | jq -r '.monitor.total_entries')
    CLIENTS=$(echo "$STATUS" | jq -r '.server.active_clients')
    
    # Check conditions
    check_alert "HIGH_DROP_RATE" \
        "[ $((DROPPED * 100 / TOTAL)) -gt 10 ]" \
        "Drop rate above 10%"
        
    check_alert "HIGH_CONNECTIONS" \
        "[ $CLIENTS -gt 90 ]" \
        "Near connection limit: $CLIENTS/100"
    
    sleep 60
done

Troubleshooting with Monitoring

No Logs Appearing

Check monitor stats:

curl -s http://localhost:8080/status | jq '.monitor'

Look for:

active_watchers = 0 (no files found)
total_entries not increasing (files not updating)

High CPU Usage

Enable debug logging:

logwisp --log-level debug --log-output stderr

Watch for:

Frequent "checkFile" messages (reduce check_interval)
Many filter operations (optimize patterns)

Memory Growth

Monitor over time:

while true; do
    ps aux | grep logwisp | grep -v grep
    curl -s http://localhost:8080/status | jq '.server.active_clients'
    sleep 10
done

Connection Issues

Check connection stats:

# Current connections
curl -s http://localhost:8080/status | jq '.server'

# Rate limit stats
curl -s http://localhost:8080/status | jq '.features.rate_limit'

Best Practices

Regular Monitoring: Check status endpoints every 30-60 seconds
Set Alerts: Configure alerts for critical conditions
Log Rotation: Rotate LogWisp's own logs to prevent disk fill
Baseline Metrics: Establish normal ranges for your environment
Capacity Planning: Monitor trends for scaling decisions
Test Monitoring: Verify alerts work before issues occur

11 KiB Raw Blame History