v0.1.11 configurable logging added, minor refactoring, orgnized docs added

This commit is contained in:
2025-07-10 01:17:06 -04:00
parent bc4ce1d0ae
commit 5936f82970
40 changed files with 5745 additions and 1701 deletions

511
doc/monitoring.md Normal file
View File

@ -0,0 +1,511 @@
# Monitoring & Status Guide
LogWisp provides comprehensive monitoring capabilities through status endpoints, operational logs, and metrics.
## Status Endpoints
### Stream Status
Each stream exposes its own status endpoint:
```bash
# Standalone mode
curl http://localhost:8080/status
# Router mode
curl http://localhost:8080/streamname/status
```
Example response:
```json
{
"service": "LogWisp",
"version": "1.0.0",
"server": {
"type": "http",
"port": 8080,
"active_clients": 5,
"buffer_size": 1000,
"uptime_seconds": 3600,
"mode": {
"standalone": true,
"router": false
}
},
"monitor": {
"active_watchers": 3,
"total_entries": 152341,
"dropped_entries": 12,
"start_time": "2024-01-20T10:00:00Z",
"last_entry_time": "2024-01-20T11:00:00Z"
},
"filters": {
"filter_count": 2,
"total_processed": 152341,
"total_passed": 48234,
"filters": [
{
"type": "include",
"logic": "or",
"pattern_count": 3,
"total_processed": 152341,
"total_matched": 48234,
"total_dropped": 0
}
]
},
"features": {
"heartbeat": {
"enabled": true,
"interval": 30,
"format": "comment"
},
"rate_limit": {
"enabled": true,
"total_requests": 8234,
"blocked_requests": 89,
"active_ips": 12,
"total_connections": 5
}
}
}
```
### Global Status (Router Mode)
In router mode, a global status endpoint provides aggregated information:
```bash
curl http://localhost:8080/status
```
## Key Metrics
### Monitor Metrics
Track file watching performance:
| Metric | Description | Healthy Range |
|--------|-------------|---------------|
| `active_watchers` | Number of files being watched | 1-1000 |
| `total_entries` | Total log entries processed | Increasing |
| `dropped_entries` | Entries dropped due to buffer full | < 1% of total |
| `entries_per_second` | Current processing rate | Varies |
### Connection Metrics
Monitor client connections:
| Metric | Description | Warning Signs |
|--------|-------------|---------------|
| `active_clients` | Current SSE connections | Near limit |
| `tcp_connections` | Current TCP connections | Near limit |
| `total_connections` | All active connections | > 80% of max |
### Filter Metrics
Understand filtering effectiveness:
| Metric | Description | Optimization |
|--------|-------------|--------------|
| `total_processed` | Entries checked | - |
| `total_passed` | Entries that passed | Very low = too restrictive |
| `total_dropped` | Entries filtered out | Very high = review patterns |
### Rate Limit Metrics
Track rate limiting impact:
| Metric | Description | Action Needed |
|--------|-------------|---------------|
| `blocked_requests` | Rejected requests | High = increase limits |
| `active_ips` | Unique clients | High = scale out |
| `blocked_percentage` | Rejection rate | > 10% = review |
## Operational Logging
### Log Levels
Configure LogWisp's operational logging:
```toml
[logging]
output = "both" # file and stderr
level = "info" # info for production
```
Log levels and their use:
- **DEBUG**: Detailed internal operations
- **INFO**: Normal operations, connections
- **WARN**: Recoverable issues
- **ERROR**: Errors requiring attention
### Important Log Messages
#### Startup Messages
```
LogWisp starting version=1.0.0 config_file=/etc/logwisp.toml
Stream registered with router stream=app
TCP endpoint configured transport=system port=9090
HTTP endpoints configured transport=app stream_url=http://localhost:8080/stream
```
#### Connection Events
```
HTTP client connected remote_addr=192.168.1.100:54231 active_clients=6
HTTP client disconnected remote_addr=192.168.1.100:54231 active_clients=5
TCP connection opened remote_addr=192.168.1.100:54232 active_connections=3
```
#### Error Conditions
```
Failed to open file for checking path=/var/log/app.log error=permission denied
Scanner error while reading file path=/var/log/huge.log error=token too long
Request rate limited ip=192.168.1.100
Connection limit exceeded ip=192.168.1.100 connections=5 limit=5
```
#### Performance Warnings
```
Dropped log entry - subscriber buffer full
Dropped entry for slow client remote_addr=192.168.1.100
Check interval too small: 5ms (min: 10ms)
```
## Health Checks
### Basic Health Check
Simple up/down check:
```bash
#!/bin/bash
# health_check.sh
STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/status)
if [ "$STATUS" -eq 200 ]; then
echo "LogWisp is healthy"
exit 0
else
echo "LogWisp is unhealthy (status: $STATUS)"
exit 1
fi
```
### Advanced Health Check
Check specific conditions:
```bash
#!/bin/bash
# advanced_health_check.sh
RESPONSE=$(curl -s http://localhost:8080/status)
# Check if processing logs
ENTRIES=$(echo "$RESPONSE" | jq -r '.monitor.total_entries')
if [ "$ENTRIES" -eq 0 ]; then
echo "WARNING: No log entries processed"
exit 1
fi
# Check dropped entries
DROPPED=$(echo "$RESPONSE" | jq -r '.monitor.dropped_entries')
TOTAL=$(echo "$RESPONSE" | jq -r '.monitor.total_entries')
DROP_PERCENT=$(( DROPPED * 100 / TOTAL ))
if [ "$DROP_PERCENT" -gt 5 ]; then
echo "WARNING: High drop rate: ${DROP_PERCENT}%"
exit 1
fi
# Check connections
CONNECTIONS=$(echo "$RESPONSE" | jq -r '.server.active_clients')
echo "OK: Processing logs, $CONNECTIONS active clients"
exit 0
```
### Container Health Check
Docker/Kubernetes configuration:
```dockerfile
# Dockerfile
HEALTHCHECK --interval=30s --timeout=3s --retries=3 \
CMD curl -f http://localhost:8080/status || exit 1
```
```yaml
# Kubernetes
livenessProbe:
httpGet:
path: /status
port: 8080
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /status
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
```
## Monitoring Integration
### Prometheus Metrics
Export metrics in Prometheus format:
```bash
#!/bin/bash
# prometheus_exporter.sh
while true; do
STATUS=$(curl -s http://localhost:8080/status)
# Extract metrics
CLIENTS=$(echo "$STATUS" | jq -r '.server.active_clients')
ENTRIES=$(echo "$STATUS" | jq -r '.monitor.total_entries')
DROPPED=$(echo "$STATUS" | jq -r '.monitor.dropped_entries')
# Output Prometheus format
cat << EOF
# HELP logwisp_active_clients Number of active streaming clients
# TYPE logwisp_active_clients gauge
logwisp_active_clients $CLIENTS
# HELP logwisp_total_entries Total log entries processed
# TYPE logwisp_total_entries counter
logwisp_total_entries $ENTRIES
# HELP logwisp_dropped_entries Total log entries dropped
# TYPE logwisp_dropped_entries counter
logwisp_dropped_entries $DROPPED
EOF
sleep 60
done
```
### Grafana Dashboard
Key panels for Grafana:
1. **Active Connections**
- Query: `logwisp_active_clients`
- Visualization: Graph
- Alert: > 80% of max
2. **Log Processing Rate**
- Query: `rate(logwisp_total_entries[5m])`
- Visualization: Graph
- Alert: < 1 entry/min
3. **Drop Rate**
- Query: `rate(logwisp_dropped_entries[5m]) / rate(logwisp_total_entries[5m])`
- Visualization: Gauge
- Alert: > 5%
4. **Rate Limit Rejections**
- Query: `rate(logwisp_blocked_requests[5m])`
- Visualization: Graph
- Alert: > 10/min
### Datadog Integration
Send custom metrics:
```bash
#!/bin/bash
# datadog_metrics.sh
while true; do
STATUS=$(curl -s http://localhost:8080/status)
# Send metrics to Datadog
echo "$STATUS" | jq -r '
"logwisp.connections:\(.server.active_clients)|g",
"logwisp.entries:\(.monitor.total_entries)|c",
"logwisp.dropped:\(.monitor.dropped_entries)|c"
' | while read metric; do
echo "$metric" | nc -u -w1 localhost 8125
done
sleep 60
done
```
## Performance Monitoring
### CPU Usage
Monitor CPU usage by component:
```bash
# Check process CPU
top -p $(pgrep logwisp) -b -n 1
# Profile CPU usage
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
```
Common CPU consumers:
- File watching (reduce check_interval_ms)
- Regex filtering (simplify patterns)
- JSON encoding (reduce clients)
### Memory Usage
Track memory consumption:
```bash
# Check process memory
ps aux | grep logwisp
# Detailed memory stats
cat /proc/$(pgrep logwisp)/status | grep -E "Vm(RSS|Size)"
```
Memory optimization:
- Reduce buffer sizes
- Limit connections
- Simplify filters
### Network Bandwidth
Monitor streaming bandwidth:
```bash
# Network statistics
netstat -i
iftop -i eth0 -f "port 8080"
# Connection count
ss -tan | grep :8080 | wc -l
```
## Alerting
### Basic Alerts
Essential alerts to configure:
| Alert | Condition | Severity |
|-------|-----------|----------|
| Service Down | Status endpoint fails | Critical |
| High Drop Rate | > 10% entries dropped | Warning |
| No Log Activity | 0 entries/min for 5 min | Warning |
| Connection Limit | > 90% of max connections | Warning |
| Rate Limit High | > 20% requests blocked | Warning |
### Alert Script
Example monitoring script:
```bash
#!/bin/bash
# monitor_alerts.sh
check_alert() {
local name=$1
local condition=$2
local message=$3
if eval "$condition"; then
echo "ALERT: $name - $message"
# Send to alerting system
# curl -X POST https://alerts.example.com/...
fi
}
while true; do
STATUS=$(curl -s http://localhost:8080/status)
if [ -z "$STATUS" ]; then
check_alert "SERVICE_DOWN" "true" "LogWisp not responding"
sleep 60
continue
fi
# Extract metrics
DROPPED=$(echo "$STATUS" | jq -r '.monitor.dropped_entries')
TOTAL=$(echo "$STATUS" | jq -r '.monitor.total_entries')
CLIENTS=$(echo "$STATUS" | jq -r '.server.active_clients')
# Check conditions
check_alert "HIGH_DROP_RATE" \
"[ $((DROPPED * 100 / TOTAL)) -gt 10 ]" \
"Drop rate above 10%"
check_alert "HIGH_CONNECTIONS" \
"[ $CLIENTS -gt 90 ]" \
"Near connection limit: $CLIENTS/100"
sleep 60
done
```
## Troubleshooting with Monitoring
### No Logs Appearing
Check monitor stats:
```bash
curl -s http://localhost:8080/status | jq '.monitor'
```
Look for:
- `active_watchers` = 0 (no files found)
- `total_entries` not increasing (files not updating)
### High CPU Usage
Enable debug logging:
```bash
logwisp --log-level debug --log-output stderr
```
Watch for:
- Frequent "checkFile" messages (reduce check_interval)
- Many filter operations (optimize patterns)
### Memory Growth
Monitor over time:
```bash
while true; do
ps aux | grep logwisp | grep -v grep
curl -s http://localhost:8080/status | jq '.server.active_clients'
sleep 10
done
```
### Connection Issues
Check connection stats:
```bash
# Current connections
curl -s http://localhost:8080/status | jq '.server'
# Rate limit stats
curl -s http://localhost:8080/status | jq '.features.rate_limit'
```
## Best Practices
1. **Regular Monitoring**: Check status endpoints every 30-60 seconds
2. **Set Alerts**: Configure alerts for critical conditions
3. **Log Rotation**: Rotate LogWisp's own logs to prevent disk fill
4. **Baseline Metrics**: Establish normal ranges for your environment
5. **Capacity Planning**: Monitor trends for scaling decisions
6. **Test Monitoring**: Verify alerts work before issues occur
## See Also
- [Performance Tuning](performance.md) - Optimization guide
- [Troubleshooting](troubleshooting.md) - Common issues
- [Configuration Guide](configuration.md) - Monitoring configuration
- [Integration Examples](integrations.md) - Monitoring system integration