Logging Practices That Actually Help Debug Production Issues

Published on February 15, 2026

How to log useful information without creating noise or exposing secrets.

Structured logging over plain text

Structured logs are machine-parseable. Use JSON format with consistent fields. This enables querying, filtering, and aggregation. Plain text logs are hard to parse programmatically. A log line like "User john logged in from IP 1.2.3.4" requires regex to extract fields. JSON with {"event":"login","user":"john","ip":"1.2.3.4"} is instantly queryable. Log aggregation tools understand structured data natively. You can filter by user, group by IP, count login events—all without custom parsing.

Include context in every log entry. Request ID, user ID, timestamp, log level, and source location should be present. This lets you correlate logs across services and reconstruct user sessions. Context transforms logs from isolated events into connected narratives. When debugging, you follow request_id across services. When investigating user issues, you filter by user_id. Without context, logs are disconnected fragments. With context, they tell stories.

Use consistent field names. Do not log user_id in one place and userId elsewhere. Inconsistency makes querying difficult. Establish naming conventions and enforce them. Linters can check field names. Code review should catch inconsistencies. Consistency is not pedantic—it is functional. If half your logs use user_id and half use userId, queries return incomplete results. Users appear to be missing logs when logs are just inconsistently named.

Log levels indicate severity. DEBUG for detailed debugging info, INFO for significant events, WARN for recoverable issues, ERROR for failures. Use levels consistently so filtering works. Production should log INFO and above. Development can log DEBUG. But level semantics must be consistent. If one service uses ERROR for minor issues and another reserves it for critical failures, filtering by ERROR is meaningless. Document what each level means in your organization.

Avoid logging objects directly. Log serializable data. Circular references break JSON serialization. Sensitive objects might leak secrets. Extract relevant fields explicitly. Instead of log.info(user), use log.info({user_id: user.id, user_role: user.role}). This is explicit about what gets logged. It prevents accidental secret logging. It ensures logs serialize correctly. Object structure might change, breaking serialization. Explicit field extraction is stable.

Add metadata, not narrative. Log actionable data: IDs, status codes, durations. Avoid prose like "Starting to process the user request". Machine-readable fields are more useful. Narrative logs are for humans reading logs line-by-line. Structured logs are for machines querying patterns. "Processing user request" tells you nothing. {event: "request_start", request_id: "abc123", endpoint: "/api/users", method: "GET"} tells you everything. You can count requests per endpoint, measure processing time, correlate with errors.

Log timestamps in ISO 8601 format with timezone. Use UTC for consistency. Timestamps in local time cause confusion in distributed systems. ISO 8601 is machine-parseable and human-readable. Include milliseconds for precision. Nanoseconds if you need high-resolution timing. Consistent timestamp format makes log correlation across services reliable.

Avoid log decorators that add noise. Some libraries add ASCII art, colors, or formatting. These are useless in production logs stored as JSON. They bloat log size and make parsing harder. Plain structured JSON is best. Terminal colors are fine for local development but should be disabled in production.

Schema validation for logs might seem excessive but catches bugs. Define log schema with required fields. Validate in tests. This ensures all logs have necessary context. Without validation, developers forget required fields. Logs become incomplete. Validation enforces standards automatically.

What to log and what to skip

Log all external interactions. API calls, database queries, file operations, network requests. Include latency and status. This helps diagnose integration issues. External systems fail. Networks have latency. Logging every external call creates visibility. Log before the call (intent) and after (result). Include duration, status code, error messages. This data is invaluable when third-party services misbehave. You can prove the problem is external, not your code.

Log state transitions. When user status changes from pending to active, log it. State machines should log entry and exit from states. This creates audit trail. State transitions are business-significant events. They have compliance implications. Audit logs often require proof of state changes. Logging transitions provides this automatically. Include old state, new state, reason, and actor. This answers "who changed what, when, and why."

Log security events. Failed login attempts, permission denials, suspicious activity. Security logs help detect and investigate attacks. Brute force attacks show up as spikes in failed logins. Privilege escalation attempts appear as permission denials. Anomalous behavior becomes visible in logs. Security information and event management (SIEM) systems ingest these logs for threat detection. Log all security-relevant events, even if they seem innocuous. Attackers probe for weaknesses. Logs reveal patterns.

Do not log sensitive data. Passwords, tokens, credit cards, PII should never appear in logs. Implement automatic redaction. Better yet, do not log fields that might be sensitive. Logs are stored long-term. They are accessed by many people. They might be sent to third-party services. Sensitive data in logs is a security incident waiting to happen. Redact by default. Allowlist safe fields rather than denylisting sensitive ones. This prevents accidental leaks when new fields are added.

Do not log every function call. This creates noise. Log at service boundaries and decision points. Internal function calls are usually noise unless debugging specific issues. Trace-level logging that logs every function entry/exit is only useful during active debugging. In production, it generates gigabytes of noise with no value. Users do not care that your code called internal helper functions. Log what matters: external interactions, business events, errors.

Do not log in tight loops. Logging thousands of entries per second overwhelms systems. If you must log in loops, sample or aggregate. Processing 1 million records should not generate 1 million log entries. Instead, log every 10,000 records or aggregate statistics. "Processed 1M records, 50 errors, avg time 5ms" is one log entry instead of 1 million. Sampling and aggregation preserve visibility while controlling volume.

Log errors with full context. When catching exceptions, log the error message, stack trace, and relevant context. Errors without context are hard to diagnose. What was the code trying to do when it failed? What inputs did it have? Which user was affected? Stack traces show where the error occurred, but not why. Context explains why. Include request parameters, user identifiers, and state that led to the error.

Rate limit logs for errors that occur repeatedly. If the same error happens 10,000 times, logging each occurrence is wasteful. Log the first few, then rate limit. "Error X occurred 10,000 times in the last minute" is more useful than 10,000 identical log entries. Rate limiting prevents log storms that overwhelm storage and make other logs hard to find.

Distinguish between expected and unexpected errors. User entering invalid input is expected. Handle gracefully, log at INFO or WARN level. Uncaught exception is unexpected. Log at ERROR level with full context. Expected errors are business logic. Unexpected errors are bugs. Mixing them obscures signal with noise.

Correlation and tracing

Generate request ID at entry point. Pass it through all layers. Log it in every entry. This lets you find all logs for a specific request. Request IDs are correlation tools. They tie together logs across services, layers, and time. Without request IDs, debugging distributed systems is guessing. With request IDs, you follow the exact path a request took through your system. Generate UUIDs for uniqueness. Pass in headers across service boundaries. Log in every log entry.

Distributed tracing spans multiple services. Use trace IDs and span IDs. Tools like Jaeger or Honeycomb provide visualization. This shows request flow across microservices. Distributed tracing is structured logging on steroids. Each operation gets a span. Spans nest to show call hierarchy. Trace ID connects all spans in a request. Visualizations show timing, dependencies, and bottlenecks. This is essential for debugging microservices. You see exactly where time is spent and which service is slow.

Include user ID in logs when applicable. This helps diagnose user-specific issues. But be careful with privacy regulations—log only when necessary. User-specific debugging requires user IDs. "This user is experiencing errors" requires filtering logs by user ID. But user IDs are PII in many jurisdictions. Balance utility with compliance. Hash user IDs if regulations require anonymization. Document retention for logs containing user IDs.

Session IDs group related requests. A user might make many requests in a session. Session ID connects them. This helps understand user behavior. Sessions span multiple requests. Session ID shows user journey. You can reconstruct what a user did: logged in, browsed products, added to cart, checked out. This narrative helps debug user-reported issues. "I could not check out" becomes "User session X experienced error at checkout step."

Parent-child relationships in spans show call hierarchy. If service A calls service B, logs should reflect this relationship. Distributed tracing tools visualize this automatically. Span relationships create tree structures. Root span is initial request. Child spans are downstream operations. This shows dependencies. If Service B is slow, you see it in the trace. Parent-child relationships enable root cause analysis in distributed systems.

Baggage propagates context across service boundaries. OpenTelemetry baggage carries key-value pairs through entire trace. This lets you tag requests with business context: customer tier, feature flags, experiment variants. Baggage appears in all logs automatically. This enables business-level analysis: "Premium users experience more errors than free users."

Sampling strategies balance cost with visibility. Full tracing of every request is expensive. Sample wisely: trace all errors, sample successful requests. Tail-based sampling traces requests that resulted in errors even if you decided not to trace them initially. This gives visibility into problems while controlling costs.

Log management and retention

Centralize logs from all services. Use Elasticsearch, Splunk, or CloudWatch. Querying across services is essential for distributed systems. Local log files do not scale. Centralized logging aggregates logs from all sources. You query once, search everywhere. This is mandatory for microservices. Debugging requires seeing the big picture. Centralized logs provide it. Choose tools that handle your scale. Some are better for massive volume. Others excel at ad-hoc queries.

Set retention policies based on value. Hot logs for recent data, warm logs for older data, delete ancient logs. Storage is expensive. Most debugging uses recent logs. Recent logs are queried constantly. Old logs are rarely accessed. Tiered storage optimizes cost. Keep last 7 days in hot storage for fast queries. Keep 30 days in warm storage for occasional access. Delete logs older than 90 days unless compliance requires longer retention. This balance cost with utility.

Index logs for fast querying. Full-text search on large log volumes is slow. Index key fields. This makes queries fast and keeps costs manageable. Indexing trades storage for speed. Index fields you query frequently: timestamp, request_id, user_id, log_level, service. Full-text indexing is expensive. Selective indexing is practical. Know your query patterns and index accordingly.

Alert on log patterns. Spike in ERROR logs indicates problems. Missing expected logs might mean service crashed. Anomaly detection catches issues proactively. Logs contain early warning signs. Configure alerts: ERROR rate exceeds threshold, specific error message appears, expected heartbeat log missing. Alerts turn logs from passive history into active monitoring. This shifts response from reactive to proactive.

Sampling reduces log volume. Not every request needs full logging. Sample 1% of requests at DEBUG level, log all at ERROR level. This balances visibility with cost. Sampling is necessary at scale. Logging every request in detail is prohibitively expensive. Sample intelligently: always log errors, sample successes. Use consistent hashing on request ID for deterministic sampling. This lets you trace sampled requests completely.

Compression saves storage. Compress old logs. They are rarely accessed. Decompression latency is acceptable for historical analysis. Logs are text, highly compressible. Compress as you archive. This cuts storage costs significantly. Hot logs stay uncompressed for speed. Cold logs compress for cost. The tradeoff is worth it.

Monitoring and alerting on log infrastructure itself is necessary. Log ingestion failures lose data. Log storage filling up causes outages. Alert on: ingestion lag, storage utilization, query performance. Logs that log the logging system seem recursive but are essential. If logging breaks silently, you lose visibility when you need it most.

Log rotation and archival strategies prevent disk exhaustion. Rotate logs daily or when size thresholds are exceeded. Archive old logs to cheaper storage. Establish clear retention policies and automate enforcement. Manual log cleanup is unreliable and leads to disk-full emergencies.

logging

debugging

observability

production