Production logging: from noise to signal

Most applications log prolifically and explain nothing. Gigabytes of output fill storage, drive six-figure observability bills, and when production breaks at three in the morning, the only entries that matter are the ones nobody thought to write.

Production logging: from noise to signal

Consider a production incident at three in the morning. An authentication service starts returning 500 errors to one in ten login requests. The logs contain hundreds of megabytes generated in the preceding hour, and somewhere in that volume is the explanation for why a service that has been stable for months is suddenly failing.

The entries read: INFO: Processing request. INFO: Request completed. DEBUG: Entering authHandler. DEBUG: Exiting authHandler. Thousands of lines, each faithfully recording that something happened without recording what. The logs know the service exists. They have no idea what it is doing.

This is not a contrived scenario. It is the default state of most applications in production. The logs were written during development to confirm code paths executed correctly, shipped unchanged, and are now useless for the only purpose they will ever serve: explaining a failure.

The gap between logging that confirms code runs and logging that explains why it broke is the difference between instrumentation and noise. Most applications produce the latter — prolifically, expensively, and to no diagnostic effect.

The irony of most logging strategies is that organisations spend enormous sums storing log data whilst extracting almost no diagnostic value from it. Honeycomb's 2025 summary of Gartner research notes that 36% of Gartner clients spend more than one million dollars annually on observability tooling, and that more than half of that spend goes to logs alone1. The data is being collected. It is rarely being useful.

The spending curve does not flatten just because a company has already invested heavily. Honeycomb noted in early 2026 that the same Gartner customer example had climbed from roughly $50,000 a year in 2009 to about $24 million, still growing 48% year over year2. That is not an observability strategy. That is a storage tax on unstructured noise.

The problem is not volume. Modern systems generate enormous quantities of log data by necessity. Meta's Scribe infrastructure processes over 2.5 terabytes of log data per second — an input rate that exceeds the output of CERN's Large Hadron Collider3. At that scale, the difference between useful logs and noise is the difference between operational clarity and an expensive storage bill that nobody can query.

Google's Site Reliability Engineering handbook captures the discipline concisely: logs are recorded 'for diagnostic or forensic purposes, with the expectation that no one reads logs unless something else prompts them to do so'4. That framing is crucial. If nobody reads logs until something breaks, then every log line must be written for the person reading it at three in the morning during an incident — not for the developer who wrote it during a quiet Tuesday afternoon.

Yet most logging decisions are made during development, when everything works. The developer adds console.log('payment processed') to confirm a code path runs correctly, then moves on. That line ships to production, multiplies by a thousand requests per minute, persists in storage for ninety days, and costs money every second it sits there. It answered a question during development — 'does this code path execute?' — that nobody will ever ask again. Meanwhile, the question that will actually be asked during the next incident — 'why did this specific payment for this specific user fail at this specific time?' — has no log line at all.

What deserves a log line

The most common logging mistake is not insufficient logging. It is indiscriminate logging. Teams instrument their applications with verbose output on the theory that more data is better — that if something goes wrong, the answer will be somewhere in the output. This produces logs that record everything and explain nothing.

Effective logging starts with a question: what will the person debugging this at 3am need to know? The answer, consistently, falls into four categories.

State transitions are the most valuable log events. When a user's account status changes from active to suspended, when a payment moves from pending to completed, when a feature flag toggles — these are the moments that change system behaviour and the first things an investigator will look for. A log entry that records User account 4821 transitioned from ACTIVE to SUSPENDED, reason: PAYMENT_FAILED, triggered by: billing-service tells you exactly what happened, why, and what caused it. An entry that records INFO: Updated user tells you nothing.

Decision points are the second category. Every time your application takes one path instead of another based on runtime data — rate limiting a request, selecting a feature flag variant, falling back to a secondary provider — that decision should be logged with the inputs that drove it. When a request is rejected by a rate limiter, the log should record the key, the current count, the limit, and the window. When a feature flag directs traffic, the log should record which variant and why. These are the entries that explain behaviour that looks wrong but is actually correct, or behaviour that looks correct but is actually wrong.

Boundary crossings — calls to external APIs, database queries, message queue publications — constitute the third category. These are the points where your application's deterministic control ends and the unpredictable world begins. Network latency, timeouts, unexpected response formats, authentication failures — production incident postmortems consistently trace root causes back to boundary crossings where something unexpected happened. Logging the request, the response status, the latency, and any deviation from the expected response format transforms these from black boxes into transparent interactions. The pattern is consistent: log what you sent, log what you got back, log how long it took. When the external payment API returns a 200 status but changes a field name from transaction_id to txn_id, the developer who logged only the status code will spend forty minutes finding the problem. The developer who logged the response structure will find it in seconds.

The fourth category is errors and exceptions — but with context. A stack trace without context is a puzzle with missing pieces. The bare exception NullPointerException at PaymentService.java:247 tells you where the code failed. It does not tell you which request triggered it, what data was being processed, which upstream service provided that data, or whether this is the first occurrence or the thousandth. An error log that includes the request ID, the user context, the operation being attempted, and the state of the relevant data transforms that puzzle into a diagnosis. The stack trace tells you what line threw the error. The context tells you why.

What does not deserve a log line is equally important. Happy-path confirmation — INFO: Processing request, INFO: Request completed successfully — consumes storage and creates noise without providing any diagnostic value during an incident. If you need to know that requests are completing successfully, that information belongs in metrics — a counter, a histogram — not in logs. Logging the successful cases creates a haystack. Logging the interesting cases creates a diagnostic record.

A common anti-pattern illustrates this: logging every database query at INFO level. In a service handling several hundred requests per second, each request touching the database three to five times, this produces tens of thousands of log entries per minute. The storage cost is substantial. The diagnostic value is zero — because when the service actually has a database problem, the relevant error is buried under millions of entries recording queries that worked perfectly. Reducing the logging to database errors, slow queries exceeding a configurable threshold, and connection pool events — acquisitions, exhaustion, timeouts — can drop log volume by an order of magnitude whilst making every remaining entry worth investigating.

Structure, correlation, and the end of grep

The difference between logs you can query and logs you can only grep determines how quickly you diagnose production issues. Unstructured log lines — free-text strings assembled with concatenation or template literals — are human-readable in isolation but machine-hostile at scale.

Consider this entry:

[2026-03-15 02:47:12] ERROR PaymentService - Failed to process payment for user 4821, amount $129.50, gateway timeout after 30s

Perfectly readable. Nearly impossible to aggregate, filter, or correlate programmatically. Extracting the user ID requires a regular expression. Extracting the amount requires a different regular expression. Filtering by error type requires string matching against an uncontrolled vocabulary that changes every time a developer writes a new error message. When this line appears alongside ten thousand others in a distributed system, finding patterns requires heroic grep-fu or luck.

The problem compounds over time. Six months after writing a log line, nobody remembers the exact string format. The developer who wrote Failed to process payment has left the team. The new developer searches for payment error and finds nothing. Different services format the same concepts differently — one logs amounts in dollars, another in cents, a third as strings with currency symbols. Timestamps appear in three formats across four services. The unstructured log becomes an archaeological artefact whose format must be reverse-engineered before it can be understood.

Structured logging replaces this with machine-parseable key-value data:

{
  "timestamp": "2026-03-15T02:47:12.847Z",
  "level": "error",
  "service": "payment-service",
  "event": "payment_failed",
  "user_id": 4821,
  "amount_cents": 12950,
  "currency": "USD",
  "gateway": "stripe",
  "error": "gateway_timeout",
  "timeout_ms": 30000,
  "request_id": "req-7f3a2b",
  "trace_id": "abc-123-def-456"
}

Every field is queryable. You can aggregate all payment failures by gateway in a single query. You can calculate timeout rates over time with a group-by on timestamp buckets. You can filter by user, by amount threshold, by error type — all without writing a single regular expression. You can build dashboards that update in real time, alerting rules that fire on specific event types, and trend analyses that detect anomalies before they become incidents. The same information exists in both formats, but the structured version transforms logs from a text file you can only read into a data store you can interrogate.

The trace_id and request_id fields solve the distributed correlation problem. In a monolith, a stack trace tells you what happened. In a distributed system, a single user action might touch five services, each producing its own logs. Without a shared identifier propagating across service boundaries, correlating those logs requires matching timestamps and hoping for the best — a process that becomes unreliable under load and impossible when services process events asynchronously. Microsoft's engineering playbook treats correlation IDs as standard practice: identifiers created at system entry and propagated through service boundaries, typically in HTTP headers5. The implementation is straightforward — middleware generates or extracts the ID at the edge, passes it through the request context, and every log line includes it automatically. A single query against the correlation ID returns every log entry from every service involved in a specific request.

Stripe's engineering team formalised a pattern they call 'canonical log lines' — each API request emits one information-dense structured log line at the end of processing, containing the request's key characteristics6. Rather than scattering twenty log lines across a request lifecycle, one canonical line captures the complete picture. These lines are codified via protocol buffers, sent asynchronously to Kafka, accumulated in S3, and ingested into Presto and Redshift for long-term analytics. The same data powers Stripe's Developer Dashboard, providing user-facing API performance metrics without separate data pipelines6. One well-structured log line, covering diagnostic and analytical use cases that many teams otherwise split across multiple pipelines.

OpenTelemetry standardises context propagation through inject/extract propagators and W3C Trace Context headers, giving services shared trace and span identifiers across request boundaries7. Adoption has been substantial: Grafana Labs' 2024 observability survey found that 85% of respondents were investing in OpenTelemetry, and more than half had increased their usage over the previous year7. The days of correlating distributed logs by eyeballing timestamps and hoping the clocks were synchronised are ending — but only for teams that actually instrument their services with propagated context rather than treating each service's logs as an isolated text file.

Log levels as engineering contracts

RFC 5424, the Syslog standard, defines eight severity levels from Emergency through Debug8. Most modern logging frameworks derive their hierarchy from this standard, typically exposing a subset: DEBUG, INFO, WARN, ERROR, and FATAL. The problem is not the hierarchy. It is the absence of team-level agreement about what each level means.

Without explicit conventions, log levels become subjective. One developer's INFO is another's DEBUG. Error handling that should produce WARN entries produces ERROR entries, triggering unnecessary alerts. Debug logging left at INFO level in production creates the noise problem that drowns out genuine signals. The 3am engineer filtering for ERROR to find the actual problem instead finds two hundred entries about transient network retries that succeeded on the second attempt.

Log levels are engineering contracts, not personal preferences. ERROR means something failed and requires human investigation. WARN means something unexpected happened but the system handled it — a retry succeeded, a fallback activated, a threshold was approached. INFO means a significant business event occurred — a user registered, a payment processed, a deployment completed. DEBUG means information useful during development that should never be enabled in production without explicit, temporary configuration.

The distinction matters because log levels drive operational behaviour. ERROR entries trigger alerts that wake people up. WARN entries surface in dashboards during business hours. INFO entries populate audit trails and feed analytics. DEBUG entries, when accidentally enabled at production verbosity, generate the storage costs that drive observability bills into six figures. Every misclassified entry either triggers a false alarm or hides a real problem — both of which erode the trust that makes logs useful.

The consequences of treating levels as suggestions rather than contracts are predictable. When a service logs every successful transaction at ERROR because someone decided payments are important enough to always be visible, the alerting channel fills with hundreds of daily notifications for perfectly healthy operations. The on-call engineer stops reading them. When a genuine payment failure occurs — the kind that actually warrants an ERROR — it drowns in the noise. This is the boy-who-cried-wolf failure mode, and it is endemic in systems where log levels reflect developer anxiety rather than operational severity.

The security surface

Logging discipline is not only a diagnostic problem. It is a security problem — in both directions. Insufficient logging lets breaches go undetected for years. Excessive logging of the wrong data creates the breach.

In May 2018, Twitter disclosed that a bug had been writing user passwords in plaintext to an internal log before hashing completed, and asked users to change their password on Twitter and any other service where it had been reused9. Days earlier, GitHub had disclosed a similar bug in its password reset flow that recorded a small number of plaintext passwords in internal logs9.

The following year, reporting on Facebook's internal investigation indicated that between 200 million and 600 million user passwords may have been stored in plaintext in internal systems, some dating back to 2012, with more than 20,000 employees potentially able to search them and roughly nine million internal queries made by around 2,000 engineers or developers10. In September 2024, Ireland's Data Protection Commission fined Meta 91 million euros over the incident10.

These were not exotic attacks exploiting sophisticated vulnerabilities. They were logging mistakes — sensitive data flowing into log output because nobody had established rules about what must never be logged. The OWASP Logging Cheat Sheet specifies the exclusion list explicitly: session identifiers should be hashed rather than logged directly, and access tokens, passwords, database connection strings, encryption keys, payment card data, and bank account numbers should be removed, masked, sanitised, hashed, or encrypted before they reach logs11. Under EU case law, IP addresses can qualify as personal data, meaning their presence in logs can trigger the same processing and retention obligations as other personal data11.

OWASP's Top 10 ranks 'Security Logging and Monitoring Failures' at A09 — a position that moved up from tenth place in the 2017 edition11. The standard cites real-world examples including a breach that exposed 3.5 million children's health records and went undetected for over seven years, and a data compromise spanning a decade of passenger records. In both cases, the core failure was not the initial breach — it was the absence of logging that would have detected the intrusion when it happened rather than years later.

The retention question amplifies both problems. Keep logs too long and you accumulate PII exposure risk — every additional month of storage is another month where a breach could expose data that should have been deleted. Keep logs too short and you cannot investigate incidents that are discovered late. GDPR does not specify a fixed retention period, but it does require controllers to keep personal data no longer than necessary for the purpose for which it is processed11. If your logs contain nothing sensitive because you designed them that way, longer retention is a diagnostic advantage. If your logs accidentally contain passwords, tokens, or personally identifiable information because nobody established exclusion rules, every day of retention is liability compounding in a database nobody audits.

Log4Shell, disclosed in December 2021, demonstrated that logging infrastructure itself can become the attack surface. The vulnerability in Apache Log4j — a ubiquitous Java logging library — received a CVSS 3.1 base score of 10.0 in NVD, while Wiz and EY estimated that 93% of cloud environments were at risk at the time of disclosure12. The US Department of Homeland Security's Cyber Safety Review Board concluded that vulnerable instances of Log4j will remain in systems for perhaps a decade or longer12. The infrastructure meant to help you understand your system became the mechanism through which attackers compromised it. Logging libraries are not inert utilities. They parse input, process data, and interact with systems — and they deserve the same security scrutiny as any other dependency in the stack.

New Relic's 2024 Observability Forecast, surveying 1,700 technology professionals across sixteen countries, found that organisations with full-stack observability — unified telemetry including structured logs, metrics, and traces — spent 76% less time resolving outages than those without13. Not incrementally less. Seventy-six percent less. The difference between an hour-long incident and a fifteen-minute resolution. Between a weekend spent debugging and a weekend spent at home.

The gap is not tooling. Every major cloud provider offers log aggregation. Every language has structured logging libraries — Pino, Winston, Bunyan for Node.js, slog for Go, Serilog for .NET, the logging module in Python. OpenTelemetry provides vendor-neutral instrumentation. The tools have been solved for years. The gap is discipline — the organisational decision to treat logging as an engineering practice rather than an afterthought that nobody owns and everybody inherits.

The discipline starts with four questions asked before code is written: what will the person debugging this need to know? What must never appear in a log line? What structure will make this queryable at three in the morning? And what correlation will connect this entry to every other entry in the same request chain?

These are not difficult questions. They are merely unpopular ones, because answering them requires thinking about failure before failure arrives — and most teams would rather ship the feature and deal with logging later. Later tends to arrive at 3am, when the person who wrote the code is asleep and the person debugging it is staring at INFO: Processing request wondering what, exactly, was being processed, and why it stopped.

The fix is not more logging. It is better logging. Fewer lines, more context. Structured data instead of concatenated strings. Correlation IDs instead of timestamp matching. Log levels that mean something instead of everything being marked important. Sensitive data excluded by policy, not by luck. The investment is measured in hours during development. The return is measured in minutes during incidents — and in breaches that get detected in days rather than years.

Your application will fail in production. When it does, every log line becomes a witness statement. The question is whether your witnesses observed anything worth reporting, or whether they spent the entire incident noting that, yes, the room existed.

Choose wisely. Your next 3am incident depends on it.


Footnotes

  1. Honeycomb. (2025). "Observability Costs: How Much Should I Spend On Observability?" Honeycomb blog. https://www.honeycomb.io/blog/how-much-should-i-spend-on-observability-pt1

  2. Honeycomb. (2026). "Why Does Observability Feel so Expensive? (Because it Is)." Honeycomb blog. https://www.honeycomb.io/blog/why-does-observability-feel-so-expensive-because-it-is

  3. Meta. (2019). "Scribe: Transporting petabytes per hour via a distributed, buffered queueing system." Engineering at Meta. https://engineering.fb.com/2019/10/07/data-infrastructure/scribe/

  4. Google SRE. (2017). "Introduction." Site Reliability Engineering. https://sre.google/sre-book/introduction/

  5. Microsoft. "Observability in Microservices." Engineering Fundamentals Playbook. https://microsoft.github.io/code-with-engineering-playbook/observability/microservices/

  6. Leach, B. (2019). "Fast and flexible observability with canonical log lines." Stripe Engineering Blog. https://stripe.com/blog/canonical-log-lines 2

  7. OpenTelemetry. "Propagators API." https://opentelemetry.io/docs/specs/otel/context/api-propagators/ See also: Grafana Labs. (2024). "Observability Survey Report 2024 - key findings." https://grafana.com/observability-survey/2024/ 2

  8. Gerhards, R. (2009). "The Syslog Protocol." IETF RFC 5424. https://www.rfc-editor.org/rfc/rfc5424

  9. Twitter. (2018). "Keeping your account secure." https://blog.x.com/en_us/topics/company/2018/keeping-your-account-secure See also: BleepingComputer. (2018). "GitHub Accidentally Recorded Some Plaintext Passwords in Its Internal Logs." https://www.bleepingcomputer.com/news/security/github-accidentally-recorded-some-plaintext-passwords-in-its-internal-logs/ 2

  10. Krebs, B. (2019). "Facebook Stored Hundreds of Millions of User Passwords in Plain Text for Years." Krebs on Security. https://krebsonsecurity.com/2019/03/facebook-stored-hundreds-of-millions-of-user-passwords-in-plain-text-for-years/ See also: Data Protection Commission. (2024). "Summary of Decision of the Data Protection Commission made pursuant to Section 111 of the Data Protection Act 2018." https://www.dataprotection.ie/sites/default/files/uploads/2024-12/Meta-Decision-Summary-IN-19-4-1-EN.pdf 2

  11. OWASP. (2021). "A09:2021 – Security Logging and Monitoring Failures." https://owasp.org/Top10/2021/A09_2021-Security_Logging_and_Monitoring_Failures/ See also: OWASP Cheat Sheet Series. "Logging Cheat Sheet." https://cheatsheetseries.owasp.org/cheatsheets/Logging_Cheat_Sheet.html and Court of Justice of the European Union. (2016). "Patrick Breyer v Bundesrepublik Deutschland (C-582/14)." https://eur-lex.europa.eu/legal-content/en/ALL/?uri=CELEX%3A62014CJ0582 2 3 4

  12. NVD. "CVE-2021-44228." https://nvd.nist.gov/vuln/detail/CVE-2021-44228 See also: Luttwak, A., & Schindel, A. (2021). "Log4Shell 10 days later: Enterprises halfway through patching." Wiz Blog. https://www.wiz.io/blog/10-days-later-enterprises-halfway-through-patching-log4shell and Cyber Safety Review Board. (2022). "Review of the December 2021 Log4j Event." https://www.nextgov.com/media/csrb_report_on_log4j_-_july_11_2022_508_compliant.pdf 2

  13. New Relic. (2024). "2024 Observability Forecast Report." https://newrelic.com/sites/default/files/2024-10/new-relic-2024-observability-forecast-report.pdf

Published on:

Updated on:

Reading time:

18 min read

Article counts:

50 paragraphs, 3,536 words

Topics

TL;DR

Gartner-linked observability cost data shows 36% of organisations spend over one million dollars annually on observability, with more than half of that going to logs. The problem is not volume but discipline: logging the wrong things (happy-path confirmations, unstructured text, accidental PII) whilst missing the entries that matter (state transitions, decision points, boundary crossings with context). Structured logging with correlation IDs, exemplified by Stripe's canonical log lines pattern, turns logs from text files into queryable diagnostic records. The discipline is straightforward: decide what the 3am debugger needs before writing the code, not after the incident.

More rabbit holes to fall down

15 August 2025

The 2038 problem: when time runs out

At exactly 03:14:07 UTC on January 19, 2038, a significant portion of the world's computing infrastructure will experience temporal catastrophe. Unlike Y2K, this isn't a formatting problem - it's mathematics meets physics, and we can't patch the fundamental laws of binary arithmetic.

1 November 2023

Downtime of uptime percentages, deciphering the impact

Understanding the real-world implications of uptime percentages is paramount for businesses and consumers alike. What might seem like minor decimal differences in uptime guarantees can translate to significant variations in service availability, impacting operations, customer experience, and bottom lines.

1 May 2025

The hidden cost of free tooling: when open source becomes technical debt

Adding file compression should have taken a day. Three packages needed different versions of the same streaming library. Three days of dependency archaeology, GitHub issue spelunking, and version juggling later, we manually patched node_modules with a post-install script. Open source is free to download but expensive to maintain.