Pipeline monitoring covers the metrics that predict failures before they breach SLAs, not just the metrics that confirm they already have. Consumer lag monitoring per consumer group per partition published to Prometheus and visualised in Grafana: lag threshold alerts are set based on the production rate (at 50,000 events/minute with a 5-minute lag tolerance, alert at 250,000 events lag) rather than arbitrary fixed numbers. Alerting via PagerDuty for pipeline-down incidents, Slack for non-critical degradation. Dead letter queue (DLQ): events that fail processing after the configured retry count (typically 3 with exponential backoff) are written to a dedicated DLQ topic (e.g., payments.transaction.created.dlq) with the original event, the failure reason, the retry count, and the timestamp. The DLQ is monitored separately -- a non-empty DLQ is always a page to the on-call engineer. DLQ events are reprocessable: after a bug fix, events are replayed from the DLQ through the corrected processor. Backpressure handling: when a downstream system (e.g., a PostgreSQL destination) slows down under write pressure, the consumer pauses partition polling to prevent unbounded memory growth in the consumer process; lag accumulates on the topic (handled by Kafka's durable retention) rather than in memory. End-to-end latency measurement: synthetic probe events injected at configurable intervals measure the full pipeline latency from producer publish to downstream destination write -- alerting when end-to-end latency exceeds the SLA (e.g., 30 seconds for an operational dashboard, 5 seconds for a fraud detection pipeline). Operations runbook covering consumer lag investigation, consumer group reset, DLQ reprocessing procedure, and broker failure recovery -- delivered with the project.