• How long does it take your team to find the root cause when a report shows a number that doesn't look right?

  • Has a pipeline ever silently loaded wrong data -- no error, no alert -- and the problem was discovered weeks later in a board meeting?

Bad data in a dashboard doesn't just produce wrong numbers -- it produces wrong decisions that nobody traces back to the data.

Data quality problems compound silently. A source system changes a field definition, a pipeline drops null records, an ETL job fails halfway through and loads partial data. The report looks complete. The numbers are wrong. Decisions get made on bad information and the root cause stays buried until something important breaks.

We build data quality monitoring infrastructure that validates data as it moves through pipelines, alerts when anomalies occur before they reach reports, and gives the data team a clear view of what is trustworthy and what isn't. Schema validation, row count monitoring, statistical anomaly detection, and data lineage for the data layer that supports business decisions.

  • dbt tests for null values, referential integrity, uniqueness, and custom business logic -- run on every pipeline execution

  • Row count and statistical anomaly detection that alerts when a pipeline delivers significantly more or fewer records than expected

  • Schema change detection that catches source system changes before bad data reaches the warehouse

  • Data lineage tracking from source to report so root cause investigation takes minutes, not days

RaftLabs builds data quality monitoring infrastructure for data engineering pipelines -- dbt-based testing, row count and statistical anomaly detection, schema change monitoring, freshness SLAs, and data lineage tracking. Most data quality projects deliver in 6 to 10 weeks at a fixed cost.

Vodafone
Aldi
Nike
Microsoft
Heineken
Cisco
Calorgas
Energia Rewards
GE
Bank of America
T-Mobile
Valero
Techstars
East Ventures

Data quality problems are expensive precisely because they are invisible until something important breaks. A pipeline that silently drops records when a source field is null delivers a report that looks complete -- the row count is close enough, the numbers are plausible, and nobody questions them until a decision made on that data produces a bad outcome. By then the root cause is buried in pipeline logs from three weeks ago, and finding it takes days.

The fix is data quality infrastructure that validates data as it moves through the pipeline -- not after the fact, and not only when someone notices a number looks wrong. Schema validation before the load, row count checks after the load, statistical anomaly detection on the delivered data, and freshness SLAs that alert before a stale table reaches a report. We build that infrastructure as a defined engagement scoped to your existing pipeline and warehouse setup.

What we build

dbt data testing

dbt schema tests for null values, uniqueness, accepted values, and referential integrity -- standard tests that cover the most common data quality failure modes. Custom dbt tests for business logic assertions specific to your domain: revenue cannot be negative, customer IDs must exist in the customer table, order dates cannot be in the future. Tests run automatically on every pipeline execution. When a test fails, the pipeline produces an error with the specific failing records, and the load is blocked until the issue is resolved -- bad data doesn't reach the mart layer.

Row count and freshness monitoring

Expected row count ranges configured per table and per pipeline run based on historical delivery patterns. Alert when a pipeline delivers a record count that falls outside the expected range by more than a configured threshold -- catching partial loads, extraction failures, and filter bugs that don't produce a pipeline error. Freshness SLA monitoring that compares the last update timestamp of each table against the expected update frequency, alerting before a stale table is queried by analysts or dashboards that depend on it.

Statistical anomaly detection

Baseline statistical profile for each monitored metric: mean, standard deviation, and seasonal patterns established from historical delivery data. Anomaly detection that flags delivered values falling outside the expected range -- separate alerting thresholds for minor deviations (warning) and major deviations (critical) so the data team can triage by severity. Weekly summary of anomaly patterns showing which tables and metrics trigger alerts most frequently and whether the trend is improving. Anomaly detection catches the failure modes that no specific test was written to catch.

Schema change detection

Automated comparison of the source schema against the expected schema on each pipeline run. Alert when a source system adds, removes, or renames a column before the pipeline attempts to load data with the unexpected structure -- catching schema changes before they cause a load failure or silently drop a field from downstream tables. Schema change log showing when each source changed, what specifically changed, and which downstream tables in the warehouse were affected. Schema drift detected early costs an hour to investigate; detected after a failed load costs a day.

Data lineage tracking

Column-level lineage from source tables through transformation models to final mart tables and reports. dbt-generated lineage documentation showing which source tables feed each staging model, which staging models feed each mart, and which marts feed each downstream report or dashboard. Impact analysis showing exactly which downstream consumers are affected when a specific source table changes -- so the data team knows the blast radius of a source change before it happens rather than after. Lineage makes root cause investigation a minutes-long task instead of a multi-day archaeology project.

Data quality dashboards and reporting

Centralised dashboard showing test pass/fail history, anomaly alert history, freshness status for all monitored tables, and pipeline run logs -- one view for the data team to assess data health across the entire pipeline layer. Weekly data quality report delivered to the data team and business stakeholders showing test pass rates, anomaly counts, and freshness SLA compliance. Trend view showing whether data quality is improving or degrading over time so the team can prioritise remediation work based on evidence rather than gut feel.

Have a data quality problem?

Tell us your current pipeline setup, what broke last time bad data reached a report, and how long it took to find the root cause. We'll scope the monitoring infrastructure and give you a fixed cost.

Frequently asked questions

dbt tests are SQL assertions that run against your warehouse tables after each pipeline load. A not_null test runs a query that counts null values in a column -- if the count is above zero, the test fails. A unique test finds duplicate values. Custom tests run any SQL you write: 'revenue must be positive', 'order date cannot be in the future', 'every order must have a matching customer'. Tests run automatically at the end of each dbt run. When a test fails, the run produces an error with the failing records, and the data team is alerted before anyone queries the affected table.

Testing (dbt tests, schema validation) checks specific assertions about the data -- 'this column cannot be null', 'this value must exist in a reference table'. Monitoring (row count, anomaly detection, freshness) tracks statistical properties of the data over time and alerts when something deviates from the expected pattern without a pre-defined assertion. Testing catches known failure modes. Monitoring catches unknown failure modes -- the pipeline that delivers 40% fewer records than usual because a source system had an issue, without failing any specific test. A complete data quality system uses both.

Freshness SLAs are configured by table: each table gets an expected update frequency (every 1 hour, every 24 hours, every Monday at 6am) and a tolerance window before alerting. dbt's built-in source freshness check compares the maximum timestamp in a configured column against the expected freshness. For tables without a reliable timestamp column, we build a pipeline metadata table that records the completion time of each pipeline run and monitor against that. Alerts go to Slack or PagerDuty depending on the severity of the affected table.

Adding dbt-based testing and freshness monitoring to an existing dbt project typically runs $8,000 to $20,000. A full data quality platform including anomaly detection, schema change monitoring, lineage tracking, and a quality dashboard typically runs $25,000 to $60,000. Fixed cost agreed before development starts.