Data Audits & Automated Quality Governance

What This Document Covers¶

Knowing that data quality matters is not the same as knowing how to measure it, enforce it, and sustain it. This is a practitioner guide covering:

How to actually conduct a data audit for accuracy and completeness
How to translate audit findings into measurable, enforced quality rules
How to automate those rules so governance runs continuously
How to connect automation to remediation so issues get fixed, not just flagged

The goal: data quality that is not a project that runs once and decays, but an operational property of the infrastructure enforced automatically, visible continuously, and owned by named people.

Part 1: How to Conduct a Data Audit¶

What a Data Quality Audit Actually Is¶

A data quality audit is a systematic review of data assets to determine whether they meet defined standards for accuracy, completeness, consistency, validity, and timeliness. It runs in two modes:

Internal continuous auditing — automated checks running on a schedule or triggered by pipeline events
External periodic auditing — independent review of quality practices and compliance posture, typically annually or before a major AI initiative

For AI readiness specifically, audits serve a third purpose: fitness-for-purpose validation — not just "is this data correct" but "is this data trustworthy enough to train a model or feed an agent that will act on it autonomously."

Step 1: Define Scope and Objectives¶

Before touching any data, define exactly what you are auditing and why.

Which systems, tables, and fields are in scope?
What use case does this data support? Reporting? ML training? Agent decision-making?
Which quality dimensions matter most for this use case?
What is the acceptable quality threshold — and what happens when it is not met?
Who is the data owner who will receive and act on findings?

Practical guidance: start with data flowing into your most critical AI systems or business decisions. Auditing everything at once produces a backlog no team can clear. Prioritize by business impact and AI risk.

Step 2: Data Profiling — the Diagnostic Layer¶

Data profiling is automated analysis of a dataset to understand its structure, content, and statistical properties before applying any rules. It answers: what does this data actually look like?

What profiling produces:

Row count — volume; detect unexpected drops or spikes
Null rate per column — completeness baseline per field
Distinct value count — cardinality; detect unexpected proliferation
Min / max / mean / stddev — distribution; detect outliers
Value frequency distribution — most common values; detect encoding issues
Data type confirmation — actual types vs. declared schema
Pattern analysis — formats present (date formats, phone formats)
Duplicate record rate — uniqueness baseline

What to look for immediately:

Null rates above 5% on fields declared as required
Sentinel values (-1, 0, 9999, "N/A", "Unknown") suggesting missing data encoded as values
Date fields with values in 1900 or 2099 — common default handling artifacts in legacy systems
Fields where 80%+ of values are a single value — may indicate a broken pipeline defaulting to a constant
Schema mismatches between the data dictionary and what the profiler finds

Step 3: Auditing for Accuracy¶

Accuracy is the hardest dimension to audit at scale because it requires a reference to compare against. You need to know what the correct value should be to determine whether the value present is wrong.

Four accuracy audit techniques:

Technique 1: Source system comparison

Compare values in the audited dataset against the authoritative source system. Pull a statistically valid sample — at least 5% of records, or 500 records minimum. If failures cluster in a time range or system segment, expand the sample in that area.

Technique 2: Validation rules

Apply domain knowledge as testable rules:

Email addresses must match a valid regex pattern
Transaction amounts must be greater than zero
Order dates must be on or before ship dates
Country codes must exist in the ISO 3166-1 standard list
Customer status must be one of a defined controlled vocabulary

Rules applied in bulk across an entire dataset are the foundation of automated accuracy checking.

Technique 3: Cross-system consistency check

If the same entity exists in multiple systems, compare corresponding fields. A customer's address in the CRM should match their address in the billing system. Discrepancies identify either data entry errors or transformation bugs. For AI specifically, this detects fields that look accurate in isolation but are inconsistent across the join that forms a training dataset.

Technique 4: Statistical anomaly detection

For numerical data, flag values that deviate significantly from historical distributions. If average order value has been around \$150 for 12 months and a segment suddenly shows \$15,000, that is either a real business event or a data error. Z-score analysis and statistical process control automate this detection.

Accuracy scoring:

Accuracy Score = (Records passing all accuracy rules / Total records audited) x 100

Target for AI training data: 95% or above. Below 90%: block from production AI use pending remediation.

Step 4: Auditing for Completeness¶

Completeness is more nuanced than null rate. A field can be populated with a value that represents missing — which a null check will not catch.

Technique 1: Null rate analysis by field

For each field, calculate the percentage of records where the value is null or empty. Compare against acceptable thresholds:

Primary key: 0% nulls acceptable
Required business fields (email, account ID): under 1%
Optional enrichment fields: document the rate; context-dependent threshold

Technique 2: Sentinel value detection

Build a sentinel value register for your environment. Common sentinel values:

Numeric: -1, 0, 999, 9999, -999
String: "N/A", "Unknown", "None", "TBD", "NULL" as a string
Date: 1900-01-01, 2099-12-31, 1970-01-01 (Unix epoch default)

Treat sentinel values as nulls in completeness scoring. A field that is 98% complete by null count but 25% sentinel-value is actually 73% complete.

Technique 3: Conditional completeness

Completeness rules that apply when another condition is true:

If order status is "shipped" then ship date must not be null
If customer type is "enterprise" then contract value must not be null

Flat null rate analysis misses these. Conditional completeness requires SQL-based or code-based assertion logic.

Technique 4: Cross-record completeness

For time-series data, check whether expected records exist. If you expect a transaction record for every active customer every month, and 3% of active customers have no record in March, that is a completeness failure at the record level — not the field level.

Completeness scoring:

Completeness Score = (Required fields populated with valid values /
                      Required fields x Total records) x 100

Target for AI training data: 98% or above for primary fields. Below 95%: flag for steward review.

Step 5: Document Findings¶

For each finding document:

Asset — which table, field, or pipeline
Dimension — which quality dimension failed
Severity — critical (blocks AI use), major (degrades AI quality), minor (cosmetic)
Scope — what percentage of records are affected
Root cause hypothesis — where in the pipeline did this likely originate
Owner — named data steward accountable for remediation
Recommended action — specific fix, not a general suggestion
AI impact — which models, features, or agents are downstream of this issue

Part 2: Automating Data Quality Governance¶

Why Manual Audits Do Not Scale¶

A manual audit conducted quarterly means 87 days of undetected data quality degradation between each check. At AI scale — where agents make decisions continuously based on pipelines that run 24/7 — manual quality processes are not a governance strategy. They are compliance theater that provides the appearance of oversight without the substance.

Automated data quality governance shifts the manual work from running checks to designing checks and resolving issues — the two activities that require human judgment.

The Automation Architecture: Four Layers¶

Layer 1 — PROFILING AND DISCOVERY Schema, statistics, lineage, catalog. Runs automatically on schedule. Produces continuously updated profiles for every monitored asset.

Layer 2 — TESTING AND VALIDATION Rules as code, CI/CD gates, pipeline checks. Quality rules version-controlled and enforced automatically. Failing rules block the pipeline — they do not produce warnings that get ignored.

Layer 3 — OBSERVABILITY AND ALERTING Drift detection, anomaly alerts, dashboards. ML-powered detection catches failure modes that static rules don't anticipate.

Layer 4 — GOVERNANCE AND REMEDIATION Routing, ownership, SLAs, resolution tracking. Issues automatically routed to named owners with context, SLA clock started, resolution tracked in governance reporting.

Most organizations have Layer 1. Production-grade AI governance requires all four.

Layer 2: Testing and Validation in Practice¶

The shift-left principle: quality checks run as early in the pipeline as possible. A check at ingestion catches an issue before it propagates to 47 downstream tables and 3 production models.

Rule categories:

Schema rules (dbt):

not_null:
  - customer_id
  - order_date
  - email

Range and validity rules (Great Expectations):

expect_column_values_to_be_between(
    column="transaction_amount",
    min_value=0.01, max_value=1000000
)
expect_column_values_to_match_regex(
    column="email",
    regex="[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
)

Conditional completeness (Soda):

checks for orders:
  - missing_count(ship_date) = 0:
      filter: status = 'shipped'
  - missing_count(contract_value) = 0:
      filter: customer_type = 'enterprise'

Cross-table consistency (dbt custom test):

SELECT COUNT(*) FROM orders
WHERE ship_date < order_date
-- Expected result: 0

CI/CD integration: quality tests run before any code change is merged. A PR that introduces a schema change breaking a downstream quality rule is rejected automatically before it reaches production.

Layer 3: ML-Powered Observability¶

Static rules catch known failure modes. Observability catches unknown ones. ML-powered platforms learn the normal behavior of a data asset and alert when something changes unexpectedly — without requiring a human to have written a rule for every possible failure mode.

What observability monitors:

Volume — row count drops or spikes outside normal range
Freshness — tables not updated within their expected window
Schema drift — columns added, dropped, or type-changed without a code change
Distribution shift — field value distributions diverging from historical baseline
Null rate change — completeness degrading over time
Duplicate rate change — sudden increase in duplicate records

How ML-powered detection works (Monte Carlo model):

Platform reads metadata and statistical signals from connected warehouses via read-only connectors
ML establishes behavioral baselines per asset per metric
Anomalies flagged when current values deviate beyond learned thresholds
Root cause analysis traces the anomaly upstream through lineage
Alert routed to the asset owner with downstream impact shown

Alert design principles:

Route to named owners, not generic Slack channels
Include downstream impact in the alert — which models, dashboards, agents are affected
Classify severity automatically
Suppress known false positives (expected volume drops on weekends)
Track MTTD and MTTR as governance KPIs

Layer 4: The Automated Remediation Workflow¶

Detection without remediation is a dashboard that makes people feel bad. The automation loop is only closed when quality issues route to the right person, with context, and with a defined SLA.

Workflow:

Quality check fails or anomaly detected
Severity classified: critical, major, minor
Owner identified via catalog ownership metadata
Ticket created with: asset name, rule failed, % records affected, downstream AI and BI impact, root cause hypothesis
SLA clock starts: critical 4 hours, major 24 hours, minor 72 hours
Owner investigates and applies fix: pipeline bug, source system error, or rule calibration
Quality check re-run to confirm resolution
Resolution logged in catalog quality history
SLA compliance tracked in governance reporting

Escalation policy:

SLA breached: escalate to steward's manager and flag in governance dashboard
Critical issue affecting live AI: trigger inference pause for affected pipeline pending resolution
Recurring pattern: trigger root cause review, not just symptom fix

Governance Metrics to Track¶

Quality score by asset: weighted composite across all dimensions — target 85% or above for AI-eligible assets
Accuracy rate: percentage of records passing all accuracy rules — target 95% or above
Completeness rate: percentage of required fields populated with valid values — target 98% or above
Rule coverage: percentage of AI-critical assets with automated checks — target 100%
MTTD: mean time from issue introduction to detection — target under 1 hour for critical assets
MTTR: mean time from detection to resolution — target under 4 hours critical, under 24 hours major
SLA compliance: percentage of issues resolved within SLA — target 95% or above
Recurring issue rate: percentage of issues that recur within 30 days — target under 10%

Part 3: Tooling Stack¶

No single tool covers all four layers. Most production environments use two to three working together. Here is what each tool actually is, what it does, and where to find it.

Layer 2 — Testing and Validation Tools¶

dbt (data build tool)

What it is: an open-source transformation framework that lets data teams write SQL-based transformations as version-controlled code. dbt tests are assertions written alongside transformation models — they run automatically every time a transformation runs, validating that outputs meet expectations before downstream systems consume them.

Best for: teams already using dbt for transformation. Zero new infrastructure required.

Limitation: only covers dbt models; no monitoring of production data outside the transformation layer.

Website: https://www.getdbt.com

Open source core. dbt Cloud (managed) starts at \$50/month per developer seat.

Great Expectations

What it is: an open-source Python library for defining, documenting, and validating data quality expectations as code. You write "expectations" — assertions about what your data should look like — and run them against any dataset at any pipeline stage. Expectations are version-controlled, shareable, and runnable in CI/CD pipelines.

Best for: engineering-heavy teams that need highly customizable, code-driven validation integrated into ingestion and CI/CD workflows.

Limitation: requires Python expertise to configure; steeper learning curve than YAML-based tools.

Website: https://greatexpectations.io

Fully open source. GX Cloud (managed UI and scheduling) is available with pricing on request.

Soda

What it is: a data quality platform that lets teams write quality checks in a simple YAML-based syntax (SodaCL) and run them against cloud warehouses. More accessible to non-engineers than Great Expectations while still supporting complex business logic. Includes a UI for collaborative monitoring and a CLI for pipeline integration.

Best for: teams that want a balance between technical flexibility and business-user accessibility. Strong for organizations where data stewards, not just engineers, need to own quality rules.

Website: https://www.soda.io

Open source core (Soda Core). Soda Cloud (managed) pricing on request.

Layer 3 — Observability and Monitoring Tools¶

Monte Carlo

What it is: an enterprise data observability platform that uses machine learning to automatically detect anomalies in data freshness, volume, schema, and distribution — without requiring manual rule configuration for every scenario. It connects to your data warehouse and BI tools via read-only metadata connectors, learns normal behavior for each asset, and alerts when something deviates. Root cause analysis traces anomalies back through lineage to the likely source pipeline.

Best for: large enterprises with complex multi-system data environments where manual rule coverage is impossible. The gold standard for production data observability at scale.

Limitation: enterprise pricing; best value at scale.

Website: https://www.montecarlodata.com

Enterprise pricing, contact for quote.

Metaplane

What it is: a lightweight data observability platform designed for teams using dbt, Snowflake, and Looker. Automatically detects anomalies in metrics, schema, and data volume, and alerts teams before bad data reaches dashboards or models. Faster to deploy than Monte Carlo with a more accessible price point.

Best for: mid-market teams on a modern dbt-centric stack who need production observability without enterprise-scale complexity.

Website: https://www.metaplane.dev

Starts at approximately \$500/month.

Bigeye

What it is: a data observability platform combining rule-based checks with ML-powered anomaly detection. Deep native integration with Snowflake and BigQuery. Monitors table freshness, volume, distribution, and schema drift. Includes column-level health scores and automatic threshold tuning.

Best for: teams primarily on Snowflake or BigQuery who want a combination of defined rules and automatic ML detection without choosing between the two approaches.

Website: https://www.bigeye.com

Pricing on request.

Layers 1 and 4 — Catalog and Governance Platforms¶

Atlan

What it is: an active metadata platform that unifies data catalog, data quality governance, lineage, and discovery in a single control plane. Atlan's Data Quality Studio lets teams define and automate quality rules directly in cloud warehouses, with trust signals and alerts embedded in everyday workflows. It integrates upstream tools like Monte Carlo, Soda, and Anomalo to provide a 360-degree quality view. For AI specifically, it serves as the context layer that agents can query programmatically to understand data assets before consuming them.

Best for: organizations that want to replace fragmented point tools with a single governed platform. Gartner Magic Quadrant Leader for Data and Analytics Governance Platforms (2026) and Metadata Management Solutions (2025).

Website: https://atlan.com

Enterprise pricing, contact for quote.

OvalEdge

What it is: an integrated data governance platform covering catalog, lineage, data quality rules, and business glossary management in one product. Applies business and technical rules to ensure data meets defined expectations. Continuously analyzes datasets to surface unexpected changes in structure or completeness. Recognized as a Niche Player in the 2026 Gartner Magic Quadrant for Data and Analytics Governance Platforms.

Best for: enterprises seeking a single governed platform for cataloging, lineage, and quality management at a lower price point than Atlan.

Website: https://www.ovaledge.com

Pricing on request.

Recommended Stack by Team Size¶

Smaller teams and earlier-stage programs:

Start with dbt tests plus Great Expectations for Layer 2 (both open source, zero cost to start)
Add Soda or Metaplane for Layer 3 observability
Use your existing catalog or a manual Notion-based tracker for Layer 4 until volume justifies a dedicated governance platform

Mid-market teams:

dbt tests plus Soda for Layer 2
Metaplane or Bigeye for Layer 3
OvalEdge for Layers 1 and 4

Enterprise:

dbt tests plus Great Expectations for Layer 2
Monte Carlo for Layer 3
Atlan for Layers 1 and 4

Part 4: Implementation Sequence¶

Weeks 1-2: Profile and baseline — run profiling on all AI-critical datasets, document null rates, distributions, sentinel values. Do not write rules yet.
Weeks 3-4: Define quality standards — for each asset, define acceptable thresholds per dimension. Get sign-off from data owners. Document in catalog.
Weeks 5-6: Write and deploy Layer 2 rules — start with highest-impact assets. Deploy in warning mode first to calibrate thresholds before switching to blocking.
Weeks 7-8: Switch to blocking mode and deploy observability — rules now block pipelines. Deploy observability tooling. Tune anomaly detection.
Weeks 9-10: Deploy governance workflows — connect alerts to ownership metadata, configure issue routing, stand up governance dashboard. Run a fire drill.
Ongoing: Expand coverage, calibrate false positive rates, review recurring issues monthly for root cause patterns.

Readiness Checklist¶

[ ] All AI-critical assets profiled — null rates, distributions, sentinel values documented
[ ] Quality thresholds defined and signed off by data owners
[ ] Accuracy rules written as code for all required fields
[ ] Completeness rules include sentinel value detection, not just null checks
[ ] Conditional completeness rules defined for context-dependent fields
[ ] All rules integrated into CI/CD pipeline — blocking, not advisory
[ ] Shift-left applied — checks at ingestion, not only at consumption
[ ] ML-powered observability deployed for production AI-critical assets
[ ] Alert routing connected to catalog ownership metadata
[ ] SLAs defined by severity
[ ] Governance dashboard live — quality scores, issue queue, SLA compliance visible
[ ] MTTD and MTTR tracked as operational KPIs
[ ] Quality scores tied to AI eligibility — assets below threshold blocked from training pipeline
[ ] Full detection-to-resolution loop tested end-to-end

Sources¶

Atlan — Data Quality Audit: How To Get It Right (May 2025)
Atlan — Automated Data Quality: Fix Bad Data and Get AI-Ready (May 2025)
Atlan — Best Data Quality Tools for 2026 (March 2026)
OvalEdge — Data Quality Assessment: A 2026 Guide to AI-Ready Trusted Data (May 2026)
OvalEdge — Which Data Quality Monitoring Tool Is Right for You (June 2026)
Improvado — Data Quality Audit: The Complete Guide (April 2026)
LatentView — What Is Data Audit? (January 2026)
Airbyte — 4 Best Tools to Automate Data Quality Checks in ETL Pipelines (September 2025)
DataKitchen — The 2026 Open-Source Data Quality and Data Observability Landscape (October 2025)
Lumenalta — Data Quality Checklist (January 2026)