Functions and classes for measuring data quality.
Example of auditing a CSV file:
import brewery.ds as ds
import brewery.dq as dq
# Open a data stream
src = ds.CSVDataSource("data.csv")
src.initialize()
# Prepare field statistics
stats = {}
fields = src.field_names
for field in fields:
stats[field] = dq.FieldStatistics(field)
record_count = 0
# Probe values
for row in src.rows():
for i, value in enumerate(row):
stats[fields[i]].probe(value)
record_count += 1
# Finalize statistics
for stat in stats.items():
finalize(record_count)
Auditing using brewery.ds.StreamAuditor:
# ... suppose we have initialized source stream as src
# Create autitor stream target and initialize field list
auditor = ds.StreamAuditor()
auditor.fields = src.fields
auditor.initialize()
# Perform audit for each row from source:
for row in src.rows():
auditor.append(row)
# Finalize results, close files, etc.
auditor.finalize()
# Get the field statistics
stats = auditor.field_statistics
See also