Data Quality API

For introduction on Data Quality see Data Quality

class brewery.dq.FieldStatistics(key=None, distinct_threshold=10)

Data quality statistics for a dataset field

Attributes :
  • field: name of a field for which statistics are being collected
  • value_count: number of records in which the field exist. In relationad database table this is equal to number of rows, in document based databse, such as MongoDB, it is number of documents that have a key present (being null or not)
  • record_count: total count of records in dataset. This should be set explicitly on finalisation. Seet FieldStatistics.finalize(). In relational database this should be the same as value_count.
  • value_ratio: ratio of value count to record count, 1 for relational databases
  • null_count: number of records where field is null
  • null_value_ratio: ratio of records with nulls to total number of probed values = null_value_ratio / value_count
  • null_record_ratio: ratio of records with nulls to total number of records = null_value_ratio / record_count
  • empty_string_count: number of empty strings
  • storage_types: list of all encountered storage types (CSV, MongoDB, XLS might have different types within a field)
  • unique_storage_type: if there is only one storage type, then this is set to that type
  • distict_values: list of collected distinct values
  • distinct_threshold: number of distict values to collect, if count of distinct values is greather than threshold, collection is stopped and distinct_overflow will be set. Set to 0 to get all values. Default is 10.
dict()

Return dictionary representation of receiver.

finalize(record_count=None)

Compute final statistics.

Parameters :
  • record_count: final number of records in probed dataset.

    See FieldStatistics() for more information.

probe(value)

Probe the value:

  • increase found value count
  • identify storage type
  • probe for null and for empty string
  • probe distinct values: if their count is less than distinct_threshold. If there are more distinct values than the distinct_threshold, then distinct_overflow flag is set and list of distinct values will be empty
class brewery.dq.FieldTypeProbe(field)

Probe for guessing field data type

Attributes:
  • field: name of a field which statistics are being presented
  • storage_types: found storage types
  • unique_storage_type: if there is only one storage type, then this is set to that type
unique_storage_type

Return storage type if there is only one. This should always return a type in relational databases, but does not have to in databases such as MongoDB.

Previous topic

Datastores and Data Streams

Next topic

Data Pipes and Data Processing Streams

This Page