For introduction on Data Streams see Data Streams
Metadata - information about a field in a dataset or in a datastream.
Attributes : |
|
---|
Storage types:
- string - names, labels, short descriptions; mostly implemeted as VARCHAR type in
database, or can be found as CSV file fields
text - longer texts, long descriptions, articles
integer - discrete values
float
boolean - binary value, mostly implemented as small integer
date
Analytical types:
Type Description set Values represent categories, like colors or contract . types. Fields of this type might be numbers which represent for example group numbers, but have no mathematical interpretation. For example addition of group numbers 1+2 has no meaning. ordered_set Similar to set field type, but values can be ordered in a meaningful order. discrete Set of integers - values can be ordered and one can perform arithmetic operations on them, such as: 1 contract + 2 contracts = 3 contracts flag Special case of set type where values can be one of two types, such as 1 or 0, ‘yes’ or ‘no’, ‘true’ or ‘false’. range Numerical value, such as financial amount, temperature default Analytical type is not explicitly set and default type for fields storage type is used. Refer to the table of default types. typeless Field has no analytical relevance.
- Default analytical types:
- integer is discrete
- float is range
- unknown, string, text, date are typeless
Create a list of Field objects from a list of strings, dictionaries or tuples
How fields are consutrcuted:
For strings and in if not explicitly specified in a tuple or a dict case, then following rules apply:
Add field to list of fields.
Parameters : |
|
---|
If field is not a Field object, then construction of new field is as follows:
For strings and in if not explicitly specified in a tuple or a dict case, then following rules apply:
Return a shallow copy of the list.
Parameters : |
|
---|
Return a field with name name
Return a tuple with indexes of fields from fieldlist in a data row.
Return index of a field
Return a tuple with indexes of fields from fields in a data row. Fields should be a list of Field objects or strings
Return names of fields in the list.
Parameters : |
|
---|
Create a FieldList from a list of strings, dictionaries or tuples.
How fields are consutrcuted:
For strings and in if not explicitly specified in a tuple or a dict case, then following rules apply:
Shared methods for data targets and data sources
Returns list of field names. This is shourt-cut for extracting field.name attribute from list of field objects returned by fields().
Information about fields: tuple of Field objects representing fields passed through the receiving stream - either read from data source (DataSource.rows()) or written to data target (DataTarget.append()).
Subclasses should implement fields property getter. Implementing fields setter is optional.
Implementation of fields setter is recommended for DataSource subclasses such as CSV files or typeless document based database. For example: explicitly specify field names for CSVs without headers or for specify field analytical or storage types for further processing. Setter is recommended also for DataTarget subclasses that create datasets (new CSV file, non-existing tables).
Subclasses might put finalisation code here, for example:
Default implementation does nothing.
Delayed stream initialisation code. Subclasses might override this method to implement file or handle opening, connecting to a database, doing web authentication, ... By default this method does nothing.
The method does not take any arguments, it expects pre-configured object.
Input data stream - for reading.
Read field descriptions from data source. You should use this for datasets that do not provide metadata directly, such as CSV files, document bases databases or directories with structured files. Does nothing in relational databases, as fields are represented by table columns and table metadata can obtained from database easily.
Note that this method can be quite costly, as by default all records within dataset are read and analysed.
After executing this method, stream fields is set to the newly read field list and may be configured (set more appropriate data types for example).
Arguments : |
|
---|
Returns: tuple with Field objects. Order of fields is datastore adapter specific.
Return iterable object with dict objects. This is one of two methods for reading from data source. Subclasses should implement this method.
Return iterable object with tuples. This is one of two methods for reading from data source. Subclasses should implement this method.
Output data stream - for writing.
Append an object into dataset. Object can be a tuple, array or a dict object. If tuple or array is used, then value position should correspond to field position in the field list, if dict is used, the keys should be valid field names.
Creates a CSV data source stream.
Attributes : |
|
---|
Note: avoid auto-detection when you are reading from remote URL stream.
Initialize CSV source stream:
create CSV reader object
read CSV headers if requested and initialize stream fields
Creates a XLS spreadsheet data source stream.
Attributes : |
|
---|
Initialize XLS source stream:
Creates a MongoDB data source stream.
Attributes : |
|
---|
Initialize Mongo source stream:
Creates a Google Spreadsheet data source stream.
Attributes : |
|
---|
You should provide either spreadsheet_key or spreadsheet_name, if more than one spreadsheet with given name are found, then the first in list returned by Google is used.
For worksheet selection you should provide either worksheet_id or worksheet_name. If more than one worksheet with given name are found, then the first in list returned by Google is used. If no worksheet_id nor worksheet_name are provided, then first worksheet in the workbook is used.
For details on query string syntax see the section on sq under http://code.google.com/apis/spreadsheets/reference.html#list_Parameters
Connect to the Google documents, authenticate.
Creates a YAML directory data source stream.
The data source reads files from a directory and treats each file as single record. For example, following directory will contain 3 records:
data/
contract_0.yml
contract_1.yml
contract_2.yml
Optionally one can specify a field where file name will be stored.
Attributes : |
|
---|
Creates a directory data target with YAML files as records.
Attributes : |
|
---|
Creates a relational database data source stream.
Attributes : |
|
---|
Note: avoid auto-detection when you are reading from remote URL stream.
Initialize source stream:
Creates a relational database data target stream.
Attributes : |
|
---|
Note: avoid auto-detection when you are reading from remote URL stream.
Initialize source stream:
Target stream for auditing data values from stream. For more information about probed value properties, please refer to brewery.dq.FieldStatistics
Probe row or record and update statistics.
Return field statistics as dictionary: keys are field names, values are brewery.dq.FieldStatistics objects