Node Reference¶

Sources¶

CSV Source¶

Synopsis: Read data from a comma separated values (CSV) file.

Class: CSVSourceNode

Source node that reads comma separated file from a filesystem or a remote URL.

It is recommended to configure node fields before running. If you do not do so, fields are read from the file header if specified by read_header flag. Field storage types are set to string and analytical type is set to typeless.

Attributes
attribute	description
resource	File name or URL containing comma separated values
fields	fields contained in the file
read_header	flag determining whether first line contains header or not
skip_rows	number of rows to be skipped
encoding	resource data encoding, by default no conversion is performed
delimiter	record delimiter character, default is comma ‘,’
quotechar	character used for quoting string values, default is double quote

Record List Source¶

Synopsis: Provide list of dict objects as data source.

Class: RecordListSourceNode

Source node that feeds records (dictionary objects) from a list (or any other iterable) object.

Attributes
attribute	description
list	List of records represented as dictionaries.
fields	Fields in the list.

Row List Source¶

Synopsis: Provide list of lists or tuples as data source.

Class: RowListSourceNode

Source node that feeds rows (list/tuple of values) from a list (or any other iterable) object.

Attributes
attribute	description
list	List of rows represented as lists or tuples.
fields	Fields in the list.

Data Stream Source¶

Synopsis: Generic data stream data source node.

Class: StreamSourceNode

Generic data stream source. Wraps a brewery.ds data source and feeds data to the output.

The source data stream should configure fields on initialize().

Note that this node is only for programatically created processing streams. Not useable in visual, web or other stream modelling tools.

Attributes
attribute	description
stream	Data stream object.

YAML Directory Source¶

Synopsis: Read data from a directory containing YAML files

Class: YamlDirectorySourceNode

Source node that reads data from a directory containing YAML files.

The data source reads files from a directory and treats each file as single record. For example, following directory will contain 3 records:

data/
    contract_0.yml
    contract_1.yml
    contract_2.yml

Optionally one can specify a field where file name will be stored.

Attributes
attribute	description
path	Path to a directory
extension	file extension to look for, default is yml. If none is given, then all regular files in the directory are read.
filename_field	name of a new field that will contain file name

Record Operations¶

Aggregate Node¶

Synopsis: Aggregate values grouping by key fields.

Class: AggregateNode

Aggregate

Attributes
attribute	description
keys	List of fields according to which records are grouped
record_count_field	Name of a field where record count will be stored. Default is record_count

Append¶

Synopsis: Concatenate input streams.

Class: AppendNode

Sequentialy append input streams. Concatenation order reflects input stream order. The input streams should have same set of fields.

Data Audit¶

Synopsis: Perform basic data audit.

Class: AuditNode

Node chcecks stream for empty strings, not filled values, number distinct values.

Audit note passes following fields to the output:

field_name - name of a field from input

record_count - number of records

null_count - number of records with null value for the field

null_record_ratio - ratio of null count to number of records

empty_string_count - number of strings that are empty (for fields of type string)

distinct_count - number of distinct values (if less than distinct threshold). Set to None if there are more distinct values than distinct_threshold.

Attributes
attribute	description
distinct_threshold	number of distinct values to be tested. If there are more than the threshold, then values are not included any more and result distinct_values is set to None

Distinct Node¶

Synopsis: Pass only distinct records (discard duplicates) or pass only duplicates

Class: DistinctNode

Node will pass distinct records with given distinct fields.

If discard is False then first record with distinct keys is passed to the output. This is used to find all distinct key values.

If discard is True then first record with distinct keys is discarded and all duplicate records with same key values are passed to the output. This mode is used to find duplicate records. For example: there should be only one invoice per organisation per month. Set distinct_fields to organisaion and month, sed discard to True. Running this node should give no records on output if there are no duplicates.

Attributes
attribute	description
distinct_fields	List of key fields that will be considered when comparing records
discard	Field where substition result will be stored. If not set, then original field will be replaced with new value.

Merge Node¶

Synopsis: no description

Class: MergeNode

Merge two or more streams (join)

Sample Node¶

Synopsis: Pass data sample from input to output.

Class: SampleNode

Create a data sample from input stream. There are more sampling possibilities:

fixed number of records
% of records, random (not yet implemented)
get each n-th record (not yet implemented)

Node can work in two modes: pass sample to the output or discard sample and pass the rest. The mode is controlled through the discard flag. When it is false, then sample is passed and rest is discarded. When it is true, then sample is discarded and rest is passed.

Attributes
attribute	description
sample_size	Size of the sample to be passed to the output
discard	flag whether the sample is discarded or included

Select¶

Synopsis: Select records by a predicate function.

Class: SelectNode

Select records that will be selected by a predicate function.

Example: configure a node that will select records where amount field is greater than 100

def select_greater_than(value, threshold):
    return value > threshold

node.function = select_greater_than
node.fields = ["amount"]
node.kwargs = {"threshold": 100}

The discard flag controls behaviour of the node: if set to True, then selection is inversed and fields that function evaluates as True are discarded. Default is False - selected records are passed to the output.

Attributes
attribute	description
function	Predicate function. Should be a callable object.
fields	List of field names to be passed to the function.
discard	flag whether the selection is discarded or included
kwargs	Keyword arguments passed to the predicate function

Set Select¶

Synopsis: Select records by a predicate function.

Class: SetSelectNode

Select records where field value is from predefined set of values.

Use case examples:

records from certain regions in region field
recprds where quality status field is low or medium

Attributes
attribute	description
field	Field to be tested.
value_set	set of values that will be used for record selection
discard	flag whether the selection is discarded or included

Field Operations¶

Binning¶

Synopsis: Derive a field based on binned values (histogram)

Class: BinningNode

Derive a bin/category field from a value.

Note: this node is not yet implemented

Binning modes:

fixed width (for example: by 100)
fixed number of fixed-width bins
n-tiles by count or by sum
record rank

Coalesce Value To Type Node¶

Synopsis: Coalesce Value to Type

Class: CoalesceValueToTypeNode

Coalesce values of selected fields, or fields of given type to match the type.

string, text
- Strip strings
- if non-string, then it is converted to a unicode string
- Change empty strings to empty (null) values
float, integer
- If value is of string type, perform string cleansing first and then convert them to respective numbers or to null on failure

Attributes
attribute	description
fields	List of fields to be cleansed. If none given then all fields of known storage type are cleansed
types	List of field types to be coalesced (if no fields given)
empty_values	dictionary of type -> value pairs to be set when field is considered empty (null) - not yet used

Field Map¶

Synopsis: Rename or drop fields from the stream.

Class: FieldMapNode

Node renames input fields or drops them from the stream.

Attributes
attribute	description
map_fields	Dictionary of input to output field name.
drop_fields	List of fields to be dropped from the stream.

String Strip¶

Synopsis: Strip characters.

Class: StringStripNode

Strip spaces (orother specified characters) from string fields.

Attributes
attribute	description
fields	List of string fields to be stripped. If none specified, then all fields of storage type string are stripped
chars	Characters to be stripped. By default all white-space characters are stripped.

Text Substitute¶

Synopsis: Substitute text in a field using regular expression.

Class: TextSubstituteNode

Substitute text in a field using regular expression.

Attributes
attribute	description
field	Field containing a string or text value where substition will be applied
derived_field	Field where substition result will be stored. If not set, then original field will be replaced with new value.
substitutions	List of substitutions: each substition is a two-element tuple (pattern, replacement) where pattern is a regular expression that will be replaced using replacement

Value Threshold¶

Synopsis: Bin values based on a threshold.

Class: ValueThresholdNode

Create a field that will refer to a value bin based on threshold(s). Values of range type can be compared against one or two thresholds to get low/high or low/medium/high value bins.

Note: this node is not yet implemented

The result is stored in a separate field that will be constructed from source field name and prefix/suffix.

For example:

amount < 100 is low
100 <= amount <= 1000 is medium
amount > 1000 is high

Generated field will be amount_threshold and will contain one of three possible values: low, medium, hight

Another possible use case might be for binning after data audit: we want to measure null record count and we set thresholds:

ratio < 5% is ok

5% <= ratio <= 15% is fair

ratio > 15% is bad

We set thresholds as (0.05, 0.15) and values to ("ok", "fair", "bad")

Attributes
attribute	description
thresholds	List of fields of range type and threshold tuples (field, low, high) or (field, low)
bin_names	Names of bins based on threshold. Default is low, medium, high
prefix	field prefix to be used, default is none.
suffix	field suffix to be used, default is ‘_bin’

Targets¶

CSV Target¶

Synopsis: Write rows as comma separated values into a file

Class: CSVTargetNode

Node that writes rows into a comma separated values (CSV) file.

Attributes:	resource: target object - might be a filename or file-like object write_headers: write field names as headers into output file truncate: remove data from file before writing, default: True

Attributes
attribute	description
resource	Target object - file name or IO object.
write_headers	Flag determining whether to write field names as file headers.
truncate	If set to `True` all data from file are removed. Default `True`

Formatted Printer¶

Synopsis: Print input using a string formatter to an output IO stream

Class: FormattedPrinterNode

Target node that will print output based on format.

Refer to the python formatting guide:

http://docs.python.org/library/string.html

Example:

Consider we have a data with information about donations. We want to pretty print two fields: project and requested_amount in the form:

Hlavicka - makovicka                                            27550.0
Obecna kniznica - symbol moderneho vzdelavania                 132000.0
Vzdelavanie na europskej urovni                                 60000.0

Node for given format is created by:

node = FormattedPrinterNode(format = u"{project:<50.50} {requested_amount:>20}")

Following format can be used to print output from an audit node:

node.header = u"field                            nulls      empty   distinct\n" \
               "------------------------------------------------------------"
node.format = u"{field_name:<30.30} {null_record_ratio: >7.2%} "\
               "{empty_string_count:>10} {distinct_count:>10}"

Output will look similar to this:

field                            nulls      empty   distinct
------------------------------------------------------------
file                             0.00%          0         32
source_code                      0.00%          0          2
id                               9.96%          0        907
receiver_name                    9.10%          0       1950
project                          0.05%          0       3628
requested_amount                22.90%          0        924
received_amount                  4.98%          0        728
source_comment                  99.98%          0          2

Attributes
attribute	description
format	Format string to be used
output	IO object. If not set then sys.stdout will be used. If it is a string, then it is considered a filename.
delimiter	Record delimiter. By default it is new line character.
header	Header string - will be printed before printing first record
footer	Footer string - will be printed after all records are printed

Record List Target¶

Synopsis: Store data as list of dictionaries (records)

Class: RecordListTargetNode

Target node that stores data from input in a list of records (dictionary objects) object.

To get list of fields, ask for output_fields.

Attributes
attribute	description
records	Created list of records represented as dictionaries.

Row List Target¶

Synopsis: Store data as list of tuples

Class: RowListTargetNode

Target node that stores data from input in a list of rows (as tuples).

To get list of fields, ask for output_fields.

Attributes
attribute	description
rows	Created list of tuples.

Data Stream Target¶

Synopsis: Generic data stream data target node.

Class: StreamTargetNode

Generic data stream target. Wraps a brewery.ds data target and feeds data from the input to the target stream.

The data target should match stream fields.

Note that this node is only for programatically created processing streams. Not useable in visual, web or other stream modelling tools.

Attributes
attribute	description
stream	Data target object.

Node Reference¶

Sources¶

CSV Source¶

Record List Source¶

Row List Source¶

Data Stream Source¶

YAML Directory Source¶

Record Operations¶

Aggregate Node¶

Append¶

Data Audit¶

Distinct Node¶

Merge Node¶

Sample Node¶

Select¶

Set Select¶

Field Operations¶

Binning¶

Coalesce Value To Type Node¶

Field Map¶

String Strip¶

Text Substitute¶

Value Threshold¶

Targets¶

CSV Target¶

Formatted Printer¶

Record List Target¶

Row List Target¶

Data Stream Target¶

Table Of Contents

Previous topic

Next topic

This Page

Navigation

Node Reference¶

Sources¶

CSV Source¶

Record List Source¶

Row List Source¶

Data Stream Source¶

YAML Directory Source¶

Record Operations¶

Aggregate Node¶

Append¶

Data Audit¶

Distinct Node¶

Merge Node¶

Sample Node¶

Select¶

Set Select¶

Field Operations¶

Binning¶

Coalesce Value To Type Node¶

Field Map¶

String Strip¶

Text Substitute¶

Value Threshold¶

Targets¶

CSV Target¶

Formatted Printer¶

Record List Target¶

Row List Target¶

Data Stream Target¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation