Merge multiple CSV (or XLS) Files with common subset of columns into one CSV
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

**Situation**: We have multiple CSV files, for example with grant listing, from various sources and from
various years. The files have couple common columns, such as grant receiver, grant amount, however they
might contain more additional information.

Files we have:

* grants_2008.csv contains `receiver`, `amount`, `date`
* grants_2009.csv contains `id`, `receiver`, `amount`, `contract_number`, `date`
* other_grants_2008.csv contains `receiver`, `subject`, `requested_amount`, `amount`, `date`


**Objective**: Create one CSV file by sequentially merging all input CSV files and using all columns.

Based on our source files we want output csv with fields:

* `receiver`
* `amount`
* `date`
* `id`
* `contract_number`
* `subject`
* `requested_amount`

In addition, we would like to know where the record comes from, therefore we add `file` which will
contain original filename.

Code
----

Import brewery and all other necessary packages:

.. code-block:: python

    import brewery.ds as ds
    import sys

Specify sources:

.. code-block:: python

    sources = [
        {"name": "grants_2008.csv", "fields": ["receiver", "amount", "date"]},
        {"name": "grants_2009.csv", "fields": ["id", "receiver", "amount", "contract_number", "date"]}.
        {"name": "other_grants_2008.csv", "fields": ["receiver", "subject", "requested_amount", "amount", 
                                                    "date"]}
    ]


It is highly recommended to explicitly name fileds contained within source files and do not rely on
source headers. In this case we also make sure, that the field (column) names are normalised. That means
that if in one file receiver is labeled just as "receiver" and in another it is "grant receiver" we will
get the same field.

You can store ``sources`` in an external file, for example as json or yaml and read it in your script.

Now collect all the fields:


.. code-block:: python

    # We want to store filename as origin of each row
    all_fields = ["file"]

    # Go through source definitions and collect the fields
    for source in sources:
        fields = source.get("fields")
        for field in fields:
            if field not in all_fields:
                all_fields.append(field)

Prepare the output stream into ``merged.csv`` and specify fields we have found in sources and want to
write into output:

.. code-block:: python

    out = ds.CSVDataTarget("merged.csv")
    out.fields = ds.fieldlist(all_fields)
    out.initialize()

Go through all sources and merge them:

.. code-block:: python

    for source in sources:
        path = source["path"]
        fields = source.get("fields")

        # Initialize data source: skip reading of headers - we are preparing them ourselves
        # use XLSDataSource for XLS files
        src = ds.CSVDataSource(path, read_header = False)
        src.fields = ds.fieldlist(fields)
        src.initialize()

        for record in src.records():

            # Add file reference into ouput - to know where the row comes from
            record["file"] = path
            out.append(record)

        # Close the source stream
        src.finalize()

Now you have a sparse CSV files which contains all rows from source CSV files in one ``merged.csv``.

You can have a directory with YAML files (one per record/row) as output instead of one CSV just by
changing data stream target. See :class:`brewery.ds.YamlDirectoryDataTarget` for more information.


.. code-block:: python

    out = ds.YamlDirectoryDataTarget("merged_grants")

Directory ``merged_grants`` must exist before running the script.

Or directly into a SQL database. The following will initialize SQL table target stream which will remove
all existing records from the table before inserting. Note that the table ``grants`` must exist in
database ``opendata`` and must contain columns with names equal to fields specified in ``all_fields``.
See :class:`brewery.ds.SQLDataTarget` for more information.

.. code-block:: python

    out = ds.SQLDataTarget(url = "postgres://localhost/opendata",
                           table = "grants",
                           truncate = True)


Refer to source streams and source targets in the API documentation for more information about
possibilities.


.. seealso:: 

    Module :mod:`brewery.ds`
        List of varous data sources and data targets.
    Function :func:`brewery.ds.fieldlist`
        All streams use list of :class:`brewery.ds.Field` objects for field metadata. This function will
        convert list of strings into list of instances of Field class.