.. module:: petl .. moduleauthor:: Alistair Miles petl - Extract, Transform and Load (Tables of Data) =================================================== :mod:`petl` is a Python package for extracting, transforming and loading tables of data. - Documentation: http://petl.readthedocs.org/ - Source Code: https://github.com/alimanfoo/petl - Download: http://pypi.python.org/pypi/petl - Mailing List: http://groups.google.com/group/python-etl For examples of :mod:`petl` in use, see the case studies below: * `Case Study 1 - Comparing Tables `_ Overview -------- The tables below gives an overview of the main functions in the :mod:`petl` module. See also the alphabetic :ref:`genindex` of all functions in the package. Functions for extracting tables from files and databases: +-----------------------+--------------------+----------------------+ | .. | Extract | .. | +=======================+====================+======================+ | :func:`fromcsv` | :func:`fromtsv` | :func:`frompickle` | +-----------------------+--------------------+----------------------+ | :func:`fromsqlite3` | :func:`fromdb` | :func:`fromtext` | +-----------------------+--------------------+----------------------+ | :func:`fromxml` | :func:`fromjson` | :func:`fromdicts` | +-----------------------+--------------------+----------------------+ Functions for transforming tables: +--------------------------------+-----------------------------+-------------------------------+ | .. | Transform | .. | +================================+=============================+===============================+ | :func:`rename` | :func:`setheader` | :func:`extendheader` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`pushheader` | :func:`skip` | :func:`skipcomments` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`rowslice` | :func:`head` | :func:`tail` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`cut` | :func:`cutout` | :func:`select` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`selectop` | :func:`selecteq` | :func:`selectne` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`selectlt` | :func:`selectle` | :func:`selectgt` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`selectge` | :func:`selectrangeopen` | :func:`selectrangeopenleft` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`selectrangeopenright` | :func:`selectrangeclosed` | :func:`selectin` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`selectnotin` | :func:`selectis` | :func:`selectisnot` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`selectisinstance` | :func:`selectre` | :func:`rowselect` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`recordselect` | :func:`rowlenselect` | :func:`fieldselect` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`facet` | :func:`rangefacet` | :func:`replace` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`replaceall` | :func:`convert` | :func:`convertall` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`fieldconvert` | :func:`convertnumbers` | :func:`sub` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`addfield` | :func:`capture` | :func:`split` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`unpack` | :func:`fieldmap` | :func:`rowmap` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`recordmap` | :func:`rowmapmany` | :func:`recordmapmany` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`cat` | :func:`duplicates` | :func:`conflicts` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`sort` | :func:`join` | :func:`leftjoin` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`rightjoin` | :func:`outerjoin` | :func:`crossjoin` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`antijoin` | :func:`complement` | :func:`diff` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`recordcomplement` | :func:`recorddiff` | :func:`intersection` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`aggregate` | :func:`rangeaggregate` | :func:`rangecounts` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`rowreduce` | :func:`recordreduce` | :func:`rangerowreduce` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`rangerecordreduce` | :func:`mergeduplicates` | :func:`mergesort` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`merge` | :func:`melt` | :func:`recast` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`transpose` | :func:`pivot` | :func:`hashjoin` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`hashleftjoin` | :func:`hashrightjoin` | :func:`hashantijoin` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`hashcomplement` | :func:`hashintersection` | :func:`flatten` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`unflatten` | :func:`annex` | :func:`unpackdict` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`unique` | :func:`fold` | :func:`addrownumbers` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`selectcontains` | :func:`search` | :func:`addcolumn` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`lookupjoin` | :func:`hashlookupjoin` | :func:`filldown` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`fillright` | :func:`fillleft` | :func:`multirangeaggregate` | +--------------------------------+-----------------------------+-------------------------------+ | :func:`unjoin` | :func:`distinct` | :func:`rowgroupmap` | +--------------------------------+-----------------------------+-------------------------------+ Functions for loading (writing) data to files and databases: +---------------------+-------------------------+------------------------+ | .. | Load | .. | +=====================+=========================+========================+ | :func:`tocsv` | :func:`appendcsv` | :func:`totsv` | +---------------------+-------------------------+------------------------+ | :func:`appendtsv` | :func:`topickle` | :func:`appendpickle` | +---------------------+-------------------------+------------------------+ | :func:`tosqlite3` | :func:`appendsqlite3` | :func:`todb` | +---------------------+-------------------------+------------------------+ | :func:`appenddb` | :func:`totext` | :func:`appendtext` | +---------------------+-------------------------+------------------------+ | :func:`tojson` | :func:`tojsonarrays` | .. | +---------------------+-------------------------+------------------------+ Other utility functions: +--------------------------------+--------------------------------+----------------------------+ | .. | Utilities | .. | +================================+================================+============================+ | :func:`header` | :func:`data` | :func:`dataslice` | +--------------------------------+--------------------------------+----------------------------+ | :func:`records` | :func:`fieldnames` | :func:`nrows` | +--------------------------------+--------------------------------+----------------------------+ | :func:`look` | :func:`lookall` | :func:`see` | +--------------------------------+--------------------------------+----------------------------+ | :func:`itervalues` | :func:`values` | :func:`valueset` | +--------------------------------+--------------------------------+----------------------------+ | :func:`valuecount` | :func:`valuecounts` | :func:`valuecounter` | +--------------------------------+--------------------------------+----------------------------+ | :func:`columns` | :func:`facetcolumns` | :func:`isunique` | +--------------------------------+--------------------------------+----------------------------+ | :func:`limits` | :func:`stats` | :func:`lenstats` | +--------------------------------+--------------------------------+----------------------------+ | :func:`stringpatterns` | :func:`stringpatterncounter` | :func:`rowlengths` | +--------------------------------+--------------------------------+----------------------------+ | :func:`typecounts` | :func:`typecounter` | :func:`typeset` | +--------------------------------+--------------------------------+----------------------------+ | :func:`parsecounts` | :func:`parsecounter` | :func:`dateparser` | +--------------------------------+--------------------------------+----------------------------+ | :func:`timeparser` | :func:`datetimeparser` | :func:`boolparser` | +--------------------------------+--------------------------------+----------------------------+ | :func:`parsenumber` | :func:`lookup` | :func:`lookupone` | +--------------------------------+--------------------------------+----------------------------+ | :func:`recordlookup` | :func:`recordlookupone` | :func:`statsum` | +--------------------------------+--------------------------------+----------------------------+ | :func:`adler32sum` | :func:`crc32sum` | :func:`expr` | +--------------------------------+--------------------------------+----------------------------+ | :func:`strjoin` | :func:`randomtable` | :func:`dummytable` | +--------------------------------+--------------------------------+----------------------------+ | :func:`diffheaders` | :func:`diffvalues` | :func:`heapqmergesorted` | +--------------------------------+--------------------------------+----------------------------+ | :func:`shortlistmergesorted` | :func:`progress` | :func:`clock` | +--------------------------------+--------------------------------+----------------------------+ | :func:`isordered` | :func:`rowgroupby` | :func:`nthword` | +--------------------------------+--------------------------------+----------------------------+ Introduction ------------ Installation ~~~~~~~~~~~~ This module is available from the `Python Package Index `_. On Linux distributions you should be able to do ``easy_install petl`` or ``pip install petl``. On Windows or Mac you can download manually, extract and run ``python setup.py install``. Dependencies and extensions ~~~~~~~~~~~~~~~~~~~~~~~~~~~ This package has been written with no dependencies other than the Python core modules, for ease of installation and maintenance. However, there are many third party packages which could usefuly be used with :mod:`petl`, e.g., providing access to data from Excel or other file types. Some extensions with these additional dependencies are provided by the `petlx `_ package, a companion package to :mod:`petl`. Conventions - row containers and row iterators ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This package defines the following convention for objects acting as containers of tabular data and supporting row-oriented iteration over the data. A *row container* (also referred to here informally as a *table*) is any object which satisfies the following: 1. implements the `__iter__` method 2. `__iter__` returns a *row iterator* (see below) 3. all row iterators returned by `__iter__` are independent, i.e., consuming items from one iterator will not affect any other iterators A *row iterator* is an iterator which satisfies the following: 4. each item returned by the iterator is either a list or a tuple 5. the first item returned by the iterator is a *header row* comprising a list or tuple of *fields* 6. each subsequent item returned by the iterator is a *data row* comprising a list or tuple of *data values* 7. a *field* is typically a string (`str` or `unicode`) but may be an object of any type as long as it implements `__str__`, is immutable (can be used as a dictionary key) and is pickleable 8. a *data value* is any pickleable object So, for example, the list of lists shown below is a row container:: >>> table = [['foo', 'bar'], ['a', 1], ['b', 2]] Note that, under this convention, an object returned by the :func:`csv.reader` function from the standard Python :mod:`csv` module is a row iterator and *not* a row container, because it can only be iterated over once, e.g.:: >>> from StringIO import StringIO >>> import csv >>> csvdata = """foo,bar ... a,1 ... b,2 ... """ >>> rowiterator = csv.reader(StringIO(csvdata)) >>> for row in rowiterator: ... print row ... ['foo', 'bar'] ['a', '1'] ['b', '2'] >>> for row in rowiterator: ... print row ... >>> # can only iterate once However, it is straightforward to define functions that support the above convention for row containers and provide access to data from CSV or other types of file or data source, see e.g. the :func:`fromcsv` function in this package. The main reason for requiring that row containers support independent row iterators (point 3) is that data from a table may need to be iterated over several times within the same program or interactive session. E.g., when using `petl` in an interactive session to build up a sequence of data transformation steps, the user might want to examine outputs from several intermediate steps, before all of the steps are defined and the transformation is executed in full. Note that this convention does not place any restrictions on the lengths of header and data rows. A table may return a header row and/or data rows of varying lengths. Transformation pipelines ~~~~~~~~~~~~~~~~~~~~~~~~ Note that this package makes extensive use of lazy evaluation and iterators. This means, generally, that a transformation will not actually be executed until data is requested. E.g., given the following data in a file at 'example1.csv' in the current working directory:: foo,bar,baz a,1,3.4 b,2,7.4 c,6,2.2 d,9,8.1 ...the following code does not actually read the file, nor does it load any of its contents into memory:: >>> from petl import * >>> table1 = fromcsv('example1.csv') Rather, `table1` is a row container object, which can be iterated over. Similarly, if one or more transformation functions are applied, e.g.::: >>> table2 = convert(table1, 'foo', 'upper') >>> table3 = convert(table2, 'bar', int) >>> table4 = convert(table3, 'baz', float) >>> table5 = addfield(table4, 'quux', expr('{bar} * {baz}')) ...no actual transformation work will be done, until data are requested from `table5` or any of the other row containers returned by the intermediate steps. So in effect, a 5 step transformation pipeline has been set up, and rows will pass through the pipeline on demand, as they are pulled from the end of the pipeline via iteration. A call to a function like :func:`look`, or any of the functions which write data to a file or database (e.g., :func:`tocsv`, :func:`totext`, :func:`tosqlite3`, :func:`todb`), will pull data through the pipeline and cause all of the transformation steps to be executed on the requested rows, e.g.:: >>> look(table5) +-------+-------+-------+--------------------+ | 'foo' | 'bar' | 'baz' | 'quux' | +=======+=======+=======+====================+ | 'A' | 1 | 3.4 | 3.4 | +-------+-------+-------+--------------------+ | 'B' | 2 | 7.4 | 14.8 | +-------+-------+-------+--------------------+ | 'C' | 6 | 2.2 | 13.200000000000001 | +-------+-------+-------+--------------------+ | 'D' | 9 | 8.1 | 72.89999999999999 | +-------+-------+-------+--------------------+ ...although note that :func:`look` will by default only request the first 10 rows, and so at most only 10 rows will be processed. Calling :func:`look` to inspect the first few rows of a table is often an efficient way to examine the output of a transformation pipeline, without having to execute the transformation over all of the input data. Caching / memoization ~~~~~~~~~~~~~~~~~~~~~ This package tries to make efficient use of memory by using iterators and lazy evaluation where possible. However, some transformations cannot be done without building data structures, either in memory or on disk. An example is the :func:`sort` function, which will either sort a table entirely in memory, or will sort the table in memory in chunks, writing chunks to disk and performing a final merge sort on the chunks. Which strategy is used will depend on the arguments passed into the :func:`sort` function when it is called. In either case, the sorting can take some time, and if the sorted data will be used more than once, it is obviously undesirable to throw away the sorted data and start again from scratch each time. It is better to cache (a.k.a., memoize) the sorted data, if possible, so it can be re-used. However, if a table upstream of the sort is mutable, there needs to be some way of discovering whether data have been changed since the last sort was performed, and hence whether the cached data are still fresh or not. There are also cases where, even though data are generated in a purely iterative fashion, some programs may still want to cache some or all of the data. E.g., where data are calculated dynamically and are relatively expensive to compute, or require scanning many rows, or where data are being retrieved via a network and there is latency. Again, there needs to be some way of finding out whether cached data are fresh or not. To support caching, it is recommended (but not required) that row container objects also implement the `cachetag` method. This method should return an integer which changes whenever the table's fields or data changes (i.e., a row iterator would yield a different sequence of items). All :mod:`petl` row container objects implement this method, where applicable. Note in particular that the functions :func:`fromcsv`, :func:`frompickle` and :func:`fromsqlite3` return tables that implement the `cachetag` method, and that the implementation of `cachetag` in these cases depends on a checksum function operating on the underlying data file. By default, :func:`statsum` is used as the checksum function, which is cheap to compute but crude because it relies only on file name, size and time of modification, and on some systems this will **not** reveal changes within the same second that preserve file size. If you need a finer level of granularity, use either :func:`adler32sum` or :func:`crc32sum` instead. These can be passed in as arguments to :func:`fromcsv`, :func:`frompickle` and :func:`fromsqlite3`, or can be set globally, e.g.:: >>> import petl.io >>> petl.io.defaultsumfun = petl.io.adler32sum Care must be taken to ensure correct implementation of the `cachetag` method where a table is generating data dynamically from another table or tables. In these cases, the state of the upstream tables must be considered when generating a `cachetag` value. One strategy is to construct `cachetag` values by hashing the internal configuration attributes of a table, along with the `cachetag` values from any tables immediately upstream. In some cases, it may be difficult to determine whether data have changed, e.g., where data are being retrieved from a database. In these cases it is suggested that table objects use a user-configurable time to live (TTL) to generate `cachetag` values. E.g., where a database table is updated once a day at most, a TTL of a few hours would enable data to be cached during an interactive session, which might improve usability for someone exploring the data or developing a transformation script. Extract - reading tables from files, databases and other sources ---------------------------------------------------------------- .. autofunction:: petl.fromcsv .. autofunction:: petl.fromtsv .. autofunction:: petl.frompickle .. autofunction:: petl.fromsqlite3 .. autofunction:: petl.fromdb .. autofunction:: petl.fromtext .. autofunction:: petl.fromxml .. autofunction:: petl.fromjson .. autofunction:: petl.fromdicts Transform - transforming tables ------------------------------- .. autofunction:: petl.rename .. autofunction:: petl.setheader .. autofunction:: petl.extendheader .. autofunction:: petl.pushheader .. autofunction:: petl.skip .. autofunction:: petl.skipcomments .. autofunction:: petl.rowslice .. autofunction:: petl.head .. autofunction:: petl.tail .. autofunction:: petl.cut .. autofunction:: petl.cutout .. autofunction:: petl.select .. autofunction:: petl.selectop .. autofunction:: petl.selecteq .. autofunction:: petl.selectne .. autofunction:: petl.selectlt .. autofunction:: petl.selectle .. autofunction:: petl.selectgt .. autofunction:: petl.selectge .. autofunction:: petl.selectrangeopen .. autofunction:: petl.selectrangeopenleft .. autofunction:: petl.selectrangeopenright .. autofunction:: petl.selectrangeclosed .. autofunction:: petl.selectcontains .. autofunction:: petl.selectin .. autofunction:: petl.selectnotin .. autofunction:: petl.selectis .. autofunction:: petl.selectisnot .. autofunction:: petl.selectisinstance .. autofunction:: petl.selectre .. autofunction:: petl.rowselect .. autofunction:: petl.recordselect .. autofunction:: petl.rowlenselect .. autofunction:: petl.fieldselect .. autofunction:: petl.facet .. autofunction:: petl.rangefacet .. autofunction:: petl.replace .. autofunction:: petl.replaceall .. autofunction:: petl.convert .. autofunction:: petl.convertall .. autofunction:: petl.fieldconvert .. autofunction:: petl.convertnumbers .. autofunction:: petl.search .. autofunction:: petl.sub .. autofunction:: petl.split .. autofunction:: petl.capture .. autofunction:: petl.unpack .. autofunction:: petl.unpackdict .. autofunction:: petl.fieldmap .. autofunction:: petl.rowmap .. autofunction:: petl.recordmap .. autofunction:: petl.rowmapmany .. autofunction:: petl.recordmapmany .. autofunction:: petl.cat .. autofunction:: petl.duplicates .. autofunction:: petl.unique .. autofunction:: petl.conflicts .. autofunction:: petl.sort .. autofunction:: petl.join .. autofunction:: petl.leftjoin .. autofunction:: petl.lookupjoin .. autofunction:: petl.rightjoin .. autofunction:: petl.outerjoin .. autofunction:: petl.crossjoin .. autofunction:: petl.antijoin .. autofunction:: petl.complement .. autofunction:: petl.diff .. autofunction:: petl.recordcomplement .. autofunction:: petl.recorddiff .. autofunction:: petl.intersection .. autofunction:: petl.aggregate .. autofunction:: petl.rangeaggregate .. autofunction:: petl.rangecounts .. autofunction:: petl.rowreduce .. autofunction:: petl.recordreduce .. autofunction:: petl.rangerowreduce .. autofunction:: petl.rangerecordreduce .. autofunction:: petl.mergeduplicates .. autofunction:: petl.mergesort .. autofunction:: petl.merge .. autofunction:: petl.melt .. autofunction:: petl.recast .. autofunction:: petl.transpose .. autofunction:: petl.pivot .. autofunction:: petl.hashjoin .. autofunction:: petl.hashleftjoin .. autofunction:: petl.hashlookupjoin .. autofunction:: petl.hashrightjoin .. autofunction:: petl.hashantijoin .. autofunction:: petl.hashcomplement .. autofunction:: petl.hashintersection .. autofunction:: petl.flatten .. autofunction:: petl.unflatten .. autofunction:: petl.annex .. autofunction:: petl.fold .. autofunction:: petl.addrownumbers .. autofunction:: petl.addcolumn .. autofunction:: petl.addfield .. autofunction:: petl.filldown .. autofunction:: petl.fillright .. autofunction:: petl.fillleft .. autofunction:: petl.multirangeaggregate .. autofunction:: petl.unjoin .. autofunction:: petl.distinct .. autofunction:: petl.rowgroupmap .. autofunction:: petl.groupcountdistinctvalues .. autofunction:: petl.groupselectfirst .. autofunction:: petl.groupselectmin .. autofunction:: petl.groupselectmax Load - writing tables to files and databases -------------------------------------------- .. autofunction:: petl.tocsv .. autofunction:: petl.appendcsv .. autofunction:: petl.totsv .. autofunction:: petl.appendtsv .. autofunction:: petl.topickle .. autofunction:: petl.appendpickle .. autofunction:: petl.tosqlite3 .. autofunction:: petl.appendsqlite3 .. autofunction:: petl.todb .. autofunction:: petl.appenddb .. autofunction:: petl.totext .. autofunction:: petl.appendtext .. autofunction:: petl.tojson .. autofunction:: petl.tojsonarrays Utility functions ----------------- .. autofunction:: petl.header .. autofunction:: petl.data .. autofunction:: petl.iterdata .. autofunction:: petl.dataslice .. autofunction:: petl.records .. autofunction:: petl.fieldnames .. autofunction:: petl.nrows .. autofunction:: petl.look .. autofunction:: petl.lookall .. autofunction:: petl.see .. autofunction:: petl.values .. autofunction:: petl.itervalues .. autofunction:: petl.valueset .. autofunction:: petl.valuecount .. autofunction:: petl.valuecounts .. autofunction:: petl.valuecounter .. autofunction:: petl.columns .. autofunction:: petl.facetcolumns .. autofunction:: petl.isunique .. autofunction:: petl.isordered .. autofunction:: petl.limits .. autofunction:: petl.stats .. autofunction:: petl.lenstats .. autofunction:: petl.stringpatterns .. autofunction:: petl.stringpatterncounter .. autofunction:: petl.rowlengths .. autofunction:: petl.typecounts .. autofunction:: petl.typecounter .. autofunction:: petl.typeset .. autofunction:: petl.parsecounts .. autofunction:: petl.parsecounter .. autofunction:: petl.dateparser .. autofunction:: petl.timeparser .. autofunction:: petl.datetimeparser .. autofunction:: petl.boolparser .. autofunction:: petl.parsenumber .. autofunction:: petl.lookup .. autofunction:: petl.lookupone .. autofunction:: petl.recordlookup .. autofunction:: petl.recordlookupone .. autofunction:: petl.statsum .. autofunction:: petl.adler32sum .. autofunction:: petl.crc32sum .. autofunction:: petl.expr .. autofunction:: petl.strjoin .. autofunction:: petl.randomtable .. autofunction:: petl.dummytable .. autofunction:: petl.diffheaders .. autofunction:: petl.diffvalues .. autofunction:: petl.heapqmergesorted .. autofunction:: petl.shortlistmergesorted .. autofunction:: petl.progress .. autofunction:: petl.clock .. autofunction:: petl.rowgroupby .. autofunction:: petl.nthword Further Reading --------------- .. toctree:: :maxdepth: 2 fluent interactive push case_study_1 related_work Indices and tables ------------------ * :ref:`genindex` * :ref:`modindex` * :ref:`search`