petl - Extract, Transform and Load (Tables of Data)¶

petl is a Python package for extracting, transforming and loading tables of data.

Documentation: http://petl.readthedocs.org/
Source Code: https://github.com/alimanfoo/petl
Download: http://pypi.python.org/pypi/petl
Mailing List: http://groups.google.com/group/python-etl

For examples of petl in use, see the case studies below:

Comparing Tables

Overview¶

The tables below gives an overview of the main functions in the petl module.

See also the alphabetic Index of all functions in the package.

Introduction¶

Installation¶

This module is available from the Python Package Index. On Linux distributions you should be able to do easy_install petl or pip install petl. On other platforms you can download manually, extract and run python setup.py install.

Dependencies and extensions¶

This package has been written with no dependencies other than the Python core modules, for ease of installation and maintenance. However, there are many third party packages which could usefuly be used with petl, e.g., providing access to data from Excel or other file types. Some extensions with these additional dependencies are provided by the petlx package, a companion package to petl.

Conventions - row containers and row iterators¶

This package defines the following convention for objects acting as containers of tabular data and supporting row-oriented iteration over the data.

A row container (also referred to here informally as a table) is any object which satisfies the following:

implements the __iter__ method
__iter__ returns a row iterator (see below)
all row iterators returned by __iter__ are independent, i.e., consuming items from one iterator will not affect any other iterators

A row iterator is an iterator which satisfies the following:

each item returned by the iterator is either a list or a tuple
the first item returned by the iterator is a header row comprising a list or tuple of fields
each subsequent item returned by the iterator is a data row comprising a list or tuple of data values
a field is typically a string (str or unicode) but may be an object of any type as long as it implements __str__ and is pickleable
a data value is any pickleable object that supports rich comparison operators

So, for example, the list of lists shown below is a row container:

>>> table = [['foo', 'bar'], ['a', 1], ['b', 2]]

Note that, under this convention, an object returned by the csv.reader() function from the standard Python csv module is a row iterator and not a row container, because it can only be iterated over once, e.g.:

>>> from StringIO import StringIO
>>> import csv
>>> csvdata = """foo,bar
... a,1
... b,2
... """
>>> rowiterator = csv.reader(StringIO(csvdata))
>>> for row in rowiterator:
...     print row
...
['foo', 'bar']
['a', '1']
['b', '2']
>>> for row in rowiterator:
...     print row
...
>>> # can only iterate once

However, it is straightforward to define functions that support the above convention for row containers and provide access to data from CSV or other types of file or data source, see e.g. the fromcsv() function in this package.

The main reason for requiring that row containers support independent row iterators (point 3) is that data from a table may need to be iterated over several times within the same program or interactive session. E.g., when using petl in an interactive session to build up a sequence of data transformation steps, the user might want to examine outputs from several intermediate steps, before all of the steps are defined and the transformation is executed in full.

Note that this convention does not place any restrictions on the lengths of header and data rows. A table may return a header row and/or data rows of varying lengths.

Note also that many features of petl depend on sorting which will only work if the data values support rich comparison operators.

Transformation pipelines¶

This package makes extensive use of lazy evaluation and iterators. This means, generally, that a transformation will not actually be executed until data is requested.

E.g., given the following data in a file at ‘example1.csv’ in the current working directory:

foo,bar,baz
a,1,3.4
b,2,7.4
c,6,2.2
d,9,8.1

...the following code does not actually read the file, nor does it load any of its contents into memory:

>>> from petl import *
>>> table1 = fromcsv('example1.csv')

Rather, table1 is a row container object, which can be iterated over.

Similarly, if one or more transformation functions are applied, e.g.::

>>> table2 = convert(table1, 'foo', 'upper')
>>> table3 = convert(table2, 'bar', int)
>>> table4 = convert(table3, 'baz', float)
>>> table5 = addfield(table4, 'quux', expr('{bar} * {baz}'))

...no actual transformation work will be done, until data are requested from table5 or any of the other row containers returned by the intermediate steps.

So in effect, a 5 step transformation pipeline has been set up, and rows will pass through the pipeline on demand, as they are pulled from the end of the pipeline via iteration.

A call to a function like look(), or any of the functions which write data to a file or database (e.g., tocsv(), totext(), tosqlite3(), todb()), will pull data through the pipeline and cause all of the transformation steps to be executed on the requested rows, e.g.:

>>> look(table5)
+-------+-------+-------+--------------------+
| 'foo' | 'bar' | 'baz' | 'quux'             |
+=======+=======+=======+====================+
| 'A'   | 1     | 3.4   | 3.4                |
+-------+-------+-------+--------------------+
| 'B'   | 2     | 7.4   | 14.8               |
+-------+-------+-------+--------------------+
| 'C'   | 6     | 2.2   | 13.200000000000001 |
+-------+-------+-------+--------------------+
| 'D'   | 9     | 8.1   | 72.89999999999999  |
+-------+-------+-------+--------------------+

...although note that look() will by default only request the first 10 rows, and so at most only 10 rows will be processed. Calling look() to inspect the first few rows of a table is often an efficient way to examine the output of a transformation pipeline, without having to execute the transformation over all of the input data.

Caching¶

This package tries to make efficient use of memory by using iterators and lazy evaluation where possible. However, some transformations cannot be done without building data structures, either in memory or on disk.

An example is the sort() function, which will either sort a table entirely in memory, or will sort the table in memory in chunks, writing chunks to disk and performing a final merge sort on the chunks. (Which strategy is used will depend on the arguments passed into the sort() function when it is called.)

In either case, the sorting can take some time, and if the sorted data will be used more than once, it is obviously undesirable to throw away the sorted data and start again from scratch each time. It is better to cache the sorted data, if possible, so it can be re-used.

The sort() function and all functions which use sort() internally provide a cache keyword argument, which can be used to turn on or off the caching of sorted data.

There is also an explicit cache() function, which can be used to cache in memory up to a configurable number of rows from a table.

Changed in version 0.16.

Use of the cachetag() method is now deprecated.

Extract - reading tables from files, databases and other sources¶

The following functions extract a table from a file-like source or database. For everything except fromdb() the source argument provides information about where to read the underlying data from. If the source argument is None or a string it is interpreted as follows:

None - read from stdin
string starting with ‘http://‘, ‘https://‘ or ‘ftp://‘ - read from URL
string ending with ‘.gz’ or ‘.bgz’ - read from file via gzip decompression
string ending with ‘.bz2’ - read from file via bz2 decompression
any other string - read directly from file

Some helper classes are also available for reading from other types of file-like sources, e.g., reading data from a Zip file, a string or a subprocess, see the section on I/O helper classes below for more information.

petl.fromcsv(source=None, dialect=<class csv.excel at 0x3e784c8>, **kwargs)¶

Wrapper for the standard csv.reader() function. Returns a table providing access to the data in the given delimited file. E.g.:

>>> import csv
>>> # set up a CSV file to demonstrate with
... with open('test.csv', 'wb') as f:
...     writer = csv.writer(f)
...     writer.writerow(['foo', 'bar'])
...     writer.writerow(['a', 1])
...     writer.writerow(['b', 2])
...     writer.writerow(['c', 2])
...
>>> # now demonstrate the use of petl.fromcsv
... from petl import fromcsv, look
>>> testcsv = fromcsv('test.csv')
>>> look(testcsv)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | '1'   |
+-------+-------+
| 'b'   | '2'   |
+-------+-------+
| 'c'   | '2'   |
+-------+-------+

The filename argument is the path of the delimited file, all other keyword arguments are passed to csv.reader(). So, e.g., to override the delimiter from the default CSV dialect, provide the delimiter keyword argument.

Note that all data values are strings, and any intended numeric values will need to be converted, see also convert().

Supports transparent reading from URLs, .gz and .bz2 files.

petl.fromtsv(source=None, dialect=<class csv.excel_tab at 0x3e78530>, **kwargs)¶

Convenience function, as fromcsv() but with different default dialect (tab delimited).

Supports transparent reading from URLs, .gz and .bz2 files.

New in version 0.9.

petl.fromucsv(source=None, dialect=<class csv.excel at 0x3e784c8>, encoding='utf-8', **kwargs)¶: Returns a table containing unicode data extracted from a delimited file via the given encoding. Like fromcsv() but accepts an additional encoding argument which should be one of the Python supported encodings. See also codecs.

New in version 0.19.

petl.fromutsv(source=None, dialect=<class csv.excel_tab at 0x3e78530>, **kwargs)¶: Convenience function, as fromucsv() but with different default dialect (tab delimited).

New in version 0.19.

petl.frompickle(source=None)¶

Returns a table providing access to the data pickled in the given file. The rows in the table should have been pickled to the file one at a time. E.g.:

>>> import pickle
>>> # set up a file to demonstrate with
... with open('test.dat', 'wb') as f:
...     pickle.dump(['foo', 'bar'], f)
...     pickle.dump(['a', 1], f)
...     pickle.dump(['b', 2], f)
...     pickle.dump(['c', 2.5], f)
...
>>> # now demonstrate the use of petl.frompickle
... from petl import frompickle, look
>>> testdat = frompickle('test.dat')
>>> look(testdat)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | 1     |
+-------+-------+
| 'b'   | 2     |
+-------+-------+
| 'c'   | 2.5   |
+-------+-------+

Supports transparent reading from URLs, .gz and .bz2 files.

petl.fromsqlite3(source, query, *args, **kwargs)¶

Provides access to data from an sqlite3 database file via a given query. E.g.:

>>> import sqlite3
>>> from petl import look, fromsqlite3    
>>> # set up a database to demonstrate with
>>> data = [['a', 1],
...         ['b', 2],
...         ['c', 2.0]]
>>> connection = sqlite3.connect('test.db')
>>> c = connection.cursor()
>>> c.execute('create table foobar (foo, bar)')
<sqlite3.Cursor object at 0x2240b90>
>>> for row in data:
...     c.execute('insert into foobar values (?, ?)', row)
... 
<sqlite3.Cursor object at 0x2240b90>
<sqlite3.Cursor object at 0x2240b90>
<sqlite3.Cursor object at 0x2240b90>
>>> connection.commit()
>>> c.close()
>>>
>>> # now demonstrate the petl.fromsqlite3 function
... foobar = fromsqlite3('test.db', 'select * from foobar')
>>> look(foobar)    
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| u'a'  | 1     |
+-------+-------+
| u'b'  | 2     |
+-------+-------+
| u'c'  | 2.0   |
+-------+-------+

Changed in version 0.10.2.

Either a database file name or a connection object can be given as the first argument.

petl.fromdb(dbo, query, *args, **kwargs)¶

Provides access to data from any DB-API 2.0 connection via a given query. E.g., using sqlite3:

>>> import sqlite3
>>> from petl import look, fromdb
>>> connection = sqlite3.connect('test.db')
>>> table = fromdb(connection, 'select * from foobar')
>>> look(table)

E.g., using psycopg2 (assuming you’ve installed it first):

>>> import psycopg2
>>> from petl import look, fromdb
>>> connection = psycopg2.connect("dbname=test user=postgres")
>>> table = fromdb(connection, 'select * from test')
>>> look(table)

E.g., using MySQLdb (assuming you’ve installed it first):

>>> import MySQLdb
>>> from petl import look, fromdb
>>> connection = MySQLdb.connect(passwd="moonpie", db="thangs")
>>> table = fromdb(connection, 'select * from test')
>>> look(table)

Changed in version 0.10.2.

The first argument may also be a function that creates a cursor. E.g.:

>>> import psycopg2
>>> from petl import look, fromdb
>>> connection = psycopg2.connect("dbname=test user=postgres")
>>> mkcursor = lambda: connection.cursor(cursor_factory=psycopg2.extras.DictCursor)
>>> table = fromdb(mkcursor, 'select * from test')
>>> look(table)

N.B., each call to the function should return a new cursor.

Changed in version 0.18.

Added support for server-side cursors.

Note that the default behaviour of most database servers and clients is for the entire result set for each query to be sent from the server to the client. If your query returns a large result set this can result in significant memory usage at the client. Some databases support server-side cursors which provide a means for client libraries to fetch result sets incrementally, reducing memory usage at the client.

To use a server-side cursor with a PostgreSQL database, e.g.:

>>> import psycopg2
>>> from petl import look, fromdb
>>> connection = psycopg2.connect("dbname=test user=postgres")
>>> table = fromdb(lambda: connection.cursor(name='arbitrary'), 'select * from test')
>>> look(table)

To use a server-side cursor with a MySQL database, e.g.:

>>> import MySQLdb
>>> from petl import look, fromdb
>>> connection = MySQLdb.connect(passwd="moonpie", db="thangs")
>>> table = fromdb(lambda: connection.cursor(MySQLdb.cursors.SSCursor), 'select * from test')
>>> look(table)

For more information on server-side cursors see the following links:

http://initd.org/psycopg/docs/usage.html#server-side-cursors

http://mysql-python.sourceforge.net/MySQLdb.html#using-and-extending

petl.fromtext(source=None, header=['lines'], strip=None)¶

Construct a table from lines in the given text file. E.g.:

>>> # example data
... with open('test.txt', 'w') as f:
...     f.write('a\t1\n')
...     f.write('b\t2\n')
...     f.write('c\t3\n')
... 
>>> from petl import fromtext, look
>>> table1 = fromtext('test.txt')
>>> look(table1)
+--------------+
| 'lines'      |
+==============+
| 'a\t1'      |
+--------------+
| 'b\t2'      |
+--------------+
| 'c\t3'      |
+--------------+

The fromtext() function provides a starting point for custom handling of text files. E.g., using capture():

>>> from petl import capture
>>> table2 = capture(table1, 'lines', '(.*)\\t(.*)$', ['foo', 'bar'])
>>> look(table2)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | '1'   |
+-------+-------+
| 'b'   | '2'   |
+-------+-------+
| 'c'   | '3'   |
+-------+-------+

Supports transparent reading from URLs, .gz and .bz2 files.

Changed in version 0.4.

The strip() function is called on each line, which by default will remove leading and trailing whitespace, including the end-of-line character - use the strip keyword argument to specify alternative characters to strip.

petl.fromutext(source=None, header=[u'lines'], encoding='utf-8', strip=None)¶: Construct a table from lines in the given text file via the given encoding. Like fromtext() but accepts an additional encoding argument which should be one of the Python supported encodings. See also codecs.

New in version 0.19.

petl.fromxml(source, *args, **kwargs)¶

Access data in an XML file. E.g.:

>>> from petl import fromxml, look
>>> data = """<table>
...     <tr>
...         <td>foo</td><td>bar</td>
...     </tr>
...     <tr>
...         <td>a</td><td>1</td>
...     </tr>
...     <tr>
...         <td>b</td><td>2</td>
...     </tr>
...     <tr>
...         <td>c</td><td>2</td>
...     </tr>
... </table>"""
>>> with open('example1.xml', 'w') as f:    
...     f.write(data)
...     f.close()
... 
>>> table1 = fromxml('example1.xml', 'tr', 'td')
>>> look(table1)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | '1'   |
+-------+-------+
| 'b'   | '2'   |
+-------+-------+
| 'c'   | '2'   |
+-------+-------+

If the data values are stored in an attribute, provide the attribute name as an extra positional argument, e.g.:

>>> data = """<table>
...     <tr>
...         <td v='foo'/><td v='bar'/>
...     </tr>
...     <tr>
...         <td v='a'/><td v='1'/>
...     </tr>
...     <tr>
...         <td v='b'/><td v='2'/>
...     </tr>
...     <tr>
...         <td v='c'/><td v='2'/>
...     </tr>
... </table>"""
>>> with open('example2.xml', 'w') as f:    
...     f.write(data)
...     f.close()
... 
>>> table2 = fromxml('example2.xml', 'tr', 'td', 'v')
>>> look(table2)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | '1'   |
+-------+-------+
| 'b'   | '2'   |
+-------+-------+
| 'c'   | '2'   |
+-------+-------+

Data values can also be extracted by providing a mapping of field names to element paths, e.g.:

>>> data = """<table>
...     <row>
...         <foo>a</foo><baz><bar v='1'/><bar v='3'/></baz>
...     </row>
...     <row>
...         <foo>b</foo><baz><bar v='2'/></baz>
...     </row>
...     <row>
...         <foo>c</foo><baz><bar v='2'/></baz>
...     </row>
... </table>"""
>>> with open('example3.xml', 'w') as f:    
...     f.write(data)
...     f.close()
... 
>>> table3 = fromxml('example3.xml', 'row', {'foo': 'foo', 'bar': ('baz/bar', 'v')})
>>> look(table3)
+-------+------------+
| 'foo' | 'bar'      |
+=======+============+
| 'a'   | ('1', '3') |
+-------+------------+
| 'b'   | '2'        |
+-------+------------+
| 'c'   | '2'        |
+-------+------------+

Note that the implementation is currently not streaming, i.e., the whole document is loaded into memory.

Supports transparent reading from URLs, .gz and .bz2 files.

New in version 0.4.

Changed in version 0.6: If multiple elements match a given field, all values are reported as a tuple.

petl.fromjson(source, *args, **kwargs)¶

Extract data from a JSON file. The file must contain a JSON array as the top level object, and each member of the array will be treated as a row of data. E.g.:

>>> from petl import fromjson, look
>>> data = '[{"foo": "a", "bar": 1}, {"foo": "b", "bar": 2}, {"foo": "c", "bar": 2}]'
>>> with open('example1.json', 'w') as f:
...     f.write(data)
... 
>>> table1 = fromjson('example1.json')
>>> look(table1)
+--------+--------+
| u'foo' | u'bar' |
+========+========+
| u'a'   | 1      |
+--------+--------+
| u'b'   | 2      |
+--------+--------+
| u'c'   | 2      |
+--------+--------+

If your JSON file does not fit this structure, you will need to parse it via json.load() and select the array to treat as the data, see also fromdicts().

Supports transparent reading from URLs, .gz and .bz2 files.

New in version 0.5.

petl.fromdicts(dicts, header=None)¶

View a sequence of Python dict as a table. E.g.:

>>> from petl import fromdicts, look
>>> dicts = [{"foo": "a", "bar": 1}, {"foo": "b", "bar": 2}, {"foo": "c", "bar": 2}]
>>> table = fromdicts(dicts)
>>> look(table)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | 1     |
+-------+-------+
| 'b'   | 2     |
+-------+-------+
| 'c'   | 2     |
+-------+-------+

Transform - transforming tables¶

petl.rename(table, *args)¶

Replace one or more fields in the table’s header row. E.g.:

>>> from petl import look, rename
>>> look(table1)
+-------+-------+
| 'sex' | 'age' |
+=======+=======+
| 'M'   | 12    |
+-------+-------+
| 'F'   | 34    |
+-------+-------+
| '-'   | 56    |
+-------+-------+

>>> # rename a single field
... table2 = rename(table1, 'sex', 'gender')
>>> look(table2)
+----------+-------+
| 'gender' | 'age' |
+==========+=======+
| 'M'      | 12    |
+----------+-------+
| 'F'      | 34    |
+----------+-------+
| '-'      | 56    |
+----------+-------+

>>> # rename multiple fields by passing a dictionary as the second argument
... table3 = rename(table1, {'sex': 'gender', 'age': 'age_years'})
>>> look(table3)
+----------+-------------+
| 'gender' | 'age_years' |
+==========+=============+
| 'M'      | 12          |
+----------+-------------+
| 'F'      | 34          |
+----------+-------------+
| '-'      | 56          |
+----------+-------------+

>>> # the returned table object can also be used to modify the field mapping using the suffix notation
... table4 = rename(table1)
>>> table4['sex'] = 'gender'
>>> table4['age'] = 'age_years'
>>> look(table4)
+----------+-------------+
| 'gender' | 'age_years' |
+==========+=============+
| 'M'      | 12          |
+----------+-------------+
| 'F'      | 34          |
+----------+-------------+
| '-'      | 56          |
+----------+-------------+

Changed in version 0.4.

Function signature changed to support the simple 2 argument form when renaming a single field.

Changed in version 0.23.

The field to rename can be specified as an index (i.e., integer representing field position).

petl.setheader(table, fields)¶

Override fields in the given table. E.g.:

>>> from petl import setheader, look    
>>> look(table1)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | 1     |
+-------+-------+
| 'b'   | 2     |
+-------+-------+

>>> table2 = setheader(table1, ['foofoo', 'barbar'])
>>> look(table2)
+----------+----------+
| 'foofoo' | 'barbar' |
+==========+==========+
| 'a'      | 1        |
+----------+----------+
| 'b'      | 2        |
+----------+----------+

petl.extendheader(table, fields)¶

Extend fields in the given table. E.g.:

>>> from petl import extendheader, look
>>> look(table1)
+-------+---+-------+
| 'foo' |   |       |
+=======+===+=======+
| 'a'   | 1 | True  |
+-------+---+-------+
| 'b'   | 2 | False |
+-------+---+-------+

>>> table2 = extendheader(table1, ['bar', 'baz'])
>>> look(table2)
+-------+-------+-------+
| 'foo' | 'bar' | 'baz' |
+=======+=======+=======+
| 'a'   | 1     | True  |
+-------+-------+-------+
| 'b'   | 2     | False |
+-------+-------+-------+

Load - writing tables to files and databases¶

The following functions write data from a table to a file-like source or database. For functions that accept a source argument, if the source argument is None or a string it is interpreted as follows:

None - write to stdout
string ending with ‘.gz’ or ‘.bgz’ - write to file via gzip decompression
string ending with ‘.bz2’ - write to file via bz2 decompression
any other string - write directly to file

Some helper classes are also available for writing to other types of file-like sources, e.g., writing to a Zip file or string buffer, see the section on I/O helper classes below for more information.

petl.tocsv(table, source=None, dialect=<class csv.excel at 0x3e784c8>, **kwargs)¶

Write the table to a CSV file. E.g.:

>>> from petl import tocsv, look
>>> look(table)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | 1     |
+-------+-------+
| 'b'   | 2     |
+-------+-------+
| 'c'   | 2     |
+-------+-------+

>>> tocsv(table, 'test.csv')
>>> # look what it did
... from petl import fromcsv
>>> look(fromcsv('test.csv'))
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | '1'   |
+-------+-------+
| 'b'   | '2'   |
+-------+-------+
| 'c'   | '2'   |
+-------+-------+

The filename argument is the path of the delimited file, all other keyword arguments are passed to csv.writer(). So, e.g., to override the delimiter from the default CSV dialect, provide the delimiter keyword argument.

Note that if a file already exists at the given location, it will be overwritten.

Supports transparent writing to .gz and .bz2 files.

petl.appendcsv(table, source=None, dialect=<class csv.excel at 0x3e784c8>, **kwargs)¶

Append data rows to an existing CSV file. E.g.:

>>> # look at an existing CSV file
... from petl import look, fromcsv
>>> testcsv = fromcsv('test.csv')
>>> look(testcsv)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | '1'   |
+-------+-------+
| 'b'   | '2'   |
+-------+-------+
| 'c'   | '2'   |
+-------+-------+

>>> # append some data
... look(table)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'd'   | 7     |
+-------+-------+
| 'e'   | 42    |
+-------+-------+
| 'f'   | 12    |
+-------+-------+

>>> from petl import appendcsv 
>>> appendcsv(table, 'test.csv')
>>> # look what it did
... look(testcsv)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | '1'   |
+-------+-------+
| 'b'   | '2'   |
+-------+-------+
| 'c'   | '2'   |
+-------+-------+
| 'd'   | '7'   |
+-------+-------+
| 'e'   | '42'  |
+-------+-------+
| 'f'   | '12'  |
+-------+-------+

The filename argument is the path of the delimited file, all other keyword arguments are passed to csv.writer(). So, e.g., to override the delimiter from the default CSV dialect, provide the delimiter keyword argument.

Note that no attempt is made to check that the fields or row lengths are consistent with the existing data, the data rows from the table are simply appended to the file. See also the cat() function.

Supports transparent writing to .gz and .bz2 files.

petl.totsv(table, source=None, dialect=<class csv.excel_tab at 0x3e78530>, **kwargs)¶

Convenience function, as tocsv() but with different default dialect (tab delimited).

Supports transparent writing to .gz and .bz2 files.

New in version 0.9.

petl.appendtsv(table, source=None, dialect=<class csv.excel_tab at 0x3e78530>, **kwargs)¶

Convenience function, as appendcsv() but with different default dialect (tab delimited).

Supports transparent writing to .gz and .bz2 files.

New in version 0.9.

petl.toucsv(table, source=None, dialect=<class csv.excel at 0x3e784c8>, encoding='utf-8', **kwargs)¶: Write the table to a CSV file via the given encoding. Like tocsv() but accepts an additional encoding argument which should be one of the Python supported encodings. See also codecs.

New in version 0.19.

petl.appenducsv(table, source=None, dialect=<class csv.excel at 0x3e784c8>, encoding='utf-8', **kwargs)¶: Append the table to a CSV file via the given encoding. Like appendcsv() but accepts an additional encoding argument which should be one of the Python supported encodings. See also codecs.

New in version 0.19.

petl.toutsv(table, source=None, dialect=<class csv.excel_tab at 0x3e78530>, **kwargs)¶: Convenience function, as toucsv() but with different default dialect (tab delimited).

New in version 0.19.

petl.appendutsv(table, source=None, dialect=<class csv.excel_tab at 0x3e78530>, **kwargs)¶: Convenience function, as appenducsv() but with different default dialect (tab delimited).

New in version 0.19.

petl.topickle(table, source=None, protocol=-1)¶

Write the table to a pickle file. E.g.:

>>> from petl import topickle, look
>>> look(table)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | 1     |
+-------+-------+
| 'b'   | 2     |
+-------+-------+
| 'c'   | 2     |
+-------+-------+

>>> topickle(table, 'test.dat')
>>> # look what it did
... from petl import frompickle
>>> look(frompickle('test.dat'))
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | 1     |
+-------+-------+
| 'b'   | 2     |
+-------+-------+
| 'c'   | 2     |
+-------+-------+

Note that if a file already exists at the given location, it will be overwritten.

The pickle file format preserves type information, i.e., reading and writing is round-trippable.

Supports transparent writing to .gz and .bz2 files.

petl.appendpickle(table, source=None, protocol=-1)¶

Append data to an existing pickle file. E.g.:

>>> from petl import look, frompickle
>>> # inspect an existing pickle file
... testdat = frompickle('test.dat')
>>> look(testdat)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | 1     |
+-------+-------+
| 'b'   | 2     |
+-------+-------+
| 'c'   | 2     |
+-------+-------+

>>> # append some data
... from petl import appendpickle
>>> look(table)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'd'   | 7     |
+-------+-------+
| 'e'   | 42    |
+-------+-------+
| 'f'   | 12    |
+-------+-------+

>>> appendpickle(table, 'test.dat')
>>> # look what it did
... look(testdat)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | 1     |
+-------+-------+
| 'b'   | 2     |
+-------+-------+
| 'c'   | 2     |
+-------+-------+
| 'd'   | 7     |
+-------+-------+
| 'e'   | 42    |
+-------+-------+
| 'f'   | 12    |
+-------+-------+

Note that no attempt is made to check that the fields or row lengths are consistent with the existing data, the data rows from the table are simply appended to the file. See also the cat() function.

Supports transparent writing to .gz and .bz2 files.

petl.tosqlite3(table, filename_or_connection, tablename, create=False, commit=True)¶

Load data into a table in an sqlite3 database. Note that if the database table exists, it will be truncated, i.e., all existing rows will be deleted prior to inserting the new data. E.g.:

>>> from petl import tosqlite3, look
>>> look(table)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | 1     |
+-------+-------+
| 'b'   | 2     |
+-------+-------+
| 'c'   | 2     |
+-------+-------+

>>> # by default, if the table does not already exist, it will be created
... tosqlite3(table, 'test.db', 'foobar')
>>> # look what it did
... from petl import fromsqlite3
>>> look(fromsqlite3('test.db', 'select * from foobar'))
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| u'a'  | 1     |
+-------+-------+
| u'b'  | 2     |
+-------+-------+
| u'c'  | 2     |
+-------+-------+

If the table does not exist and create=True then a table will be created using the field names in the table header. However, note that no type specifications will be included in the table creation statement and so column type affinities may be inappropriate.

Changed in version 0.10.2.

Either a database file name or a connection object can be given as the second argument.

Changed in version 0.21.

Default value for create argument changed to False.

petl.appendsqlite3(table, filename_or_connection, tablename, commit=True)¶

Load data into an existing table in an sqlite3 database. Note that the database table will be appended, i.e., the new data will be inserted into the table, and any existing rows will remain. E.g.:

>>> from petl import appendsqlite3, look
>>> look(moredata)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'd'   | 7     |
+-------+-------+
| 'e'   | 9     |
+-------+-------+
| 'f'   | 1     |
+-------+-------+

>>> appendsqlite3(moredata, 'test.db', 'foobar') 
>>> # look what it did
... from petl import look, fromsqlite3
>>> look(fromsqlite3('test.db', 'select * from foobar'))
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| u'a'  | 1     |
+-------+-------+
| u'b'  | 2     |
+-------+-------+
| u'c'  | 2     |
+-------+-------+
| u'd'  | 7     |
+-------+-------+
| u'e'  | 9     |
+-------+-------+
| u'f'  | 1     |
+-------+-------+

Changed in version 0.10.2.

Either a database file name or a connection object can be given as the second argument.

petl.todb(table, dbo, tablename, schema=None, commit=True)¶

Load data into an existing database table via a DB-API 2.0 connection or cursor. Note that the database table will be truncated, i.e., all existing rows will be deleted prior to inserting the new data. E.g.:

>>> from petl import look, todb
>>> look(table)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | 1     |
+-------+-------+
| 'b'   | 2     |
+-------+-------+
| 'c'   | 2     |
+-------+-------+

... using sqlite3:

>>> import sqlite3
>>> connection = sqlite3.connect('test.db')
>>> # assuming table "foobar" already exists in the database
... todb(table, connection, 'foobar')    

... using psycopg2:

>>> import psycopg2 
>>> connection = psycopg2.connect("dbname=test user=postgres")
>>> # assuming table "foobar" already exists in the database
... todb(table, connection, 'foobar')    

... using MySQLdb:

>>> import MySQLdb
>>> connection = MySQLdb.connect(passwd="moonpie", db="thangs")
>>> # tell MySQL to use standard quote character
... connection.cursor().execute('SET SQL_MODE=ANSI_QUOTES')
>>> # load data, assuming table "foobar" already exists in the database
... todb(table, connection, 'foobar')    

N.B., for MySQL the statement SET SQL_MODE=ANSI_QUOTES is required to ensure MySQL uses SQL-92 standard quote characters.

Changed in version 0.10.2.

A cursor can also be provided instead of a connection, e.g.:

>>> import psycopg2 
>>> connection = psycopg2.connect("dbname=test user=postgres")
>>> cursor = connection.cursor()
>>> todb(table, cursor, 'foobar')    

petl.appenddb(table, dbo, tablename, schema=None, commit=True)¶

Load data into an existing database table via a DB-API 2.0 connection or cursor. Note that the database table will be appended, i.e., the new data will be inserted into the table, and any existing rows will remain. E.g.:

>>> from petl import look, appenddb
>>> look(table)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | 1     |
+-------+-------+
| 'b'   | 2     |
+-------+-------+
| 'c'   | 2     |
+-------+-------+

... using sqlite3:

>>> import sqlite3
>>> connection = sqlite3.connect('test.db')
>>> # assuming table "foobar" already exists in the database
... appenddb(table, connection, 'foobar')    

... using psycopg2:

>>> import psycopg2 
>>> connection = psycopg2.connect("dbname=test user=postgres")
>>> # assuming table "foobar" already exists in the database
... appenddb(table, connection, 'foobar')    

... using MySQLdb:

>>> import MySQLdb
>>> connection = MySQLdb.connect(passwd="moonpie", db="thangs")
>>> # tell MySQL to use standard quote character
... connection.cursor().execute('SET SQL_MODE=ANSI_QUOTES')
>>> # load data, appending rows to table "foobar" 
... appenddb(table, connection, 'foobar')    

N.B., for MySQL the statement SET SQL_MODE=ANSI_QUOTES is required to ensure MySQL uses SQL-92 standard quote characters.

Changed in version 0.10.2.

A cursor can also be provided instead of a connection, e.g.:

>>> import psycopg2 
>>> connection = psycopg2.connect("dbname=test user=postgres")
>>> cursor = connection.cursor()
>>> appenddb(table, cursor, 'foobar')    

petl.totext(table, source=None, template=None, prologue=None, epilogue=None)¶

Write the table to a text file. E.g.:

>>> from petl import totext, look    
>>> look(table)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | 1     |
+-------+-------+
| 'b'   | 2     |
+-------+-------+
| 'c'   | 2     |
+-------+-------+

>>> prologue = """{| class="wikitable"
... |-
... ! foo
... ! bar
... """
>>> template = """|-
... | {foo}
... | {bar}
... """
>>> epilogue = "|}"
>>> totext(table, 'test.txt', template, prologue, epilogue)
>>> 
>>> # see what we did
... with open('test.txt') as f:
...     print f.read()
...     
{| class="wikitable"
|-
! foo
! bar
|-
| a
| 1
|-
| b
| 2
|-
| c
| 2
|}

The template will be used to format each row via str.format.

Supports transparent writing to .gz and .bz2 files.

petl.appendtext(table, source=None, template=None, prologue=None, epilogue=None)¶: Append the table to a text file.

New in version 0.19.

petl.toutext(table, source=None, encoding='utf-8', template=None, prologue=None, epilogue=None)¶: Write the table to a text file via the given encoding. Like totext() but accepts an additional encoding argument which should be one of the Python supported encodings. See also codecs.

New in version 0.19.

petl.appendutext(table, source=None, encoding='utf-8', template=None, prologue=None, epilogue=None)¶: Append the table to a text file via the given encoding. Like appendtext() but accepts an additional encoding argument which should be one of the Python supported encodings. See also codecs.

New in version 0.19.

petl.tojson(table, source=None, prefix=None, suffix=None, *args, **kwargs)¶

Write a table in JSON format, with rows output as JSON objects. E.g.:

>>> from petl import tojson, look
>>> look(table)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | 1     |
+-------+-------+
| 'b'   | 2     |
+-------+-------+
| 'c'   | 2     |
+-------+-------+

>>> tojson(table, 'example.json')
>>> # check what it did
... with open('example.json') as f:
...     print f.read()
... 
[{"foo": "a", "bar": 1}, {"foo": "b", "bar": 2}, {"foo": "c", "bar": 2}]

Note that this is currently not streaming, all data is loaded into memory before being written to the file.

Supports transparent writing to .gz and .bz2 files.

New in version 0.5.

petl.tojsonarrays(table, source=None, prefix=None, suffix=None, output_header=False, *args, **kwargs)¶

Write a table in JSON format, with rows output as JSON arrays. E.g.:

>>> from petl import tojsonarrays, look
>>> look(table)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | 1     |
+-------+-------+
| 'b'   | 2     |
+-------+-------+
| 'c'   | 2     |
+-------+-------+

>>> tojsonarrays(table, 'example.json')
>>> # check what it did
... with open('example.json') as f:
...     print f.read()
... 
[["a", 1], ["b", 2], ["c", 2]]

Note that this is currently not streaming, all data is loaded into memory before being written to the file.

Supports transparent writing to .gz and .bz2 files.

New in version 0.11.

Utility functions¶

petl.header(table)¶

Return the header row for the given table. E.g.:

>>> from petl import header
>>> table = [['foo', 'bar'], ['a', 1], ['b', 2]]
>>> header(table)
['foo', 'bar']

I/O helper classes¶

The following classes are helpers for extract (from...()) and load (to...()) functions that use a file-like data source. An instance of any of the following classes can be used as the source argument to data extraction functions like fromcsv() etc., with the exception of StdoutSource which is write-only. An instance of any of the following classes can also be used as the source argument to data loading functions like tocsv() etc., with the exception of StdinSource, URLSource and PopenSource which are read-only. The behaviour of each source can usually be configured by passing arguments to the constructor, see the source code of the petl.io module for full details.

class petl.FileSource(filename, **kwargs)¶

class petl.GzipSource(filename, **kwargs)¶

class petl.BZ2Source(filename, **kwargs)¶

class petl.ZipSource(filename, membername, pwd=None, **kwargs)¶

class petl.StdinSource¶

class petl.StdoutSource¶

class petl.URLSource(*args, **kwargs)¶

class petl.StringSource(s=None)¶

class petl.PopenSource(*args, **kwargs)¶

petl - Extract, Transform and Load (Tables of Data)¶

Overview¶

Introduction¶

Installation¶

Dependencies and extensions¶

Conventions - row containers and row iterators¶

Transformation pipelines¶

Caching¶

Extract - reading tables from files, databases and other sources¶

Transform - transforming tables¶

Load - writing tables to files and databases¶

Utility functions¶

I/O helper classes¶

Further Reading¶

Indices and tables¶