Extract - reading tables from files, databases and other sources

The following functions extract a table from a file-like source or database. For everything except fromdb() the source argument provides information about where to read the underlying data from. If the source argument is None or a string it is interpreted as follows:

  • None - read from stdin
  • string starting with ‘http://‘, ‘https://‘ or ‘ftp://‘ - read from URL
  • string ending with ‘.gz’ or ‘.bgz’ - read from file via gzip decompression
  • string ending with ‘.bz2’ - read from file via bz2 decompression
  • any other string - read directly from file

Some helper classes are also available for reading from other types of file-like sources, e.g., reading data from a Zip file, a string or a subprocess, see the section on I/O helper classes for more information.

Delimited files

petl.fromcsv(source=None, dialect=<class csv.excel at 0x7fc97ffd2ef0>, **kwargs)

Wrapper for the standard csv.reader() function. Returns a table providing access to the data in the given delimited file. E.g.:

>>> import csv
>>> # set up a CSV file to demonstrate with
... with open('test.csv', 'wb') as f:
...     writer = csv.writer(f)
...     writer.writerow(['foo', 'bar'])
...     writer.writerow(['a', 1])
...     writer.writerow(['b', 2])
...     writer.writerow(['c', 2])
...
>>> # now demonstrate the use of petl.fromcsv
... from petl import fromcsv, look
>>> testcsv = fromcsv('test.csv')
>>> look(testcsv)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | '1'   |
+-------+-------+
| 'b'   | '2'   |
+-------+-------+
| 'c'   | '2'   |
+-------+-------+

The filename argument is the path of the delimited file, all other keyword arguments are passed to csv.reader(). So, e.g., to override the delimiter from the default CSV dialect, provide the delimiter keyword argument.

Note that all data values are strings, and any intended numeric values will need to be converted, see also convert().

Supports transparent reading from URLs, .gz and .bz2 files.

petl.fromtsv(source=None, dialect=<class csv.excel_tab at 0x7fc97ffd2f58>, **kwargs)

Convenience function, as fromcsv() but with different default dialect (tab delimited).

Supports transparent reading from URLs, .gz and .bz2 files.

New in version 0.9.

petl.fromucsv(source=None, dialect=<class csv.excel at 0x7fc97ffd2ef0>, encoding='utf-8', **kwargs)

Returns a table containing unicode data extracted from a delimited file via the given encoding. Like fromcsv() but accepts an additional encoding argument which should be one of the Python supported encodings. See also codecs.

New in version 0.19.

petl.fromutsv(source=None, dialect=<class csv.excel_tab at 0x7fc97ffd2f58>, **kwargs)

Convenience function, as fromucsv() but with different default dialect (tab delimited).

New in version 0.19.

Pickle files

petl.frompickle(source=None)

Returns a table providing access to the data pickled in the given file. The rows in the table should have been pickled to the file one at a time. E.g.:

>>> import pickle
>>> # set up a file to demonstrate with
... with open('test.dat', 'wb') as f:
...     pickle.dump(['foo', 'bar'], f)
...     pickle.dump(['a', 1], f)
...     pickle.dump(['b', 2], f)
...     pickle.dump(['c', 2.5], f)
...
>>> # now demonstrate the use of petl.frompickle
... from petl import frompickle, look
>>> testdat = frompickle('test.dat')
>>> look(testdat)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | 1     |
+-------+-------+
| 'b'   | 2     |
+-------+-------+
| 'c'   | 2.5   |
+-------+-------+

Supports transparent reading from URLs, .gz and .bz2 files.

Text files

petl.fromtext(source=None, header=['lines'], strip=None)

Construct a table from lines in the given text file. E.g.:

>>> # example data
... with open('test.txt', 'w') as f:
...     f.write('a\t1\n')
...     f.write('b\t2\n')
...     f.write('c\t3\n')
... 
>>> from petl import fromtext, look
>>> table1 = fromtext('test.txt')
>>> look(table1)
+--------------+
| 'lines'      |
+==============+
| 'a\t1'      |
+--------------+
| 'b\t2'      |
+--------------+
| 'c\t3'      |
+--------------+

The fromtext() function provides a starting point for custom handling of text files. E.g., using capture():

>>> from petl import capture
>>> table2 = capture(table1, 'lines', '(.*)\\t(.*)$', ['foo', 'bar'])
>>> look(table2)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | '1'   |
+-------+-------+
| 'b'   | '2'   |
+-------+-------+
| 'c'   | '3'   |
+-------+-------+

Supports transparent reading from URLs, .gz and .bz2 files.

Changed in version 0.4.

The strip() function is called on each line, which by default will remove leading and trailing whitespace, including the end-of-line character - use the strip keyword argument to specify alternative characters to strip.

petl.fromutext(source=None, header=[u'lines'], encoding='utf-8', strip=None)

Construct a table from lines in the given text file via the given encoding. Like fromtext() but accepts an additional encoding argument which should be one of the Python supported encodings. See also codecs.

New in version 0.19.

XML files

petl.fromxml(source, *args, **kwargs)

Access data in an XML file. E.g.:

>>> from petl import fromxml, look
>>> data = """<table>
...     <tr>
...         <td>foo</td><td>bar</td>
...     </tr>
...     <tr>
...         <td>a</td><td>1</td>
...     </tr>
...     <tr>
...         <td>b</td><td>2</td>
...     </tr>
...     <tr>
...         <td>c</td><td>2</td>
...     </tr>
... </table>"""
>>> with open('example1.xml', 'w') as f:    
...     f.write(data)
...     f.close()
... 
>>> table1 = fromxml('example1.xml', 'tr', 'td')
>>> look(table1)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | '1'   |
+-------+-------+
| 'b'   | '2'   |
+-------+-------+
| 'c'   | '2'   |
+-------+-------+

If the data values are stored in an attribute, provide the attribute name as an extra positional argument, e.g.:

>>> data = """<table>
...     <tr>
...         <td v='foo'/><td v='bar'/>
...     </tr>
...     <tr>
...         <td v='a'/><td v='1'/>
...     </tr>
...     <tr>
...         <td v='b'/><td v='2'/>
...     </tr>
...     <tr>
...         <td v='c'/><td v='2'/>
...     </tr>
... </table>"""
>>> with open('example2.xml', 'w') as f:    
...     f.write(data)
...     f.close()
... 
>>> table2 = fromxml('example2.xml', 'tr', 'td', 'v')
>>> look(table2)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | '1'   |
+-------+-------+
| 'b'   | '2'   |
+-------+-------+
| 'c'   | '2'   |
+-------+-------+

Data values can also be extracted by providing a mapping of field names to element paths, e.g.:

>>> data = """<table>
...     <row>
...         <foo>a</foo><baz><bar v='1'/><bar v='3'/></baz>
...     </row>
...     <row>
...         <foo>b</foo><baz><bar v='2'/></baz>
...     </row>
...     <row>
...         <foo>c</foo><baz><bar v='2'/></baz>
...     </row>
... </table>"""
>>> with open('example3.xml', 'w') as f:    
...     f.write(data)
...     f.close()
... 
>>> table3 = fromxml('example3.xml', 'row', {'foo': 'foo', 'bar': ('baz/bar', 'v')})
>>> look(table3)
+-------+------------+
| 'foo' | 'bar'      |
+=======+============+
| 'a'   | ('1', '3') |
+-------+------------+
| 'b'   | '2'        |
+-------+------------+
| 'c'   | '2'        |
+-------+------------+

Note that the implementation is currently not streaming, i.e., the whole document is loaded into memory.

Supports transparent reading from URLs, .gz and .bz2 files.

New in version 0.4.

Changed in version 0.6: If multiple elements match a given field, all values are reported as a tuple.

JSON files

petl.fromjson(source, *args, **kwargs)

Extract data from a JSON file. The file must contain a JSON array as the top level object, and each member of the array will be treated as a row of data. E.g.:

>>> from petl import fromjson, look
>>> data = '[{"foo": "a", "bar": 1}, {"foo": "b", "bar": 2}, {"foo": "c", "bar": 2}]'
>>> with open('example1.json', 'w') as f:
...     f.write(data)
... 
>>> table1 = fromjson('example1.json')
>>> look(table1)
+--------+--------+
| u'foo' | u'bar' |
+========+========+
| u'a'   | 1      |
+--------+--------+
| u'b'   | 2      |
+--------+--------+
| u'c'   | 2      |
+--------+--------+

If your JSON file does not fit this structure, you will need to parse it via json.load() and select the array to treat as the data, see also fromdicts().

Supports transparent reading from URLs, .gz and .bz2 files.

New in version 0.5.

petl.fromdicts(dicts, header=None)

View a sequence of Python dict as a table. E.g.:

>>> from petl import fromdicts, look
>>> dicts = [{"foo": "a", "bar": 1}, {"foo": "b", "bar": 2}, {"foo": "c", "bar": 2}]
>>> table = fromdicts(dicts)
>>> look(table)
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| 'a'   | 1     |
+-------+-------+
| 'b'   | 2     |
+-------+-------+
| 'c'   | 2     |
+-------+-------+

See also fromjson().

New in version 0.5.

Databases

petl.fromsqlite3(source, query, *args, **kwargs)

Provides access to data from an sqlite3 database file via a given query. E.g.:

>>> import sqlite3
>>> from petl import look, fromsqlite3    
>>> # set up a database to demonstrate with
>>> data = [['a', 1],
...         ['b', 2],
...         ['c', 2.0]]
>>> connection = sqlite3.connect('test.db')
>>> c = connection.cursor()
>>> c.execute('create table foobar (foo, bar)')
<sqlite3.Cursor object at 0x2240b90>
>>> for row in data:
...     c.execute('insert into foobar values (?, ?)', row)
... 
<sqlite3.Cursor object at 0x2240b90>
<sqlite3.Cursor object at 0x2240b90>
<sqlite3.Cursor object at 0x2240b90>
>>> connection.commit()
>>> c.close()
>>>
>>> # now demonstrate the petl.fromsqlite3 function
... foobar = fromsqlite3('test.db', 'select * from foobar')
>>> look(foobar)    
+-------+-------+
| 'foo' | 'bar' |
+=======+=======+
| u'a'  | 1     |
+-------+-------+
| u'b'  | 2     |
+-------+-------+
| u'c'  | 2.0   |
+-------+-------+

Changed in version 0.10.2.

Either a database file name or a connection object can be given as the first argument.

petl.fromdb(dbo, query, *args, **kwargs)

Provides access to data from any DB-API 2.0 connection via a given query. E.g., using sqlite3:

>>> import sqlite3
>>> from petl import look, fromdb
>>> connection = sqlite3.connect('test.db')
>>> table = fromdb(connection, 'select * from foobar')
>>> look(table)

E.g., using psycopg2 (assuming you’ve installed it first):

>>> import psycopg2
>>> from petl import look, fromdb
>>> connection = psycopg2.connect("dbname=test user=postgres")
>>> table = fromdb(connection, 'select * from test')
>>> look(table)

E.g., using MySQLdb (assuming you’ve installed it first):

>>> import MySQLdb
>>> from petl import look, fromdb
>>> connection = MySQLdb.connect(passwd="moonpie", db="thangs")
>>> table = fromdb(connection, 'select * from test')
>>> look(table)

Changed in version 0.10.2.

The first argument may also be a function that creates a cursor. E.g.:

>>> import psycopg2
>>> from petl import look, fromdb
>>> connection = psycopg2.connect("dbname=test user=postgres")
>>> mkcursor = lambda: connection.cursor(cursor_factory=psycopg2.extras.DictCursor)
>>> table = fromdb(mkcursor, 'select * from test')
>>> look(table)

N.B., each call to the function should return a new cursor.

Changed in version 0.18.

Added support for server-side cursors.

Note that the default behaviour of most database servers and clients is for the entire result set for each query to be sent from the server to the client. If your query returns a large result set this can result in significant memory usage at the client. Some databases support server-side cursors which provide a means for client libraries to fetch result sets incrementally, reducing memory usage at the client.

To use a server-side cursor with a PostgreSQL database, e.g.:

>>> import psycopg2
>>> from petl import look, fromdb
>>> connection = psycopg2.connect("dbname=test user=postgres")
>>> table = fromdb(lambda: connection.cursor(name='arbitrary'), 'select * from test')
>>> look(table)

To use a server-side cursor with a MySQL database, e.g.:

>>> import MySQLdb
>>> from petl import look, fromdb
>>> connection = MySQLdb.connect(passwd="moonpie", db="thangs")
>>> table = fromdb(lambda: connection.cursor(MySQLdb.cursors.SSCursor), 'select * from test')
>>> look(table)

For more information on server-side cursors see the following links: