Extract/Load - reading/writing tables from files, databases and other sources

Extract (read)

The “from…” functions extract a table from a file-like source or database. For everything except petl.io.db.fromdb() the source argument provides information about where to extract the underlying data from. If the source argument is None or a string it is interpreted as follows:

  • None - read from stdin
  • string starting with http://, https:// or ftp:// - read from URL
  • string ending with .gz or .bgz - read from file via gzip decompression
  • string ending with .bz2 - read from file via bz2 decompression
  • any other string - read directly from file

Some helper classes are also available for reading from other types of file-like sources, e.g., reading data from a Zip file, a string or a subprocess, see the section on I/O helper classes below for more information.

Be aware that loading data from stdin breaks the table container convention, because data can usually only be read once. If you are sure that data will only be read once in your script or interactive session then this may not be a problem, however note that some petl functions do access the underlying data source more than once and so will not work as expected with data from stdin.

Load (write)

The “to…” functions load data from a table into a file-like source or database. For functions that accept a source argument, if the source argument is None or a string it is interpreted as follows:

  • None - write to stdout
  • string ending with .gz or .bgz - write to file via gzip decompression
  • string ending with .bz2 - write to file via bz2 decompression
  • any other string - write directly to file

Some helper classes are also available for writing to other types of file-like sources, e.g., writing to a Zip file or string buffer, see the section on I/O helper classes below for more information.

Python objects

petl.io.base.fromcolumns(cols, header=None, missing=None)[source]

View a sequence of columns as a table, e.g.:

>>> import petl as etl
>>> cols = [[0, 1, 2], ['a', 'b', 'c']]
>>> tbl = etl.fromcolumns(cols)
>>> tbl
+----+-----+
| f0 | f1  |
+====+=====+
|  0 | 'a' |
+----+-----+
|  1 | 'b' |
+----+-----+
|  2 | 'c' |
+----+-----+

If columns are not the same length, values will be padded to the length of the longest column with missing, which is None by default, e.g.:

>>> cols = [[0, 1, 2], ['a', 'b']]
>>> tbl = etl.fromcolumns(cols, missing='NA')
>>> tbl
+----+------+
| f0 | f1   |
+====+======+
|  0 | 'a'  |
+----+------+
|  1 | 'b'  |
+----+------+
|  2 | 'NA' |
+----+------+

See also petl.io.json.fromdicts().

New in version 1.1.0.

Delimited files

petl.io.csv.fromcsv(source=None, encoding=None, errors='strict', header=None, **csvargs)[source]

Extract a table from a delimited file. E.g.:

>>> import petl as etl
>>> import csv
>>> # set up a CSV file to demonstrate with
... table1 = [['foo', 'bar'],
...           ['a', 1],
...           ['b', 2],
...           ['c', 2]]
>>> with open('example.csv', 'w') as f:
...     writer = csv.writer(f)
...     writer.writerows(table1)
...
>>> # now demonstrate the use of fromcsv()
... table2 = etl.fromcsv('example.csv')
>>> table2
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' | '1' |
+-----+-----+
| 'b' | '2' |
+-----+-----+
| 'c' | '2' |
+-----+-----+

The source argument is the path of the delimited file, all other keyword arguments are passed to csv.reader(). So, e.g., to override the delimiter from the default CSV dialect, provide the delimiter keyword argument.

Note that all data values are strings, and any intended numeric values will need to be converted, see also petl.transform.conversions.convert().

petl.io.csv.tocsv(table, source=None, encoding=None, errors='strict', write_header=True, **csvargs)[source]

Write the table to a CSV file. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['a', 1],
...           ['b', 2],
...           ['c', 2]]
>>> etl.tocsv(table1, 'example.csv')
>>> # look what it did
... print(open('example.csv').read())
foo,bar
a,1
b,2
c,2

The source argument is the path of the delimited file, and the optional write_header argument specifies whether to include the field names in the delimited file. All other keyword arguments are passed to csv.writer(). So, e.g., to override the delimiter from the default CSV dialect, provide the delimiter keyword argument.

Note that if a file already exists at the given location, it will be overwritten.

petl.io.csv.appendcsv(table, source=None, encoding=None, errors='strict', write_header=False, **csvargs)[source]

Append data rows to an existing CSV file. As petl.io.csv.tocsv() but the file is opened in append mode and the table header is not written by default.

Note that no attempt is made to check that the fields or row lengths are consistent with the existing data, the data rows from the table are simply appended to the file.

petl.io.csv.teecsv(table, source=None, encoding=None, errors='strict', write_header=True, **csvargs)[source]

Returns a table that writes rows to a CSV file as they are iterated over.

petl.io.csv.fromtsv(source=None, encoding=None, errors='strict', header=None, **csvargs)[source]

Convenience function, as petl.io.csv.fromcsv() but with different default dialect (tab delimited).

petl.io.csv.totsv(table, source=None, encoding=None, errors='strict', write_header=True, **csvargs)[source]

Convenience function, as petl.io.csv.tocsv() but with different default dialect (tab delimited).

petl.io.csv.appendtsv(table, source=None, encoding=None, errors='strict', write_header=False, **csvargs)[source]

Convenience function, as petl.io.csv.appendcsv() but with different default dialect (tab delimited).

petl.io.csv.teetsv(table, source=None, encoding=None, errors='strict', write_header=True, **csvargs)[source]

Convenience function, as petl.io.csv.teecsv() but with different default dialect (tab delimited).

Pickle files

petl.io.pickle.frompickle(source=None)[source]

Extract a table From data pickled in the given file. The rows in the table should have been pickled to the file one at a time. E.g.:

>>> import petl as etl
>>> import pickle
>>> # set up a file to demonstrate with
... with open('example.p', 'wb') as f:
...     pickle.dump(['foo', 'bar'], f)
...     pickle.dump(['a', 1], f)
...     pickle.dump(['b', 2], f)
...     pickle.dump(['c', 2.5], f)
...
>>> # now demonstrate the use of frompickle()
... table1 = etl.frompickle('example.p')
>>> table1
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' |   1 |
+-----+-----+
| 'b' |   2 |
+-----+-----+
| 'c' | 2.5 |
+-----+-----+
petl.io.pickle.topickle(table, source=None, protocol=-1, write_header=True)[source]

Write the table to a pickle file. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['a', 1],
...           ['b', 2],
...           ['c', 2]]
>>> etl.topickle(table1, 'example.p')
>>> # look what it did
... table2 = etl.frompickle('example.p')
>>> table2
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' |   1 |
+-----+-----+
| 'b' |   2 |
+-----+-----+
| 'c' |   2 |
+-----+-----+

Note that if a file already exists at the given location, it will be overwritten.

The pickle file format preserves type information, i.e., reading and writing is round-trippable for tables with non-string data values.

petl.io.pickle.appendpickle(table, source=None, protocol=-1, write_header=False)[source]

Append data to an existing pickle file. I.e., as petl.io.pickle.topickle() but the file is opened in append mode.

Note that no attempt is made to check that the fields or row lengths are consistent with the existing data, the data rows from the table are simply appended to the file.

petl.io.pickle.teepickle(table, source=None, protocol=-1, write_header=True)[source]

Return a table that writes rows to a pickle file as they are iterated over.

Text files

petl.io.text.fromtext(source=None, encoding=None, errors='strict', strip=None, header=('lines', ))[source]

Extract a table from lines in the given text file. E.g.:

>>> import petl as etl
>>> # setup example file
... text = 'a,1\nb,2\nc,2\n'
>>> with open('example.txt', 'w') as f:
...     f.write(text)
...
12
>>> table1 = etl.fromtext('example.txt')
>>> table1
+-------+
| lines |
+=======+
| 'a,1' |
+-------+
| 'b,2' |
+-------+
| 'c,2' |
+-------+

>>> # post-process, e.g., with capture()
... table2 = table1.capture('lines', '(.*),(.*)$', ['foo', 'bar'])
>>> table2
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' | '1' |
+-----+-----+
| 'b' | '2' |
+-----+-----+
| 'c' | '2' |
+-----+-----+

Note that the strip() function is called on each line, which by default will remove leading and trailing whitespace, including the end-of-line character - use the strip keyword argument to specify alternative characters to strip. Set the strip argument to False to disable this behaviour and leave line endings in place.

petl.io.text.totext(table, source=None, encoding=None, errors='strict', template=None, prologue=None, epilogue=None)[source]

Write the table to a text file. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['a', 1],
...           ['b', 2],
...           ['c', 2]]
>>> prologue = '''{| class="wikitable"
... |-
... ! foo
... ! bar
... '''
>>> template = '''|-
... | {foo}
... | {bar}
... '''
>>> epilogue = '|}'
>>> etl.totext(table1, 'example.txt', template=template,
... prologue=prologue, epilogue=epilogue)
>>> # see what we did
... print(open('example.txt').read())
{| class="wikitable"
|-
! foo
! bar
|-
| a
| 1
|-
| b
| 2
|-
| c
| 2
|}

The template will be used to format each row via str.format.

petl.io.text.appendtext(table, source=None, encoding=None, errors='strict', template=None, prologue=None, epilogue=None)[source]

Append the table to a text file.

petl.io.text.teetext(table, source=None, encoding=None, errors='strict', template=None, prologue=None, epilogue=None)[source]

Return a table that writes rows to a text file as they are iterated over.

XML files

petl.io.xml.fromxml(source, *args, **kwargs)[source]

Extract data from an XML file. E.g.:

>>> import petl as etl
>>> # setup a file to demonstrate with
... d = '''<table>
...     <tr>
...         <td>foo</td><td>bar</td>
...     </tr>
...     <tr>
...         <td>a</td><td>1</td>
...     </tr>
...     <tr>
...         <td>b</td><td>2</td>
...     </tr>
...     <tr>
...         <td>c</td><td>2</td>
...     </tr>
... </table>'''
>>> with open('example1.xml', 'w') as f:
...     f.write(d)
...
212
>>> table1 = etl.fromxml('example1.xml', 'tr', 'td')
>>> table1
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' | '1' |
+-----+-----+
| 'b' | '2' |
+-----+-----+
| 'c' | '2' |
+-----+-----+

If the data values are stored in an attribute, provide the attribute name as an extra positional argument:

>>> d = '''<table>
...     <tr>
...         <td v='foo'/><td v='bar'/>
...     </tr>
...     <tr>
...         <td v='a'/><td v='1'/>
...     </tr>
...     <tr>
...         <td v='b'/><td v='2'/>
...     </tr>
...     <tr>
...         <td v='c'/><td v='2'/>
...     </tr>
... </table>'''
>>> with open('example2.xml', 'w') as f:
...     f.write(d)
...
220
>>> table2 = etl.fromxml('example2.xml', 'tr', 'td', 'v')
>>> table2
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' | '1' |
+-----+-----+
| 'b' | '2' |
+-----+-----+
| 'c' | '2' |
+-----+-----+

Data values can also be extracted by providing a mapping of field names to element paths:

>>> d = '''<table>
...     <row>
...         <foo>a</foo><baz><bar v='1'/><bar v='3'/></baz>
...     </row>
...     <row>
...         <foo>b</foo><baz><bar v='2'/></baz>
...     </row>
...     <row>
...         <foo>c</foo><baz><bar v='2'/></baz>
...     </row>
... </table>'''
>>> with open('example3.xml', 'w') as f:
...     f.write(d)
...
223
>>> table3 = etl.fromxml('example3.xml', 'row',
...                      {'foo': 'foo', 'bar': ('baz/bar', 'v')})
>>> table3
+------------+-----+
| bar        | foo |
+============+=====+
| ('1', '3') | 'a' |
+------------+-----+
| '2'        | 'b' |
+------------+-----+
| '2'        | 'c' |
+------------+-----+

If lxml is installed, full XPath expressions can be used.

Note that the implementation is currently not streaming, i.e., the whole document is loaded into memory.

If multiple elements match a given field, all values are reported as a tuple.

If there is more than one element name used for row values, a tuple or list of paths can be provided, e.g., fromxml('example.html', './/tr', ('th', 'td')).

For writing to an XML file, see petl.io.text.totext().

HTML files

petl.io.html.tohtml(table, source=None, encoding=None, errors='strict', caption=None, vrepr=<class 'str'>, lineterminator='\n', index_header=False, tr_style=None, td_styles=None, truncate=None)[source]

Write the table as HTML to a file. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['a', 1],
...           ['b', 2],
...           ['c', 2]]
>>> etl.tohtml(table1, 'example.html', caption='example table')
>>> print(open('example.html').read())
<table class='petl'>
<caption>example table</caption>
<thead>
<tr>
<th>foo</th>
<th>bar</th>
</tr>
</thead>
<tbody>
<tr>
<td>a</td>
<td style='text-align: right'>1</td>
</tr>
<tr>
<td>b</td>
<td style='text-align: right'>2</td>
</tr>
<tr>
<td>c</td>
<td style='text-align: right'>2</td>
</tr>
</tbody>
</table>

The caption keyword argument is used to provide a table caption in the output HTML.

petl.io.html.teehtml(table, source=None, encoding=None, errors='strict', caption=None, vrepr=<class 'str'>, lineterminator='\n', index_header=False, tr_style=None, td_styles=None, truncate=None)[source]

Return a table that writes rows to a Unicode HTML file as they are iterated over.

JSON files

petl.io.json.fromjson(source, *args, **kwargs)[source]

Extract data from a JSON file. The file must contain a JSON array as the top level object, and each member of the array will be treated as a row of data. E.g.:

>>> import petl as etl
>>> data = '''
... [{"foo": "a", "bar": 1},
... {"foo": "b", "bar": 2},
... {"foo": "c", "bar": 2}]
... '''
>>> with open('example.json', 'w') as f:
...     f.write(data)
...
74
>>> table1 = etl.fromjson('example.json', header=['foo', 'bar'])
>>> table1
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' |   1 |
+-----+-----+
| 'b' |   2 |
+-----+-----+
| 'c' |   2 |
+-----+-----+

If your JSON file does not fit this structure, you will need to parse it via json.load() and select the array to treat as the data, see also petl.io.json.fromdicts().

Changed in version 1.1.0.

If no header is specified, fields will be discovered by sampling keys from the first sample objects in source. The header will be constructed from keys in the order discovered. Note that this ordering may not be stable, and therefore it may be advisable to specify an explicit header or to use another function like petl.transform.headers.sortheader() on the resulting table to guarantee stability.

petl.io.json.fromdicts(dicts, header=None, sample=1000, missing=None)[source]

View a sequence of Python dict as a table. E.g.:

>>> import petl as etl
>>> dicts = [{"foo": "a", "bar": 1},
...          {"foo": "b", "bar": 2},
...          {"foo": "c", "bar": 2}]
>>> table1 = etl.fromdicts(dicts, header=['foo', 'bar'])
>>> table1
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' |   1 |
+-----+-----+
| 'b' |   2 |
+-----+-----+
| 'c' |   2 |
+-----+-----+

If header is not specified, sample items from dicts will be inspected to discovery dictionary keys. Note that the order in which dictionary keys are discovered may not be stable,

See also petl.io.json.fromjson().

Changed in version 1.1.0.

If no header is specified, fields will be discovered by sampling keys from the first sample dictionaries in dicts. The header will be constructed from keys in the order discovered. Note that this ordering may not be stable, and therefore it may be advisable to specify an explicit header or to use another function like petl.transform.headers.sortheader() on the resulting table to guarantee stability.

petl.io.json.tojson(table, source=None, prefix=None, suffix=None, *args, **kwargs)[source]

Write a table in JSON format, with rows output as JSON objects. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['a', 1],
...           ['b', 2],
...           ['c', 2]]
>>> etl.tojson(table1, 'example.json', sort_keys=True)
>>> # check what it did
... print(open('example.json').read())
[{"bar": 1, "foo": "a"}, {"bar": 2, "foo": "b"}, {"bar": 2, "foo": "c"}]

Note that this is currently not streaming, all data is loaded into memory before being written to the file.

petl.io.json.tojsonarrays(table, source=None, prefix=None, suffix=None, output_header=False, *args, **kwargs)[source]

Write a table in JSON format, with rows output as JSON arrays. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['a', 1],
...           ['b', 2],
...           ['c', 2]]
>>> etl.tojsonarrays(table1, 'example.json')
>>> # check what it did
... print(open('example.json').read())
[["a", 1], ["b", 2], ["c", 2]]

Note that this is currently not streaming, all data is loaded into memory before being written to the file.

Databases

Note

The automatic table creation feature of petl.io.db.todb() requires SQLAlchemy to be installed, e.g.:

$ pip install sqlalchemy
petl.io.db.fromdb(dbo, query, *args, **kwargs)[source]

Provides access to data from any DB-API 2.0 connection via a given query. E.g., using sqlite3:

>>> import petl as etl
>>> import sqlite3
>>> connection = sqlite3.connect('example.db')
>>> table = etl.fromdb(connection, 'SELECT * FROM example')

E.g., using psycopg2 (assuming you’ve installed it first):

>>> import petl as etl
>>> import psycopg2
>>> connection = psycopg2.connect('dbname=example user=postgres')
>>> table = etl.fromdb(connection, 'SELECT * FROM example')

E.g., using pymysql (assuming you’ve installed it first):

>>> import petl as etl
>>> import pymysql
>>> connection = pymysql.connect(password='moonpie', database='thangs')
>>> table = etl.fromdb(connection, 'SELECT * FROM example')

The dbo argument may also be a function that creates a cursor. N.B., each call to the function should return a new cursor. E.g.:

>>> import petl as etl
>>> import psycopg2
>>> connection = psycopg2.connect('dbname=example user=postgres')
>>> mkcursor = lambda: connection.cursor(cursor_factory=psycopg2.extras.DictCursor)
>>> table = etl.fromdb(mkcursor, 'SELECT * FROM example')

The parameter dbo may also be an SQLAlchemy engine, session or connection object.

The parameter dbo may also be a string, in which case it is interpreted as the name of a file containing an sqlite3 database.

Note that the default behaviour of most database servers and clients is for the entire result set for each query to be sent from the server to the client. If your query returns a large result set this can result in significant memory usage at the client. Some databases support server-side cursors which provide a means for client libraries to fetch result sets incrementally, reducing memory usage at the client.

To use a server-side cursor with a PostgreSQL database, e.g.:

>>> import petl as etl
>>> import psycopg2
>>> connection = psycopg2.connect('dbname=example user=postgres')
>>> table = etl.fromdb(lambda: connection.cursor(name='arbitrary'),
...                    'SELECT * FROM example')

For more information on server-side cursors see the following links:

petl.io.db.todb(table, dbo, tablename, schema=None, commit=True, create=False, drop=False, constraints=True, metadata=None, dialect=None, sample=1000)[source]

Load data into an existing database table via a DB-API 2.0 connection or cursor. Note that the database table will be truncated, i.e., all existing rows will be deleted prior to inserting the new data. E.g.:

>>> import petl as etl
>>> table = [['foo', 'bar'],
...          ['a', 1],
...          ['b', 2],
...          ['c', 2]]
>>> # using sqlite3
... import sqlite3
>>> connection = sqlite3.connect('example.db')
>>> # assuming table "foobar" already exists in the database
... etl.todb(table, connection, 'foobar')
>>> # using psycopg2
>>> import psycopg2
>>> connection = psycopg2.connect('dbname=example user=postgres')
>>> # assuming table "foobar" already exists in the database
... etl.todb(table, connection, 'foobar')
>>> # using pymysql
>>> import pymysql
>>> connection = pymysql.connect(password='moonpie', database='thangs')
>>> # tell MySQL to use standard quote character
... connection.cursor().execute('SET SQL_MODE=ANSI_QUOTES')
>>> # load data, assuming table "foobar" already exists in the database
... etl.todb(table, connection, 'foobar')

N.B., for MySQL the statement SET SQL_MODE=ANSI_QUOTES is required to ensure MySQL uses SQL-92 standard quote characters.

A cursor can also be provided instead of a connection, e.g.:

>>> import psycopg2
>>> connection = psycopg2.connect('dbname=example user=postgres')
>>> cursor = connection.cursor()
>>> etl.todb(table, cursor, 'foobar')

The parameter dbo may also be an SQLAlchemy engine, session or connection object.

The parameter dbo may also be a string, in which case it is interpreted as the name of a file containing an sqlite3 database.

If create=True this function will attempt to automatically create a database table before loading the data. This functionality requires SQLAlchemy to be installed.

Keyword arguments:

table : table container
Table data to load
dbo : database object
DB-API 2.0 connection, callable returning a DB-API 2.0 cursor, or SQLAlchemy connection, engine or session
tablename : string
Name of the table in the database
schema : string
Name of the database schema to find the table in
commit : bool
If True commit the changes
create : bool
If True attempt to create the table before loading, inferring types from a sample of the data (requires SQLAlchemy)
drop : bool
If True attempt to drop the table before recreating (only relevant if create=True)
constraints : bool
If True use length and nullable constraints (only relevant if create=True)
metadata : sqlalchemy.MetaData
Custom table metadata (only relevant if create=True)
dialect : string
One of {‘access’, ‘sybase’, ‘sqlite’, ‘informix’, ‘firebird’, ‘mysql’, ‘oracle’, ‘maxdb’, ‘postgresql’, ‘mssql’} (only relevant if create=True)
sample : int
Number of rows to sample when inferring types etc. Set to 0 to use the whole table (only relevant if create=True)

Note

This function is in principle compatible with any DB-API 2.0 compliant database driver. However, at the time of writing some DB-API 2.0 implementations, including cx_Oracle and MySQL’s Connector/Python, are not compatible with this function, because they only accept a list argument to the cursor.executemany() function called internally by petl. This can be worked around by proxying the cursor objects, e.g.:

>>> import cx_Oracle
>>> connection = cx_Oracle.Connection(...)
>>> class CursorProxy(object):
...     def __init__(self, cursor):
...         self._cursor = cursor
...     def executemany(self, statement, parameters, **kwargs):
...         # convert parameters to a list
...         parameters = list(parameters)
...         # pass through to proxied cursor
...         return self._cursor.executemany(statement, parameters, **kwargs)
...     def __getattr__(self, item):
...         return getattr(self._cursor, item)
...
>>> def get_cursor():
...     return CursorProxy(connection.cursor())
...
>>> import petl as etl
>>> etl.todb(tbl, get_cursor, ...)

Note however that this does imply loading the entire table into memory as a list prior to inserting into the database.

petl.io.db.appenddb(table, dbo, tablename, schema=None, commit=True)[source]

Load data into an existing database table via a DB-API 2.0 connection or cursor. As petl.io.db.todb() except that the database table will be appended, i.e., the new data will be inserted into the table, and any existing rows will remain.

Excel .xls files (xlrd/xlwt)

Note

The following functions require xlrd and xlwt to be installed, e.g.:

$ pip install xlrd xlwt-future
petl.io.xls.fromxls(filename, sheet=None, use_view=True, **kwargs)[source]

Extract a table from a sheet in an Excel .xls file.

Sheet is identified by its name or index number.

N.B., the sheet name is case sensitive.

petl.io.xls.toxls(tbl, filename, sheet, encoding=None, style_compression=0, styles=None)[source]

Write a table to a new Excel .xls file.

Excel .xlsx files (openpyxl)

Note

The following functions require openpyxl to be installed, e.g.:

$ pip install openpyxl
petl.io.xlsx.fromxlsx(filename, sheet=None, range_string=None, min_row=None, min_col=None, max_row=None, max_col=None, read_only=False, **kwargs)[source]

Extract a table from a sheet in an Excel .xlsx file.

N.B., the sheet name is case sensitive.

The sheet argument can be omitted, in which case the first sheet in the workbook is used by default.

The range_string argument can be used to provide a range string specifying a range of cells to extract.

The min_row, min_col, max_row and max_col arguments can be used to limit the range of cells to extract. They will be ignored if range_string is provided.

The read_only argument determines how openpyxl returns the loaded workbook. Default is False as it prevents some LibreOffice files from getting truncated at 65536 rows. True should be faster if the file use is read-only and the files are made with Microsoft Excel.

Any other keyword arguments are passed through to openpyxl.load_workbook().

petl.io.xlsx.toxlsx(tbl, filename, sheet=None, write_header=True, mode='replace')[source]

Write a table to a new Excel .xlsx file.

N.B., the sheet name is case sensitive.

The mode argument controls how the file and sheet are treated:

  • replace: This is the default. It either replaces or adds a named sheet, or if no sheet name is provided, all sheets (overwrites the entire file).
  • overwrite: Always overwrites the file. This produces a file with a single sheet.
  • add: Adds a new sheet. Raises ValueError if a named sheet already exists.

The sheet argument can be omitted in all cases. The new sheet will then get a default name. If the file does not exist, it will be created, unless replace mode is used with a named sheet. In the latter case, the file must exist and be a valid .xlsx file.

petl.io.xlsx.appendxlsx(tbl, filename, sheet=None, write_header=False)[source]

Appends rows to an existing Excel .xlsx file.

Arrays (NumPy)

Note

The following functions require numpy to be installed, e.g.:

$ pip install numpy
petl.io.numpy.fromarray(a)[source]

Extract a table from a numpy structured array, e.g.:

>>> import petl as etl
>>> import numpy as np
>>> a = np.array([('apples', 1, 2.5),
...               ('oranges', 3, 4.4),
...               ('pears', 7, 0.1)],
...              dtype='U8, i4,f4')
>>> table = etl.fromarray(a)
>>> table
+-----------+----+-----+
| f0        | f1 | f2  |
+===========+====+=====+
| 'apples'  | 1  | 2.5 |
+-----------+----+-----+
| 'oranges' | 3  | 4.4 |
+-----------+----+-----+
| 'pears'   | 7  | 0.1 |
+-----------+----+-----+
petl.io.numpy.toarray(table, dtype=None, count=-1, sample=1000)[source]

Load data from the given table into a numpy structured array. E.g.:

>>> import petl as etl
>>> table = [('foo', 'bar', 'baz'),
...          ('apples', 1, 2.5),
...          ('oranges', 3, 4.4),
...          ('pears', 7, .1)]
>>> a = etl.toarray(table)
>>> a
array([('apples', 1, 2.5), ('oranges', 3, 4.4), ('pears', 7, 0.1)],
      dtype=(numpy.record, [('foo', '<U7'), ('bar', '<i8'), ('baz', '<f8')]))
>>> # the dtype can be specified as a string
... a = etl.toarray(table, dtype='a4, i2, f4')
>>> a
array([(b'appl', 1, 2.5), (b'oran', 3, 4.4), (b'pear', 7, 0.1)],
      dtype=[('foo', 'S4'), ('bar', '<i2'), ('baz', '<f4')])
>>> # the dtype can also be partially specified
... a = etl.toarray(table, dtype={'foo': 'a4'})
>>> a
array([(b'appl', 1, 2.5), (b'oran', 3, 4.4), (b'pear', 7, 0.1)],
      dtype=[('foo', 'S4'), ('bar', '<i8'), ('baz', '<f8')])

If the dtype is not completely specified, sample rows will be examined to infer an appropriate dtype.

petl.io.numpy.torecarray(*args, **kwargs)[source]

Convenient shorthand for toarray(*args, **kwargs).view(np.recarray).

petl.io.numpy.valuestoarray(vals, dtype=None, count=-1, sample=1000)[source]

Load values from a table column into a numpy array, e.g.:

>>> import petl as etl
>>> table = [('foo', 'bar', 'baz'),
...          ('apples', 1, 2.5),
...          ('oranges', 3, 4.4),
...          ('pears', 7, .1)]
>>> table = etl.wrap(table)
>>> table.values('bar').array()
array([1, 3, 7])
>>> # specify dtype
... table.values('bar').array(dtype='i4')
array([1, 3, 7], dtype=int32)

DataFrames (pandas)

Note

The following functions require pandas to be installed, e.g.:

$ pip install pandas
petl.io.pandas.fromdataframe(df, include_index=False)[source]

Extract a table from a pandas DataFrame. E.g.:

>>> import petl as etl
>>> import pandas as pd
>>> records = [('apples', 1, 2.5), ('oranges', 3, 4.4), ('pears', 7, 0.1)]
>>> df = pd.DataFrame.from_records(records, columns=('foo', 'bar', 'baz'))
>>> table = etl.fromdataframe(df)
>>> table
+-----------+-----+-----+
| foo       | bar | baz |
+===========+=====+=====+
| 'apples'  |   1 | 2.5 |
+-----------+-----+-----+
| 'oranges' |   3 | 4.4 |
+-----------+-----+-----+
| 'pears'   |   7 | 0.1 |
+-----------+-----+-----+
petl.io.pandas.todataframe(table, index=None, exclude=None, columns=None, coerce_float=False, nrows=None)[source]

Load data from the given table into a pandas DataFrame. E.g.:

>>> import petl as etl
>>> table = [('foo', 'bar', 'baz'),
...          ('apples', 1, 2.5),
...          ('oranges', 3, 4.4),
...          ('pears', 7, .1)]
>>> df = etl.todataframe(table)
>>> df
       foo  bar  baz
0   apples    1  2.5
1  oranges    3  4.4
2    pears    7  0.1

HDF5 files (PyTables)

Note

The following functions require PyTables to be installed, e.g.:

$ # install HDF5
$ apt-get install libhdf5-7 libhdf5-dev
$ # install other prerequisites
$ pip install cython
$ pip install numpy
$ pip install numexpr
$ # install PyTables
$ pip install tables
petl.io.pytables.fromhdf5(source, where=None, name=None, condition=None, condvars=None, start=None, stop=None, step=None)[source]

Provides access to an HDF5 table. E.g.:

>>> import petl as etl
>>>
>>> # set up a new hdf5 table to demonstrate with
>>> class FooBar(tables.IsDescription): 
...     foo = tables.Int32Col(pos=0) 
...     bar = tables.StringCol(6, pos=2) 
>>> #
>>> def setup_hdf5_table():
...     import tables
...     h5file = tables.open_file('example.h5', mode='w',
...                               title='Example file')
...     h5file.create_group('/', 'testgroup', 'Test Group')
...     h5table = h5file.create_table('/testgroup', 'testtable', FooBar,
...                                   'Test Table')
...     # load some data into the table
...     table1 = (('foo', 'bar'),
...               (1, b'asdfgh'),
...               (2, b'qwerty'),
...               (3, b'zxcvbn'))
...     for row in table1[1:]:
...         for i, f in enumerate(table1[0]):
...             h5table.row[f] = row[i]
...         h5table.row.append()
...     h5file.flush()
...     h5file.close()
>>>
>>> setup_hdf5_table() 
>>>
>>> # now demonstrate use of fromhdf5
>>> table1 = etl.fromhdf5('example.h5', '/testgroup', 'testtable') 
>>> table1 
+-----+-----------+
| foo | bar       |
+=====+===========+
|   1 | b'asdfgh' |
+-----+-----------+
|   2 | b'qwerty' |
+-----+-----------+
|   3 | b'zxcvbn' |
+-----+-----------+

>>> # alternatively just specify path to table node
... table1 = etl.fromhdf5('example.h5', '/testgroup/testtable') 
>>> # ...or use an existing tables.File object
... h5file = tables.open_file('example.h5') 
>>> table1 = etl.fromhdf5(h5file, '/testgroup/testtable') 
>>> # ...or use an existing tables.Table object
... h5tbl = h5file.get_node('/testgroup/testtable') 
>>> table1 = etl.fromhdf5(h5tbl) 
>>> # use a condition to filter data
... table2 = etl.fromhdf5(h5tbl, condition='foo < 3') 
>>> table2 
+-----+-----------+
| foo | bar       |
+=====+===========+
|   1 | b'asdfgh' |
+-----+-----------+
|   2 | b'qwerty' |
+-----+-----------+

>>> h5file.close() 
petl.io.pytables.fromhdf5sorted(source, where=None, name=None, sortby=None, checkCSI=False, start=None, stop=None, step=None)[source]

Provides access to an HDF5 table, sorted by an indexed column, e.g.:

>>> import petl as etl
>>>
>>> # set up a new hdf5 table to demonstrate with
>>> class FooBar(tables.IsDescription): 
...     foo = tables.Int32Col(pos=0) 
...     bar = tables.StringCol(6, pos=2) 
>>>
>>> def setup_hdf5_index():
...     import tables
...     h5file = tables.open_file('example.h5', mode='w',
...                               title='Example file')
...     h5file.create_group('/', 'testgroup', 'Test Group')
...     h5table = h5file.create_table('/testgroup', 'testtable', FooBar,
...                                   'Test Table')
...     # load some data into the table
...     table1 = (('foo', 'bar'),
...               (1, b'asdfgh'),
...               (2, b'qwerty'),
...               (3, b'zxcvbn'))
...     for row in table1[1:]:
...         for i, f in enumerate(table1[0]):
...             h5table.row[f] = row[i]
...         h5table.row.append()
...     h5table.cols.foo.create_csindex()  # CS index is required
...     h5file.flush()
...     h5file.close()
>>>
>>> setup_hdf5_index() 
>>>
... # access the data, sorted by the indexed column
... table2 = etl.fromhdf5sorted('example.h5', '/testgroup', 'testtable', sortby='foo') 
>>> table2 
+-----+-----------+
| foo | bar       |
+=====+===========+
|   1 | b'zxcvbn' |
+-----+-----------+
|   2 | b'qwerty' |
+-----+-----------+
|   3 | b'asdfgh' |
+-----+-----------+
petl.io.pytables.tohdf5(table, source, where=None, name=None, create=False, drop=False, description=None, title='', filters=None, expectedrows=10000, chunkshape=None, byteorder=None, createparents=False, sample=1000)[source]

Write to an HDF5 table. If create is False, assumes the table already exists, and attempts to truncate it before loading. If create is True, a new table will be created, and if drop is True, any existing table will be dropped first. If description is None, the description will be guessed. E.g.:

>>> import petl as etl
>>> table1 = (('foo', 'bar'),
...           (1, b'asdfgh'),
...           (2, b'qwerty'),
...           (3, b'zxcvbn'))
>>> etl.tohdf5(table1, 'example.h5', '/testgroup', 'testtable',
...            drop=True, create=True, createparents=True) 
>>> etl.fromhdf5('example.h5', '/testgroup', 'testtable') 
+-----+-----------+
| foo | bar       |
+=====+===========+
|   1 | b'asdfgh' |
+-----+-----------+
|   2 | b'qwerty' |
+-----+-----------+
|   3 | b'zxcvbn' |
+-----+-----------+
petl.io.pytables.appendhdf5(table, source, where=None, name=None)[source]

As petl.io.hdf5.tohdf5() but don’t truncate the target table before loading.

Bcolz ctables

Note

The following functions require bcolz to be installed, e.g.:

$ pip install bcolz
petl.io.bcolz.frombcolz(source, expression=None, outcols=None, limit=None, skip=0)[source]

Extract a table from a bcolz ctable, e.g.:

>>> import petl as etl
>>>
>>> def example_from_bcolz():
...     import bcolz
...     cols = [
...         ['apples', 'oranges', 'pears'],
...         [1, 3, 7],
...         [2.5, 4.4, .1]
...     ]
...     names = ('foo', 'bar', 'baz')
...     ctbl = bcolz.ctable(cols, names=names)
...     return etl.frombcolz(ctbl)
>>>
>>> example_from_bcolz() 
+-----------+-----+-----+
| foo       | bar | baz |
+===========+=====+=====+
| 'apples'  |   1 | 2.5 |
+-----------+-----+-----+
| 'oranges' |   3 | 4.4 |
+-----------+-----+-----+
| 'pears'   |   7 | 0.1 |
+-----------+-----+-----+

If expression is provided it will be executed by bcolz and only matching rows returned, e.g.:

>>> tbl2 = etl.frombcolz(ctbl, expression='bar > 1') 
>>> tbl2 
+-----------+-----+-----+
| foo       | bar | baz |
+===========+=====+=====+
| 'oranges' |   3 | 4.4 |
+-----------+-----+-----+
| 'pears'   |   7 | 0.1 |
+-----------+-----+-----+

New in version 1.1.0.

petl.io.bcolz.tobcolz(table, dtype=None, sample=1000, **kwargs)[source]

Load data into a bcolz ctable, e.g.:

>>> import petl as etl
>>>
>>> def example_to_bcolz():
...     table = [('foo', 'bar', 'baz'),
...              ('apples', 1, 2.5),
...              ('oranges', 3, 4.4),
...              ('pears', 7, .1)]
...     return etl.tobcolz(table)
>>> 
>>> ctbl = example_to_bcolz() 
>>> ctbl 
ctable((3,), [('foo', '<U7'), ('bar', '<i8'), ('baz', '<f8')])
  nbytes: 132; cbytes: 1023.98 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
[('apples', 1, 2.5) ('oranges', 3, 4.4) ('pears', 7, 0.1)]
>>> ctbl.names 
['foo', 'bar', 'baz']
>>> ctbl['foo'] 
carray((3,), <U7)
  nbytes := 84; cbytes := 511.98 KB; ratio: 0.00
  cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
  chunklen := 18724; chunksize: 524272; blocksize: 0
['apples' 'oranges' 'pears']

Other keyword arguments are passed through to the ctable constructor.

New in version 1.1.0.

petl.io.bcolz.appendbcolz(table, obj, check_names=True)[source]

Append data into a bcolz ctable. The obj argument can be either an existing ctable or the name of a directory were an on-disk ctable is stored.

New in version 1.1.0.

Text indexes (Whoosh)

Note

The following functions require Whoosh to be installed, e.g.:

$ pip install whoosh
petl.io.whoosh.fromtextindex(index_or_dirname, indexname=None, docnum_field=None)[source]

Extract all documents from a Whoosh index. E.g.:

>>> import petl as etl
>>> import os
>>> # set up an index and load some documents via the Whoosh API
... from whoosh.index import create_in
>>> from whoosh.fields import *
>>> schema = Schema(title=TEXT(stored=True), path=ID(stored=True),
...                 content=TEXT)
>>> dirname = 'example.whoosh'
>>> if not os.path.exists(dirname):
...     os.mkdir(dirname)
...
>>> index = create_in(dirname, schema)
>>> writer = index.writer()
>>> writer.add_document(title=u"First document", path=u"/a",
...                     content=u"This is the first document we've added!")
>>> writer.add_document(title=u"Second document", path=u"/b",
...                     content=u"The second one is even more interesting!")
>>> writer.commit()
>>> # extract documents as a table
... table = etl.fromtextindex(dirname)
>>> table
+------+-------------------+
| path | title             |
+======+===================+
| '/a' | 'First document'  |
+------+-------------------+
| '/b' | 'Second document' |
+------+-------------------+

Keyword arguments:

index_or_dirname
Either an instance of whoosh.index.Index or a string containing the directory path where the index is stored.
indexname
String containing the name of the index, if multiple indexes are stored in the same directory.
docnum_field
If not None, an extra field will be added to the output table containing the internal document number stored in the index. The name of the field will be the value of this argument.
petl.io.whoosh.searchtextindex(index_or_dirname, query, limit=10, indexname=None, docnum_field=None, score_field=None, fieldboosts=None, search_kwargs=None)[source]

Search a Whoosh index using a query. E.g.:

>>> import petl as etl
>>> import os
>>> # set up an index and load some documents via the Whoosh API
... from whoosh.index import create_in
>>> from whoosh.fields import *
>>> schema = Schema(title=TEXT(stored=True), path=ID(stored=True),
...                            content=TEXT)
>>> dirname = 'example.whoosh'
>>> if not os.path.exists(dirname):
...     os.mkdir(dirname)
...
>>> index = create_in('example.whoosh', schema)
>>> writer = index.writer()
>>> writer.add_document(title=u"Oranges", path=u"/a",
...                     content=u"This is the first document we've added!")
>>> writer.add_document(title=u"Apples", path=u"/b",
...                     content=u"The second document is even more "
...                             u"interesting!")
>>> writer.commit()
>>> # demonstrate the use of searchtextindex()
... table1 = etl.searchtextindex('example.whoosh', 'oranges')
>>> table1
+------+-----------+
| path | title     |
+======+===========+
| '/a' | 'Oranges' |
+------+-----------+

>>> table2 = etl.searchtextindex('example.whoosh', 'doc*')
>>> table2
+------+-----------+
| path | title     |
+======+===========+
| '/a' | 'Oranges' |
+------+-----------+
| '/b' | 'Apples'  |
+------+-----------+

Keyword arguments:

index_or_dirname
Either an instance of whoosh.index.Index or a string containing the directory path where the index is to be stored.
query
Either a string or an instance of whoosh.query.Query. If a string, it will be parsed as a multi-field query, i.e., any terms not bound to a specific field will match any field.
limit
Return at most limit results.
indexname
String containing the name of the index, if multiple indexes are stored in the same directory.
docnum_field
If not None, an extra field will be added to the output table containing the internal document number stored in the index. The name of the field will be the value of this argument.
score_field
If not None, an extra field will be added to the output table containing the score of the result. The name of the field will be the value of this argument.
fieldboosts
An optional dictionary mapping field names to boosts.
search_kwargs
Any extra keyword arguments to be passed through to the Whoosh search() method.
petl.io.whoosh.searchtextindexpage(index_or_dirname, query, pagenum, pagelen=10, indexname=None, docnum_field=None, score_field=None, fieldboosts=None, search_kwargs=None)[source]

Search an index using a query, returning a result page.

Keyword arguments:

index_or_dirname
Either an instance of whoosh.index.Index or a string containing the directory path where the index is to be stored.
query
Either a string or an instance of whoosh.query.Query. If a string, it will be parsed as a multi-field query, i.e., any terms not bound to a specific field will match any field.
pagenum
Number of the page to return (e.g., 1 = first page).
pagelen
Number of results per page.
indexname
String containing the name of the index, if multiple indexes are stored in the same directory.
docnum_field
If not None, an extra field will be added to the output table containing the internal document number stored in the index. The name of the field will be the value of this argument.
score_field
If not None, an extra field will be added to the output table containing the score of the result. The name of the field will be the value of this argument.
fieldboosts
An optional dictionary mapping field names to boosts.
search_kwargs
Any extra keyword arguments to be passed through to the Whoosh search() method.
petl.io.whoosh.totextindex(table, index_or_dirname, schema=None, indexname=None, merge=False, optimize=False)[source]

Load all rows from table into a Whoosh index. N.B., this will clear any existing data in the index before loading. E.g.:

>>> import petl as etl
>>> import datetime
>>> import os
>>> # here is the table we want to load into an index
... table = (('f0', 'f1', 'f2', 'f3', 'f4'),
...          ('AAA', 12, 4.3, True, datetime.datetime.now()),
...          ('BBB', 6, 3.4, False, datetime.datetime(1900, 1, 31)),
...          ('CCC', 42, 7.8, True, datetime.datetime(2100, 12, 25)))
>>> # define a schema for the index
... from whoosh.fields import *
>>> schema = Schema(f0=TEXT(stored=True),
...                 f1=NUMERIC(int, stored=True),
...                 f2=NUMERIC(float, stored=True),
...                 f3=BOOLEAN(stored=True),
...                 f4=DATETIME(stored=True))
>>> # load index
... dirname = 'example.whoosh'
>>> if not os.path.exists(dirname):
...     os.mkdir(dirname)
...
>>> etl.totextindex(table, dirname, schema=schema)

Keyword arguments:

table
A table container with the data to be loaded.
index_or_dirname
Either an instance of whoosh.index.Index or a string containing the directory path where the index is to be stored.
schema
Index schema to use if creating the index.
indexname
String containing the name of the index, if multiple indexes are stored in the same directory.
merge
Merge small segments during commit?
optimize
Merge all segments together?
petl.io.whoosh.appendtextindex(table, index_or_dirname, indexname=None, merge=True, optimize=False)[source]

Load all rows from table into a Whoosh index, adding them to any existing data in the index.

Keyword arguments:

table
A table container with the data to be loaded.
index_or_dirname
Either an instance of whoosh.index.Index or a string containing the directory path where the index is to be stored.
indexname
String containing the name of the index, if multiple indexes are stored in the same directory.
merge
Merge small segments during commit?
optimize
Merge all segments together?

Avro files (fastavro)

Note

The following functions require fastavro to be installed, e.g.:

$ pip install fastavro
petl.io.avro.fromavro(source, limit=None, skips=0, **avro_args)[source]

Extract a table from the records of a avro file.

The source argument (string or file-like or fastavro.reader) can either be the path of the file, a file-like input stream or a instance from fastavro.reader.

The limit and skip arguments can be used to limit the range of rows to extract.

The sample argument (int, optional) defines how many rows are inspected for discovering the field types and building a schema for the avro file when the schema argument is not passed.

The rows fields read from file can have scalar values like int, string, float, datetime, date and decimal but can also have compound types like enum, array, map, union and record. The fields types can also have recursive structures defined in complex schemas.

Also types with logical types types are read and translated to coresponding python types: long timestamp-millis and long timestamp-micros: datetime.datetime, int date: datetime.date, bytes decimal and fixed decimal: Decimal, int time-millis and long time-micros: datetime.time.

Example usage for reading files:

>>> # set up a Avro file to demonstrate with
...
>>> schema1 = {
...     'doc': 'Some people records.',
...     'name': 'People',
...     'namespace': 'test',
...     'type': 'record',
...     'fields': [
...         {'name': 'name', 'type': 'string'},
...         {'name': 'friends', 'type': 'int'},
...         {'name': 'age', 'type': 'int'},
...     ]
... }
...
>>> records1 = [
...     {'name': 'Bob', 'friends': 42, 'age': 33},
...     {'name': 'Jim', 'friends': 13, 'age': 69},
...     {'name': 'Joe', 'friends': 86, 'age': 17},
...     {'name': 'Ted', 'friends': 23, 'age': 51}
... ]
...
>>> import fastavro
>>> parsed_schema1 = fastavro.parse_schema(schema1)
>>> with open('example-file-to-read.avro', 'wb') as f1:
...     fastavro.writer(f1, parsed_schema1, records1)
...
>>> # now demonstrate the use of fromavro()
>>> import petl as etl
>>> tbl1 = etl.fromavro('example-file-to-read.avro')
>>> tbl1
+-------+---------+-----+
| name  | friends | age |
+=======+=========+=====+
| 'Bob' |      42 |  33 |
+-------+---------+-----+
| 'Jim' |      13 |  69 |
+-------+---------+-----+
| 'Joe' |      86 |  17 |
+-------+---------+-----+
| 'Ted' |      23 |  51 |
+-------+---------+-----+

New in version 1.4.0.

petl.io.avro.toavro(table, target, schema=None, sample=9, codec='deflate', compression_level=None, **avro_args)[source]

Write the table into a new avro file according to schema passed.

This method assume that each column has values with the same type for all rows of the source table.

Apache Avro is a data serialization framework. It is used in data serialization (especially in Hadoop ecosystem), for dataexchange for databases (Redshift) and RPC protocols (like in Kafka). It has libraries to support many languages and generally is faster and safer than text formats like Json, XML or CSV.

The target argument is the file path for creating the avro file. Note that if a file already exists at the given location, it will be overwritten.

The schema argument (dict) defines the rows field structure of the file. Check fastavro documentation and Avro schema reference for details.

The sample argument (int, optional) defines how many rows are inspected for discovering the field types and building a schema for the avro file when the schema argument is not passed.

The codec argument (string, optional) sets the compression codec used to shrink data in the file. It can be ‘null’, ‘deflate’ (default), ‘bzip2’ or ‘snappy’, ‘zstandard’, ‘lz4’, ‘xz’ (if installed)

The compression_level argument (int, optional) sets the level of compression to use with the specified codec (if the codec supports it)

Additionally there are support for passing extra options in the argument **avro_args that are fowarded directly to fastavro. Check the fastavro documentation for reference.

The avro file format preserves type information, i.e., reading and writing is round-trippable for tables with non-string data values. However the conversion from Python value types to avro fields is not perfect. Use the schema argument to define proper type to the conversion.

The following avro types are supported by the schema: null, boolean, string, int, long, float, double, bytes, fixed, enum, array, map, union, record, and recursive types defined in complex schemas.

Also logical types are supported and translated to coresponding python types: long timestamp-millis, long timestamp-micros, int date, bytes decimal, fixed decimal, string uuid, int time-millis, long time-micros.

Example usage for writing files:

>>> # set up a Avro file to demonstrate with
>>> table2 = [['name', 'friends', 'age'],
...           ['Bob', 42, 33],
...           ['Jim', 13, 69],
...           ['Joe', 86, 17],
...           ['Ted', 23, 51]]
...
>>> schema2 = {
...     'doc': 'Some people records.',
...     'name': 'People',
...     'namespace': 'test',
...     'type': 'record',
...     'fields': [
...         {'name': 'name', 'type': 'string'},
...         {'name': 'friends', 'type': 'int'},
...         {'name': 'age', 'type': 'int'},
...     ]
... }
...
>>> # now demonstrate what writing with toavro()
>>> import petl as etl
>>> etl.toavro(table2, 'example-file-to-write.avro', schema=schema2)
...
>>> # this was what was saved above
>>> tbl2 = etl.fromavro('example-file-to-write.avro')
>>> tbl2
+-------+---------+-----+
| name  | friends | age |
+=======+=========+=====+
| 'Bob' |      42 |  33 |
+-------+---------+-----+
| 'Jim' |      13 |  69 |
+-------+---------+-----+
| 'Joe' |      86 |  17 |
+-------+---------+-----+
| 'Ted' |      23 |  51 |
+-------+---------+-----+

New in version 1.4.0.

petl.io.avro.appendavro(table, target, schema=None, sample=9, **avro_args)[source]

Append rows into a avro existing avro file or create a new one.

The target argument can be either an existing avro file or the file path for creating new one.

The schema argument is checked against the schema of the existing file. So it must be the same schema as used by toavro() or the schema of the existing file.

The sample argument (int, optional) defines how many rows are inspected for discovering the field types and building a schema for the avro file when the schema argument is not passed.

Additionally there are support for passing extra options in the argument **avro_args that are fowarded directly to fastavro. Check the fastavro documentation for reference.

See petl.io.avro.toavro() method for more information and examples.

New in version 1.4.0.

Avro schema for logical types

logical_schema = {
    'fields': [
        {
            'name': 'date',
            'type': {'type': 'int', 'logicalType': 'date'}
        },
        {
            'name': 'datetime',
            'type': {'type': 'long', 'logicalType': 'timestamp-millis'}
        },
        {
            'name': 'datetime2',
            'type': {'type': 'long', 'logicalType': 'timestamp-micros'}
        },
        {
            'name': 'uuid',
            'type': {'type': 'string', 'logicalType': 'uuid'}
        },
        {
            'name': 'time',
            'type': {'type': 'int', 'logicalType': 'time-millis'}
        },
        {
            'name': 'time2',
            'type': {'type': 'long', 'logicalType': 'time-micros'}
        },
        {
            'name': 'Decimal',
            'type': 
                {
                   'type': 'bytes', 'logicalType': 'decimal',
                   'precision': 15, 'scale': 6
                }
        },
        {
            'name': 'Decimal2',
            'type': 
                {
                   'type': 'fixed', 'size': 8,
                   'logicalType': 'decimal', 'precision': 15, 'scale': 3
                }
        }
    ],
    'namespace': 'namespace',
    'name': 'name',
    'type': 'record'
}

Avro schema with nullable fields

schema0 = {
    'doc': 'Nullable records.',
    'name': 'anyone',
    'namespace': 'test',
    'type': 'record',
    'fields': [
        {'name': 'name', 'type': ['null', 'string']},
        {'name': 'friends', 'type': ['null', 'int']},
        {'name': 'age', 'type': ['null', 'int']},
    ],
}

Avro schema with array values in fields

schema5 = {
    'name': 'palettes',
    'namespace': 'color',
    'type': 'record',
    'fields': [
        {'name': 'palette', 'type': 'string'},
        {'name': 'colors',
            'type': ['null', {'type': 'array', 'items': 'string'}]
         }
    ],
}

Example of recursive complex Avro schema

schema6 = {
    'fields': [
        {
            'name': 'array_string',
            'type': {'type': 'array', 'items': 'string'}
        },
        {
            'name': 'array_record',
            'type': {'type': 'array', 'items': {
                'type': 'record',
                'name': 'some_record',
                'fields': [
                    {
                        'name': 'f1',
                        'type': 'string'
                    },
                    {
                        'name': 'f2',
                        'type': {'type': 'bytes',
                                 'logicalType': 'decimal',
                                 'precision': 18,
                                 'scale': 6, }
                    }
                ]
            }
            }
        },
        {
            'name': 'nulable_date',
            'type': ['null', {'type': 'int',
                              'logicalType': 'date'}]
        },
        {
            'name': 'multi_union_time',
            'type': ['null', 'string', {'type': 'long',
                                        'logicalType': 'timestamp-micros'}]
        },
        {
            'name': 'array_bytes_decimal',
            'type': ['null', {'type': 'array',
                              'items': {'type': 'bytes',
                                        'logicalType': 'decimal',
                                        'precision': 18,
                                        'scale': 6, }
                              }]
        },
        {
            'name': 'array_fixed_decimal',
            'type': ['null', {'type': 'array',
                              'items': {'type': 'fixed',
                                        'name': 'FixedDecimal',
                                        'size': 8,
                                        'logicalType': 'decimal',
                                        'precision': 18,
                                        'scale': 6, }
                              }]
        },
    ],
    'namespace': 'namespace',
    'name': 'name',
    'type': 'record'
}

I/O helper classes

The following classes are helpers for extract (from...()) and load (to...()) functions that use a file-like data source.

An instance of any of the following classes can be used as the source argument to data extraction functions like petl.io.csv.fromcsv() etc., with the exception of petl.io.sources.StdoutSource which is write-only.

An instance of any of the following classes can also be used as the source argument to data loading functions like petl.io.csv.tocsv() etc., with the exception of petl.io.sources.StdinSource, petl.io.sources.URLSource and petl.io.sources.PopenSource which are read-only.

The behaviour of each source can usually be configured by passing arguments to the constructor, see the source code of the petl.io.sources module for full details.

class petl.io.sources.StdinSource[source]
class petl.io.sources.StdoutSource[source]
class petl.io.sources.MemorySource(s=None)[source]

Memory data source. E.g.:

>>> import petl as etl
>>> data = b'foo,bar\na,1\nb,2\nc,2\n'
>>> source = etl.MemorySource(data)
>>> tbl = etl.fromcsv(source)
>>> tbl
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' | '1' |
+-----+-----+
| 'b' | '2' |
+-----+-----+
| 'c' | '2' |
+-----+-----+

>>> sink = etl.MemorySource()
>>> tbl.tojson(sink)
>>> sink.getvalue()
b'[{"foo": "a", "bar": "1"}, {"foo": "b", "bar": "2"}, {"foo": "c", "bar": "2"}]'

Also supports appending.

class petl.io.sources.PopenSource(*args, **kwargs)[source]

Remote I/O helper classes

The following classes are helpers for reading (from...()) and writing (to...()) functions transparently as a file-like source.

There are no need to instantiate them. They are used in the mecanism described in Extract and Load.

It’s possible to read and write just by prefixing the protocol (e.g: s3://) in the source path of the file.

class petl.io.remotes.RemoteSource(url, **kwargs)[source]

Read or write directly from files in remote filesystems.

This source handles many filesystems that are selected based on the protocol passed in the url argument.

The url should be specified in to..() and from…() functions. E.g.:

>>> import petl as etl
>>> 
>>> def example_s3():
...     url = 's3://mybucket/prefix/to/myfilename.csv'
...     data = b'foo,bar\na,1\nb,2\nc,2\n'
... 
...     etl.tocsv(data, url)
...     tbl = etl.fromcsv(url)
...     
>>> example_s3() 
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' | '1' |
+-----+-----+
| 'b' | '2' |
+-----+-----+
| 'c' | '2' |
+-----+-----+

This source uses fsspec to provide the data transfer with the remote filesystem. Check the Built-in Implementations for available remote implementations.

Some filesystem can use URL chaining for compound I/O.

Note

For working this source require fsspec to be installed, e.g.:

$ pip install fsspec

Some remote filesystems require aditional packages to be installed. Check Known Implementations for checking what packages need to be installed, e.g.:

$ pip install s3fs     # AWS S3
$ pip install gcsfs    # Google Cloud Storage
$ pip install adlfs    # Azure Blob service
$ pip install paramiko # SFTP
$ pip install requests # HTTP, github

New in version 1.6.0.

class petl.io.remotes.SMBSource(url, **kwargs)[source]

Downloads or uploads to Windows and Samba network drives. E.g.:

>>> def example_smb():
...     import petl as etl
...     url = 'smb://user:password@server/share/folder/file.csv'
...     data = b'foo,bar\na,1\nb,2\nc,2\n'
...     etl.tocsv(data, url)
...     tbl = etl.fromcsv(url)
... 
>>> example_smb() 
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' | '1' |
+-----+-----+
| 'b' | '2' |
+-----+-----+
| 'c' | '2' |
+-----+-----+

The argument url (str) must have a URI with format: smb://workgroup;user:password@server:port/share/folder/file.csv.

Note that you need to pass in a valid hostname or IP address for the host component of the URL. Do not use the Windows/NetBIOS machine name for the host component.

The first component of the path in the URL points to the name of the shared folder. Subsequent path components will point to the directory/folder/file.

Note

For working this source require smbprotocol to be installed, e.g.:

$ pip install smbprotocol[kerberos]

New in version 1.5.0.

Deprecated I/O helper classes

The following helpers are deprecated and will be removed in a future version.

It’s functionality was replaced by helpers in Remote helpers.

class petl.io.sources.FileSource(filename, **kwargs)[source]
class petl.io.sources.GzipSource(filename, remote=False, **kwargs)[source]
class petl.io.sources.BZ2Source(filename, remote=False, **kwargs)[source]
class petl.io.sources.ZipSource(filename, membername, pwd=None, **kwargs)[source]
class petl.io.sources.URLSource(*args, **kwargs)[source]

Custom I/O helper classes

For creating custom helpers for remote I/O or compression use the following functions:

petl.io.sources.register_reader(protocol, handler_class)[source]

Register handler for automatic reading using a remote protocol.

Use of the handler is determined matching the protocol with the scheme part of the url in from...() function (e.g: http://).

New in version 1.5.0.

petl.io.sources.register_writer(protocol, handler_class)[source]

Register handler for automatic writing using a remote protocol.

Use of the handler is determined matching the protocol with the scheme part of the url in to...() function (e.g: smb://).

New in version 1.5.0.

petl.io.sources.get_reader(protocol)[source]

Retrieve the handler responsible for reading from a remote protocol.

New in version 1.6.0.

petl.io.sources.get_writer(protocol)[source]

Retrieve the handler responsible for writing from a remote protocol.

New in version 1.6.0.

See the source code of the classes in petl.io.sources module for more details.