Utility functions

Basic utilities

petl.util.base.header(table)[source]

Return the header row for the given table. E.g.:

>>> import petl as etl
>>> table = [['foo', 'bar'], ['a', 1], ['b', 2]]
>>> etl.header(table)
('foo', 'bar')

Note that the header row will always be returned as a tuple, regardless of what the underlying data are.

petl.util.base.fieldnames(table)[source]

Return the string values of the header row. If the header row contains only strings, then this function is equivalent to header(), i.e.:

>>> import petl as etl
>>> table = [['foo', 'bar'], ['a', 1], ['b', 2]]
>>> etl.fieldnames(table)
('foo', 'bar')
>>> etl.header(table)
('foo', 'bar')
petl.util.base.data(table, *sliceargs)[source]

Return a container supporting iteration over data rows in a given table (i.e., without the header). E.g.:

>>> import petl as etl
>>> table = [['foo', 'bar'], ['a', 1], ['b', 2]]
>>> d = etl.data(table)
>>> list(d)
[['a', 1], ['b', 2]]

Positional arguments can be used to slice the data rows. The sliceargs are passed to itertools.islice().

petl.util.base.values(table, *field, **kwargs)[source]

Return a container supporting iteration over values in a given field or fields. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['a', True],
...           ['b'],
...           ['b', True],
...           ['c', False]]
>>> foo = etl.values(table1, 'foo')
>>> foo
foo: 'a', 'b', 'b', 'c'
>>> list(foo)
['a', 'b', 'b', 'c']
>>> bar = etl.values(table1, 'bar')
>>> bar
bar: True, None, True, False
>>> list(bar)
[True, None, True, False]
>>> # values from multiple fields
... table2 = [['foo', 'bar', 'baz'],
...           [1, 'a', True],
...           [2, 'bb', True],
...           [3, 'd', False]]
>>> foobaz = etl.values(table2, 'foo', 'baz')
>>> foobaz
('foo', 'baz'): (1, True), (2, True), (3, False)
>>> list(foobaz)
[(1, True), (2, True), (3, False)]

The field argument can be a single field name or index (starting from zero) or a tuple of field names and/or indexes. Multiple fields can also be provided as positional arguments.

If rows are uneven, the value of the keyword argument missing is returned.

petl.util.base.dicts(table, *sliceargs, **kwargs)[source]

Return a container supporting iteration over rows as dicts. E.g.:

>>> import petl as etl
>>> table = [['foo', 'bar'], ['a', 1], ['b', 2]]
>>> d = etl.dicts(table)
>>> d
{'foo': 'a', 'bar': 1}
{'foo': 'b', 'bar': 2}
>>> list(d)
[{'foo': 'a', 'bar': 1}, {'foo': 'b', 'bar': 2}]

Short rows are padded with the value of the missing keyword argument.

petl.util.base.namedtuples(table, *sliceargs, **kwargs)[source]

View the table as a container of named tuples. E.g.:

>>> import petl as etl
>>> table = [['foo', 'bar'], ['a', 1], ['b', 2]]
>>> d = etl.namedtuples(table)
>>> d
row(foo='a', bar=1)
row(foo='b', bar=2)
>>> list(d)
[row(foo='a', bar=1), row(foo='b', bar=2)]

Short rows are padded with the value of the missing keyword argument.

The name keyword argument can be given to override the name of the named tuple class (defaults to ‘row’).

petl.util.base.records(table, *sliceargs, **kwargs)[source]

Return a container supporting iteration over rows as records, where a record is a hybrid object supporting all possible ways of accessing values. E.g.:

>>> import petl as etl
>>> table = [['foo', 'bar'], ['a', 1], ['b', 2]]
>>> d = etl.records(table)
>>> d
('a', 1)
('b', 2)
>>> list(d)
[('a', 1), ('b', 2)]
>>> [r[0] for r in d]
['a', 'b']
>>> [r['foo'] for r in d]
['a', 'b']
>>> [r.foo for r in d]
['a', 'b']

Short rows are padded with the value of the missing keyword argument.

petl.util.base.expr(s)[source]

Construct a function operating on a table record.

The expression string is converted into a lambda function by prepending the string with 'lambda rec: ', then replacing anything enclosed in curly braces (e.g., "{foo}") with a lookup on the record (e.g., "rec['foo']"), then finally calling eval().

So, e.g., the expression string "{foo} * {bar}" is converted to the function lambda rec: rec['foo'] * rec['bar']

petl.util.base.rowgroupby(table, key, value=None)[source]

Convenient adapter for itertools.groupby(). E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar', 'baz'],
...           ['a', 1, True],
...           ['b', 3, True],
...           ['b', 2]]
>>> # group entire rows
... for key, group in etl.rowgroupby(table1, 'foo'):
...     print(key, list(group))
...
a [('a', 1, True)]
b [('b', 3, True), ('b', 2)]
>>> # group specific values
... for key, group in etl.rowgroupby(table1, 'foo', 'bar'):
...     print(key, list(group))
...
a [1]
b [3, 2]

N.B., assumes the input table is already sorted by the given key.

petl.util.base.empty()[source]

Return an empty table. Can be useful when building up a table from a set of columns, e.g.:

>>> import petl as etl
>>> table = (
...     etl
...     .empty()
...     .addcolumn('foo', ['A', 'B'])
...     .addcolumn('bar', [1, 2])
... )
>>> table
+-----+-----+
| foo | bar |
+=====+=====+
| 'A' |   1 |
+-----+-----+
| 'B' |   2 |
+-----+-----+

Visualising tables

petl.util.vis.look(table, limit=0, vrepr=None, index_header=None, style=None, truncate=None, width=None)[source]

Format a portion of the table as text for inspection in an interactive session. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['a', 1],
...           ['b', 2]]
>>> etl.look(table1)
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' |   1 |
+-----+-----+
| 'b' |   2 |
+-----+-----+

>>> # alternative formatting styles
... etl.look(table1, style='simple')
===  ===
foo  bar
===  ===
'a'    1
'b'    2
===  ===

>>> etl.look(table1, style='minimal')
foo  bar
'a'    1
'b'    2

>>> # any irregularities in the length of header and/or data
... # rows will appear as blank cells
... table2 = [['foo', 'bar'],
...           ['a'],
...           ['b', 2, True]]
>>> etl.look(table2)
+-----+-----+------+
| foo | bar |      |
+=====+=====+======+
| 'a' |     |      |
+-----+-----+------+
| 'b' |   2 | True |
+-----+-----+------+

Three alternative presentation styles are available: ‘grid’, ‘simple’ and ‘minimal’, where ‘grid’ is the default. A different style can be specified using the style keyword argument. The default style can also be changed by setting petl.config.look_style.

petl.util.vis.lookall(table, **kwargs)[source]

Format the entire table as text for inspection in an interactive session.

N.B., this will load the entire table into memory.

See also petl.util.vis.look() and petl.util.vis.see().

petl.util.vis.see(table, limit=0, vrepr=None, index_header=None)[source]

Format a portion of a table as text in a column-oriented layout for inspection in an interactive session. E.g.:

>>> import petl as etl
>>> table = [['foo', 'bar'], ['a', 1], ['b', 2]]
>>> etl.see(table)
foo: 'a', 'b'
bar: 1, 2

Useful for tables with a larger number of fields.

petl.util.vis.display(table, limit=0, vrepr=None, index_header=None, caption=None, tr_style=None, td_styles=None, encoding=None, truncate=None, epilogue=None)[source]

Display a table inline within an IPython notebook.

petl.util.vis.displayall(table, **kwargs)[source]

Display all rows from a table inline within an IPython notebook (use with caution, big tables will kill your browser).

Lookup data structures

petl.util.lookups.lookup(table, key, value=None, dictionary=None)[source]

Load a dictionary with data from the given table. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['a', 1],
...           ['b', 2],
...           ['b', 3]]
>>> lkp = etl.lookup(table1, 'foo', 'bar')
>>> lkp['a']
[1]
>>> lkp['b']
[2, 3]
>>> # if no value argument is given, defaults to the whole
... # row (as a tuple)
... lkp = etl.lookup(table1, 'foo')
>>> lkp['a']
[('a', 1)]
>>> lkp['b']
[('b', 2), ('b', 3)]
>>> # compound keys are supported
... table2 = [['foo', 'bar', 'baz'],
...           ['a', 1, True],
...           ['b', 2, False],
...           ['b', 3, True],
...           ['b', 3, False]]
>>> lkp = etl.lookup(table2, ('foo', 'bar'), 'baz')
>>> lkp[('a', 1)]
[True]
>>> lkp[('b', 2)]
[False]
>>> lkp[('b', 3)]
[True, False]
>>> # data can be loaded into an existing dictionary-like
... # object, including persistent dictionaries created via the
... # shelve module
... import shelve
>>> lkp = shelve.open('example.dat', flag='n')
>>> lkp = etl.lookup(table1, 'foo', 'bar', lkp)
>>> lkp.close()
>>> lkp = shelve.open('example.dat', flag='r')
>>> lkp['a']
[1]
>>> lkp['b']
[2, 3]
petl.util.lookups.lookupone(table, key, value=None, dictionary=None, strict=False)[source]

Load a dictionary with data from the given table, assuming there is at most one value for each key. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['a', 1],
...           ['b', 2],
...           ['b', 3]]
>>> # if the specified key is not unique and strict=False (default),
... # the first value wins
... lkp = etl.lookupone(table1, 'foo', 'bar')
>>> lkp['a']
1
>>> lkp['b']
2
>>> # if the specified key is not unique and strict=True, will raise
... # DuplicateKeyError
... try:
...     lkp = etl.lookupone(table1, 'foo', strict=True)
... except etl.errors.DuplicateKeyError as e:
...     print(e)
...
duplicate key: 'b'
>>> # compound keys are supported
... table2 = [['foo', 'bar', 'baz'],
...           ['a', 1, True],
...           ['b', 2, False],
...           ['b', 3, True],
...           ['b', 3, False]]
>>> lkp = etl.lookupone(table2, ('foo', 'bar'), 'baz')
>>> lkp[('a', 1)]
True
>>> lkp[('b', 2)]
False
>>> lkp[('b', 3)]
True
>>> # data can be loaded into an existing dictionary-like
... # object, including persistent dictionaries created via the
... # shelve module
... import shelve
>>> lkp = shelve.open('example.dat', flag='n')
>>> lkp = etl.lookupone(table1, 'foo', 'bar', lkp)
>>> lkp.close()
>>> lkp = shelve.open('example.dat', flag='r')
>>> lkp['a']
1
>>> lkp['b']
2
petl.util.lookups.dictlookup(table, key, dictionary=None)[source]

Load a dictionary with data from the given table, mapping to dicts. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['a', 1],
...           ['b', 2],
...           ['b', 3]]
>>> lkp = etl.dictlookup(table1, 'foo')
>>> lkp['a']
[{'foo': 'a', 'bar': 1}]
>>> lkp['b']
[{'foo': 'b', 'bar': 2}, {'foo': 'b', 'bar': 3}]
>>> # compound keys are supported
... table2 = [['foo', 'bar', 'baz'],
...           ['a', 1, True],
...           ['b', 2, False],
...           ['b', 3, True],
...           ['b', 3, False]]
>>> lkp = etl.dictlookup(table2, ('foo', 'bar'))
>>> lkp[('a', 1)]
[{'foo': 'a', 'baz': True, 'bar': 1}]
>>> lkp[('b', 2)]
[{'foo': 'b', 'baz': False, 'bar': 2}]
>>> lkp[('b', 3)]
[{'foo': 'b', 'baz': True, 'bar': 3}, {'foo': 'b', 'baz': False, 'bar': 3}]
>>> # data can be loaded into an existing dictionary-like
... # object, including persistent dictionaries created via the
... # shelve module
... import shelve
>>> lkp = shelve.open('example.dat', flag='n')
>>> lkp = etl.dictlookup(table1, 'foo', lkp)
>>> lkp.close()
>>> lkp = shelve.open('example.dat', flag='r')
>>> lkp['a']
[{'foo': 'a', 'bar': 1}]
>>> lkp['b']
[{'foo': 'b', 'bar': 2}, {'foo': 'b', 'bar': 3}]
petl.util.lookups.dictlookupone(table, key, dictionary=None, strict=False)[source]

Load a dictionary with data from the given table, mapping to dicts, assuming there is at most one row for each key. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['a', 1],
...           ['b', 2],
...           ['b', 3]]
>>> # if the specified key is not unique and strict=False (default),
... # the first value wins
... lkp = etl.dictlookupone(table1, 'foo')
>>> lkp['a']
{'foo': 'a', 'bar': 1}
>>> lkp['b']
{'foo': 'b', 'bar': 2}
>>> # if the specified key is not unique and strict=True, will raise
... # DuplicateKeyError
... try:
...     lkp = etl.dictlookupone(table1, 'foo', strict=True)
... except etl.errors.DuplicateKeyError as e:
...     print(e)
...
duplicate key: 'b'
>>> # compound keys are supported
... table2 = [['foo', 'bar', 'baz'],
...           ['a', 1, True],
...           ['b', 2, False],
...           ['b', 3, True],
...           ['b', 3, False]]
>>> lkp = etl.dictlookupone(table2, ('foo', 'bar'))
>>> lkp[('a', 1)]
{'foo': 'a', 'baz': True, 'bar': 1}
>>> lkp[('b', 2)]
{'foo': 'b', 'baz': False, 'bar': 2}
>>> lkp[('b', 3)]
{'foo': 'b', 'baz': True, 'bar': 3}
>>> # data can be loaded into an existing dictionary-like
... # object, including persistent dictionaries created via the
... # shelve module
... import shelve
>>> lkp = shelve.open('example.dat', flag='n')
>>> lkp = etl.dictlookupone(table1, 'foo', lkp)
>>> lkp.close()
>>> lkp = shelve.open('example.dat', flag='r')
>>> lkp['a']
{'foo': 'a', 'bar': 1}
>>> lkp['b']
{'foo': 'b', 'bar': 2}
petl.util.lookups.recordlookup(table, key, dictionary=None)[source]

Load a dictionary with data from the given table, mapping to record objects.

petl.util.lookups.recordlookupone(table, key, dictionary=None, strict=False)[source]

Load a dictionary with data from the given table, mapping to record objects, assuming there is at most one row for each key.

Parsing string/text values

petl.util.parsers.dateparser(fmt, strict=True)[source]

Return a function to parse strings as datetime.date objects using a given format. E.g.:

>>> from petl import dateparser
>>> isodate = dateparser('%Y-%m-%d')
>>> isodate('2002-12-25')
datetime.date(2002, 12, 25)
>>> try:
...     isodate('2002-02-30')
... except ValueError as e:
...     print(e)
...
day is out of range for month

If strict=False then if an error occurs when parsing, the original value will be returned as-is, and no error will be raised.

petl.util.parsers.timeparser(fmt, strict=True)[source]

Return a function to parse strings as datetime.time objects using a given format. E.g.:

>>> from petl import timeparser
>>> isotime = timeparser('%H:%M:%S')
>>> isotime('00:00:00')
datetime.time(0, 0)
>>> isotime('13:00:00')
datetime.time(13, 0)
>>> try:
...     isotime('12:00:99')
... except ValueError as e:
...     print(e)
...
unconverted data remains: 9
>>> try:
...     isotime('25:00:00')
... except ValueError as e:
...     print(e)
...
time data '25:00:00' does not match format '%H:%M:%S'

If strict=False then if an error occurs when parsing, the original value will be returned as-is, and no error will be raised.

petl.util.parsers.datetimeparser(fmt, strict=True)[source]

Return a function to parse strings as datetime.datetime objects using a given format. E.g.:

>>> from petl import datetimeparser
>>> isodatetime = datetimeparser('%Y-%m-%dT%H:%M:%S')
>>> isodatetime('2002-12-25T00:00:00')
datetime.datetime(2002, 12, 25, 0, 0)
>>> try:
...     isodatetime('2002-12-25T00:00:99')
... except ValueError as e:
...     print(e)
...
unconverted data remains: 9

If strict=False then if an error occurs when parsing, the original value will be returned as-is, and no error will be raised.

petl.util.parsers.boolparser(true_strings=('true', 't', 'yes', 'y', '1'), false_strings=('false', 'f', 'no', 'n', '0'), case_sensitive=False, strict=True)[source]

Return a function to parse strings as bool objects using a given set of string representations for True and False. E.g.:

>>> from petl import boolparser
>>> mybool = boolparser(true_strings=['yes', 'y'], false_strings=['no', 'n'])
>>> mybool('y')
True
>>> mybool('yes')
True
>>> mybool('Y')
True
>>> mybool('No')
False
>>> try:
...     mybool('foo')
... except ValueError as e:
...     print(e)
...
value is not one of recognised boolean strings: 'foo'
>>> try:
...     mybool('True')
... except ValueError as e:
...     print(e)
...
value is not one of recognised boolean strings: 'true'

If strict=False then if an error occurs when parsing, the original value will be returned as-is, and no error will be raised.

petl.util.parsers.numparser(strict=False)[source]

Return a function that will attempt to parse the value as a number, trying int(), long(), float() and complex() in that order. If all fail, return the value as-is, unless strict=True, in which case raise the underlying exception.

Counting

petl.util.counting.nrows(table)[source]

Count the number of data rows in a table. E.g.:

>>> import petl as etl
>>> table = [['foo', 'bar'], ['a', 1], ['b', 2]]
>>> etl.nrows(table)
2
petl.util.counting.valuecount(table, field, value, missing=None)[source]

Count the number of occurrences of value under the given field. Returns the absolute count and relative frequency as a pair. E.g.:

>>> import petl as etl
>>> table = [['foo', 'bar'],
...          ['a', 1],
...          ['b', 2],
...          ['b', 7]]
>>> etl.valuecount(table, 'foo', 'b')
(2, 0.6666666666666666)

The field argument can be a single field name or index (starting from zero) or a tuple of field names and/or indexes.

petl.util.counting.valuecounter(table, *field, **kwargs)[source]

Find distinct values for the given field and count the number of occurrences. Returns a dict mapping values to counts. E.g.:

>>> import petl as etl
>>> table = [['foo', 'bar'],
...          ['a', True],
...          ['b'],
...          ['b', True],
...          ['c', False]]
>>> etl.valuecounter(table, 'foo')
Counter({'b': 2, 'c': 1, 'a': 1})

The field argument can be a single field name or index (starting from zero) or a tuple of field names and/or indexes.

petl.util.counting.valuecounts(table, *field, **kwargs)[source]

Find distinct values for the given field and count the number and relative frequency of occurrences. Returns a table mapping values to counts, with most common values first. E.g.:

>>> import petl as etl
>>> table = [['foo', 'bar', 'baz'],
...          ['a', True, 0.12],
...          ['a', True, 0.17],
...          ['b', False, 0.34],
...          ['b', False, 0.44],
...          ['b']]
>>> etl.valuecounts(table, 'foo')
+-----+-------+-----------+
| foo | count | frequency |
+=====+=======+===========+
| 'b' |     3 |       0.6 |
+-----+-------+-----------+
| 'a' |     2 |       0.4 |
+-----+-------+-----------+

>>> etl.valuecounts(table, 'foo', 'bar')
+-----+-------+-------+-----------+
| foo | bar   | count | frequency |
+=====+=======+=======+===========+
| 'b' | False |     2 |       0.4 |
+-----+-------+-------+-----------+
| 'a' | True  |     2 |       0.4 |
+-----+-------+-------+-----------+
| 'b' | None  |     1 |       0.2 |
+-----+-------+-------+-----------+

If rows are short, the value of the keyword argument missing is counted.

Multiple fields can be given as positional arguments. If multiple fields are given, these are treated as a compound key.

petl.util.counting.stringpatterncounter(table, field)[source]

Profile string patterns in the given field, returning a dict mapping patterns to counts.

petl.util.counting.stringpatterns(table, field)[source]

Profile string patterns in the given field, returning a table of patterns, counts and frequencies. E.g.:

>>> import petl as etl
>>> table = [['foo', 'bar'],
...          ['Mr. Foo', '123-1254'],
...          ['Mrs. Bar', '234-1123'],
...          ['Mr. Spo', '123-1254'],
...          [u'Mr. Baz', u'321 1434'],
...          [u'Mrs. Baz', u'321 1434'],
...          ['Mr. Quux', '123-1254-XX']]
>>> etl.stringpatterns(table, 'foo')
+------------+-------+---------------------+
| pattern    | count | frequency           |
+============+=======+=====================+
| 'Aa. Aaa'  |     3 |                 0.5 |
+------------+-------+---------------------+
| 'Aaa. Aaa' |     2 |  0.3333333333333333 |
+------------+-------+---------------------+
| 'Aa. Aaaa' |     1 | 0.16666666666666666 |
+------------+-------+---------------------+

>>> etl.stringpatterns(table, 'bar')
+---------------+-------+---------------------+
| pattern       | count | frequency           |
+===============+=======+=====================+
| '999-9999'    |     3 |                 0.5 |
+---------------+-------+---------------------+
| '999 9999'    |     2 |  0.3333333333333333 |
+---------------+-------+---------------------+
| '999-9999-AA' |     1 | 0.16666666666666666 |
+---------------+-------+---------------------+
petl.util.counting.rowlengths(table)[source]

Report on row lengths found in the table. E.g.:

>>> import petl as etl
>>> table = [['foo', 'bar', 'baz'],
...          ['A', 1, 2],
...          ['B', '2', '3.4'],
...          [u'B', u'3', u'7.8', True],
...          ['D', 'xyz', 9.0],
...          ['E', None],
...          ['F', 9]]
>>> etl.rowlengths(table)
+--------+-------+
| length | count |
+========+=======+
|      3 |     3 |
+--------+-------+
|      2 |     2 |
+--------+-------+
|      4 |     1 |
+--------+-------+

Useful for finding potential problems in data files.

petl.util.counting.typecounter(table, field)[source]

Count the number of values found for each Python type.

>>> import petl as etl
>>> table = [['foo', 'bar', 'baz'],
...          ['A', 1, 2],
...          ['B', u'2', '3.4'],
...          [u'B', u'3', u'7.8', True],
...          ['D', u'xyz', 9.0],
...          ['E', 42]]
>>> etl.typecounter(table, 'foo')
Counter({'str': 5})
>>> etl.typecounter(table, 'bar')
Counter({'str': 3, 'int': 2})
>>> etl.typecounter(table, 'baz')
Counter({'str': 2, 'int': 1, 'NoneType': 1, 'float': 1})

The field argument can be a field name or index (starting from zero).

petl.util.counting.typecounts(table, field)[source]

Count the number of values found for each Python type and return a table mapping class names to counts and frequencies. E.g.:

>>> import petl as etl
>>> table = [['foo', 'bar', 'baz'],
...          [b'A', 1, 2],
...          [b'B', '2', b'3.4'],
...          ['B', '3', '7.8', True],
...          ['D', u'xyz', 9.0],
...          ['E', 42]]
>>> etl.typecounts(table, 'foo')
+---------+-------+-----------+
| type    | count | frequency |
+=========+=======+===========+
| 'str'   |     3 |       0.6 |
+---------+-------+-----------+
| 'bytes' |     2 |       0.4 |
+---------+-------+-----------+

>>> etl.typecounts(table, 'bar')
+-------+-------+-----------+
| type  | count | frequency |
+=======+=======+===========+
| 'str' |     3 |       0.6 |
+-------+-------+-----------+
| 'int' |     2 |       0.4 |
+-------+-------+-----------+

>>> etl.typecounts(table, 'baz')
+------------+-------+-----------+
| type       | count | frequency |
+============+=======+===========+
| 'int'      |     1 |       0.2 |
+------------+-------+-----------+
| 'NoneType' |     1 |       0.2 |
+------------+-------+-----------+
| 'bytes'    |     1 |       0.2 |
+------------+-------+-----------+
| 'float'    |     1 |       0.2 |
+------------+-------+-----------+
| 'str'      |     1 |       0.2 |
+------------+-------+-----------+

The field argument can be a field name or index (starting from zero).

petl.util.counting.parsecounter(table, field, parsers=(('int', <type 'int'>), ('float', <type 'float'>)))[source]

Count the number of str or unicode values under the given fields that can be parsed as ints, floats or via custom parser functions. Return a pair of Counter objects, the first mapping parser names to the number of strings successfully parsed, the second mapping parser names to the number of errors. E.g.:

>>> import petl as etl
>>> table = [['foo', 'bar', 'baz'],
...          ['A', 'aaa', 2],
...          ['B', u'2', '3.4'],
...          [u'B', u'3', u'7.8', True],
...          ['D', '3.7', 9.0],
...          ['E', 42]]
>>> counter, errors = etl.parsecounter(table, 'bar')
>>> counter
Counter({'float': 3, 'int': 2})
>>> errors
Counter({'int': 2, 'float': 1})

The field argument can be a field name or index (starting from zero).

petl.util.counting.parsecounts(table, field, parsers=(('int', <type 'int'>), ('float', <type 'float'>)))[source]

Count the number of str or unicode values that can be parsed as ints, floats or via custom parser functions. Return a table mapping parser names to the number of values successfully parsed and the number of errors. E.g.:

>>> import petl as etl
>>> table = [['foo', 'bar', 'baz'],
...          ['A', 'aaa', 2],
...          ['B', u'2', '3.4'],
...          [u'B', u'3', u'7.8', True],
...          ['D', '3.7', 9.0],
...          ['E', 42]]
>>> etl.parsecounts(table, 'bar')
+---------+-------+--------+
| type    | count | errors |
+=========+=======+========+
| 'float' |     3 |      1 |
+---------+-------+--------+
| 'int'   |     2 |      2 |
+---------+-------+--------+

The field argument can be a field name or index (starting from zero).

Timing

petl.util.timing.progress(table, batchsize=1000, prefix='', out=<open file '<stderr>', mode 'w'>)[source]

Report progress on rows passing through. E.g.:

>>> import petl as etl
>>> table = etl.dummytable(100000)
>>> table.progress(10000).tocsv('example.csv')
10000 rows in 0.13s (78363 row/s); batch in 0.13s (78363 row/s)
20000 rows in 0.22s (91679 row/s); batch in 0.09s (110448 row/s)
30000 rows in 0.31s (96573 row/s); batch in 0.09s (108114 row/s)
40000 rows in 0.40s (99535 row/s); batch in 0.09s (109625 row/s)
50000 rows in 0.49s (101396 row/s); batch in 0.09s (109591 row/s)
60000 rows in 0.59s (102245 row/s); batch in 0.09s (106709 row/s)
70000 rows in 0.68s (103221 row/s); batch in 0.09s (109498 row/s)
80000 rows in 0.77s (103810 row/s); batch in 0.09s (108126 row/s)
90000 rows in 0.90s (99465 row/s); batch in 0.13s (74516 row/s)
100000 rows in 1.02s (98409 row/s); batch in 0.11s (89821 row/s)
100000 rows in 1.02s (98402 row/s); batches in 0.10 +/- 0.02s [0.09-0.13] (100481 +/- 13340 rows/s [74516-110448])

See also petl.util.timing.clock().

petl.util.timing.clock(table)[source]

Time how long is spent retrieving rows from the wrapped container. Enables diagnosis of which steps in a pipeline are taking the most time. E.g.:

>>> import petl as etl
>>> t1 = etl.dummytable(100000)
>>> c1 = etl.clock(t1)
>>> t2 = etl.convert(c1, 'foo', lambda v: v**2)
>>> c2 = etl.clock(t2)
>>> p = etl.progress(c2, 10000)
>>> etl.tocsv(p, 'example.csv')
10000 rows in 0.23s (44036 row/s); batch in 0.23s (44036 row/s)
20000 rows in 0.38s (52167 row/s); batch in 0.16s (63979 row/s)
30000 rows in 0.54s (55749 row/s); batch in 0.15s (64624 row/s)
40000 rows in 0.69s (57765 row/s); batch in 0.15s (64793 row/s)
50000 rows in 0.85s (59031 row/s); batch in 0.15s (64707 row/s)
60000 rows in 1.00s (59927 row/s); batch in 0.15s (64847 row/s)
70000 rows in 1.16s (60483 row/s); batch in 0.16s (64051 row/s)
80000 rows in 1.31s (61008 row/s); batch in 0.15s (64953 row/s)
90000 rows in 1.47s (61356 row/s); batch in 0.16s (64285 row/s)
100000 rows in 1.62s (61703 row/s); batch in 0.15s (65012 row/s)
100000 rows in 1.62s (61700 row/s); batches in 0.16 +/- 0.02s [0.15-0.23] (62528 +/- 6173 rows/s [44036-65012])
>>> # time consumed retrieving rows from t1
... c1.time
0.7243089999999492
>>> # time consumed retrieving rows from t2
... c2.time
1.1704209999999766
>>> # actual time consumed by the convert step
... c2.time - c1.time
0.4461120000000274

See also petl.util.timing.progress().

Statistics

petl.util.statistics.limits(table, field)[source]

Find minimum and maximum values under the given field. E.g.:

>>> import petl as etl
>>> table = [['foo', 'bar'], ['a', 1], ['b', 2], ['b', 3]]
>>> minv, maxv = etl.limits(table, 'bar')
>>> minv
1
>>> maxv
3

The field argument can be a field name or index (starting from zero).

petl.util.statistics.stats(table, field)[source]

Calculate basic descriptive statistics on a given field. E.g.:

>>> import petl as etl
>>> table = [['foo', 'bar', 'baz'],
...          ['A', 1, 2],
...          ['B', '2', '3.4'],
...          [u'B', u'3', u'7.8', True],
...          ['D', 'xyz', 9.0],
...          ['E', None]]
>>> etl.stats(table, 'bar')
stats(count=3, errors=2, sum=6.0, min=1.0, max=3.0, mean=2.0, pvariance=0.6666666666666666, pstdev=0.816496580927726)

The field argument can be a field name or index (starting from zero).

Materialising tables

petl.util.materialise.columns(table, missing=None)[source]

Construct a dict mapping field names to lists of values. E.g.:

>>> import petl as etl
>>> table = [['foo', 'bar'], ['a', 1], ['b', 2], ['b', 3]]
>>> cols = etl.columns(table)
>>> cols['foo']
['a', 'b', 'b']
>>> cols['bar']
[1, 2, 3]

See also petl.util.materialise.facetcolumns().

petl.util.materialise.facetcolumns(table, key, missing=None)[source]

Like petl.util.materialise.columns() but stratified by values of the given key field. E.g.:

>>> import petl as etl
>>> table = [['foo', 'bar', 'baz'],
...          ['a', 1, True],
...          ['b', 2, True],
...          ['b', 3]]
>>> fc = etl.facetcolumns(table, 'foo')
>>> fc['a']
{'foo': ['a'], 'baz': [True], 'bar': [1]}
>>> fc['b']
{'foo': ['b', 'b'], 'baz': [True, None], 'bar': [2, 3]}
petl.util.materialise.listoflists(tbl)[source]
petl.util.materialise.listoftuples(tbl)[source]
petl.util.materialise.tupleoflists(tbl)[source]
petl.util.materialise.tupleoftuples(tbl)[source]
petl.util.materialise.cache(table, n=None)[source]

Wrap the table with a cache that caches up to n rows as they are initially requested via iteration (cache all rows be default).

Randomly generated tables

petl.util.random.randomtable(numflds=5, numrows=100, wait=0, seed=None)[source]

Construct a table with random numerical data. Use numflds and numrows to specify the number of fields and rows respectively. Set wait to a float greater than zero to simulate a delay on each row generation (number of seconds per row). E.g.:

>>> import petl as etl
>>> table = etl.randomtable(3, 100, seed=42)
>>> table
+----------------------+----------------------+---------------------+
| f0                   | f1                   | f2                  |
+======================+======================+=====================+
|   0.6394267984578837 | 0.025010755222666936 | 0.27502931836911926 |
+----------------------+----------------------+---------------------+
|  0.22321073814882275 |   0.7364712141640124 |  0.6766994874229113 |
+----------------------+----------------------+---------------------+
|   0.8921795677048454 |  0.08693883262941615 |  0.4219218196852704 |
+----------------------+----------------------+---------------------+
| 0.029797219438070344 |  0.21863797480360336 |  0.5053552881033624 |
+----------------------+----------------------+---------------------+
| 0.026535969683863625 |   0.1988376506866485 |  0.6498844377795232 |
+----------------------+----------------------+---------------------+
...

Note that the data are generated on the fly and are not stored in memory, so this function can be used to simulate very large tables.

petl.util.random.dummytable(numrows=100, fields=(('foo', <functools.partial object at 0x7f6ca270d100>), ('bar', <functools.partial object at 0x7f6ca270d158>), ('baz', <built-in method random of Random object at 0x1125180>)), wait=0, seed=None)[source]

Construct a table with dummy data. Use numrows to specify the number of rows. Set wait to a float greater than zero to simulate a delay on each row generation (number of seconds per row). E.g.:

>>> import petl as etl
>>> table1 = etl.dummytable(100, seed=42)
>>> table1
+-----+----------+----------------------+
| foo | bar      | baz                  |
+=====+==========+======================+
|  81 | 'apples' | 0.025010755222666936 |
+-----+----------+----------------------+
|  35 | 'pears'  |  0.22321073814882275 |
+-----+----------+----------------------+
|  94 | 'apples' |   0.6766994874229113 |
+-----+----------+----------------------+
|  69 | 'apples' |   0.5904925124490397 |
+-----+----------+----------------------+
|   4 | 'apples' |  0.09369523986159245 |
+-----+----------+----------------------+
...

>>> # customise fields
... import random
>>> from functools import partial
>>> fields = [('foo', random.random),
...           ('bar', partial(random.randint, 0, 500)),
...           ('baz', partial(random.choice,
...                           ['chocolate', 'strawberry', 'vanilla']))]
>>> table2 = etl.dummytable(100, fields=fields, seed=42)
>>> table2
+---------------------+-----+-------------+
| foo                 | bar | baz         |
+=====================+=====+=============+
|  0.6394267984578837 |  12 | 'vanilla'   |
+---------------------+-----+-------------+
| 0.27502931836911926 | 114 | 'chocolate' |
+---------------------+-----+-------------+
|  0.7364712141640124 | 346 | 'vanilla'   |
+---------------------+-----+-------------+
|  0.8921795677048454 |  44 | 'vanilla'   |
+---------------------+-----+-------------+
|  0.4219218196852704 |  15 | 'chocolate' |
+---------------------+-----+-------------+
...

Data generation functions can be specified via the fields keyword argument.

Note that the data are generated on the fly and are not stored in memory, so this function can be used to simulate very large tables.

Miscellaneous

petl.util.misc.typeset(table, field)[source]

Return a set containing all Python types found for values in the given field. E.g.:

>>> import petl as etl
>>> table = [['foo', 'bar', 'baz'],
...          ['A', 1, '2'],
...          ['B', u'2', '3.4'],
...          [u'B', u'3', '7.8', True],
...          ['D', u'xyz', 9.0],
...          ['E', 42]]
>>> sorted(etl.typeset(table, 'foo'))
['str']
>>> sorted(etl.typeset(table, 'bar'))
['int', 'str']
>>> sorted(etl.typeset(table, 'baz'))
['NoneType', 'float', 'str']

The field argument can be a field name or index (starting from zero).

petl.util.misc.diffheaders(t1, t2)[source]

Return the difference between the headers of the two tables as a pair of sets. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar', 'baz'],
...           ['a', 1, .3]]
>>> table2 = [['baz', 'bar', 'quux'],
...           ['a', 1, .3]]
>>> add, sub = etl.diffheaders(table1, table2)
>>> add
{'quux'}
>>> sub
{'foo'}
petl.util.misc.diffvalues(t1, t2, f)[source]

Return the difference between the values under the given field in the two tables, e.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['a', 1],
...           ['b', 3]]
>>> table2 = [['bar', 'foo'],
...           [1, 'a'],
...           [3, 'c']]
>>> add, sub = etl.diffvalues(table1, table2, 'foo')
>>> add
{'c'}
>>> sub
{'b'}
petl.util.misc.strjoin(s)[source]

Return a function to join sequences using s as the separator. Intended for use with petl.transform.conversions.convert().

petl.util.misc.nthword(n, sep=None)[source]

Construct a function to return the nth word in a string. E.g.:

>>> import petl as etl
>>> s = 'foo bar'
>>> f = etl.nthword(0)
>>> f(s)
'foo'
>>> g = etl.nthword(1)
>>> g(s)
'bar'

Intended for use with petl.transform.conversions.convert().

petl.util.misc.coalesce(*fields, **kwargs)[source]

Return a function which accepts a row and returns the first non-missing value from the specified fields. Intended for use with petl.transform.basics.addfield().