Usage - transforming rows and columns¶

Basic transformations¶

petl.transform.basics.head(table, n=5)[source]¶

Select the first n data rows. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['a', 1],
...           ['b', 2],
...           ['c', 5],
...           ['d', 7],
...           ['f', 42],
...           ['f', 3],
...           ['h', 90]]
>>> table2 = etl.head(table1, 4)
>>> table2
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' |   1 |
+-----+-----+
| 'b' |   2 |
+-----+-----+
| 'c' |   5 |
+-----+-----+
| 'd' |   7 |
+-----+-----+

petl.transform.basics.tail(table, n=5)[source]¶

Select the last n data rows. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['a', 1],
...           ['b', 2],
...           ['c', 5],
...           ['d', 7],
...           ['f', 42],
...           ['f', 3],
...           ['h', 90],
...           ['k', 12],
...           ['l', 77],
...           ['q', 2]]
>>> table2 = etl.tail(table1, 4)
>>> table2
+-----+-----+
| foo | bar |
+=====+=====+
| 'h' |  90 |
+-----+-----+
| 'k' |  12 |
+-----+-----+
| 'l' |  77 |
+-----+-----+
| 'q' |   2 |
+-----+-----+

petl.transform.basics.rowslice(table, *sliceargs)[source]¶

Choose a subsequence of data rows. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['a', 1],
...           ['b', 2],
...           ['c', 5],
...           ['d', 7],
...           ['f', 42]]
>>> table2 = etl.rowslice(table1, 2)
>>> table2
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' |   1 |
+-----+-----+
| 'b' |   2 |
+-----+-----+

>>> table3 = etl.rowslice(table1, 1, 4)
>>> table3
+-----+-----+
| foo | bar |
+=====+=====+
| 'b' |   2 |
+-----+-----+
| 'c' |   5 |
+-----+-----+
| 'd' |   7 |
+-----+-----+

>>> table4 = etl.rowslice(table1, 0, 5, 2)
>>> table4
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' |   1 |
+-----+-----+
| 'c' |   5 |
+-----+-----+
| 'f' |  42 |
+-----+-----+

Positional arguments are used to slice the data rows. The sliceargs are passed through to itertools.islice().

petl.transform.basics.cut(table, *args, **kwargs)[source]¶

Choose and/or re-order fields. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar', 'baz'],
...           ['A', 1, 2.7],
...           ['B', 2, 3.4],
...           ['B', 3, 7.8],
...           ['D', 42, 9.0],
...           ['E', 12]]
>>> table2 = etl.cut(table1, 'foo', 'baz')
>>> table2
+-----+------+
| foo | baz  |
+=====+======+
| 'A' |  2.7 |
+-----+------+
| 'B' |  3.4 |
+-----+------+
| 'B' |  7.8 |
+-----+------+
| 'D' |  9.0 |
+-----+------+
| 'E' | None |
+-----+------+

>>> # fields can also be specified by index, starting from zero
... table3 = etl.cut(table1, 0, 2)
>>> table3
+-----+------+
| foo | baz  |
+=====+======+
| 'A' |  2.7 |
+-----+------+
| 'B' |  3.4 |
+-----+------+
| 'B' |  7.8 |
+-----+------+
| 'D' |  9.0 |
+-----+------+
| 'E' | None |
+-----+------+

>>> # field names and indices can be mixed
... table4 = etl.cut(table1, 'bar', 0)
>>> table4
+-----+-----+
| bar | foo |
+=====+=====+
|   1 | 'A' |
+-----+-----+
|   2 | 'B' |
+-----+-----+
|   3 | 'B' |
+-----+-----+
|  42 | 'D' |
+-----+-----+
|  12 | 'E' |
+-----+-----+

>>> # select a range of fields
... table5 = etl.cut(table1, *range(0, 2))
>>> table5
+-----+-----+
| foo | bar |
+=====+=====+
| 'A' |   1 |
+-----+-----+
| 'B' |   2 |
+-----+-----+
| 'B' |   3 |
+-----+-----+
| 'D' |  42 |
+-----+-----+
| 'E' |  12 |
+-----+-----+

Note that any short rows will be padded with None values (or whatever is provided via the missing keyword argument).

petl.transform.basics.cutout(table, *args, **kwargs)[source]¶

Remove fields. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar', 'baz'],
...           ['A', 1, 2.7],
...           ['B', 2, 3.4],
...           ['B', 3, 7.8],
...           ['D', 42, 9.0],
...           ['E', 12]]
>>> table2 = etl.cutout(table1, 'bar')
>>> table2
+-----+------+
| foo | baz  |
+=====+======+
| 'A' |  2.7 |
+-----+------+
| 'B' |  3.4 |
+-----+------+
| 'B' |  7.8 |
+-----+------+
| 'D' |  9.0 |
+-----+------+
| 'E' | None |
+-----+------+

Header manipulations¶

petl.transform.headers.rename(table, *args, **kwargs)[source]¶

Replace one or more values in the table’s header row. E.g.:

>>> import petl as etl
>>> table1 = [['sex', 'age'],
...           ['m', 12],
...           ['f', 34],
...           ['-', 56]]
>>> # rename a single field
... table2 = etl.rename(table1, 'sex', 'gender')
>>> table2
+--------+-----+
| gender | age |
+========+=====+
| 'm'    |  12 |
+--------+-----+
| 'f'    |  34 |
+--------+-----+
| '-'    |  56 |
+--------+-----+

>>> # rename multiple fields by passing dictionary as second argument
... table3 = etl.rename(table1, {'sex': 'gender', 'age': 'age_years'})
>>> table3
+--------+-----------+
| gender | age_years |
+========+===========+
| 'm'    |        12 |
+--------+-----------+
| 'f'    |        34 |
+--------+-----------+
| '-'    |        56 |
+--------+-----------+

The field to rename can be specified as an index (i.e., integer representing field position).

If any nonexistent fields are specified, the default behaviour is to raise a FieldSelectionError. However, if strict keyword argument is False, any nonexistent fields specified will be silently ignored.

petl.transform.headers.setheader(table, header)[source]¶

Replace header row in the given table. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['a', 1],
...           ['b', 2]]
>>> table2 = etl.setheader(table1, ['foofoo', 'barbar'])
>>> table2
+--------+--------+
| foofoo | barbar |
+========+========+
| 'a'    |      1 |
+--------+--------+
| 'b'    |      2 |
+--------+--------+

petl.transform.headers.extendheader(table, fields)[source]¶

Extend header row in the given table. E.g.:

>>> import petl as etl
>>> table1 = [['foo'],
...           ['a', 1, True],
...           ['b', 2, False]]
>>> table2 = etl.extendheader(table1, ['bar', 'baz'])
>>> table2
+-----+-----+-------+
| foo | bar | baz   |
+=====+=====+=======+
| 'a' |   1 | True  |
+-----+-----+-------+
| 'b' |   2 | False |
+-----+-----+-------+

petl.transform.headers.pushheader(table, header, *args)[source]¶

Push rows down and prepend a header row. E.g.:

>>> import petl as etl
>>> table1 = [['a', 1],
...           ['b', 2]]
>>> table2 = etl.pushheader(table1, ['foo', 'bar'])
>>> table2
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' |   1 |
+-----+-----+
| 'b' |   2 |
+-----+-----+

The header row can either be a list or positional arguments.

petl.transform.headers.prefixheader(table, prefix)[source]¶: Prefix all fields in the table header.

petl.transform.headers.suffixheader(table, suffix)[source]¶: Suffix all fields in the table header.

petl.transform.headers.sortheader(table, reverse=False, missing=None)[source]¶: Re-order columns so the header is sorted.

New in version 1.1.0.

petl.transform.headers.skip(table, n)[source]¶

Skip n rows, including the header row. E.g.:

>>> import petl as etl
>>> table1 = [['#aaa', 'bbb', 'ccc'],
...           ['#mmm'],
...           ['foo', 'bar'],
...           ['a', 1],
...           ['b', 2]]
>>> table2 = etl.skip(table1, 2)
>>> table2
+-----+-----+
| foo | bar |
+=====+=====+
| 'a' |   1 |
+-----+-----+
| 'b' |   2 |
+-----+-----+

Converting values¶

petl.transform.conversions.convert(table, *args, **kwargs)[source]¶

Transform values under one or more fields via arbitrary functions, method invocations or dictionary translations. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar', 'baz'],
...           ['A', '2.4', 12],
...           ['B', '5.7', 34],
...           ['C', '1.2', 56]]
>>> # using a built-in function:
... table2 = etl.convert(table1, 'bar', float)
>>> table2
+-----+-----+-----+
| foo | bar | baz |
+=====+=====+=====+
| 'A' | 2.4 |  12 |
+-----+-----+-----+
| 'B' | 5.7 |  34 |
+-----+-----+-----+
| 'C' | 1.2 |  56 |
+-----+-----+-----+

>>> # using a lambda function::
... table3 = etl.convert(table1, 'baz', lambda v: v*2)
>>> table3
+-----+-------+-----+
| foo | bar   | baz |
+=====+=======+=====+
| 'A' | '2.4' |  24 |
+-----+-------+-----+
| 'B' | '5.7' |  68 |
+-----+-------+-----+
| 'C' | '1.2' | 112 |
+-----+-------+-----+

>>> # a method of the data value can also be invoked by passing
... # the method name
... table4 = etl.convert(table1, 'foo', 'lower')
>>> table4
+-----+-------+-----+
| foo | bar   | baz |
+=====+=======+=====+
| 'a' | '2.4' |  12 |
+-----+-------+-----+
| 'b' | '5.7' |  34 |
+-----+-------+-----+
| 'c' | '1.2' |  56 |
+-----+-------+-----+

>>> # arguments to the method invocation can also be given
... table5 = etl.convert(table1, 'foo', 'replace', 'A', 'AA')
>>> table5
+------+-------+-----+
| foo  | bar   | baz |
+======+=======+=====+
| 'AA' | '2.4' |  12 |
+------+-------+-----+
| 'B'  | '5.7' |  34 |
+------+-------+-----+
| 'C'  | '1.2' |  56 |
+------+-------+-----+

>>> # values can also be translated via a dictionary
... table7 = etl.convert(table1, 'foo', {'A': 'Z', 'B': 'Y'})
>>> table7
+-----+-------+-----+
| foo | bar   | baz |
+=====+=======+=====+
| 'Z' | '2.4' |  12 |
+-----+-------+-----+
| 'Y' | '5.7' |  34 |
+-----+-------+-----+
| 'C' | '1.2' |  56 |
+-----+-------+-----+

>>> # the same conversion can be applied to multiple fields
... table8 = etl.convert(table1, ('foo', 'bar', 'baz'), str)
>>> table8
+-----+-------+------+
| foo | bar   | baz  |
+=====+=======+======+
| 'A' | '2.4' | '12' |
+-----+-------+------+
| 'B' | '5.7' | '34' |
+-----+-------+------+
| 'C' | '1.2' | '56' |
+-----+-------+------+

>>> # multiple conversions can be specified at the same time
... table9 = etl.convert(table1, {'foo': 'lower',
...                               'bar': float,
...                               'baz': lambda v: v * 2})
>>> table9
+-----+-----+-----+
| foo | bar | baz |
+=====+=====+=====+
| 'a' | 2.4 |  24 |
+-----+-----+-----+
| 'b' | 5.7 |  68 |
+-----+-----+-----+
| 'c' | 1.2 | 112 |
+-----+-----+-----+

>>> # ...or alternatively via a list
... table10 = etl.convert(table1, ['lower', float, lambda v: v*2])
>>> table10
+-----+-----+-----+
| foo | bar | baz |
+=====+=====+=====+
| 'a' | 2.4 |  24 |
+-----+-----+-----+
| 'b' | 5.7 |  68 |
+-----+-----+-----+
| 'c' | 1.2 | 112 |
+-----+-----+-----+

>>> # conversion can be conditional
... table11 = etl.convert(table1, 'baz', lambda v: v * 2,
...                       where=lambda r: r.foo == 'B')
>>> table11
+-----+-------+-----+
| foo | bar   | baz |
+=====+=======+=====+
| 'A' | '2.4' |  12 |
+-----+-------+-----+
| 'B' | '5.7' |  68 |
+-----+-------+-----+
| 'C' | '1.2' |  56 |
+-----+-------+-----+

>>> # conversion can access other values from the same row
... table12 = etl.convert(table1, 'baz',
...                       lambda v, row: v * float(row.bar),
...                       pass_row=True)
>>> table12
+-----+-------+--------------------+
| foo | bar   | baz                |
+=====+=======+====================+
| 'A' | '2.4' | 28.799999999999997 |
+-----+-------+--------------------+
| 'B' | '5.7' |              193.8 |
+-----+-------+--------------------+
| 'C' | '1.2' |               67.2 |
+-----+-------+--------------------+
>>> # conversion can use a custom function
>>> def my_func(val, row):
...     return float(row.bar) + row.baz
... 
>>> table13 = etl.convert(table1, 'foo', my_func, pass_row=True)
>>> table13
+------+-------+-----+
| foo  | bar   | baz |
+======+=======+=====+
| 14.4 | '2.4' |  12 |
+------+-------+-----+
| 39.7 | '5.7' |  34 |
+------+-------+-----+
| 57.2 | '1.2' |  56 |
+------+-------+-----+

Note that either field names or indexes can be given.

The where keyword argument can be given with a callable or expression which is evaluated on each row and which should return True if the conversion should be applied on that row, else False.

The pass_row keyword argument can be given, which if True will mean that both the value and the containing row will be passed as arguments to the conversion function (so, i.e., the conversion function should accept two arguments).

When multiple fields are converted in a single call, the conversions are independent of each other. Each conversion sees the original row:

>>> # multiple conversions do not affect each other
... table13 = etl.convert(table1, {
...                           "foo": lambda foo, row: row.bar,
...                           "bar": lambda bar, row: row.foo,
...                       }, pass_row=True)
>>> table13
+-------+-----+-----+
| foo   | bar | baz |
+=======+=====+=====+
| '2.4' | 'A' |  12 |
+-------+-----+-----+
| '5.7' | 'B' |  34 |
+-------+-----+-----+
| '1.2' | 'C' |  56 |
+-------+-----+-----+

Also accepts failonerror and errorvalue keyword arguments, documented under petl.config.failonerror()

petl.transform.conversions.convertall(table, *args, **kwargs)[source]¶

Convenience function to convert all fields in the table using a common function or mapping. See also convert().

The where keyword argument can be given with a callable or expression which is evaluated on each row and which should return True if the conversion should be applied on that row, else False.

petl.transform.conversions.convertnumbers(table, strict=False, **kwargs)[source]¶

Convenience function to convert all field values to numbers where possible. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar', 'baz', 'quux'],
...           ['1', '3.0', '9+3j', 'aaa'],
...           ['2', '1.3', '7+2j', None]]
>>> table2 = etl.convertnumbers(table1)
>>> table2
+-----+-----+--------+-------+
| foo | bar | baz    | quux  |
+=====+=====+========+=======+
|   1 | 3.0 | (9+3j) | 'aaa' |
+-----+-----+--------+-------+
|   2 | 1.3 | (7+2j) | None  |
+-----+-----+--------+-------+

petl.transform.conversions.replace(table, field, a, b, **kwargs)[source]¶

Convenience function to replace all occurrences of a with b under the given field. See also convert().

The where keyword argument can be given with a callable or expression which is evaluated on each row and which should return True if the conversion should be applied on that row, else False.

petl.transform.conversions.replaceall(table, a, b, **kwargs)[source]¶

Convenience function to replace all instances of a with b under all fields. See also convertall().

The where keyword argument can be given with a callable or expression which is evaluated on each row and which should return True if the conversion should be applied on that row, else False.

petl.transform.conversions.format(table, field, fmt, **kwargs)[source]¶

Convenience function to format all values in the given field using the fmt format string.

The where keyword argument can be given with a callable or expression which is evaluated on each row and which should return True if the conversion should be applied on that row, else False.

petl.transform.conversions.formatall(table, fmt, **kwargs)[source]¶

Convenience function to format all values in all fields using the fmt format string.

The where keyword argument can be given with a callable or expression which is evaluated on each row and which should return True if the conversion should be applied on that row, else False.

petl.transform.conversions.interpolate(table, field, fmt, **kwargs)[source]¶

Convenience function to interpolate all values in the given field using the fmt string.

The where keyword argument can be given with a callable or expression which is evaluated on each row and which should return True if the conversion should be applied on that row, else False.

petl.transform.conversions.interpolateall(table, fmt, **kwargs)[source]¶

Convenience function to interpolate all values in all fields using the fmt string.

The where keyword argument can be given with a callable or expression which is evaluated on each row and which should return True if the conversion should be applied on that row, else False.

petl.transform.conversions.update(table, field, value, **kwargs)[source]¶: Convenience function to convert a field to a fixed value. Accepts the where keyword argument. See also convert().

Selecting rows¶

petl.transform.selects.select(table, *args, **kwargs)[source]¶

Select rows meeting a condition. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar', 'baz'],
...           ['a', 4, 9.3],
...           ['a', 2, 88.2],
...           ['b', 1, 23.3],
...           ['c', 8, 42.0],
...           ['d', 7, 100.9],
...           ['c', 2]]
>>> # the second positional argument can be a function accepting
... # a row
... table2 = etl.select(table1,
...                     lambda rec: rec.foo == 'a' and rec.baz > 88.1)
>>> table2
+-----+-----+------+
| foo | bar | baz  |
+=====+=====+======+
| 'a' |   2 | 88.2 |
+-----+-----+------+

>>> # the second positional argument can also be an expression
... # string, which will be converted to a function using petl.expr()
... table3 = etl.select(table1, "{foo} == 'a' and {baz} > 88.1")
>>> table3
+-----+-----+------+
| foo | bar | baz  |
+=====+=====+======+
| 'a' |   2 | 88.2 |
+-----+-----+------+

>>> # the condition can also be applied to a single field
... table4 = etl.select(table1, 'foo', lambda v: v == 'a')
>>> table4
+-----+-----+------+
| foo | bar | baz  |
+=====+=====+======+
| 'a' |   4 |  9.3 |
+-----+-----+------+
| 'a' |   2 | 88.2 |
+-----+-----+------+

The complement of the selection can be returned (i.e., the query can be inverted) by providing complement=True as a keyword argument.

petl.transform.selects.selectop(table, field, value, op, complement=False)[source]¶: Select rows where the function op applied to the given field and the given value returns True.

petl.transform.selects.selecteq(table, field, value, complement=False)[source]¶: Select rows where the given field equals the given value.

petl.transform.selects.selectne(table, field, value, complement=False)[source]¶: Select rows where the given field does not equal the given value.

petl.transform.selects.selectlt(table, field, value, complement=False)[source]¶: Select rows where the given field is less than the given value.

petl.transform.selects.selectle(table, field, value, complement=False)[source]¶: Select rows where the given field is less than or equal to the given value.

petl.transform.selects.selectgt(table, field, value, complement=False)[source]¶: Select rows where the given field is greater than the given value.

petl.transform.selects.selectge(table, field, value, complement=False)[source]¶: Select rows where the given field is greater than or equal to the given value.

petl.transform.selects.selectrangeopen(table, field, minv, maxv, complement=False)[source]¶: Select rows where the given field is greater than or equal to minv and less than or equal to maxv.

petl.transform.selects.selectrangeopenleft(table, field, minv, maxv, complement=False)[source]¶: Select rows where the given field is greater than or equal to minv and less than maxv.

petl.transform.selects.selectrangeopenright(table, field, minv, maxv, complement=False)[source]¶: Select rows where the given field is greater than minv and less than or equal to maxv.

petl.transform.selects.selectrangeclosed(table, field, minv, maxv, complement=False)[source]¶: Select rows where the given field is greater than minv and less than maxv.

petl.transform.selects.selectcontains(table, field, value, complement=False)[source]¶: Select rows where the given field contains the given value.

petl.transform.selects.selectin(table, field, value, complement=False)[source]¶: Select rows where the given field is a member of the given value.

petl.transform.selects.selectnotin(table, field, value, complement=False)[source]¶: Select rows where the given field is not a member of the given value.

petl.transform.selects.selectis(table, field, value, complement=False)[source]¶: Select rows where the given field is the given value.

petl.transform.selects.selectisnot(table, field, value, complement=False)[source]¶: Select rows where the given field is not the given value.

petl.transform.selects.selectisinstance(table, field, value, complement=False)[source]¶: Select rows where the given field is an instance of the given type.

petl.transform.selects.selecttrue(table, field, complement=False)[source]¶: Select rows where the given field evaluates True.

petl.transform.selects.selectfalse(table, field, complement=False)[source]¶: Select rows where the given field evaluates False.

petl.transform.selects.selectnone(table, field, complement=False)[source]¶: Select rows where the given field is None.

petl.transform.selects.selectnotnone(table, field, complement=False)[source]¶: Select rows where the given field is not None.

petl.transform.selects.selectusingcontext(table, query)[source]¶

Select rows based on data in the current row and/or previous and next row. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['A', 1],
...           ['B', 4],
...           ['C', 5],
...           ['D', 9]]
>>> def query(prv, cur, nxt):
...     return ((prv is not None and (cur.bar - prv.bar) < 2)
...             or (nxt is not None and (nxt.bar - cur.bar) < 2))
...
>>> table2 = etl.selectusingcontext(table1, query)
>>> table2
+-----+-----+
| foo | bar |
+=====+=====+
| 'B' |   4 |
+-----+-----+
| 'C' |   5 |
+-----+-----+

The query function should accept three rows and return a boolean value.

petl.transform.selects.rowlenselect(table, n, complement=False)[source]¶: Select rows of length n.

petl.transform.selects.facet(table, key)[source]¶

Return a dictionary mapping field values to tables. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar', 'baz'],
...           ['a', 4, 9.3],
...           ['a', 2, 88.2],
...           ['b', 1, 23.3],
...           ['c', 8, 42.0],
...           ['d', 7, 100.9],
...           ['c', 2]]
>>> foo = etl.facet(table1, 'foo')
>>> sorted(foo.keys())
['a', 'b', 'c', 'd']
>>> foo['a']
+-----+-----+------+
| foo | bar | baz  |
+=====+=====+======+
| 'a' |   4 |  9.3 |
+-----+-----+------+
| 'a' |   2 | 88.2 |
+-----+-----+------+

>>> foo['c']
+-----+-----+------+
| foo | bar | baz  |
+=====+=====+======+
| 'c' |   8 | 42.0 |
+-----+-----+------+
| 'c' |   2 |      |
+-----+-----+------+

>>> # works with compound keys too
>>> table2 = [['foo', 'bar', 'baz'],
...           ['a', 1, True],
...           ['b', 2, False],
...           ['b', 3, True],
...           ['b', 3, False]]
>>> foobar = etl.facet(table2, ('foo', 'bar'))

>>> sorted(foobar.keys())
[('a', 1), ('b', 2), ('b', 3)]

>>> foobar[('b', 3)]
+-----+-----+-------+
| foo | bar | baz   |
+=====+=====+=======+
| 'b' |   3 | True  |
+-----+-----+-------+
| 'b' |   3 | False |
+-----+-----+-------+

petl.transform.selects.biselect(table, *args, **kwargs)[source]¶

Return two tables, the first containing selected rows, the second containing remaining rows. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar', 'baz'],
...           ['a', 4, 9.3],
...           ['a', 2, 88.2],
...           ['b', 1, 23.3],
...           ['c', 8, 42.0],
...           ['d', 7, 100.9],
...           ['c', 2]]
>>> table2, table3 = etl.biselect(table1, lambda rec: rec.foo == 'a')
>>> table2
+-----+-----+------+
| foo | bar | baz  |
+=====+=====+======+
| 'a' |   4 |  9.3 |
+-----+-----+------+
| 'a' |   2 | 88.2 |
+-----+-----+------+
>>> table3
+-----+-----+-------+
| foo | bar | baz   |
+=====+=====+=======+
| 'b' |   1 |  23.3 |
+-----+-----+-------+
| 'c' |   8 |  42.0 |
+-----+-----+-------+
| 'd' |   7 | 100.9 |
+-----+-----+-------+
| 'c' |   2 |       |
+-----+-----+-------+

New in version 1.1.0.

Regular expressions¶

petl.transform.regex.search(table, *args, **kwargs)[source]¶

Perform a regular expression search, returning rows that match a given pattern, either anywhere in the row or within a specific field. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar', 'baz'],
...           ['orange', 12, 'oranges are nice fruit'],
...           ['mango', 42, 'I like them'],
...           ['banana', 74, 'lovely too'],
...           ['cucumber', 41, 'better than mango']]
>>> # search any field
... table2 = etl.search(table1, '.g.')
>>> table2
+------------+-----+--------------------------+
| foo        | bar | baz                      |
+============+=====+==========================+
| 'orange'   |  12 | 'oranges are nice fruit' |
+------------+-----+--------------------------+
| 'mango'    |  42 | 'I like them'            |
+------------+-----+--------------------------+
| 'cucumber' |  41 | 'better than mango'      |
+------------+-----+--------------------------+

>>> # search a specific field
... table3 = etl.search(table1, 'foo', '.g.')
>>> table3
+----------+-----+--------------------------+
| foo      | bar | baz                      |
+==========+=====+==========================+
| 'orange' |  12 | 'oranges are nice fruit' |
+----------+-----+--------------------------+
| 'mango'  |  42 | 'I like them'            |
+----------+-----+--------------------------+

The complement can be found via petl.transform.regex.searchcomplement().

petl.transform.regex.searchcomplement(table, *args, **kwargs)[source]¶

Perform a regular expression search, returning rows that do not match a given pattern, either anywhere in the row or within a specific field. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar', 'baz'],
...           ['orange', 12, 'oranges are nice fruit'],
...           ['mango', 42, 'I like them'],
...           ['banana', 74, 'lovely too'],
...           ['cucumber', 41, 'better than mango']]
>>> # search any field
... table2 = etl.searchcomplement(table1, '.g.')
>>> table2
+----------+-----+--------------+
| foo      | bar | baz          |
+==========+=====+==============+
| 'banana' |  74 | 'lovely too' |
+----------+-----+--------------+

>>> # search a specific field
... table3 = etl.searchcomplement(table1, 'foo', '.g.')
>>> table3
+------------+-----+---------------------+
| foo        | bar | baz                 |
+============+=====+=====================+
| 'banana'   |  74 | 'lovely too'        |
+------------+-----+---------------------+
| 'cucumber' |  41 | 'better than mango' |
+------------+-----+---------------------+

This returns the complement of petl.transform.regex.search().

petl.transform.regex.sub(table, field, pattern, repl, count=0, flags=0)[source]¶: Convenience function to convert values under the given field using a regular expression substitution. See also re.sub().

petl.transform.regex.split(table, field, pattern, newfields=None, include_original=False, maxsplit=0, flags=0)[source]¶

Add one or more new fields with values generated by splitting an existing value around occurrences of a regular expression. E.g.:

>>> import petl as etl
>>> table1 = [['id', 'variable', 'value'],
...           ['1', 'parad1', '12'],
...           ['2', 'parad2', '15'],
...           ['3', 'tempd1', '18'],
...           ['4', 'tempd2', '19']]
>>> table2 = etl.split(table1, 'variable', 'd', ['variable', 'day'])
>>> table2
+-----+-------+----------+-----+
| id  | value | variable | day |
+=====+=======+==========+=====+
| '1' | '12'  | 'para'   | '1' |
+-----+-------+----------+-----+
| '2' | '15'  | 'para'   | '2' |
+-----+-------+----------+-----+
| '3' | '18'  | 'temp'   | '1' |
+-----+-------+----------+-----+
| '4' | '19'  | 'temp'   | '2' |
+-----+-------+----------+-----+

By default the field on which the split is performed is omitted. It can be included using the include_original argument.

petl.transform.regex.splitdown(table, field, pattern, maxsplit=0, flags=0)[source]¶

Split a field into multiple rows using a regular expression. E.g.:

>>> import petl as etl
>>> table1 = [['name', 'roles'],
...           ['Jane Doe', 'president,engineer,tailor,lawyer'],
...           ['John Doe', 'rocket scientist,optometrist,chef,knight,sailor']]
>>> table2 = etl.splitdown(table1, 'roles', ',')
>>> table2.lookall()
+------------+--------------------+
| name       | roles              |
+============+====================+
| 'Jane Doe' | 'president'        |
+------------+--------------------+
| 'Jane Doe' | 'engineer'         |
+------------+--------------------+
| 'Jane Doe' | 'tailor'           |
+------------+--------------------+
| 'Jane Doe' | 'lawyer'           |
+------------+--------------------+
| 'John Doe' | 'rocket scientist' |
+------------+--------------------+
| 'John Doe' | 'optometrist'      |
+------------+--------------------+
| 'John Doe' | 'chef'             |
+------------+--------------------+
| 'John Doe' | 'knight'           |
+------------+--------------------+
| 'John Doe' | 'sailor'           |
+------------+--------------------+

petl.transform.regex.capture(table, field, pattern, newfields=None, include_original=False, flags=0, fill=None)[source]¶

Add one or more new fields with values captured from an existing field searched via a regular expression. E.g.:

>>> import petl as etl
>>> table1 = [['id', 'variable', 'value'],
...           ['1', 'A1', '12'],
...           ['2', 'A2', '15'],
...           ['3', 'B1', '18'],
...           ['4', 'C12', '19']]
>>> table2 = etl.capture(table1, 'variable', '([A-Z,a-z]+)([0-9]+)',
...                      ['treat', 'time'])
>>> table2
+-----+-------+-------+------+
| id  | value | treat | time |
+=====+=======+=======+======+
| '1' | '12'  | 'A'   | '1'  |
+-----+-------+-------+------+
| '2' | '15'  | 'A'   | '2'  |
+-----+-------+-------+------+
| '3' | '18'  | 'B'   | '1'  |
+-----+-------+-------+------+
| '4' | '19'  | 'C'   | '12' |
+-----+-------+-------+------+

>>> # using the include_original argument
... table3 = etl.capture(table1, 'variable', '([A-Z,a-z]+)([0-9]+)',
...                      ['treat', 'time'],
...                      include_original=True)
>>> table3
+-----+----------+-------+-------+------+
| id  | variable | value | treat | time |
+=====+==========+=======+=======+======+
| '1' | 'A1'     | '12'  | 'A'   | '1'  |
+-----+----------+-------+-------+------+
| '2' | 'A2'     | '15'  | 'A'   | '2'  |
+-----+----------+-------+-------+------+
| '3' | 'B1'     | '18'  | 'B'   | '1'  |
+-----+----------+-------+-------+------+
| '4' | 'C12'    | '19'  | 'C'   | '12' |
+-----+----------+-------+-------+------+

By default the field on which the capture is performed is omitted. It can be included using the include_original argument.

The fill parameter can be used to provide a list or tuple of values to use if the regular expression does not match. The fill parameter should contain as many values as there are capturing groups in the regular expression. If fill is None (default) then a petl.transform.TransformError will be raised on the first non-matching value.

Unpacking compound values¶

petl.transform.unpacks.unpack(table, field, newfields=None, include_original=False, missing=None)[source]¶

Unpack data values that are lists or tuples. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           [1, ['a', 'b']],
...           [2, ['c', 'd']],
...           [3, ['e', 'f']]]
>>> table2 = etl.unpack(table1, 'bar', ['baz', 'quux'])
>>> table2
+-----+-----+------+
| foo | baz | quux |
+=====+=====+======+
|   1 | 'a' | 'b'  |
+-----+-----+------+
|   2 | 'c' | 'd'  |
+-----+-----+------+
|   3 | 'e' | 'f'  |
+-----+-----+------+

This function will attempt to unpack exactly the number of values as given by the number of new fields specified. If there are more values than new fields, remaining values will not be unpacked. If there are less values than new fields, missing values will be added.

petl.transform.unpacks.unpackdict(table, field, keys=None, includeoriginal=False, samplesize=1000, missing=None)[source]¶

Unpack dictionary values into separate fields. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           [1, {'baz': 'a', 'quux': 'b'}],
...           [2, {'baz': 'c', 'quux': 'd'}],
...           [3, {'baz': 'e', 'quux': 'f'}]]
>>> table2 = etl.unpackdict(table1, 'bar')
>>> table2
+-----+-----+------+
| foo | baz | quux |
+=====+=====+======+
|   1 | 'a' | 'b'  |
+-----+-----+------+
|   2 | 'c' | 'd'  |
+-----+-----+------+
|   3 | 'e' | 'f'  |
+-----+-----+------+

Transforming rows¶

petl.transform.maps.fieldmap(table, mappings=None, failonerror=None, errorvalue=None)[source]¶

Transform a table, mapping fields arbitrarily between input and output. E.g.:

>>> import petl as etl
>>> from collections import OrderedDict
>>> table1 = [['id', 'sex', 'age', 'height', 'weight'],
...           [1, 'male', 16, 1.45, 62.0],
...           [2, 'female', 19, 1.34, 55.4],
...           [3, 'female', 17, 1.78, 74.4],
...           [4, 'male', 21, 1.33, 45.2],
...           [5, '-', 25, 1.65, 51.9]]
>>> mappings = OrderedDict()
>>> # rename a field
... mappings['subject_id'] = 'id'
>>> # translate a field
... mappings['gender'] = 'sex', {'male': 'M', 'female': 'F'}
>>> # apply a calculation to a field
... mappings['age_months'] = 'age', lambda v: v * 12
>>> # apply a calculation to a combination of fields
... mappings['bmi'] = lambda rec: rec['weight'] / rec['height']**2
>>> # transform and inspect the output
... table2 = etl.fieldmap(table1, mappings)
>>> table2
+------------+--------+------------+--------------------+
| subject_id | gender | age_months | bmi                |
+============+========+============+====================+
|          1 | 'M'    |        192 |  29.48870392390012 |
+------------+--------+------------+--------------------+
|          2 | 'F'    |        228 |   30.8531967030519 |
+------------+--------+------------+--------------------+
|          3 | 'F'    |        204 | 23.481883600555488 |
+------------+--------+------------+--------------------+
|          4 | 'M'    |        252 |  25.55260331279326 |
+------------+--------+------------+--------------------+
|          5 | '-'    |        300 |   19.0633608815427 |
+------------+--------+------------+--------------------+

Note also that the mapping value can be an expression string, which will be converted to a lambda function via petl.util.base.expr().

The failonerror and errorvalue keyword arguments are documented under petl.config.failonerror()

petl.transform.maps.rowmap(table, rowmapper, header, failonerror=None)[source]¶

Transform rows via an arbitrary function. E.g.:

>>> import petl as etl
>>> table1 = [['id', 'sex', 'age', 'height', 'weight'],
...           [1, 'male', 16, 1.45, 62.0],
...           [2, 'female', 19, 1.34, 55.4],
...           [3, 'female', 17, 1.78, 74.4],
...           [4, 'male', 21, 1.33, 45.2],
...           [5, '-', 25, 1.65, 51.9]]
>>> def rowmapper(row):
...     transmf = {'male': 'M', 'female': 'F'}
...     return [row[0],
...             transmf[row['sex']] if row['sex'] in transmf else None,
...             row.age * 12,
...             row.height / row.weight ** 2]
...
>>> table2 = etl.rowmap(table1, rowmapper,
...                     header=['subject_id', 'gender', 'age_months',
...                             'bmi'])
>>> table2
+------------+--------+------------+-----------------------+
| subject_id | gender | age_months | bmi                   |
+============+========+============+=======================+
|          1 | 'M'    |        192 | 0.0003772112382934443 |
+------------+--------+------------+-----------------------+
|          2 | 'F'    |        228 | 0.0004366015456998006 |
+------------+--------+------------+-----------------------+
|          3 | 'F'    |        204 | 0.0003215689675106949 |
+------------+--------+------------+-----------------------+
|          4 | 'M'    |        252 | 0.0006509906805544679 |
+------------+--------+------------+-----------------------+
|          5 | None   |        300 | 0.0006125608384287258 |
+------------+--------+------------+-----------------------+

The rowmapper function should accept a single row and return a single row (list or tuple).

The failonerror keyword argument is documented under petl.config.failonerror()

petl.transform.maps.rowmapmany(table, rowgenerator, header, failonerror=None)[source]¶

Map each input row to any number of output rows via an arbitrary function. E.g.:

>>> import petl as etl
>>> table1 = [['id', 'sex', 'age', 'height', 'weight'],
...           [1, 'male', 16, 1.45, 62.0],
...           [2, 'female', 19, 1.34, 55.4],
...           [3, '-', 17, 1.78, 74.4],
...           [4, 'male', 21, 1.33]]
>>> def rowgenerator(row):
...     transmf = {'male': 'M', 'female': 'F'}
...     yield [row[0], 'gender',
...            transmf[row['sex']] if row['sex'] in transmf else None]
...     yield [row[0], 'age_months', row.age * 12]
...     yield [row[0], 'bmi', row.height / row.weight ** 2]
...
>>> table2 = etl.rowmapmany(table1, rowgenerator,
...                         header=['subject_id', 'variable', 'value'])
>>> table2.lookall()
+------------+--------------+-----------------------+
| subject_id | variable     | value                 |
+============+==============+=======================+
|          1 | 'gender'     | 'M'                   |
+------------+--------------+-----------------------+
|          1 | 'age_months' |                   192 |
+------------+--------------+-----------------------+
|          1 | 'bmi'        | 0.0003772112382934443 |
+------------+--------------+-----------------------+
|          2 | 'gender'     | 'F'                   |
+------------+--------------+-----------------------+
|          2 | 'age_months' |                   228 |
+------------+--------------+-----------------------+
|          2 | 'bmi'        | 0.0004366015456998006 |
+------------+--------------+-----------------------+
|          3 | 'gender'     | None                  |
+------------+--------------+-----------------------+
|          3 | 'age_months' |                   204 |
+------------+--------------+-----------------------+
|          3 | 'bmi'        | 0.0003215689675106949 |
+------------+--------------+-----------------------+
|          4 | 'gender'     | 'M'                   |
+------------+--------------+-----------------------+
|          4 | 'age_months' |                   252 |
+------------+--------------+-----------------------+

The rowgenerator function should accept a single row and yield zero or more rows (lists or tuples).

The failonerror keyword argument is documented under petl.config.failonerror()

See also the petl.transform.reshape.melt() function.

petl.transform.maps.rowgroupmap(table, key, mapper, header=None, presorted=False, buffersize=None, tempdir=None, cache=True)[source]¶: Group rows under the given key then apply mapper to yield zero or more output rows for each input group of rows.

Sorting¶

petl.transform.sorts.sort(table, key=None, reverse=False, buffersize=None, tempdir=None, cache=True)[source]¶

Sort the table. Field names or indices (from zero) can be used to specify the key. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['C', 2],
...           ['A', 9],
...           ['A', 6],
...           ['F', 1],
...           ['D', 10]]
>>> table2 = etl.sort(table1, 'foo')
>>> table2
+-----+-----+
| foo | bar |
+=====+=====+
| 'A' |   9 |
+-----+-----+
| 'A' |   6 |
+-----+-----+
| 'C' |   2 |
+-----+-----+
| 'D' |  10 |
+-----+-----+
| 'F' |   1 |
+-----+-----+

>>> # sorting by compound key is supported
... table3 = etl.sort(table1, key=['foo', 'bar'])
>>> table3
+-----+-----+
| foo | bar |
+=====+=====+
| 'A' |   6 |
+-----+-----+
| 'A' |   9 |
+-----+-----+
| 'C' |   2 |
+-----+-----+
| 'D' |  10 |
+-----+-----+
| 'F' |   1 |
+-----+-----+

>>> # if no key is specified, the default is a lexical sort
... table4 = etl.sort(table1)
>>> table4
+-----+-----+
| foo | bar |
+=====+=====+
| 'A' |   6 |
+-----+-----+
| 'A' |   9 |
+-----+-----+
| 'C' |   2 |
+-----+-----+
| 'D' |  10 |
+-----+-----+
| 'F' |   1 |
+-----+-----+

The buffersize argument should be an int or None.

If the number of rows in the table is less than buffersize, the table will be sorted in memory. Otherwise, the table is sorted in chunks of no more than buffersize rows, each chunk is written to a temporary file, and then a merge sort is performed on the temporary files.

If buffersize is None, the value of petl.config.sort_buffersize will be used. By default this is set to 100000 rows, but can be changed, e.g.:

>>> import petl.config
>>> petl.config.sort_buffersize = 500000

If petl.config.sort_buffersize is set to None, this forces all sorting to be done entirely in memory.

By default the results of the sort will be cached, and so a second pass over the sorted table will yield rows from the cache and will not repeat the sort operation. To turn off caching, set the cache argument to False.

petl.transform.sorts.mergesort(*tables, **kwargs)[source]¶

Combine multiple input tables into one sorted output table. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['A', 9],
...           ['C', 2],
...           ['D', 10],
...           ['A', 6],
...           ['F', 1]]
>>> table2 = [['foo', 'bar'],
...           ['B', 3],
...           ['D', 10],
...           ['A', 10],
...           ['F', 4]]
>>> table3 = etl.mergesort(table1, table2, key='foo')
>>> table3.lookall()
+-----+-----+
| foo | bar |
+=====+=====+
| 'A' |   9 |
+-----+-----+
| 'A' |   6 |
+-----+-----+
| 'A' |  10 |
+-----+-----+
| 'B' |   3 |
+-----+-----+
| 'C' |   2 |
+-----+-----+
| 'D' |  10 |
+-----+-----+
| 'D' |  10 |
+-----+-----+
| 'F' |   1 |
+-----+-----+
| 'F' |   4 |
+-----+-----+

If the input tables are already sorted by the given key, give presorted=True as a keyword argument.

This function is equivalent to concatenating the input tables using cat() then sorting, however this function will typically be more efficient, especially if the input tables are presorted.

Keyword arguments:

key : string or tuple of strings, optional: Field name or tuple of fields to sort by (defaults to None lexical sort)
reverse : bool, optional: True if sort in reverse (descending) order (defaults to False)
presorted : bool, optional: True if inputs are already sorted by the given key (defaults to False)
missing : object: Value to fill with when input tables have different fields (defaults to None)
header : sequence of strings, optional: Specify a fixed header for the output table
buffersize : int, optional: Limit the number of rows in memory per input table when inputs are not presorted

petl.transform.sorts.issorted(table, key=None, reverse=False, strict=False)[source]¶

Return True if the table is ordered (i.e., sorted) by the given key. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar', 'baz'],
...           ['a', 1, True],
...           ['b', 3, True],
...           ['b', 2]]
>>> etl.issorted(table1, key='foo')
True
>>> etl.issorted(table1, key='bar')
False
>>> etl.issorted(table1, key='foo', strict=True)
False
>>> etl.issorted(table1, key='foo', reverse=True)
False

Joins¶

petl.transform.joins.join(left, right, key=None, lkey=None, rkey=None, presorted=False, buffersize=None, tempdir=None, cache=True, lprefix=None, rprefix=None)[source]¶

Perform an equi-join on the given tables. E.g.:

>>> import petl as etl
>>> table1 = [['id', 'colour'],
...           [1, 'blue'],
...           [2, 'red'],
...           [3, 'purple']]
>>> table2 = [['id', 'shape'],
...           [1, 'circle'],
...           [3, 'square'],
...           [4, 'ellipse']]
>>> table3 = etl.join(table1, table2, key='id')
>>> table3
+----+----------+----------+
| id | colour   | shape    |
+====+==========+==========+
|  1 | 'blue'   | 'circle' |
+----+----------+----------+
|  3 | 'purple' | 'square' |
+----+----------+----------+

>>> # if no key is given, a natural join is tried
... table4 = etl.join(table1, table2)
>>> table4
+----+----------+----------+
| id | colour   | shape    |
+====+==========+==========+
|  1 | 'blue'   | 'circle' |
+----+----------+----------+
|  3 | 'purple' | 'square' |
+----+----------+----------+

>>> # note behaviour if the key is not unique in either or both tables
... table5 = [['id', 'colour'],
...           [1, 'blue'],
...           [1, 'red'],
...           [2, 'purple']]
>>> table6 = [['id', 'shape'],
...           [1, 'circle'],
...           [1, 'square'],
...           [2, 'ellipse']]
>>> table7 = etl.join(table5, table6, key='id')
>>> table7
+----+----------+-----------+
| id | colour   | shape     |
+====+==========+===========+
|  1 | 'blue'   | 'circle'  |
+----+----------+-----------+
|  1 | 'blue'   | 'square'  |
+----+----------+-----------+
|  1 | 'red'    | 'circle'  |
+----+----------+-----------+
|  1 | 'red'    | 'square'  |
+----+----------+-----------+
|  2 | 'purple' | 'ellipse' |
+----+----------+-----------+

>>> # compound keys are supported
... table8 = [['id', 'time', 'height'],
...           [1, 1, 12.3],
...           [1, 2, 34.5],
...           [2, 1, 56.7]]
>>> table9 = [['id', 'time', 'weight'],
...           [1, 2, 4.5],
...           [2, 1, 6.7],
...           [2, 2, 8.9]]
>>> table10 = etl.join(table8, table9, key=['id', 'time'])
>>> table10
+----+------+--------+--------+
| id | time | height | weight |
+====+======+========+========+
|  1 |    2 |   34.5 |    4.5 |
+----+------+--------+--------+
|  2 |    1 |   56.7 |    6.7 |
+----+------+--------+--------+

If presorted is True, it is assumed that the data are already sorted by the given key, and the buffersize, tempdir and cache arguments are ignored. Otherwise, the data are sorted, see also the discussion of the buffersize, tempdir and cache arguments under the petl.transform.sorts.sort() function.

Left and right tables with different key fields can be handled via the lkey and rkey arguments.

petl.transform.joins.leftjoin(left, right, key=None, lkey=None, rkey=None, missing=None, presorted=False, buffersize=None, tempdir=None, cache=True, lprefix=None, rprefix=None)[source]¶

Perform a left outer join on the given tables. E.g.:

>>> import petl as etl
>>> table1 = [['id', 'colour'],
...           [1, 'blue'],
...           [2, 'red'],
...           [3, 'purple']]
>>> table2 = [['id', 'shape'],
...           [1, 'circle'],
...           [3, 'square'],
...           [4, 'ellipse']]
>>> table3 = etl.leftjoin(table1, table2, key='id')
>>> table3
+----+----------+----------+
| id | colour   | shape    |
+====+==========+==========+
|  1 | 'blue'   | 'circle' |
+----+----------+----------+
|  2 | 'red'    | None     |
+----+----------+----------+
|  3 | 'purple' | 'square' |
+----+----------+----------+

If presorted is True, it is assumed that the data are already sorted by the given key, and the buffersize, tempdir and cache arguments are ignored. Otherwise, the data are sorted, see also the discussion of the buffersize, tempdir and cache arguments under the petl.transform.sorts.sort() function.

Left and right tables with different key fields can be handled via the lkey and rkey arguments.

petl.transform.joins.lookupjoin(left, right, key=None, lkey=None, rkey=None, missing=None, presorted=False, buffersize=None, tempdir=None, cache=True, lprefix=None, rprefix=None)[source]¶

Perform a left join, but where the key is not unique in the right-hand table, arbitrarily choose the first row and ignore others. E.g.:

>>> import petl as etl
>>> table1 = [['id', 'color', 'cost'],
...           [1, 'blue', 12],
...           [2, 'red', 8],
...           [3, 'purple', 4]]
>>> table2 = [['id', 'shape', 'size'],
...           [1, 'circle', 'big'],
...           [1, 'circle', 'small'],
...           [2, 'square', 'tiny'],
...           [2, 'square', 'big'],
...           [3, 'ellipse', 'small'],
...           [3, 'ellipse', 'tiny']]
>>> table3 = etl.lookupjoin(table1, table2, key='id')
>>> table3
+----+----------+------+-----------+---------+
| id | color    | cost | shape     | size    |
+====+==========+======+===========+=========+
|  1 | 'blue'   |   12 | 'circle'  | 'big'   |
+----+----------+------+-----------+---------+
|  2 | 'red'    |    8 | 'square'  | 'tiny'  |
+----+----------+------+-----------+---------+
|  3 | 'purple' |    4 | 'ellipse' | 'small' |
+----+----------+------+-----------+---------+

petl.transform.joins.rightjoin(left, right, key=None, lkey=None, rkey=None, missing=None, presorted=False, buffersize=None, tempdir=None, cache=True, lprefix=None, rprefix=None)[source]¶

Perform a right outer join on the given tables. E.g.:

>>> import petl as etl
>>> table1 = [['id', 'colour'],
...           [1, 'blue'],
...           [2, 'red'],
...           [3, 'purple']]
>>> table2 = [['id', 'shape'],
...           [1, 'circle'],
...           [3, 'square'],
...           [4, 'ellipse']]
>>> table3 = etl.rightjoin(table1, table2, key='id')
>>> table3
+----+----------+-----------+
| id | colour   | shape     |
+====+==========+===========+
|  1 | 'blue'   | 'circle'  |
+----+----------+-----------+
|  3 | 'purple' | 'square'  |
+----+----------+-----------+
|  4 | None     | 'ellipse' |
+----+----------+-----------+

If presorted is True, it is assumed that the data are already sorted by the given key, and the buffersize, tempdir and cache arguments are ignored. Otherwise, the data are sorted, see also the discussion of the buffersize, tempdir and cache arguments under the petl.transform.sorts.sort() function.

Left and right tables with different key fields can be handled via the lkey and rkey arguments.

petl.transform.joins.outerjoin(left, right, key=None, lkey=None, rkey=None, missing=None, presorted=False, buffersize=None, tempdir=None, cache=True, lprefix=None, rprefix=None)[source]¶

Perform a full outer join on the given tables. E.g.:

>>> import petl as etl
>>> table1 = [['id', 'colour'],
...           [1, 'blue'],
...           [2, 'red'],
...           [3, 'purple']]
>>> table2 = [['id', 'shape'],
...           [1, 'circle'],
...           [3, 'square'],
...           [4, 'ellipse']]
>>> table3 = etl.outerjoin(table1, table2, key='id')
>>> table3
+----+----------+-----------+
| id | colour   | shape     |
+====+==========+===========+
|  1 | 'blue'   | 'circle'  |
+----+----------+-----------+
|  2 | 'red'    | None      |
+----+----------+-----------+
|  3 | 'purple' | 'square'  |
+----+----------+-----------+
|  4 | None     | 'ellipse' |
+----+----------+-----------+

If presorted is True, it is assumed that the data are already sorted by the given key, and the buffersize, tempdir and cache arguments are ignored. Otherwise, the data are sorted, see also the discussion of the buffersize, tempdir and cache arguments under the petl.transform.sorts.sort() function.

Left and right tables with different key fields can be handled via the lkey and rkey arguments.

petl.transform.joins.crossjoin(*tables, **kwargs)[source]¶

Form the cartesian product of the given tables. E.g.:

>>> import petl as etl
>>> table1 = [['id', 'colour'],
...           [1, 'blue'],
...           [2, 'red']]
>>> table2 = [['id', 'shape'],
...           [1, 'circle'],
...           [3, 'square']]
>>> table3 = etl.crossjoin(table1, table2)
>>> table3
+----+--------+----+----------+
| id | colour | id | shape    |
+====+========+====+==========+
|  1 | 'blue' |  1 | 'circle' |
+----+--------+----+----------+
|  1 | 'blue' |  3 | 'square' |
+----+--------+----+----------+
|  2 | 'red'  |  1 | 'circle' |
+----+--------+----+----------+
|  2 | 'red'  |  3 | 'square' |
+----+--------+----+----------+

If prefix is True then field names in the output table header will be prefixed by the index of the input table.

petl.transform.joins.antijoin(left, right, key=None, lkey=None, rkey=None, presorted=False, buffersize=None, tempdir=None, cache=True)[source]¶

Return rows from the left table where the key value does not occur in the right table. E.g.:

>>> import petl as etl
>>> table1 = [['id', 'colour'],
...           [0, 'black'],
...           [1, 'blue'],
...           [2, 'red'],
...           [4, 'yellow'],
...           [5, 'white']]
>>> table2 = [['id', 'shape'],
...           [1, 'circle'],
...           [3, 'square']]
>>> table3 = etl.antijoin(table1, table2, key='id')
>>> table3
+----+----------+
| id | colour   |
+====+==========+
|  0 | 'black'  |
+----+----------+
|  2 | 'red'    |
+----+----------+
|  4 | 'yellow' |
+----+----------+
|  5 | 'white'  |
+----+----------+

If presorted is True, it is assumed that the data are already sorted by the given key, and the buffersize, tempdir and cache arguments are ignored. Otherwise, the data are sorted, see also the discussion of the buffersize, tempdir and cache arguments under the petl.transform.sorts.sort() function.

Left and right tables with different key fields can be handled via the lkey and rkey arguments.

petl.transform.joins.unjoin(table, value, key=None, autoincrement=(1, 1), presorted=False, buffersize=None, tempdir=None, cache=True)[source]¶

Split a table into two tables by reversing an inner join. E.g.:

>>> import petl as etl
>>> # join key is present in the table
... table1 = (('foo', 'bar', 'baz'),
...           ('A', 1, 'apple'),
...           ('B', 1, 'apple'),
...           ('C', 2, 'orange'))
>>> table2, table3 = etl.unjoin(table1, 'baz', key='bar')
>>> table2
+-----+-----+
| foo | bar |
+=====+=====+
| 'A' |   1 |
+-----+-----+
| 'B' |   1 |
+-----+-----+
| 'C' |   2 |
+-----+-----+

>>> table3
+-----+----------+
| bar | baz      |
+=====+==========+
|   1 | 'apple'  |
+-----+----------+
|   2 | 'orange' |
+-----+----------+

>>> # an integer join key can also be reconstructed
... table4 = (('foo', 'bar'),
...           ('A', 'apple'),
...           ('B', 'apple'),
...           ('C', 'orange'))
>>> table5, table6 = etl.unjoin(table4, 'bar')
>>> table5
+-----+--------+
| foo | bar_id |
+=====+========+
| 'A' |      1 |
+-----+--------+
| 'B' |      1 |
+-----+--------+
| 'C' |      2 |
+-----+--------+

>>> table6
+----+----------+
| id | bar      |
+====+==========+
|  1 | 'apple'  |
+----+----------+
|  2 | 'orange' |
+----+----------+

The autoincrement parameter controls how an integer join key is reconstructed, and should be a tuple of (start, step).

petl.transform.hashjoins.hashjoin(left, right, key=None, lkey=None, rkey=None, cache=True, lprefix=None, rprefix=None)[source]¶

Alternative implementation of petl.transform.joins.join(), where the join is executed by constructing an in-memory lookup for the right hand table, then iterating over rows from the left hand table.

May be faster and/or more resource efficient where the right table is small and the left table is large.

By default data from right hand table is cached to improve performance (only available when key is given).

Left and right tables with different key fields can be handled via the lkey and rkey arguments.

petl.transform.hashjoins.hashleftjoin(left, right, key=None, lkey=None, rkey=None, missing=None, cache=True, lprefix=None, rprefix=None)[source]¶

Alternative implementation of petl.transform.joins.leftjoin(), where the join is executed by constructing an in-memory lookup for the right hand table, then iterating over rows from the left hand table.

May be faster and/or more resource efficient where the right table is small and the left table is large.

By default data from right hand table is cached to improve performance (only available when key is given).

Left and right tables with different key fields can be handled via the lkey and rkey arguments.

petl.transform.hashjoins.hashlookupjoin(left, right, key=None, lkey=None, rkey=None, missing=None, lprefix=None, rprefix=None)[source]¶

Alternative implementation of petl.transform.joins.lookupjoin(), where the join is executed by constructing an in-memory lookup for the right hand table, then iterating over rows from the left hand table.

May be faster and/or more resource efficient where the right table is small and the left table is large.

Left and right tables with different key fields can be handled via the lkey and rkey arguments.

petl.transform.hashjoins.hashrightjoin(left, right, key=None, lkey=None, rkey=None, missing=None, cache=True, lprefix=None, rprefix=None)[source]¶

Alternative implementation of petl.transform.joins.rightjoin(), where the join is executed by constructing an in-memory lookup for the left hand table, then iterating over rows from the right hand table.

May be faster and/or more resource efficient where the left table is small and the right table is large.

By default data from right hand table is cached to improve performance (only available when key is given).

Left and right tables with different key fields can be handled via the lkey and rkey arguments.

petl.transform.hashjoins.hashantijoin(left, right, key=None, lkey=None, rkey=None)[source]¶

Alternative implementation of petl.transform.joins.antijoin(), where the join is executed by constructing an in-memory set for all keys found in the right hand table, then iterating over rows from the left hand table.

May be faster and/or more resource efficient where the right table is small and the left table is large.

Left and right tables with different key fields can be handled via the lkey and rkey arguments.

Set operations¶

petl.transform.setops.complement(a, b, presorted=False, buffersize=None, tempdir=None, cache=True, strict=False)[source]¶

Return rows in a that are not in b. E.g.:

>>> import petl as etl
>>> a = [['foo', 'bar', 'baz'],
...      ['A', 1, True],
...      ['C', 7, False],
...      ['B', 2, False],
...      ['C', 9, True]]
>>> b = [['x', 'y', 'z'],
...      ['B', 2, False],
...      ['A', 9, False],
...      ['B', 3, True],
...      ['C', 9, True]]
>>> aminusb = etl.complement(a, b)
>>> aminusb
+-----+-----+-------+
| foo | bar | baz   |
+=====+=====+=======+
| 'A' |   1 | True  |
+-----+-----+-------+
| 'C' |   7 | False |
+-----+-----+-------+

>>> bminusa = etl.complement(b, a)
>>> bminusa
+-----+---+-------+
| x   | y | z     |
+=====+===+=======+
| 'A' | 9 | False |
+-----+---+-------+
| 'B' | 3 | True  |
+-----+---+-------+

Note that the field names of each table are ignored - rows are simply compared following a lexical sort. See also the petl.transform.setops.recordcomplement() function.

If presorted is True, it is assumed that the data are already sorted by the given key, and the buffersize, tempdir and cache arguments are ignored. Otherwise, the data are sorted, see also the discussion of the buffersize, tempdir and cache arguments under the petl.transform.sorts.sort() function.

Note that the default behaviour is not strictly set-like, because duplicate rows are counted separately, e.g.:

>>> a = [['foo', 'bar'],
...      ['A', 1],
...      ['B', 2],
...      ['B', 2],
...      ['C', 7]]
>>> b = [['foo', 'bar'],
...      ['B', 2]]
>>> aminusb = etl.complement(a, b)
>>> aminusb
+-----+-----+
| foo | bar |
+=====+=====+
| 'A' |   1 |
+-----+-----+
| 'B' |   2 |
+-----+-----+
| 'C' |   7 |
+-----+-----+

This behaviour can be changed with the strict keyword argument, e.g.:

>>> aminusb = etl.complement(a, b, strict=True)
>>> aminusb
+-----+-----+
| foo | bar |
+=====+=====+
| 'A' |   1 |
+-----+-----+
| 'C' |   7 |
+-----+-----+

Changed in version 1.1.0.

If strict is True then strict set-like behaviour is used, i.e., only rows in a not found in b are returned.

petl.transform.setops.diff(a, b, presorted=False, buffersize=None, tempdir=None, cache=True, strict=False)[source]¶

Find the difference between rows in two tables. Returns a pair of tables. E.g.:

>>> import petl as etl
>>> a = [['foo', 'bar', 'baz'],
...      ['A', 1, True],
...      ['C', 7, False],
...      ['B', 2, False],
...      ['C', 9, True]]
>>> b = [['x', 'y', 'z'],
...      ['B', 2, False],
...      ['A', 9, False],
...      ['B', 3, True],
...      ['C', 9, True]]
>>> added, subtracted = etl.diff(a, b)
>>> # rows in b not in a
... added
+-----+---+-------+
| x   | y | z     |
+=====+===+=======+
| 'A' | 9 | False |
+-----+---+-------+
| 'B' | 3 | True  |
+-----+---+-------+

>>> # rows in a not in b
... subtracted
+-----+-----+-------+
| foo | bar | baz   |
+=====+=====+=======+
| 'A' |   1 | True  |
+-----+-----+-------+
| 'C' |   7 | False |
+-----+-----+-------+

Convenient shorthand for (complement(b, a), complement(a, b)). See also petl.transform.setops.complement().

If presorted is True, it is assumed that the data are already sorted by the given key, and the buffersize, tempdir and cache arguments are ignored. Otherwise, the data are sorted, see also the discussion of the buffersize, tempdir and cache arguments under the petl.transform.sorts.sort() function.

Changed in version 1.1.0.

If strict is True then strict set-like behaviour is used.

petl.transform.setops.recordcomplement(a, b, buffersize=None, tempdir=None, cache=True, strict=False)[source]¶

Find records in a that are not in b. E.g.:

>>> import petl as etl
>>> a = [['foo', 'bar', 'baz'],
...      ['A', 1, True],
...      ['C', 7, False],
...      ['B', 2, False],
...      ['C', 9, True]]
>>> b = [['bar', 'foo', 'baz'],
...      [2, 'B', False],
...      [9, 'A', False],
...      [3, 'B', True],
...      [9, 'C', True]]
>>> aminusb = etl.recordcomplement(a, b)
>>> aminusb
+-----+-----+-------+
| foo | bar | baz   |
+=====+=====+=======+
| 'A' |   1 | True  |
+-----+-----+-------+
| 'C' |   7 | False |
+-----+-----+-------+

>>> bminusa = etl.recordcomplement(b, a)
>>> bminusa
+-----+-----+-------+
| bar | foo | baz   |
+=====+=====+=======+
|   3 | 'B' | True  |
+-----+-----+-------+
|   9 | 'A' | False |
+-----+-----+-------+

Note that both tables must have the same set of fields, but that the order of the fields does not matter. See also the petl.transform.setops.complement() function.

See also the discussion of the buffersize, tempdir and cache arguments under the petl.transform.sorts.sort() function.

petl.transform.setops.recorddiff(a, b, buffersize=None, tempdir=None, cache=True, strict=False)[source]¶

Find the difference between records in two tables. E.g.:

>>> import petl as etl
>>> a = [['foo', 'bar', 'baz'],
...      ['A', 1, True],
...      ['C', 7, False],
...      ['B', 2, False],
...      ['C', 9, True]]
>>> b = [['bar', 'foo', 'baz'],
...      [2, 'B', False],
...      [9, 'A', False],
...      [3, 'B', True],
...      [9, 'C', True]]
>>> added, subtracted = etl.recorddiff(a, b)
>>> added
+-----+-----+-------+
| bar | foo | baz   |
+=====+=====+=======+
|   3 | 'B' | True  |
+-----+-----+-------+
|   9 | 'A' | False |
+-----+-----+-------+

>>> subtracted
+-----+-----+-------+
| foo | bar | baz   |
+=====+=====+=======+
| 'A' |   1 | True  |
+-----+-----+-------+
| 'C' |   7 | False |
+-----+-----+-------+

Convenient shorthand for (recordcomplement(b, a), recordcomplement(a, b)). See also petl.transform.setops.recordcomplement().

See also the discussion of the buffersize, tempdir and cache arguments under the petl.transform.sorts.sort() function.

Changed in version 1.1.0.

If strict is True then strict set-like behaviour is used.

petl.transform.setops.intersection(a, b, presorted=False, buffersize=None, tempdir=None, cache=True)[source]¶

Return rows in a that are also in b. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar', 'baz'],
...           ['A', 1, True],
...           ['C', 7, False],
...           ['B', 2, False],
...           ['C', 9, True]]
>>> table2 = [['x', 'y', 'z'],
...           ['B', 2, False],
...           ['A', 9, False],
...           ['B', 3, True],
...           ['C', 9, True]]
>>> table3 = etl.intersection(table1, table2)
>>> table3
+-----+-----+-------+
| foo | bar | baz   |
+=====+=====+=======+
| 'B' |   2 | False |
+-----+-----+-------+
| 'C' |   9 | True  |
+-----+-----+-------+

If presorted is True, it is assumed that the data are already sorted by the given key, and the buffersize, tempdir and cache arguments are ignored. Otherwise, the data are sorted, see also the discussion of the buffersize, tempdir and cache arguments under the petl.transform.sorts.sort() function.

petl.transform.setops.hashcomplement(a, b, strict=False)[source]¶

Alternative implementation of petl.transform.setops.complement(), where the complement is executed by constructing an in-memory set for all rows found in the right hand table, then iterating over rows from the left hand table.

May be faster and/or more resource efficient where the right table is small and the left table is large.

Changed in version 1.1.0.

If strict is True then strict set-like behaviour is used, i.e., only rows in a not found in b are returned.

petl.transform.setops.hashintersection(a, b)[source]¶

Alternative implementation of petl.transform.setops.intersection(), where the intersection is executed by constructing an in-memory set for all rows found in the right hand table, then iterating over rows from the left hand table.

May be faster and/or more resource efficient where the right table is small and the left table is large.

Deduplicating rows¶

petl.transform.dedup.duplicates(table, key=None, presorted=False, buffersize=None, tempdir=None, cache=True)[source]¶

Select rows with duplicate values under a given key (or duplicate rows where no key is given). E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar', 'baz'],
...           ['A', 1, 2.0],
...           ['B', 2, 3.4],
...           ['D', 6, 9.3],
...           ['B', 3, 7.8],
...           ['B', 2, 12.3],
...           ['E', None, 1.3],
...           ['D', 4, 14.5]]
>>> table2 = etl.duplicates(table1, 'foo')
>>> table2
+-----+-----+------+
| foo | bar | baz  |
+=====+=====+======+
| 'B' |   2 |  3.4 |
+-----+-----+------+
| 'B' |   3 |  7.8 |
+-----+-----+------+
| 'B' |   2 | 12.3 |
+-----+-----+------+
| 'D' |   6 |  9.3 |
+-----+-----+------+
| 'D' |   4 | 14.5 |
+-----+-----+------+

>>> # compound keys are supported
... table3 = etl.duplicates(table1, key=['foo', 'bar'])
>>> table3
+-----+-----+------+
| foo | bar | baz  |
+=====+=====+======+
| 'B' |   2 |  3.4 |
+-----+-----+------+
| 'B' |   2 | 12.3 |
+-----+-----+------+

If presorted is True, it is assumed that the data are already sorted by the given key, and the buffersize, tempdir and cache arguments are ignored. Otherwise, the data are sorted, see also the discussion of the buffersize, tempdir and cache arguments under the petl.transform.sorts.sort() function.

petl.transform.dedup.unique(table, key=None, presorted=False, buffersize=None, tempdir=None, cache=True)[source]¶

Select rows with unique values under a given key (or unique rows if no key is given). E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar', 'baz'],
...           ['A', 1, 2],
...           ['B', '2', '3.4'],
...           ['D', 'xyz', 9.0],
...           ['B', u'3', u'7.8'],
...           ['B', '2', 42],
...           ['E', None, None],
...           ['D', 4, 12.3],
...           ['F', 7, 2.3]]
>>> table2 = etl.unique(table1, 'foo')
>>> table2
+-----+------+------+
| foo | bar  | baz  |
+=====+======+======+
| 'A' |    1 |    2 |
+-----+------+------+
| 'E' | None | None |
+-----+------+------+
| 'F' |    7 |  2.3 |
+-----+------+------+

If presorted is True, it is assumed that the data are already sorted by the given key, and the buffersize, tempdir and cache arguments are ignored. Otherwise, the data are sorted, see also the discussion of the buffersize, tempdir and cache arguments under the petl.transform.sorts.sort() function.

petl.transform.dedup.conflicts(table, key, missing=None, include=None, exclude=None, presorted=False, buffersize=None, tempdir=None, cache=True)[source]¶

Select rows with the same key value but differing in some other field. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar', 'baz'],
...           ['A', 1, 2.7],
...           ['B', 2, None],
...           ['D', 3, 9.4],
...           ['B', None, 7.8],
...           ['E', None],
...           ['D', 3, 12.3],
...           ['A', 2, None]]
>>> table2 = etl.conflicts(table1, 'foo')
>>> table2
+-----+-----+------+
| foo | bar | baz  |
+=====+=====+======+
| 'A' |   1 |  2.7 |
+-----+-----+------+
| 'A' |   2 | None |
+-----+-----+------+
| 'D' |   3 |  9.4 |
+-----+-----+------+
| 'D' |   3 | 12.3 |
+-----+-----+------+

Missing values are not considered conflicts. By default, None is treated as the missing value, this can be changed via the missing keyword argument.

One or more fields can be ignored when determining conflicts by providing the exclude keyword argument. Alternatively, fields to use when determining conflicts can be specified explicitly with the include keyword argument. This provides a simple mechanism for analysing the source of conflicting rows from multiple tables, e.g.:

>>> table1 = [['foo', 'bar'], [1, 'a'], [2, 'b']]
>>> table2 = [['foo', 'bar'], [1, 'a'], [2, 'c']]
>>> table3 = etl.cat(etl.addfield(table1, 'source', 1),
...                  etl.addfield(table2, 'source', 2))
>>> table4 = etl.conflicts(table3, key='foo', exclude='source')
>>> table4
+-----+-----+--------+
| foo | bar | source |
+=====+=====+========+
|   2 | 'b' |      1 |
+-----+-----+--------+
|   2 | 'c' |      2 |
+-----+-----+--------+

If presorted is True, it is assumed that the data are already sorted by the given key, and the buffersize, tempdir and cache arguments are ignored. Otherwise, the data are sorted, see also the discussion of the buffersize, tempdir and cache arguments under the petl.transform.sorts.sort() function.

petl.transform.dedup.distinct(table, key=None, count=None, presorted=False, buffersize=None, tempdir=None, cache=True)[source]¶

Return only distinct rows in the table.

If the count argument is not None, it will be used as the name for an additional field, and the values of the field will be the number of duplicate rows.

If the key keyword argument is passed, the comparison is done on the given key instead of the full row.

petl.transform.dedup.isunique(table, field)[source]¶

Return True if there are no duplicate values for the given field(s), otherwise False. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['a', 1],
...           ['b'],
...           ['b', 2],
...           ['c', 3, True]]
>>> etl.isunique(table1, 'foo')
False
>>> etl.isunique(table1, 'bar')
True

The field argument can be a single field name or index (starting from zero) or a tuple of field names and/or indexes.

Reducing rows (aggregation)¶

petl.transform.reductions.aggregate(table, key, aggregation=None, value=None, presorted=False, buffersize=None, tempdir=None, cache=True, field='value')[source]¶

Apply aggregation functions. E.g.:

>>> import petl as etl
>>>
>>> table1 = [['foo', 'bar', 'baz'],
...           ['a', 3, True],
...           ['a', 7, False],
...           ['b', 2, True],
...           ['b', 2, False],
...           ['b', 9, False],
...           ['c', 4, True]]
>>> # aggregate whole rows
... table2 = etl.aggregate(table1, 'foo', len)
>>> table2
+-----+-------+
| foo | value |
+=====+=======+
| 'a' |     2 |
+-----+-------+
| 'b' |     3 |
+-----+-------+
| 'c' |     1 |
+-----+-------+

>>> # aggregate whole rows without a key
>>> etl.aggregate(table1, None, len)
+-------+
| value |
+=======+
|     6 |
+-------+

>>> # aggregate single field
... table3 = etl.aggregate(table1, 'foo', sum, 'bar')
>>> table3
+-----+-------+
| foo | value |
+=====+=======+
| 'a' |    10 |
+-----+-------+
| 'b' |    13 |
+-----+-------+
| 'c' |     4 |
+-----+-------+

>>> # aggregate single field without a key
>>> etl.aggregate(table1, None, sum, 'bar')
+-------+
| value |
+=======+
|    27 |
+-------+

>>> # alternative signature using keyword args
... table4 = etl.aggregate(table1, key=('foo', 'bar'),
...                        aggregation=list, value=('bar', 'baz'))
>>> table4
+-----+-----+-------------------------+
| foo | bar | value                   |
+=====+=====+=========================+
| 'a' |   3 | [(3, True)]             |
+-----+-----+-------------------------+
| 'a' |   7 | [(7, False)]            |
+-----+-----+-------------------------+
| 'b' |   2 | [(2, True), (2, False)] |
+-----+-----+-------------------------+
| 'b' |   9 | [(9, False)]            |
+-----+-----+-------------------------+
| 'c' |   4 | [(4, True)]             |
+-----+-----+-------------------------+

>>> # alternative signature using keyword args without a key
>>> etl.aggregate(table1, key=None,
...                 aggregation=list, value=('bar', 'baz'))
+-----------------------------------------------------------------------+
| value                                                                 |
+=======================================================================+
| [(3, True), (7, False), (2, True), (2, False), (9, False), (4, True)] |
+-----------------------------------------------------------------------+

>>> # aggregate multiple fields
... from collections import OrderedDict
>>> import petl as etl
>>>
>>> aggregation = OrderedDict()
>>> aggregation['count'] = len
>>> aggregation['minbar'] = 'bar', min
>>> aggregation['maxbar'] = 'bar', max
>>> aggregation['sumbar'] = 'bar', sum
>>> # default aggregation function is list
... aggregation['listbar'] = 'bar'
>>> aggregation['listbarbaz'] = ('bar', 'baz'), list
>>> aggregation['bars'] = 'bar', etl.strjoin(', ')
>>> table5 = etl.aggregate(table1, 'foo', aggregation)
>>> table5
+-----+-------+--------+--------+--------+-----------+-------------------------------------+-----------+
| foo | count | minbar | maxbar | sumbar | listbar   | listbarbaz                          | bars      |
+=====+=======+========+========+========+===========+=====================================+===========+
| 'a' |     2 |      3 |      7 |     10 | [3, 7]    | [(3, True), (7, False)]             | '3, 7'    |
+-----+-------+--------+--------+--------+-----------+-------------------------------------+-----------+
| 'b' |     3 |      2 |      9 |     13 | [2, 2, 9] | [(2, True), (2, False), (9, False)] | '2, 2, 9' |
+-----+-------+--------+--------+--------+-----------+-------------------------------------+-----------+
| 'c' |     1 |      4 |      4 |      4 | [4]       | [(4, True)]                         | '4'       |
+-----+-------+--------+--------+--------+-----------+-------------------------------------+-----------+

>>> # aggregate multiple fields without a key
>>> etl.aggregate(table1, None, aggregation)
+-------+--------+--------+--------+--------------------+-----------------------------------------------------------------------+--------------------+
| count | minbar | maxbar | sumbar | listbar            | listbarbaz                                                            | bars               |
+=======+========+========+========+====================+=======================================================================+====================+
|     6 |      2 |      9 |     27 | [3, 7, 2, 2, 9, 4] | [(3, True), (7, False), (2, True), (2, False), (9, False), (4, True)] | '3, 7, 2, 2, 9, 4' |
+-------+--------+--------+--------+--------------------+-----------------------------------------------------------------------+--------------------+

If presorted is True, it is assumed that the data are already sorted by the given key, and the buffersize, tempdir and cache arguments are ignored. Otherwise, the data are sorted, see also the discussion of the buffersize, tempdir and cache arguments under the petl.transform.sorts.sort() function.

If key is None, sorting is not necessary.

petl.transform.reductions.rowreduce(table, key, reducer, header=None, presorted=False, buffersize=None, tempdir=None, cache=True)[source]¶

Group rows under the given key then apply reducer to produce a single output row for each input group of rows. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar'],
...           ['a', 3],
...           ['a', 7],
...           ['b', 2],
...           ['b', 1],
...           ['b', 9],
...           ['c', 4]]
>>> def sumbar(key, rows):
...     return [key, sum(row[1] for row in rows)]
...
>>> table2 = etl.rowreduce(table1, key='foo', reducer=sumbar,
...                        header=['foo', 'barsum'])
>>> table2
+-----+--------+
| foo | barsum |
+=====+========+
| 'a' |     10 |
+-----+--------+
| 'b' |     12 |
+-----+--------+
| 'c' |      4 |
+-----+--------+

N.B., this is not strictly a “reduce” in the sense of the standard Python reduce() function, i.e., the reducer function is not applied recursively to values within a group, rather it is applied once to each row group as a whole.

petl.transform.reductions.mergeduplicates(table, key, missing=None, presorted=False, buffersize=None, tempdir=None, cache=True)[source]¶

Merge duplicate rows under the given key. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar', 'baz'],
...           ['A', 1, 2.7],
...           ['B', 2, None],
...           ['D', 3, 9.4],
...           ['B', None, 7.8],
...           ['E', None, 42.],
...           ['D', 3, 12.3],
...           ['A', 2, None]]
>>> table2 = etl.mergeduplicates(table1, 'foo')
>>> table2
+-----+------------------+-----------------------+
| foo | bar              | baz                   |
+=====+==================+=======================+
| 'A' | Conflict({1, 2}) |                   2.7 |
+-----+------------------+-----------------------+
| 'B' |                2 |                   7.8 |
+-----+------------------+-----------------------+
| 'D' |                3 | Conflict({9.4, 12.3}) |
+-----+------------------+-----------------------+
| 'E' | None             |                  42.0 |
+-----+------------------+-----------------------+

Missing values are overridden by non-missing values. Conflicting values are reported as an instance of the Conflict class (sub-class of frozenset).

If presorted is True, it is assumed that the data are already sorted by the given key, and the buffersize, tempdir and cache arguments are ignored. Otherwise, the data are sorted, see also the discussion of the buffersize, tempdir and cache arguments under the petl.transform.sorts.sort() function.

petl.transform.reductions.merge(*tables, **kwargs)[source]¶

Convenience function to combine multiple tables (via petl.transform.sorts.mergesort()) then combine duplicate rows by merging under the given key (via petl.transform.reductions.mergeduplicates()). E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar', 'baz'],
...           [1, 'A', True],
...           [2, 'B', None],
...           [4, 'C', True]]
>>> table2 = [['bar', 'baz', 'quux'],
...           ['A', True, 42.0],
...           ['B', False, 79.3],
...           ['C', False, 12.4]]
>>> table3 = etl.merge(table1, table2, key='bar')
>>> table3
+-----+-----+-------------------------+------+
| bar | foo | baz                     | quux |
+=====+=====+=========================+======+
| 'A' |   1 | True                    | 42.0 |
+-----+-----+-------------------------+------+
| 'B' |   2 | False                   | 79.3 |
+-----+-----+-------------------------+------+
| 'C' |   4 | Conflict({False, True}) | 12.4 |
+-----+-----+-------------------------+------+

Keyword arguments are the same as for petl.transform.sorts.mergesort(), except key is required.

petl.transform.reductions.fold(table, key, f, value=None, presorted=False, buffersize=None, tempdir=None, cache=True)[source]¶

Reduce rows recursively via the Python standard reduce() function. E.g.:

>>> import petl as etl
>>> table1 = [['id', 'count'],
...           [1, 3],
...           [1, 5],
...           [2, 4],
...           [2, 8]]
>>> import operator
>>> table2 = etl.fold(table1, 'id', operator.add, 'count',
...                   presorted=True)
>>> table2
+-----+-------+
| key | value |
+=====+=======+
|   1 |     8 |
+-----+-------+
|   2 |    12 |
+-----+-------+

petl.transform.reductions.groupcountdistinctvalues(table, key, value)[source]¶: Group by the key field then count the number of distinct values in the value field.

petl.transform.reductions.groupselectfirst(table, key, presorted=False, buffersize=None, tempdir=None, cache=True)[source]¶

Group by the key field then return the first row within each group. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar', 'baz'],
...           ['A', 1, True],
...           ['C', 7, False],
...           ['B', 2, False],
...           ['C', 9, True]]
>>> table2 = etl.groupselectfirst(table1, key='foo')
>>> table2
+-----+-----+-------+
| foo | bar | baz   |
+=====+=====+=======+
| 'A' |   1 | True  |
+-----+-----+-------+
| 'B' |   2 | False |
+-----+-----+-------+
| 'C' |   7 | False |
+-----+-----+-------+

petl.transform.reductions.groupselectlast(table, key, presorted=False, buffersize=None, tempdir=None, cache=True)[source]¶

Group by the key field then return the last row within each group. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar', 'baz'],
...           ['A', 1, True],
...           ['C', 7, False],
...           ['B', 2, False],
...           ['C', 9, True]]
>>> table2 = etl.groupselectlast(table1, key='foo')
>>> table2
+-----+-----+-------+
| foo | bar | baz   |
+=====+=====+=======+
| 'A' |   1 | True  |
+-----+-----+-------+
| 'B' |   2 | False |
+-----+-----+-------+
| 'C' |   9 | True  |
+-----+-----+-------+

New in version 1.1.0.

petl.transform.reductions.groupselectmin(table, key, value, presorted=False, buffersize=None, tempdir=None, cache=True)[source]¶: Group by the key field then return the row with the minimum of the value field within each group. N.B., will only return one row for each group, even if multiple rows have the same (minimum) value.

petl.transform.reductions.groupselectmax(table, key, value, presorted=False, buffersize=None, tempdir=None, cache=True)[source]¶: Group by the key field then return the row with the maximum of the value field within each group. N.B., will only return one row for each group, even if multiple rows have the same (maximum) value.

Reshaping tables¶

petl.transform.reshape.melt(table, key=None, variables=None, variablefield='variable', valuefield='value')[source]¶

Reshape a table, melting fields into data. E.g.:

>>> import petl as etl
>>> table1 = [['id', 'gender', 'age'],
...           [1, 'F', 12],
...           [2, 'M', 17],
...           [3, 'M', 16]]
>>> table2 = etl.melt(table1, 'id')
>>> table2.lookall()
+----+----------+-------+
| id | variable | value |
+====+==========+=======+
|  1 | 'gender' | 'F'   |
+----+----------+-------+
|  1 | 'age'    |    12 |
+----+----------+-------+
|  2 | 'gender' | 'M'   |
+----+----------+-------+
|  2 | 'age'    |    17 |
+----+----------+-------+
|  3 | 'gender' | 'M'   |
+----+----------+-------+
|  3 | 'age'    |    16 |
+----+----------+-------+

>>> # compound keys are supported
... table3 = [['id', 'time', 'height', 'weight'],
...           [1, 11, 66.4, 12.2],
...           [2, 16, 53.2, 17.3],
...           [3, 12, 34.5, 9.4]]
>>> table4 = etl.melt(table3, key=['id', 'time'])
>>> table4.lookall()
+----+------+----------+-------+
| id | time | variable | value |
+====+======+==========+=======+
|  1 |   11 | 'height' |  66.4 |
+----+------+----------+-------+
|  1 |   11 | 'weight' |  12.2 |
+----+------+----------+-------+
|  2 |   16 | 'height' |  53.2 |
+----+------+----------+-------+
|  2 |   16 | 'weight' |  17.3 |
+----+------+----------+-------+
|  3 |   12 | 'height' |  34.5 |
+----+------+----------+-------+
|  3 |   12 | 'weight' |   9.4 |
+----+------+----------+-------+

>>> # a subset of variable fields can be selected
... table5 = etl.melt(table3, key=['id', 'time'],
...                   variables=['height'])
>>> table5.lookall()
+----+------+----------+-------+
| id | time | variable | value |
+====+======+==========+=======+
|  1 |   11 | 'height' |  66.4 |
+----+------+----------+-------+
|  2 |   16 | 'height' |  53.2 |
+----+------+----------+-------+
|  3 |   12 | 'height' |  34.5 |
+----+------+----------+-------+

petl.transform.reshape.recast(table, key=None, variablefield='variable', valuefield='value', samplesize=1000, reducers=None, missing=None)[source]¶

Recast molten data. E.g.:

>>> import petl as etl
>>> table1 = [['id', 'variable', 'value'],
...           [3, 'age', 16],
...           [1, 'gender', 'F'],
...           [2, 'gender', 'M'],
...           [2, 'age', 17],
...           [1, 'age', 12],
...           [3, 'gender', 'M']]
>>> table2 = etl.recast(table1)
>>> table2
+----+-----+--------+
| id | age | gender |
+====+=====+========+
|  1 |  12 | 'F'    |
+----+-----+--------+
|  2 |  17 | 'M'    |
+----+-----+--------+
|  3 |  16 | 'M'    |
+----+-----+--------+

>>> # specifying variable and value fields
... table3 = [['id', 'vars', 'vals'],
...           [3, 'age', 16],
...           [1, 'gender', 'F'],
...           [2, 'gender', 'M'],
...           [2, 'age', 17],
...           [1, 'age', 12],
...           [3, 'gender', 'M']]
>>> table4 = etl.recast(table3, variablefield='vars', valuefield='vals')
>>> table4
+----+-----+--------+
| id | age | gender |
+====+=====+========+
|  1 |  12 | 'F'    |
+----+-----+--------+
|  2 |  17 | 'M'    |
+----+-----+--------+
|  3 |  16 | 'M'    |
+----+-----+--------+

>>> # if there are multiple values for each key/variable pair, and no
... # reducers function is provided, then all values will be listed
... table6 = [['id', 'time', 'variable', 'value'],
...           [1, 11, 'weight', 66.4],
...           [1, 14, 'weight', 55.2],
...           [2, 12, 'weight', 53.2],
...           [2, 16, 'weight', 43.3],
...           [3, 12, 'weight', 34.5],
...           [3, 17, 'weight', 49.4]]
>>> table7 = etl.recast(table6, key='id')
>>> table7
+----+--------------+
| id | weight       |
+====+==============+
|  1 | [66.4, 55.2] |
+----+--------------+
|  2 | [53.2, 43.3] |
+----+--------------+
|  3 | [34.5, 49.4] |
+----+--------------+

>>> # multiple values can be reduced via an aggregation function
... def mean(values):
...     return float(sum(values)) / len(values)
...
>>> table8 = etl.recast(table6, key='id', reducers={'weight': mean})
>>> table8
+----+--------------------+
| id | weight             |
+====+====================+
|  1 | 60.800000000000004 |
+----+--------------------+
|  2 |              48.25 |
+----+--------------------+
|  3 |              41.95 |
+----+--------------------+

>>> # missing values are padded with whatever is provided via the
... # missing keyword argument (None by default)
... table9 = [['id', 'variable', 'value'],
...           [1, 'gender', 'F'],
...           [2, 'age', 17],
...           [1, 'age', 12],
...           [3, 'gender', 'M']]
>>> table10 = etl.recast(table9, key='id')
>>> table10
+----+------+--------+
| id | age  | gender |
+====+======+========+
|  1 |   12 | 'F'    |
+----+------+--------+
|  2 |   17 | None   |
+----+------+--------+
|  3 | None | 'M'    |
+----+------+--------+

Note that the table is scanned once to discover variables, then a second time to reshape the data and recast variables as fields. How many rows are scanned in the first pass is determined by the samplesize argument.

petl.transform.reshape.transpose(table)[source]¶

Transpose rows into columns. E.g.:

>>> import petl as etl
>>> table1 = [['id', 'colour'],
...           [1, 'blue'],
...           [2, 'red'],
...           [3, 'purple'],
...           [5, 'yellow'],
...           [7, 'orange']]
>>> table2 = etl.transpose(table1)
>>> table2
+----------+--------+-------+----------+----------+----------+
| id       | 1      | 2     | 3        | 5        | 7        |
+==========+========+=======+==========+==========+==========+
| 'colour' | 'blue' | 'red' | 'purple' | 'yellow' | 'orange' |
+----------+--------+-------+----------+----------+----------+

petl.transform.reshape.pivot(table, f1, f2, f3, aggfun, missing=None, presorted=False, buffersize=None, tempdir=None, cache=True)[source]¶

Construct a pivot table. E.g.:

>>> import petl as etl
>>> table1 = [['region', 'gender', 'style', 'units'],
...           ['east', 'boy', 'tee', 12],
...           ['east', 'boy', 'golf', 14],
...           ['east', 'boy', 'fancy', 7],
...           ['east', 'girl', 'tee', 3],
...           ['east', 'girl', 'golf', 8],
...           ['east', 'girl', 'fancy', 18],
...           ['west', 'boy', 'tee', 12],
...           ['west', 'boy', 'golf', 15],
...           ['west', 'boy', 'fancy', 8],
...           ['west', 'girl', 'tee', 6],
...           ['west', 'girl', 'golf', 16],
...           ['west', 'girl', 'fancy', 1]]
>>> table2 = etl.pivot(table1, 'region', 'gender', 'units', sum)
>>> table2
+--------+-----+------+
| region | boy | girl |
+========+=====+======+
| 'east' |  33 |   29 |
+--------+-----+------+
| 'west' |  35 |   23 |
+--------+-----+------+

>>> table3 = etl.pivot(table1, 'region', 'style', 'units', sum)
>>> table3
+--------+-------+------+-----+
| region | fancy | golf | tee |
+========+=======+======+=====+
| 'east' |    25 |   22 |  15 |
+--------+-------+------+-----+
| 'west' |     9 |   31 |  18 |
+--------+-------+------+-----+

>>> table4 = etl.pivot(table1, 'gender', 'style', 'units', sum)
>>> table4
+--------+-------+------+-----+
| gender | fancy | golf | tee |
+========+=======+======+=====+
| 'boy'  |    15 |   29 |  24 |
+--------+-------+------+-----+
| 'girl' |    19 |   24 |   9 |
+--------+-------+------+-----+

petl.transform.reshape.flatten(table)[source]¶

Convert a table to a sequence of values in row-major order. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar', 'baz'],
...           ['A', 1, True],
...           ['C', 7, False],
...           ['B', 2, False],
...           ['C', 9, True]]
>>> list(etl.flatten(table1))
['A', 1, True, 'C', 7, False, 'B', 2, False, 'C', 9, True]

petl.transform.reshape.unflatten(*args, **kwargs)[source]¶

Convert a sequence of values in row-major order into a table. E.g.:

>>> import petl as etl
>>> a = ['A', 1, True, 'C', 7, False, 'B', 2, False, 'C', 9]
>>> table1 = etl.unflatten(a, 3)
>>> table1
+-----+----+-------+
| f0  | f1 | f2    |
+=====+====+=======+
| 'A' |  1 | True  |
+-----+----+-------+
| 'C' |  7 | False |
+-----+----+-------+
| 'B' |  2 | False |
+-----+----+-------+
| 'C' |  9 | None  |
+-----+----+-------+

>>> # a table and field name can also be provided as arguments
... table2 = [['lines'],
...           ['A'],
...           [1],
...           [True],
...           ['C'],
...           [7],
...           [False],
...           ['B'],
...           [2],
...           [False],
...           ['C'],
...           [9]]
>>> table3 = etl.unflatten(table2, 'lines', 3)
>>> table3
+-----+----+-------+
| f0  | f1 | f2    |
+=====+====+=======+
| 'A' |  1 | True  |
+-----+----+-------+
| 'C' |  7 | False |
+-----+----+-------+
| 'B' |  2 | False |
+-----+----+-------+
| 'C' |  9 | None  |
+-----+----+-------+

Filling missing values¶

petl.transform.fills.filldown(table, *fields, **kwargs)[source]¶

Replace missing values with non-missing values from the row above. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar', 'baz'],
...           [1, 'a', None],
...           [1, None, .23],
...           [1, 'b', None],
...           [2, None, None],
...           [2, None, .56],
...           [2, 'c', None],
...           [None, 'c', .72]]
>>> table2 = etl.filldown(table1)
>>> table2.lookall()
+-----+-----+------+
| foo | bar | baz  |
+=====+=====+======+
|   1 | 'a' | None |
+-----+-----+------+
|   1 | 'a' | 0.23 |
+-----+-----+------+
|   1 | 'b' | 0.23 |
+-----+-----+------+
|   2 | 'b' | 0.23 |
+-----+-----+------+
|   2 | 'b' | 0.56 |
+-----+-----+------+
|   2 | 'c' | 0.56 |
+-----+-----+------+
|   2 | 'c' | 0.72 |
+-----+-----+------+

>>> table3 = etl.filldown(table1, 'bar')
>>> table3.lookall()
+------+-----+------+
| foo  | bar | baz  |
+======+=====+======+
|    1 | 'a' | None |
+------+-----+------+
|    1 | 'a' | 0.23 |
+------+-----+------+
|    1 | 'b' | None |
+------+-----+------+
|    2 | 'b' | None |
+------+-----+------+
|    2 | 'b' | 0.56 |
+------+-----+------+
|    2 | 'c' | None |
+------+-----+------+
| None | 'c' | 0.72 |
+------+-----+------+

>>> table4 = etl.filldown(table1, 'bar', 'baz')
>>> table4.lookall()
+------+-----+------+
| foo  | bar | baz  |
+======+=====+======+
|    1 | 'a' | None |
+------+-----+------+
|    1 | 'a' | 0.23 |
+------+-----+------+
|    1 | 'b' | 0.23 |
+------+-----+------+
|    2 | 'b' | 0.23 |
+------+-----+------+
|    2 | 'b' | 0.56 |
+------+-----+------+
|    2 | 'c' | 0.56 |
+------+-----+------+
| None | 'c' | 0.72 |
+------+-----+------+

Use the missing keyword argument to control which value is treated as missing (None by default).

petl.transform.fills.fillright(table, missing=None)[source]¶

Replace missing values with preceding non-missing values. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar', 'baz'],
...           [1, 'a', None],
...           [1, None, .23],
...           [1, 'b', None],
...           [2, None, None],
...           [2, None, .56],
...           [2, 'c', None],
...           [None, 'c', .72]]
>>> table2 = etl.fillright(table1)
>>> table2.lookall()
+------+-----+------+
| foo  | bar | baz  |
+======+=====+======+
|    1 | 'a' | 'a'  |
+------+-----+------+
|    1 |   1 | 0.23 |
+------+-----+------+
|    1 | 'b' | 'b'  |
+------+-----+------+
|    2 |   2 |    2 |
+------+-----+------+
|    2 |   2 | 0.56 |
+------+-----+------+
|    2 | 'c' | 'c'  |
+------+-----+------+
| None | 'c' | 0.72 |
+------+-----+------+

Use the missing keyword argument to control which value is treated as missing (None by default).

petl.transform.fills.fillleft(table, missing=None)[source]¶

Replace missing values with following non-missing values. E.g.:

>>> import petl as etl
>>> table1 = [['foo', 'bar', 'baz'],
...           [1, 'a', None],
...           [1, None, .23],
...           [1, 'b', None],
...           [2, None, None],
...           [2, None, .56],
...           [2, 'c', None],
...           [None, 'c', .72]]
>>> table2 = etl.fillleft(table1)
>>> table2.lookall()
+-----+------+------+
| foo | bar  | baz  |
+=====+======+======+
|   1 | 'a'  | None |
+-----+------+------+
|   1 | 0.23 | 0.23 |
+-----+------+------+
|   1 | 'b'  | None |
+-----+------+------+
|   2 | None | None |
+-----+------+------+
|   2 | 0.56 | 0.56 |
+-----+------+------+
|   2 | 'c'  | None |
+-----+------+------+
| 'c' | 'c'  | 0.72 |
+-----+------+------+

Use the missing keyword argument to control which value is treated as missing (None by default).

Validation¶

petl.transform.validation.validate(table, constraints=None, header=None)[source]¶

Validate a table against a set of constraints and/or an expected header, e.g.:

>>> import petl as etl
>>> # define some validation constraints
... header = ('foo', 'bar', 'baz')
>>> constraints = [
...     dict(name='foo_int', field='foo', test=int),
...     dict(name='bar_date', field='bar', test=etl.dateparser('%Y-%m-%d')),
...     dict(name='baz_enum', field='baz', assertion=lambda v: v in ['Y', 'N']),
...     dict(name='not_none', assertion=lambda row: None not in row),
...     dict(name='qux_int', field='qux', test=int, optional=True),
... ]
>>> # now validate a table
... table = (('foo', 'bar', 'bazzz'),
...          (1, '2000-01-01', 'Y'),
...          ('x', '2010-10-10', 'N'),
...          (2, '2000/01/01', 'Y'),
...          (3, '2015-12-12', 'x'),
...          (4, None, 'N'),
...          ('y', '1999-99-99', 'z'),
...          (6, '2000-01-01'),
...          (7, '2001-02-02', 'N', True))
>>> problems = etl.validate(table, constraints=constraints, header=header)
>>> problems.lookall()
+--------------+-----+-------+--------------+------------------+
| name         | row | field | value        | error            |
+==============+=====+=======+==============+==================+
| '__header__' |   0 | None  | None         | 'AssertionError' |
+--------------+-----+-------+--------------+------------------+
| 'foo_int'    |   2 | 'foo' | 'x'          | 'ValueError'     |
+--------------+-----+-------+--------------+------------------+
| 'bar_date'   |   3 | 'bar' | '2000/01/01' | 'ValueError'     |
+--------------+-----+-------+--------------+------------------+
| 'baz_enum'   |   4 | 'baz' | 'x'          | 'AssertionError' |
+--------------+-----+-------+--------------+------------------+
| 'bar_date'   |   5 | 'bar' | None         | 'AttributeError' |
+--------------+-----+-------+--------------+------------------+
| 'not_none'   |   5 | None  | None         | 'AssertionError' |
+--------------+-----+-------+--------------+------------------+
| 'foo_int'    |   6 | 'foo' | 'y'          | 'ValueError'     |
+--------------+-----+-------+--------------+------------------+
| 'bar_date'   |   6 | 'bar' | '1999-99-99' | 'ValueError'     |
+--------------+-----+-------+--------------+------------------+
| 'baz_enum'   |   6 | 'baz' | 'z'          | 'AssertionError' |
+--------------+-----+-------+--------------+------------------+
| '__len__'    |   7 | None  |            2 | 'AssertionError' |
+--------------+-----+-------+--------------+------------------+
| 'baz_enum'   |   7 | 'baz' | None         | 'AssertionError' |
+--------------+-----+-------+--------------+------------------+
| '__len__'    |   8 | None  |            4 | 'AssertionError' |
+--------------+-----+-------+--------------+------------------+

Returns a table of validation problems.

Intervals (intervaltree)¶

Note

The following functions require the package intervaltree to be installed, e.g.:

$ pip install intervaltree

petl.transform.intervals.intervaljoin(left, right, lstart='start', lstop='stop', rstart='start', rstop='stop', lkey=None, rkey=None, include_stop=False, lprefix=None, rprefix=None)[source]¶

Join two tables by overlapping intervals. E.g.:

>>> import petl as etl
>>> left = [['begin', 'end', 'quux'],
...         [1, 2, 'a'],
...         [2, 4, 'b'],
...         [2, 5, 'c'],
...         [9, 14, 'd'],
...         [1, 1, 'e'],
...         [10, 10, 'f']]
>>> right = [['start', 'stop', 'value'],
...          [1, 4, 'foo'],
...          [3, 7, 'bar'],
...          [4, 9, 'baz']]
>>> table1 = etl.intervaljoin(left, right,
...                           lstart='begin', lstop='end',
...                           rstart='start', rstop='stop')
>>> table1.lookall()
+-------+-----+------+-------+------+-------+
| begin | end | quux | start | stop | value |
+=======+=====+======+=======+======+=======+
|     1 |   2 | 'a'  |     1 |    4 | 'foo' |
+-------+-----+------+-------+------+-------+
|     2 |   4 | 'b'  |     1 |    4 | 'foo' |
+-------+-----+------+-------+------+-------+
|     2 |   4 | 'b'  |     3 |    7 | 'bar' |
+-------+-----+------+-------+------+-------+
|     2 |   5 | 'c'  |     1 |    4 | 'foo' |
+-------+-----+------+-------+------+-------+
|     2 |   5 | 'c'  |     3 |    7 | 'bar' |
+-------+-----+------+-------+------+-------+
|     2 |   5 | 'c'  |     4 |    9 | 'baz' |
+-------+-----+------+-------+------+-------+

>>> # include stop coordinate in intervals
... table2 = etl.intervaljoin(left, right,
...                           lstart='begin', lstop='end',
...                           rstart='start', rstop='stop',
...                           include_stop=True)
>>> table2.lookall()
+-------+-----+------+-------+------+-------+
| begin | end | quux | start | stop | value |
+=======+=====+======+=======+======+=======+
|     1 |   2 | 'a'  |     1 |    4 | 'foo' |
+-------+-----+------+-------+------+-------+
|     2 |   4 | 'b'  |     1 |    4 | 'foo' |
+-------+-----+------+-------+------+-------+
|     2 |   4 | 'b'  |     3 |    7 | 'bar' |
+-------+-----+------+-------+------+-------+
|     2 |   4 | 'b'  |     4 |    9 | 'baz' |
+-------+-----+------+-------+------+-------+
|     2 |   5 | 'c'  |     1 |    4 | 'foo' |
+-------+-----+------+-------+------+-------+
|     2 |   5 | 'c'  |     3 |    7 | 'bar' |
+-------+-----+------+-------+------+-------+
|     2 |   5 | 'c'  |     4 |    9 | 'baz' |
+-------+-----+------+-------+------+-------+
|     9 |  14 | 'd'  |     4 |    9 | 'baz' |
+-------+-----+------+-------+------+-------+
|     1 |   1 | 'e'  |     1 |    4 | 'foo' |
+-------+-----+------+-------+------+-------+

Note start coordinates are included and stop coordinates are excluded from the interval. Use the include_stop keyword argument to include the upper bound of the interval when finding overlaps.

An additional key comparison can be made, e.g.:

>>> import petl as etl
>>> left = (('fruit', 'begin', 'end'),
...         ('apple', 1, 2),
...         ('apple', 2, 4),
...         ('apple', 2, 5),
...         ('orange', 2, 5),
...         ('orange', 9, 14),
...         ('orange', 19, 140),
...         ('apple', 1, 1))
>>> right = (('type', 'start', 'stop', 'value'),
...          ('apple', 1, 4, 'foo'),
...          ('apple', 3, 7, 'bar'),
...          ('orange', 4, 9, 'baz'))
>>> table3 = etl.intervaljoin(left, right,
...                           lstart='begin', lstop='end', lkey='fruit',
...                           rstart='start', rstop='stop', rkey='type')
>>> table3.lookall()
+----------+-------+-----+----------+-------+------+-------+
| fruit    | begin | end | type     | start | stop | value |
+==========+=======+=====+==========+=======+======+=======+
| 'apple'  |     1 |   2 | 'apple'  |     1 |    4 | 'foo' |
+----------+-------+-----+----------+-------+------+-------+
| 'apple'  |     2 |   4 | 'apple'  |     1 |    4 | 'foo' |
+----------+-------+-----+----------+-------+------+-------+
| 'apple'  |     2 |   4 | 'apple'  |     3 |    7 | 'bar' |
+----------+-------+-----+----------+-------+------+-------+
| 'apple'  |     2 |   5 | 'apple'  |     1 |    4 | 'foo' |
+----------+-------+-----+----------+-------+------+-------+
| 'apple'  |     2 |   5 | 'apple'  |     3 |    7 | 'bar' |
+----------+-------+-----+----------+-------+------+-------+
| 'orange' |     2 |   5 | 'orange' |     4 |    9 | 'baz' |
+----------+-------+-----+----------+-------+------+-------+

petl.transform.intervals.intervalleftjoin(left, right, lstart='start', lstop='stop', rstart='start', rstop='stop', lkey=None, rkey=None, include_stop=False, missing=None, lprefix=None, rprefix=None)[source]¶

Like petl.transform.intervals.intervaljoin() but rows from the left table without a match in the right table are also included. E.g.:

>>> import petl as etl
>>> left = [['begin', 'end', 'quux'],
...         [1, 2, 'a'],
...         [2, 4, 'b'],
...         [2, 5, 'c'],
...         [9, 14, 'd'],
...         [1, 1, 'e'],
...         [10, 10, 'f']]
>>> right = [['start', 'stop', 'value'],
...          [1, 4, 'foo'],
...          [3, 7, 'bar'],
...          [4, 9, 'baz']]
>>> table1 = etl.intervalleftjoin(left, right,
...                               lstart='begin', lstop='end',
...                               rstart='start', rstop='stop')
>>> table1.lookall()
+-------+-----+------+-------+------+-------+
| begin | end | quux | start | stop | value |
+=======+=====+======+=======+======+=======+
|     1 |   2 | 'a'  |     1 |    4 | 'foo' |
+-------+-----+------+-------+------+-------+
|     2 |   4 | 'b'  |     1 |    4 | 'foo' |
+-------+-----+------+-------+------+-------+
|     2 |   4 | 'b'  |     3 |    7 | 'bar' |
+-------+-----+------+-------+------+-------+
|     2 |   5 | 'c'  |     1 |    4 | 'foo' |
+-------+-----+------+-------+------+-------+
|     2 |   5 | 'c'  |     3 |    7 | 'bar' |
+-------+-----+------+-------+------+-------+
|     2 |   5 | 'c'  |     4 |    9 | 'baz' |
+-------+-----+------+-------+------+-------+
|     9 |  14 | 'd'  | None  | None | None  |
+-------+-----+------+-------+------+-------+
|     1 |   1 | 'e'  | None  | None | None  |
+-------+-----+------+-------+------+-------+
|    10 |  10 | 'f'  | None  | None | None  |
+-------+-----+------+-------+------+-------+

Note start coordinates are included and stop coordinates are excluded from the interval. Use the include_stop keyword argument to include the upper bound of the interval when finding overlaps.

petl.transform.intervals.intervaljoinvalues(left, right, value, lstart='start', lstop='stop', rstart='start', rstop='stop', lkey=None, rkey=None, include_stop=False)[source]¶

Convenience function to join the left table with values from a specific field in the right hand table.

Note start coordinates are included and stop coordinates are excluded from the interval. Use the include_stop keyword argument to include the upper bound of the interval when finding overlaps.

petl.transform.intervals.intervalantijoin(left, right, lstart='start', lstop='stop', rstart='start', rstop='stop', lkey=None, rkey=None, include_stop=False, missing=None)[source]¶

Return rows from the left table with no overlapping rows from the right table.

Note start coordinates are included and stop coordinates are excluded from the interval. Use the include_stop keyword argument to include the upper bound of the interval when finding overlaps.

petl.transform.intervals.intervallookup(table, start='start', stop='stop', value=None, include_stop=False)[source]¶

Construct an interval lookup for the given table. E.g.:

>>> import petl as etl
>>> table = [['start', 'stop', 'value'],
...          [1, 4, 'foo'],
...          [3, 7, 'bar'],
...          [4, 9, 'baz']]
>>> lkp = etl.intervallookup(table, 'start', 'stop')
>>> lkp.search(0, 1)
[]
>>> lkp.search(1, 2)
[(1, 4, 'foo')]
>>> lkp.search(2, 4)
[(1, 4, 'foo'), (3, 7, 'bar')]
>>> lkp.search(2, 5)
[(1, 4, 'foo'), (3, 7, 'bar'), (4, 9, 'baz')]
>>> lkp.search(9, 14)
[]
>>> lkp.search(19, 140)
[]
>>> lkp.search(0)
[]
>>> lkp.search(1)
[(1, 4, 'foo')]
>>> lkp.search(2)
[(1, 4, 'foo')]
>>> lkp.search(4)
[(3, 7, 'bar'), (4, 9, 'baz')]
>>> lkp.search(5)
[(3, 7, 'bar'), (4, 9, 'baz')]

Note start coordinates are included and stop coordinates are excluded from the interval. Use the include_stop keyword argument to include the upper bound of the interval when finding overlaps.

Some examples using the include_stop and value keyword arguments:

>>> import petl as etl
>>> table = [['start', 'stop', 'value'],
...          [1, 4, 'foo'],
...          [3, 7, 'bar'],
...          [4, 9, 'baz']]
>>> lkp = etl.intervallookup(table, 'start', 'stop', include_stop=True,
...                          value='value')
>>> lkp.search(0, 1)
['foo']
>>> lkp.search(1, 2)
['foo']
>>> lkp.search(2, 4)
['foo', 'bar', 'baz']
>>> lkp.search(2, 5)
['foo', 'bar', 'baz']
>>> lkp.search(9, 14)
['baz']
>>> lkp.search(19, 140)
[]
>>> lkp.search(0)
[]
>>> lkp.search(1)
['foo']
>>> lkp.search(2)
['foo']
>>> lkp.search(4)
['foo', 'bar', 'baz']
>>> lkp.search(5)
['bar', 'baz']

petl.transform.intervals.intervallookupone(table, start='start', stop='stop', value=None, include_stop=False, strict=True)[source]¶

Construct an interval lookup for the given table, returning at most one result for each query. E.g.:

>>> import petl as etl
>>> table = [['start', 'stop', 'value'],
...          [1, 4, 'foo'],
...          [3, 7, 'bar'],
...          [4, 9, 'baz']]
>>> lkp = etl.intervallookupone(table, 'start', 'stop', strict=False)
>>> lkp.search(0, 1)
>>> lkp.search(1, 2)
(1, 4, 'foo')
>>> lkp.search(2, 4)
(1, 4, 'foo')
>>> lkp.search(2, 5)
(1, 4, 'foo')
>>> lkp.search(9, 14)
>>> lkp.search(19, 140)
>>> lkp.search(0)
>>> lkp.search(1)
(1, 4, 'foo')
>>> lkp.search(2)
(1, 4, 'foo')
>>> lkp.search(4)
(3, 7, 'bar')
>>> lkp.search(5)
(3, 7, 'bar')

If strict=True, queries returning more than one result will raise a DuplicateKeyError. If strict=False and there is more than one result, the first result is returned.

Note start coordinates are included and stop coordinates are excluded from the interval. Use the include_stop keyword argument to include the upper bound of the interval when finding overlaps.

petl.transform.intervals.intervalrecordlookup(table, start='start', stop='stop', include_stop=False)[source]¶: As petl.transform.intervals.intervallookup() but return records instead of tuples.

petl.transform.intervals.intervalrecordlookupone(table, start='start', stop='stop', include_stop=False, strict=True)[source]¶: As petl.transform.intervals.intervallookupone() but return records instead of tuples.

petl.transform.intervals.facetintervallookup(table, key, start='start', stop='stop', value=None, include_stop=False)[source]¶

Construct a faceted interval lookup for the given table. E.g.:

>>> import petl as etl
>>> table = (('type', 'start', 'stop', 'value'),
...          ('apple', 1, 4, 'foo'),
...          ('apple', 3, 7, 'bar'),
...          ('orange', 4, 9, 'baz'))
>>> lkp = etl.facetintervallookup(table, key='type', start='start', stop='stop')
>>> lkp['apple'].search(1, 2)
[('apple', 1, 4, 'foo')]
>>> lkp['apple'].search(2, 4)
[('apple', 1, 4, 'foo'), ('apple', 3, 7, 'bar')]
>>> lkp['apple'].search(2, 5)
[('apple', 1, 4, 'foo'), ('apple', 3, 7, 'bar')]
>>> lkp['orange'].search(2, 5)
[('orange', 4, 9, 'baz')]
>>> lkp['orange'].search(9, 14)
[]
>>> lkp['orange'].search(19, 140)
[]
>>> lkp['apple'].search(1)
[('apple', 1, 4, 'foo')]
>>> lkp['apple'].search(2)
[('apple', 1, 4, 'foo')]
>>> lkp['apple'].search(4)
[('apple', 3, 7, 'bar')]
>>> lkp['apple'].search(5)
[('apple', 3, 7, 'bar')]
>>> lkp['orange'].search(5)
[('orange', 4, 9, 'baz')]

petl.transform.intervals.facetintervallookupone(table, key, start='start', stop='stop', value=None, include_stop=False, strict=True)[source]¶

Construct a faceted interval lookup for the given table, returning at most one result for each query.

If strict=True, queries returning more than one result will raise a DuplicateKeyError. If strict=False and there is more than one result, the first result is returned.

petl.transform.intervals.facetintervalrecordlookup(table, key, start='start', stop='stop', include_stop=False)[source]¶: As petl.transform.intervals.facetintervallookup() but return records.

petl.transform.intervals.facetintervalrecordlookupone(table, key, start, stop, include_stop=False, strict=True)[source]¶: As petl.transform.intervals.facetintervallookupone() but return records.

petl.transform.intervals.intervalsubtract(left, right, lstart='start', lstop='stop', rstart='start', rstop='stop', lkey=None, rkey=None, include_stop=False)[source]¶: Subtract intervals in the right hand table from intervals in the left hand table.

petl.transform.intervals.collapsedintervals(table, start='start', stop='stop', key=None)[source]¶

Utility function to collapse intervals in a table.

If no facet key is given, returns an iterator over (start, stop) tuples.

If facet key is given, returns an iterator over (key, start, stop) tuples.