HDF5 in Python

Giacomo Debidda

18/12/2017 @PyData Munich

Topics

  • HDF5
  • HDF5 tools
  • h5py
  • PyTables
  • Data Model
  • Library
  • Format

HDF5 is...

A filesystem in a file

/ root group (every HDF5 file has a root group)

/foo member of the root group called foo

/foo/bar member of the group foo called bar

Working with groups and group members is similar to working with directories and files in UNIX.

HDF5 Data Model

  • Datasets (i.e. files in a filesystem)
  • Groups (i.e. directories in a filesystem)
  • Attributes (i.e. metadata of file/directory)

The HDF5 File Format Specification specifies the bit-level organization of an HDF5 file on storage media.

Why use HDF5?

  • Portable
  • Self-describing
  • Can contain binary data (in many representations)

  • Allows direct access to parts of the file without first parsing the entire contents

  • Supports large/complex/heterogeneous data
  • File format tool kit (you can design your own file format and use HDF5 under the hood)

Who uses HDF5?

HDF5 Tools

Reference

h5py

  • Thin, pythonic wrapper around HDF5

  • HDF5 errors are converted into Python exceptions

  • Written in Cython

  • Uses numpy objects

  • Higher level abstraction

  • It does not want to be a complete wrapper for the entire HDF5 C API

  • Can be faster than h5py, thanks to out-of-core querying

  • Allow indexing and complex queries

  • Built-in compression

  • Undo mode

Scipy 2015

At SciPy 2015, developers from PyTables, h5py, the HDF Group and pandas decided to start a refactor: PyTables will depend on h5py for its bindings to HDF5.

Code, plz!

HDF5 talk PyData

By Giacomo Debidda

HDF5 talk PyData

  • 1,133