Hoffman MM, Buske OJ, Noble WS. 2010. The Genomedata format for storing large-scale functional genomics data. Bioinformatics, 26(11):1458-1459; doi:10.1093/bioinformatics/btq164
Genomedata is a format for efficient storage of multiple tracks of numeric data anchored to a genome. The format allows fast random access to hundreds of gigabytes of data, while retaining a small disk space footprint. We have also developed utilities to load data into this format. A reference implementation in Python and C components is available here under the GNU General Public License.
To install Genomedata, you must have HDF5 and Python 3.7 (or later) installed on your system. Genomedata can then be installed using the following command on your Linux/Unix based system*:
pip install genomedata
For more detailed instructions on how to install Genomedata, see the documentation linked below.
* We have only tested this software on Linux and Mac systems. We would love to extend our support to other systems in the future, and we would gladly accept any contributions toward this end. Specicially, we have successfully installed Genomedata on the following platforms:Genomedata is briefly described in the Bioinformatics application note cited and linked at the top of this page.
The application's documentation is available in three formats:
Reference assemblies can be downloaded from the National Center for Biotechnology Information FTP site. On the FTP Site, there is the current human reference genome assembly, hg38.
1.7.2: * required Python is now >=3.9 * fixed consistency in array shape output when track indexing on bigWig files 1.7.1: * fix array dimensionality consistency for summary statistics on bigWig files * add debug representation for chromosomes for bigWig files 1.7.0: * adapted existing python interface to open bigWig files 1.6.0: * required Python is now >=3.7 * genomedata-load-data: changed to a python script with a c-extension 1.5.0: * genomedata-load-data: fix bad error message when loading process fails * genomedata-load-seq: add chromsome name mapping based on assembly reports 1.4.4: * fixed pkg-config output encoding when finding HDF5 directories 1.4.3: * fixed genomedata script entry points for Python 3 1.4.2: * added compatibilty for Python 3 * genomedata-load-seq: adjacent AGP entries are merged into a single supercontig * Use pkg-config during setup to determine paths to HDF5 directories * Removed forked-path dependency, added Path.py 1.4.1: * genomedata-hard-mask: fix verbosity line not outputting to stderr * genomedata-load-data: fix hdf5 group leak 1.4.0: * genomedata-close-data: chunk metadata now truncates telomeres and trims large gaps between supercontigs * genomedata-load-data: new option for masking data with --maskfile * genomedata-hardmask: new command added to filter out track regions * hardmask_data: new python interface to filter out track regions * Genome: add ability to open archives for writing * genomedata-load-seq: AGP are now correctly loaded regardless of filename and may be concatenated together * genomedata-load-seq: fix assertion failure on argument parsing when loading fasta sequence (thanks to Kate Cook) * genomedata-load: fix agp files not being recognized from this entry point * docs: clarified that agp files cannot be combined * docs: warned users that globs must be quoted to be parsed by genomedata-load 1.3.6: * `sizes` command added to `genomedata-info` (Jay Hesselberth) * Updated installation instructions for installing with PyTables 3.1.1 * toward python3 compatibility (Jay Hesselberth) - genomedata now requires python 2.7+ - moved from `optparse` to `argparse` throughout - package-wide `__version__` lets modules report true version number - __future__ imports added to all modules and python3 `print()` functions 1.3.5: * Removed platform specific builds from distribution 1.3.4: * fixed bug related to updated PyTables * compile works with HDF5 setups even when they were built --with-default-api-version=v16 * doc fixes * fixed DeprecatingWarnings associated with PyTables 3.0 * updated dependency to PyTables >= 3.0 1.3.3: * genomedata-query: new command that prints data from a Genomedata archive for your non-Python scripting needs (thanks to Max Libbrecht) * genomedata-histogram: new command that prints histograms from a Genomedata archive (combination of a new module by Max Libbrecht and an old module by Michael Hoffman) * genomedata-info: add "contigs" subcommand (thanks to Max Libbrecht) * genomedata-info: friendlier error when unsupported command name used * genomedata-load-data: friendlier errors when invalid BED3+1/bedGraph data supplied * genomedata-load-seq: always makes chromosome and supercontig coordinates with unsigned 32-bit integers instead of system int * genomedata-load-data: more detailed error message when initial file open fails * genomedata-load-data: bugfix * now compile with -Wextra * doc fixes 1.3.2: * API: now allow array of tracks. For example: chromosome[245:270, array([7, 5])] 1.3.1: * API: now allow lists of tracks when directly accessing chromosome data, for example: chromosome[245:270, ["data1", "data3"]] or chromosome[245:270, [7, 5]] * genomedata-load-seq: add --assembly option which supports AGP files, to allow avoid loading seq while still dealing with assembly gaps properly * genomedata-load: now supports --assembly and --sizes options * genomedata-load-assembly: alias for genomedata-load-seq. genomedata-load-seq will be deprecated in the future * genomedata-load-data: now support DOS-style line endings ("\r\n") * genomedata-load: print genomedata-load-data error code on failure * genomedata-load-data: print more informative messages when ignoring data * genomedata-load: all diagnostics messages to stderr * genomedata-load: some diagnostics now include timestamp so we can see where performance bottlenecks are * genomedata-load: more descriptive error messages * genomedata-load-seq: print more descriptive error message when attempting to load sequence from a non-FASTA file * genomedata-load: fixed issue 10: now compiles on gcc 4.6.2 * docs: add links to source code * docs: genomedata-load: sequence "option" is mandatory. In a future version, we should change this to an argument to reflect this. * test: add tests for DOS-style line-endings 1.3.0: * genomedata supercontigs are no longer guaranteed to have seq data * add --sizes option to genomedata-load-seq, to allow avoid loading seq * Genome.add_track_continuous() has a significant performance improvement. This also means that genomedata-open-data will run much faster, as well as genomedata-load-data on fresh tracks * fix bug where genomedata-load-seq didn't work * fix bug where directory genomedata archive didn't work with only one chromosome 1.2.3: * allow use with PyTables >=2.2 * new command: genomedata-info: "genomedata-info tracknames ARCHIVE" prints the tracknames for ARCHIVE * Genome.format_version will now return 0 when files are missing a genomedata_format_version attribute * Genome.__init__: future-proof to future versions of file format by throwing an error * tests: add regression tests, lots of changes * docs: add man pages 1.2.2: * genomedata-load: will now support track filenames with "=" in the names * genomedata-load: now supports UNIX glob wildcards as arguments to -s * genomedata-load-data: allow other delimiters besides space for variableStep and fixedStep, allow wiggle_0 track specification * genomedata-load-data, genomedata-load: remove unused --chunk-size option * genomedata-close-data: fix bug where chunk_starts, chunk_ends not written for supercontigs with zero present data * installation: move from path.py to forked-path * docs: fixed small errors * various: removed exclamation marks from error messages. It's not *that* exciting. * some portability improvements * tests: improve unit test interface 1.2.1: * Fixed an installation bug where HDF5 installations later in LIBRARY_PATH might override those specified first, leading to linking errors during build.
There is a moderated genomedata-announce mailing list that you can subscribe to for information on new releases of Genomedata.
There is also a genomedata-users mailing list for general discussion and questions about the use of the Genomedata system.
If you want to report a bug or request a feature, please do so using the Genomedata issue tracker.
For other support with Genomedata, or to provide feedback, please e-mail Michael. We are interested in all comments regarding the package and the ease of use of installation and documentation.
genomedata-users mailing list