Compressive Metagenomics – MIT Center for Microbiome Informatics & Therapeutics (CMIT)

The explosion in ‘omics data—particularly human genomic data, but also metagenomic data from the human microbiome—has driven a critical need for the research community. The growth of genomic and metagenomic data presents scalability concerns in transmission, storage, and analysis of data. In this project, we combine compressive approaches from storage to mapping in order to dramatically accelerate the metagenomics sequence analysis pipeline without the need to store or transmit large intermediate data files. The ultimate goal is to provide scalable microbiome sequence analysis pipelines to improve downstream analysis.

Outcome: Briefly, we extended CORA — a state-of-the-art read mapper that addresses a major NGS pipeline bottleneck — adapting its mid-sized k-mer based read mapping to metagenomic read data, and applied CORA’s k-mer based compression to unmapped reads. In addition, we capitalized on the low metric entropy and fractal dimension of metagenomic data to develop a compressively-accelerated version of BLASTX, MICA, which inputs read data and searches protein sequence databases for matches. In comparison to both BLASTX and DIAMOND run on real human gut microbiome data, we were able to compressively accelerate both using our framework in our MICA program, at no further loss in accuracy. We thus enabled the development of compressive algorithms that scale sub-linearly with the size of the data sets in large-scale metagenomic search, achieving orders of magnitude improvements in runtime.