Leveraging Modern Data Stack in Box for Natural Product Genome Mining in Small-Scale and Private Strain Collection
Date:
This talk is about giving solutions to small and private labs to help them perform, manage, and explore genomics information from their strain collection to look for interesting BGCS. This talk offers more of the data management vibe from our published workflow.
Conference details here | Group picture with Pep and Anna from NPGM group |
I presented this poster at the “Data Science for Planetary and Human Health: The True Life Cycle of Multi-Omics Data” conference from the Novo Nordisk Foundation Science Cluster.
A heartfelt thank you to the committee for awarding me the poster prize for our work! 🎉
Abstract
The advent of third-generation sequencing technologies, especially Oxford Nanopore Technology, enables individual researchers and small laboratories to affordably create and manage private microbial strain collections. This shift promises to accelerate natural product discovery by facilitating the mining of biosynthetic gene clusters (BGCs) from genomic sequences, an important step in unlocking novel pharmaceuticals, agrochemicals, and other industrially relevant compounds. As researchers embark on building and analyzing their own private collections, the challenge extends beyond managing large-scale public genomic datasets but also in providing solutions that cater to the analysis of smaller, more focused collections. Here, we present BGCFlow, a comprehensive genome mining workflow for the analysis of bacterial pangenomes. BGCFlow integrates a “modern data stack in a box,” leveraging tools such as dbt, DuckDB, and Metabase to offer streamlined data engineering pipeline and efficient platform for the exploration and management of private strain collections. Each tool is selected for its unique capabilities: dbt for transforming data with simplicity and reproducibility, duckdb for its lightweight, in-process SQL database that facilitates fast analytical queries, and metabase for its user-friendly interface allowing both data scientists and lab researchers to visualize and interact with their data. By doing so, we aim to bridge the gap between the potential of genome mining and the practicalities of conducting such research at a scale that is both manageable and accessible to a broader scientific community.