Introduction

Overview

Teaching: 20 min
Exercises: 30 min

Questions

How do I start working in Galaxy?

What does my raw sequencing data looks like?

Do I have good enough quality and sequencing depth for my assembly?

Objectives

Getting familiar with Galaxy’s workflow and history

Understand the data generated and make a decision on how to process MinION data

Introduction

For the bioinformatics analysis of your genomes, we will be working in Galaxy.

There are two main online galaxy platforms; the global server: https://usegalaxy.org/ and the European server: https://usegalaxy.eu/. The global version has a slightly larger set of programs than the European version. However, the European version runs in general slightly faster, and therefore you are recommended to use the European server.

Overview of Genome Assembly Processes

Next-generation Sequencing (NGS) generates DNA reads in parallel, which are massive amount of ~~short~~ sequences of nucleotides. The length of these reads depends on the sequencing platforms as well as the methods used for extracting the DNA. Short-read NGS platform such as Illumina generate an average read length of 2 x 150 bp to 2 x 300 bp with high accuracy. In the past decade, long-read sequencing have become a reality, with the two leading platform are PacBio and Oxford Nanopore Technologies. Current PacBio HiFi can generate up to 25kb reads with high accuracy, while Nanopore can generate ultra long reads (>900 kb) but with less accuracy.

Long-read sequencing are crucial to solve assembly of complex genomic regions such as repeats, something that short-read could not achieve. A hybrid assembly (combining both the accuracy of short-reads and with long-reads to connect fragments) are used to generate a complete, high-quality genome. Nevertheless, the field is moving rapidly, opening a new avenues to look at microbial diversity at strain-level resolutions.

Because Nanopore sequencing results in so much longer reads, different bioinformatics tools need to be applied to process the data, as compared to the data generated by Illumina sequencing. Therefore, be careful to use bioinformatics programs, which are applicable for short or long reads depending on the dataset you have. In this course we are solely working with Nanopore sequencing.

More information on sequencing techniques:

Ryan Wick provides a nice “Guide to bacterial genome assembly” as a starting point to choose which approaches and techniques to assemble your genome. He also write an approach in combining ONT long reads with Illumina short reads to achieve perfect assemblies: Assembling the perfect bacterial genome using Oxford Nanopore and Illumina sequencing.

The general pipeline for processing nanopore reads into whole genomes are:

Trimming and filtering reads
Assembly of the reads into a genome
Polishing the assembly
Analysing the genome

We will be using similar pipeline to the one suggested by Ryan Wick with several measures to check quality in between each steps.

Importing Data into Galaxy

There are several ways we can input our data into Galaxy. We can upload datasets from our local machine, fetch from a public repository such as Zenodo, or reuse data from Galaxy history.

We will inform you how to fetch your raw sequencing data when it is ready. In the meantime, we can do an exercise using the data from previous year (2022), which are available from Zenodo. We will be using the strain with decent reads (barcode02).

Let’s login to Galaxy and import this data into our history:

Go to https://usegalaxy.eu/
Click Login or Register. Create an account and activate (if you hadn’t so)
Import the data by pressing the Upload Data button on the upper right corner and choose the Paste/Fetch data button at the bottom window.
Change the name from New file to barcode02.fastq.gz and paste the zenodo link: https://zenodo.org/record/6475902/files/barcode02.fastq.gz into the box.
Press start to download

Discussion 01

Each item in the Zenodo repository are the sample raw reads that has been demultiplexed (one strain each).

How many gigabytes of data are there?

How much data was generated for an average sample?

Why do we have uneven distribution of data for each samples?

What kind of information can you get from a FASTQ file? What are the differences with a FASTA format?

Solution

TBD

There are two main data formats which are relevant for your bioinformatics analysis. The first format is called FASTQ. It is usually used to store the raw reads (that is the nucleotide sequence string) together with their estimated quality(represented as ASCII bytes – not part of the curriculum). Data in fastq format can be identified by the endings: *.fq or *.fastq. The other format is called fasta. It is used to store any sort of DNA or amino acid sequence. Data in fasta format can be identified by the endings: *.fasta, *.fa, *.fna, *.fas or *.fsa. To save storage space data files are often compressed. Most often used is a Gzip compression, which results in an added *.gz at the end of the file name for example: example.fq.gz or example.fa.gz. When uploading files to galaxy, files get automatically unzipped, so this should not be an issue.

Importing and editing a published workflow

We have prepared a template workflow that you can edit and use in your genome assembly and annotation. The template workflow are split into different steps with several measures for checking the quality in between steps.

Let’s start by importing and exploring an initial QC workflow:

In a new tab, open this link: https://usegalaxy.eu/u/matinnu/w/01rawqc
On the top-right corner, click the + button (import workflow)
Click start using this workflow
Click on the name of the newly imported workflow (imported: 01_Raw_QC) and click Edit
Explore the workflow and make changes if necessary

Discussion 02

What is the input and final output of the workflow?

What are the purpose of the two main tools in the workflow?

Do you think Kraken2 is suited for long-read sequences?

Solution

TBD

When you are contempt with the workflow, click save. Next, we will run the analysis on barcode02.fastq (the smallest dataset).

Click the “play” button on the top-right corner (Run Workflow)
If the workflow does not show in detailed view, click Expand to full workflow form
Set Send results to a new history to yes.
Change the history name to “01_Raw_QC_2021_sample02”.
On the 1: input dataset, select 2: barcode02.fastq as the input.
Click Run Workflow on the top panel.

Working with Galaxy history

In Galaxy, each analysis will run on different history. You can have multiple history and switch between history depending on your needs.

We just run a simple QC to our raw reads, and it might take a while for the jobs to finish. Note that we have made some of the intermediate jobs hidden.

Discussion 03

When the job is done, let’s look at the output of the jobs in the history. Open the Nanoplot result by clicking the “eye” (view data) icon.

Find out the value of these parameters:

Total base

N50

Mean / Median Length

Mean / Median Quality

Your longest read

Note that this sample have 506.4 MB (Mega Bytes) of data and a 262.4 MBp (Mega Base Pairs) of reads. Which one is more appropriate to measure the sequencing yield?

What does N50 means? What is a good N50 value?

There are 66% of reads that are above Q10. What does this mean? What is a Q-score?

Do you think you have enough data for the assembly? What sequencing depth do you expect? How do you calculate a sequencing depth?

Solution

Example of Nanoplot result

Before the assembly a quality check is necessary. For that we can use Nanoplot with default settings. Nanoplot gives several output files, but you can get an easy overview with the HTML report. The most important numbers there the mean read quality and the read length N50. The read quality is denoted as a Phred score and is for nanopore reads typicall not very high. Anything above 7 should be fine. If you want to know more, it is possible to calculate back from the Phred score to the estimated probability of erroneous base pairs. The N50 is a measurement for the contiguity or the mean read/contig length. It is defined as the sequence length of the shortest contig at 50% of the total genome length. Low N50 (<1000bp) of the nanopore reads usually indicate sheared DNA, which can occur due to problems during the library prep or DNA extraction.

Let’s open the Krona Pie Chart to visualize the Kraken2 result.

What does Kraken2 says about the majority of the reads? What species are assigned?

Do you see potential contamination in your reads?

Would you trust Kraken2 result as the taxonomic placement of your assembly or do you want to try another approach? Which database are being used by Krona for the taxonomic assignment?

Solution

TBD

Concluding the QC Analysis

Based on the QC, we have good quality and sufficient amount of reads for the assembly. Nevertheless, we might improve our assembly by filtering out the shorter reads (and perhaps bad quality reads?), which we will do in the next steps.

Key Points

You can start working in Galaxy by getting familiar to importing data, using history, and manipulating workflows.

Tools such as Nanoplot and Kraken2 can give you information about the quantity (yield), quality (Q-score, N50, and read distribution), and possible contamination.

QC are valuable to decide how the assembly process should be approached. It can tell us whether more sequencing data need to be generated or a certain part of the data need to be filtered. Overall, the decision might be a case by case basis.

lesson home

(27255) Week 11 - Genome Assembly and Annotation

next episode

Introduction

Overview

Introduction

Overview of Genome Assembly Processes

Importing Data into Galaxy

Discussion 01

Solution

Importing and editing a published workflow

Discussion 02

Solution

Working with Galaxy history

Discussion 03

Solution

Solution

Concluding the QC Analysis

Key Points

lesson home

next episode