nf-core/genomeassembler   
 Assembly and scaffolding of haploid / unphased genomes from long ONT or PacBio HiFi reads
1.0.1). The latest
                                stable release is
 1.1.0 
.
  Introduction
This document describes the output produced by the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- Read preparation
- Assembly, choice between assemblers
- Polishing
- Scaffolding
- Annotation liftover
- Quality control
- Reporting
Output structure
Outputs are collect into the output directory by sample:
Output files
- <SampleName>/
Within each sample, the files are structured as follows:
Read preparation
The outputs from all read preparation steps are emitted into <SampleName>/reads/.
ONT reads
If the basecalls are scattered across multiple files, collect can be used to collect those into a single file.
porechop is a tool that identifies and trims adapter sequences from ONT reads.
genomescope estimates genome size and ploidy from the k-mer spectrum computed by jellyfish.
Output files
- <SampleName>/- reads/- collect/: single fastq.gz files per sample
- porechop/: output from porechop, fastq.gz
- genomescope/: output from jellyfish and genomescope- jellyfish/- count/: output from jellyfish count
- stats/: output from jellyfish stats
- histo/: output from jellyfish histogram
- dump/: output from jellyfish dump
 
- genomescope/: genomescope plots
 
 
 
HiFi reads
lima performs trimming of adapters from pacbio HiFi reads.
Output files
- <SampleName>/- reads/- lima/: hifi reads after adapter removal with lima.- fastq/: hifi reads after adapter remval with lima converted to fastq format.
 
 
 
Short reads
TrimGalore! can remove adapters from illumina short-reads. meryl calculates the k-mer spectrum of short reads.
Output files
- <SampleName>/- reads/- trimgalore/:- <SampleName>_val_1.fq.gz: Trimmed forward reads
- <SampleName>_val_2.fq.gz: Trimmed reverse reads (if included)
- <SampleName>_1.fastq.gz.trimming_report.txt: Trimming report forward
- <SampleName>_2.fastq.gz.trimming_report.txt: Trimming report reverse (if included)
 
- meryl/: output from meryl- count/: k-mer counts per file
- unionsum/: union of k-mer counts per sample
 
 
 
Assembly
This folder contains the initial assemblies of the provided reads.
Depending on the assembly strategy chosen, different assemblers are used.
flye performs assembly of ONT reads
hifiasm performs assembly of HiFi reads, or combinations of HiFi reads and ONT reads in --ul mode.
ragtag performs scaffolding and can be used to scaffold assemblies of ONT onto assemblies of HiFi reads.
Annotation gff3 and unmapped.txt files are only created if a reference for annotation liftover is provided and lift_annotations is enabled.
Output files
- <SampleName>- assembly/- flye/: output from flye.- <SampleName>.assembly.fasta.gz: Assembly in gzipped fasta format
- <SampleName>.assembly_graph.gfa.gz: Assembly graph in gzipped gfa format
- <SampleName>.assembly_graph.gv.gz: Assembly graph in gzipped gv format
- <SampleName>.assembly_info.txt: Information on the assembly
- <SampleName>.flye.log: flye log-file
- <SampleName>.params.json: params used for running flye
 
- hifiasm/: output from hifiasm.- <SampleName>.asm.bp.p_ctg.fa.gz: gzipped fasta file of the primary contigs
- <SampleName>.asm.bp.p_ctg.gfa: primary contigs in gfa format
- <SampleName>.asm.bp.p_utg.gfa: processed unitigs in gfa format
- <SampleName>.asm.bp.r_utg.gfa: raw unitigs in gfa format
- <SampleName>.stderr.log: Any output form hifiasm to stderr
- gfa2_fasta/: hifiasm assembly in fasta format.
 
- ragtag/: output from RagTag, only if- 'flye_on_hifiasm'was used as the assembler. Contains one folder per sample.- <SampleName>_assembly_scaffold/- <SampleName>_assembly_scaffold.agp: Scaffolds in agp format
- <SampleName>_assembly_scaffold.fasta: Scaffolds in fasta format
- <SampleName>_assembly_scaffold.stats: Scaffolding statistics.
 
 
- <SampleName>_assembly.gff3annotation liftover
- <SampleName>_assembly.unnapped.txtannotations that could not be lifted over during annotation liftover
 
 
Polishing
Polishing can be used to correct errors in the assembly. This pipeline supports two polishing tools.
medaka polishes assemblies using the ONT reads that were used for assembly.
pilon polishes any type of assembly using short-reads.
Annotation gff3 and unmapped.txt files are only created if a reference for annotation liftover is provided and lift_annotations is enabled.
Output files
- <SampleName>- polish/- pilon/: output from pilon- <SampleName>_pilon.fastaPolished assembly
- <SampleName>_pilon.gff3annotation liftover
- <SampleName>_pilon.unnapped.txtannotations that could not be lifted over during annotation liftover
 
- medaka/: output from medaka- <SampleName>_medaka.fa.gzPolished assembly
- <SampleName>_medaka.gff3annotation liftover
- <SampleName>_medaka.unnapped.txtannotations that could not be lifted over during annotation liftover
 
 
 
Scaffolding
The (polished) assembly can be scaffolded using different tools.
links performs scaffolding of the assembly using long-reads
longstitch performs correction via Tigmint and scaffolding using long reads via ntLink and ARKS.
Annotation gff3 and unmapped.txt files are only created if a reference for annotation liftover is provided and lift_annotations is enabled.
Output files
- <SampleName>- scaffold/- links/: output from links- <SampleName>_links.gv: scaffolding graph
- <SampleName>_links.log: log file
- <SampleName>_links.scaffolds: scaffold statistics
- <SampleName>_links.scaffolds.fa: scaffold fasta
- <SampleName>_links.gff3annotation liftover
- <SampleName>_links.unnapped.txtannotations that could not be lifted over during annotation liftover
 
- longstitch/: output from longstitch- <SampleName>_tigmint-ntLinks.arks.longstitch-scaffolds.fa: Scaffolds after scaffolding with tigmint, ntLinks, and arks. Annotations are based on this file.
- <SampleName>_tigmint-ntLinks.longstitch-scaffolds.fa: Scaffolds after scaffolding with tigmint, and ntLinks.
- <SampleName>_longstitch.gff3annotation liftover (onto- *._tigmint-ntLinks.arks.*)
- <SampleName>_longstitch.unnapped.txtannotations that could not be lifted over during annotation liftover
 
- ragtag/: output from RagTag- <SampleName>_ragtag_<Reference>/- <SampleName>_ragtag_<Reference>.agp: agp file, scaffolding results
- <SampleName>_ragtag_<Reference>.fasta: Scaffold fasta file
- <SampleName>_ragtag_<Reference>.stats: Scaffolding statistics
- <SampleName>_ragtag.gff3annotation liftover
- <SampleName>_ragtag.unnapped.txtannotations that could not be lifted over during annotation liftover
 
 
 
 
Quality control
All quality control files end up in QC. Below is the tree assuming that all steps of the pipeline were run:
- nanoqgenerates descriptive statistics of the nanopore reads. For each step three quality control tools can be run.
- QUASTprovides assembly statistics (e.g. size, N50, etc. )
- BUSCOassess genome quality based on the presence of lineage-specific single-copy orthologs
- merqurycompares the genome k-mer spectrum to the short-read k-mer spectrum to assess base-accuracy of the assembly.
The files and folders in the different QC folders are named based on
<SampleName> and <stage>. SampleName is the sample name, and stage is one of: assembly, medaka, pilon, links, longstitch or ragtag.
Folder contents
- <SampleName>- QC/:- BUSCO/: BUSCO reports- <SampleName>_<stage>-<BuscoLineage>-busco/: BUSCO output folder, please refer to BUSCO documentation for details.
- <SampleName>_<stage>-<BuscoLineage>-busco.batch_summary.txt: BUSCO batch summary output
- short_summary.specific.<SampleName>_<stage>.{txt,json}: BUSCO short summaries in txt and json format
 
- merqury/: merqury analysis of the assembly- <SampleName>_<stage>.<SampleName>.assembly.qv: QV of the assembly (per sequence)
- <SampleName>_<stage>.<SampleName>.assembly.spectra-cn.fl.png: Copy Number plot, filled
- <SampleName>_<stage>.<SampleName>.assembly.spectra-cn.ln.png: Copy Number plot, lines
- <SampleName>_<stage>.<SampleName>.assembly.spectra-cn.st.png: Copy Number plot, semi-transparent
- <SampleName>_<stage>.<SampleName>.assembly.spectra-cn.hist: Copy Number histogram file
- <SampleName>_<stage>.completeness.stats: Assembly completeness statistics (overall)
- <SampleName>_<stage>.qv: Assembly QV (overall)
- <SampleName>_<stage>.spectra-asm.fl.png: Assembly k-mer spectrum, filled
- <SampleName>_<stage>.spectra-asm.ln.png: Assembly k-mer spectrum, lines
- <SampleName>_<stage>.spectra-asm.st.png: Assembly k-mer spectrum, semi-transparent
- <SampleName>_<stage>.spectra-asm.hist: Assembly QV (overall)
- <SampleName>_<stage>.dist_only.hist: Number of k-mers distinct to the assembly
- <SampleName>_<stage>.assembly_only.bed: bp errors in assembly (bed)
- <SampleName>_<stage>.assembly_only.wig: bp errors in assembly (wig)
- <SampleName>_<stage>.unionsum.hist.ploidy: ploidy estimates from short-reads
 
- nanoq/: nanoq results- <SampleName>_report.json: nanoq report in json format
- <SampleName>_stats.json: nanoq stats in json format
 
- QUAST/: QUAST analysis- <Sample Name>_<stage>/: QUAST results, cp. QUAST Docs- report.txt: summary table
- report.tsv: tab-separated version, for parsing, or for spreadsheets (Google Docs, Excel, etc)
- report.tex: Latex version
- report.pdf: PDF version, includes all tables and plots for some statistics
- report.html: everything in an interactive HTML file
- icarus.html: Icarus main menu with links to interactive viewers
- contigs_reports/: [only if a reference genome is provided]- misassemblies_report: detailed report on misassemblies
- unaligned_report: detailed report on unaligned and partially unaligned contigs
 
- reads_stats/: [only if reads are provided]- reads_report: detailed report on mapped reads statistics
 
 
- <Sample Name>_<stage_report>.tsv: QUAST summary report
 
 
 
Alignments
All alignments created are saved to the results directory.
Alignments are created for:
- pilon: short read alignment
- QUAST:
- long reads against reference (if provided)
- long reads against assemblies / polishs / scaffolds
 
The files in the alignment folder have the following base name structure:
<SampleName>_<stage>. SampleName is the sample name, and stage is one of:
assembly, medaka, pilon, links, longstitch or ragtag.
Output files
- <SampleName>- QC/- alignments/: alignments to assemblies- <SampleName>_<stage>.bamAlignment
- <SampleName>_<stage>.baibam index file
- <SampleName>_<stage>.statscomprehensive statistics from alignment file
- <SampleName>_<stage>.idxstatsalignment summary statistics
- <SampleName>_<stage>.flagstatnumber of alignments for each FLAG type
- shortreads/: folder containing short read mapping for pilon- <SampleName>_shortreads.bamAlignment
- <SampleName>_shortreads.baibam index file
- <SampleName>_shortreads.statscomprehensive statistics from alignment file
- <SampleName>_shortreads.idxstatsalignment summary statistics
- <SampleName>_shortreads.flagstatnumber of alignments for each FLAG type
 
- reference/: folder containing alignment of long reads to reference- <SampleName>_to_reference.bamAlignment
- <SampleName>_to_reference.baibam index file
- <SampleName>_to_reference.statscomprehensive statistics from alignment file
- <SampleName>_to_reference.idxstatsalignment summary statistics
- <SampleName>_to_reference.flagstatnumber of alignments for each FLAG type
 
 
 
 
Report
The pipeline collects the quality control outputs into an html report. Below is the tree assuming that all steps of the pipeline were run:
Output files
- report/:- busco_files/reports.tsv: Table containing aggregated BUSCO reports
- quast_files/reports.tsv: Table containing aggregated QUAST reports
- report.html: The report file
- report_files/: Folder containing js and css. Required to properly display the- .htmlfile
 
Pipeline information
Output files
- pipeline_info/- Reports generated by Nextflow: execution_report.html,execution_timeline.html,execution_trace.txtandpipeline_dag.dot/pipeline_dag.svg.
- Reports generated by the pipeline: pipeline_report.html,pipeline_report.txtandsoftware_versions.yml. Thepipeline_report*files will only be present if the--email/--email_on_failparameter’s are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
- Parameters used by the pipeline run: params.json.
 
- Reports generated by Nextflow: 
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.