Just 15 minutes + questions, we focus on topics about using and developing nf-core
pipelines. These are recorded and made available at https://nf-co.re
, helping to build an archive of training material. Got an idea for a talk? Let us know on the #bytesize
Slack channel!
This week, Daniel Straub (@d4straub) will tell us all about the nf-core/ampliseq pipeline.
nfcore/ampliseq is a bioinformatics analysis pipeline used for amplicon sequencing, supporting denoising of any amplicon and, currently, taxonomic assignment of 16S, ITS and 18S amplicons. Supported is paired-end Illumina or single-end Illumina, PacBio and IonTorrent data. Default is the analysis of 16S rRNA gene amplicons sequenced paired-end with Illumina.
Video transcription
The content has been edited to make it reader-friendly
--single_end
parameter, for PacBio and for IonTorrent technologies, you have to specify that. In special cases, Illumina paired-end ITS, so fungi community analyses, you also should already set a parameter so that all the downstream analyses will have sensible settings.
What is the output of those very basic analyses? In the results folder, in our subfolder “dada[2]” will be that file, and that file here contains already most of the information that you will need. It contains for each of the different samples, the sequences that it found and also the quantification, so it simply counts how many of the sequences in there will originate from that original 16S sequence in your sample. There is also a handy overall summary for all the read numbers that were for each of the samples processed and then ended up in the table and if those numbers do not look as expected, then this is a very good starting point to troubleshoot.
The next step of the pipeline is to classify the sequences that we have now produced, meaning that we give it a name, that we give it a taxonomic name. This is by default done with the DADA2 software, but we also have an alternative way to do this with QIIME2 which is also a very popular program in the area. By default it is using the DADA algorithm with the SILVA138 taxonomy. So we have here a range of reference taxonomy databases that can be used and that is the SILVA, the RDP, the GTDB database that are all for bacteria, for 16S rRNA gene amplicon analyses; the UNITE database for fungi analyzes ITS, internal transcribed spacers, and the PR2 database for eukaryotes with the 18S rRNA gene amplicons.
You can also taxonomically classify any kind of sequences that you have produced by using --input
and provide your FASTA, which needs to have then a .fasta extension. Following, there will be taxonomic filtering and by default only mitochondria and chloroplasts will be filtered out from the following tables, because those are typically off-targets of the 16S rRNA gene amplicon sequencing, that should be removed. The output of this taxonomy classification is again a table, a TSV, that will list all the sequences that were discovered and also add here the taxonomic levels and their classifications with confidence value.
Finally there is also downstream analysis, but this downstream analysis is only provided when the optional parameter metadata is pointing to a tab-separated metadata sheet. Then automatically appropriate metadata columns will be chosen and used for visualization of a bar plot, as you can see here. Those are HTML files that are then interactive so you can choose colors, you can choose different kinds of taxonomic levels. I chose here a very high one just to show you, for example here with the test data set, that we are typically running here with triplicates, that you can see already patterns of different kind of data.
Additionally there will be a differential abundance analysis with a program called ANCOM and that produces for example such a volcano plot, also interactive. This red dot here which is identified as a significant different abundance between the different treatments would be a Burkholderiales bacteria.
Then also alpha and beta diversity indices are produced. Alpha diversity is a measure of each of the samples, how diverse the sequences in there are. You can see that for this test data set, it’s a bit small, the samples originating from groundwater, have the lowest Shannon diversity index but the sediment samples here have the highest one. To all of these comparisons will be of course also statistical analyses provided in the output of the pipeline. Beta diversity plots show the difference in PCOA plots. What you can see here is that the replicates for the different kinds of sample sources (groundwater, river water, sediment and soil in our example data set) are nicely clustering and well separated from the others.
Additionally there will be quality control figures like alpha rarefaction curves, which show you if you have a sequence with sufficient output for your samples. As long as those curves will flatten out you know that you have sufficient sequencing data produced for that sample. All those results can be also found on the nf-core website in /ampliseq/results
and you can have a look what is produced, how does it look like and there is also extensive documentation what each of those plots mean and what you can use it for. Those plots, of course, cannot show all kinds of complicated experimental setups, but they are very helpful for getting a first view on the data and are also a sort of quality control that what you are producing here makes sense. With this I want to finish and I want to thank you for your attention, my colleagues at QBiC , who also were involved in the development of the pipeline, nf-core, especially Daniel Lundin, helped a lot and the University of Tübingen Microbiology and Geomicrobiology who produced a lot of data, that I was then able to analyze and with this pipeline and produce it as good as possible.
(host) Thanks very much for that very clear and insightful talk Daniel. Are there any questions? If anyone in the audience has questions you can unmute yourself or even add questions to the chat and I can read them out. Okay I don’t see anything coming up so that was extremely clear so I guess that’s why people don’t have questions but if anything crops up anywhere, you know where to reach Daniel. You can message him and continue the discussion on the bytesize channel and thanks again for joining. We will be back next week with another pipeline focused talk.
(question) We have one comment here okay there’s a question. This person thanks you for the presentation and asks what type of license is associated with this kind of tool, for example the ampliseq pipeline.
(answer) It is a CCPY license [comment: ampliseq is actually under an MIT licence], like I think all the nf-core pipelines have, there are no restrictions whatsoever.
(host) I hope that answers your question. Yes thank you.
(host) As I said we will be back with another pipeline focused bytesize talk next week where we will have Payam Emami, who will be presenting the metaboigniter pipeline. But before that I hope to be able to see some of you in sunny Gathertown, starting tomorrow at our hackathon and take care everyone thanks again for joining.