Analysis pipeline for CUT&RUN and CUT&TAG experiments that includes QC, support for spike-ins, IgG controls, peak calling and downstream analysis.
Version history
[3.2.2] - 2024-02-01
- Updated pipeline template to nf-core/tools
. - Added option to dump calculated scale factor values from the pipeline using
- Added options to group IGV tracks by samplesheet group or by file type using
- Fixed error that caused mismapping of IgG controls to their targets in certain samplesheet configurations.
[3.2.1] - 2023-10-25
- Updated pipeline template to nf-core/tools
- Fixed error that caused one random sample to run for markduplicates with certain versions of NextFlow. Adding an explicit
on genome files has fixed this issue.
Software dependencies
Note, since the pipeline is now using Nextflow DSL2, each process will be run with its own Biocontainer. This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference.
Dependency | Old version | New version |
fastqc | 11.9 | 12.1 |
picard | 3.0.0 | 3.1.0 |
multiqc | 1.15 | 1.17 |
NB: Dependency has been updated if both old and new version information is present. NB: Dependency has been added if just the new version information is present. NB: Dependency has been removed if version information isnโt present.
[3.2] - 2023-08-31
Major Changes
- [#189] - Duplicates arising from linear amplification can be now removed by setting
--remove_linear_duplicates true
is default. Linear amplification is used in the TIPseq protocol in which genomic DNA is cut with Tn5 loaded with T7 promoter sequence that gets inserted in the cut DNA fragment. The T7 promoter sequence is then used to perform in vitro transcription to produce copies of the cut fragment. These duplicates are referred to as linear duplicates. Recent iterations of the CUT&Tag protocol, such as nano-CUT&Tag, have also been modified to include a linear amplification step. Credit to teemuronkko for this. - [#208] - Updated the genome blacklists file to more accurate CUT&RUN specific regions rather than the old ChIP-Seq ENCODE blacklist. This should improve mapping rates and reduce spurious peaks. Credit to Adrija K for this. [The CUT&RUN suspect list of problematic regions of the genome]
- Updated pipeline template to nf-core/tools
. - [#189] - Mitochondrial reads can be filtered before peak calling by setting
--remove_mitochondrial_reads true
is default. If using a custom reference genome, user can specify the string that is used to denote the mitochondrial reads in the reference using the--mito_name
parameter. - [#189] - The user can now specify explicitly if
mode of Bowtie2 should be used by setting--end_to_end
is default. In theend-to-end
mode, all read characters are included when optiming an alignment. If thelocal
mode is specified, Bowtie2 might exclude characters from one or both ends of the read to maximise alignment scores. - [#189] - Added the name of the peak caller in the consensus peaks to make it clearer which peaks were used in the downstream reporting steps.
- [#196] - Extended documentation for most common alternative spike-in genomes, i.e. yeast and fruit fly. Credit to smoe for this.
- The Preseq module
was moved fromlocal
- Updated all nf-core modules to latest versions.
- Standardised channel structure for the nf-core Bowtie2
module in the localalign_bowtie2
subworkflows to prevent file errors. - Fixed error caused by altered channel structure of the nf-core
Software dependencies
Note, since the pipeline is now using Nextflow DSL2, each process will be run with its own Biocontainer. This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference.
Dependency | Old version | New version |
bedtools | 1.13 | 1.14 |
multiqc | 2.30.0 | 2.31.0 |
samtools | 1.16.1 | 1.17 |
NB: Dependency has been updated if both old and new version information is present. NB: Dependency has been added if just the new version information is present. NB: Dependency has been removed if version information isnโt present.
[3.1] - 2023-03-10
Major Changes
- IgG controls will now be analysed by the deeptools QC subworkflow giving greater visibility on the quality of control samples.
- Updated the MACS2 default parameters to better process PA-Tn5/PA-Mnase based experiments. The new defaults use the q-value of
as the default cutoff in place of the p-value. The defaults have also been updated to keep duplicate reads int he peak finding process and also to shift the model to better account for nucleosome positioning--nomodel --shift -75 --extsize 150 --keep-dup all
- Deeptools plotHeatmap will now run for all samples as well as for singles. This can be disabled using the parameter
--dt_calc_all_matrix false
- Bowtie2 default parameters have been updated to use the
option. After careful consideration and literature review, we have decided that overlapping mates can occur in CUT&RUN data and are still valid reads. This is also the agreed parameterisation in similar pipelines and also on the 4D nucleome portal.
- Updated pipeline template to nf-core/tools
. - Updated pipeline syntax to conform to new Nextflow version standards.
- Some locally defined subworkflows/modules have now been added to nf-core and re-imported as official modules/subworkflows.
- Fixed confusing config warnings that were being displayed on legitmate parameter configurations.
- Fixed deeptools correlation plots that were showing low levels of correlation even in test data by changing the plot to use Pearson correlation.
- Corrected the SEACR p-value parameter description.
- Fixed output of Picard mark/remove duplicate files so that the sorted, indexed bams for all files always output to the results folder.
- Spikein genome processes and checks no longer run when the normalisation mode is set to something other than
. - Pipeline will now fail gracefully when single-end reads are detected.
Software dependencies
Note, since the pipeline is now using Nextflow DSL2, each process will be run with its own Biocontainer. This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference.
Dependency | Old version | New version |
multiqc | 1.13 | 1.14 |
samtools | 1.15.1 | 1.16.1 |
NB: Dependency has been updated if both old and new version information is present. NB: Dependency has been added if just the new version information is present. NB: Dependency has been removed if version information isnโt present.
[3.0] - 2022-10-27
Major Changes
Major rework of the pipeline internal flow structure. Metadata from processes (such as read counts) was previously annotated to a channel dictionary that was passed through the pipeline where various reporting processes could use the data. This was interacting with quite a few bugs in the Nextflow pipeline resume feature, causing lots of processes to rerun unnecessarily on resume. Any metadata generated in the pipeline is now written to files and passed where necessary to consuming reporting processes. This has drastically lowered the number of processes that incorrectly rerun on resume.
Re-organized the pipeline into clearer sections, breaking related processes into sub-workflows where possible. This is for better readability, but also to prepare the pipeline for the major upcoming nf-core feature of re-usable subworkflows. As part of this rework, the pipeline now has distinct sections for fragment-based QC and peak-based QC.
All reporting has been moved into MultiQC where possible. All PDF-based charting has been removed. Other PDF reports such as heatmaps and upset plots are still generated.
We have listened to user comments that there is no guide on how to interpret the results from the pipeline. In response, we have revamped the documentation in the
document to describe the reporting in much more depth including good and bad examples of reporting output where possible. -
[#140] - IGV browser output has been reworked. We first fixed the performance issues with long load times by including the genome index into the session folder. IGV output now includes peaks from all peak callers used in the pipeline, not just the primary one. Users can now select whether the gene track exported with the IGV session contains gene symbols or gene names. Several visual changes have been made to improve the default appearance and order of tracks.
Added PreSeq library complexity reporting.
Added the full suite of fragment-based deepTools QC using the
module. We generate three reports from this fragment dataset: PCA, correlation and fingerprint plots. This has replaced our previous python implementation of sample correlation calculation. -
All coverage tracks generated from reads now extend reads to full fragment length by default. We feel this creates more realistic coverage tracks for CUT&RUN and improves the accuracy of other fragment-based reports.
- Updated pipeline template to nf-core/tools
. - [#149] - Pipeline will now use a blacklist file if provided to create an include list for the genome.
- The FRiP score is now calculated based on extended read fragments and not just mapped reads.
- [#138] - Better sample sheet error reporting.
- Gene bed files will now be automatically created from the GTF file if not supplied.
- The default minimum q-score for read quality has been changed from 0 to 20.
- [#156] SEACR has been better parameterized with dedicated config values for stringency and normalization. Credit to
for this. - deepTools heatmap generation has been better parameterized with dedicated config values for the gene and peak region settings.
- Consensus peak count reporting has been added to MultiQC.
- Reviewed and updated CI tests for better code coverage.
- Updated all nf-core modules to latest versions.
- Fixed some bugs in the passing of MACS2 peak data through the pipeline in v2.0. MACS2 peaks will now be correctly used and reporting on in the pipeline.
- [#135] - Removed many of the yellow warnings that were appearing in the pipeline to do with resource config options for processes that were not run.
- [#137] - Fixed the
Software dependencies
Note, since the pipeline is now using Nextflow DSL2, each process will be run with its own Biocontainer. This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference.
Dependency | Old version | New version |
multiqc | 1.12 | 1.13 |
picard | 2.27.2 | 2.27.4 |
NB: Dependency has been updated if both old and new version information is present. NB: Dependency has been added if just the new version information is present. NB: Dependency has been removed if version information isnโt present.
[2.0] - 2022-06-08
Major Changes
- [#53] - Complete redesign of the samplesheet input system. Controls are no longer hard-coded to
and the process of assigning controls to samples has been simplified. - [#93], [#73] - Additional sample normalisation options have been added. In addition to normalising using detected spike-in DNA, there are now several options for normalising against read depth instead as well as skipping normalisation entirely.
- [#62] - Added MACS2 as an optional peak caller. Peak calling can now be altered using the
variable. Both peak callers can be run using--peakcaller SEACR,MACS2
, the primary caller is the first item in the list and will be used downstream while the secondary will be run and outputted to the results folder. - [#101] -
ran consensus peak calling at both the group level and for all samples. This was causing performance issues for larger sample sets. There is now a newconsensus_peak_mode
parameter that defaults togroup
. Consensus will only be run on all samples if this is changed toall
- Updated pipeline template to nf-core/tools
. - Upgraded pipeline to support the new nf-core module configuration system.
- More robust CI testing. Over 213 tests now before any code is merged with the main code base.
- More control over which parts of the pipeline run. Explicit skipping has been implemented for every section of the pipeline.
- Added options for scaling control data before it is used to call peaks. This is especially useful when using read depth normalisation as this can sometimes result in few peaks being called due to high background levels.
- Added support for Bowtie2 large indexes.
- IGV auto-session builder now supports
file extensions. - Bowtie2 alignment has been altered to run in
mode only if trimming is skipped. If trimming is activated then it will run in--local
mode. - [#88] - Many processes have been optimized for resource utilization. Users will especially notice that single thread processes will now only request 1 core rather than 2.
- [#63] - Custom containers for python reporting have now been condensed into a single container and added to BioConda.
- [#76] - Standardized python versions across reporting modules.
- [#120] - DeepTools compute matrix/heatmaps now only runs if there are peaks detected.
- [#99] - Large upset plots were causing process crashes. Upset plots will now fail gracefully if the number of samples in the consensus group is more than 10.
- [#95] - Fixed FRIP calculation performance issues and crashes.
Software dependencies
Note, since the pipeline is now using Nextflow DSL2, each process will be run with its own Biocontainer. This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference.
Dependency | Old version | New version |
samtools | 1.14 | 1.15.1 |
bowtie2 | 2.4.2 | 2.4.4 |
picard | 2.25.7 | 2.27.2 |
NB: Dependency has been updated if both old and new version information is present. NB: Dependency has been added if just the new version information is present. NB: Dependency has been removed if version information isnโt present.
[1.1] - 2022-01-20
Enhancements & fixes
- Updated pipeline template to nf-core/tools
- [#71] - Bumped Nextflow version
- Added pipeline diagram to [README]
- Upgraded all modules (local and nf-core) to support the new versioning system
- The module
was submitted to nf-core and moved fromlocal
- Added support for GFF files in IGV session generation
- [#57, #66] - Upgraded version reporting in multiqc to support both software version by module and unique software versions. This improves detection of multi-version software usage in the pipeline
- [#54] - Fixed pipeline error where dots in sample ids inside the sample sheet were not correctly handled
- [#75] - Fixed error caused by emtpy peak files being passed to the
python reporting modules - [#83] - Fixed error in violin chart generation with cast to int64
Software dependencies
Note, since the pipeline is now using Nextflow DSL2, each process will be run with its own Biocontainer. This means that on occasion it is entirely possible for the pipeline to be using different versions of the same tool. However, the overall software dependency changes compared to the last release have been listed below for reference.
Dependency | Old version | New version |
samtools | 1.13 | 1.14 |
NB: Dependency has been updated if both old and new version information is present. NB: Dependency has been added if just the new version information is present. NB: Dependency has been removed if version information isnโt present.
The initial release of nf-core/cutandrun! ๐
After months of hard work, we are proud to say the pipeline has reached a point of feature completeness and stability. I would especially like to thank all of our beta testers who helped stress-test the pipeline to breaking point!
nf-core/cutandrun was originally written by Chris Cheshire (@chris-cheshire) and Charlotte West (@charlotte-west) from the Luscombe Lab at The Francis Crick Institute, London, UK.
The pipeline structure and parts of the downstream analysis were adapted from the original CUT&Tag analysis protocol from the Henikoff Lab.