nf-core/proteinfamilies      
 Generation and updating of protein families
 metagenomicsprotein-familiesproteomics 
   Version history
Added
- #124
- Added new subworkflow MERGE_FAMILIESthat can optionally merge similar (but not redundant) generated protein families. (by @vagkaratzas)
- Added new functionality to the local module IDENTIFY_REDUNDANT_FAMSwhich now also detects and outputs the identifiers of similar families that can optionally be merged downstream. These identifiers are written to “/remove_redundancy/<samplename>/similar_fam_ids.txt”, and the corresponding family pairwise similarity scores to “/remove_redundancy/<samplename>/similarities.csv”. (by @vagkaratzas)
- Added new local module POOL_SIMILAR_COMPONENTSthat generates family clusters, from a family-similarity edgelist. (by @vagkaratzas)
- Added new local module MERGE_SEEDSthat merges seed alignments of similar families, before restarting the family generation subworkflow. (by @vagkaratzas)
 
- Added new subworkflow 
- #118
- Added preprint citation to the repo. (by @vagkaratzas)
- Added separate metro map files for dark and light browser modes. (by @vagkaratzas)
- Added new local module EXTRACT_FAMILY_MEMBERSwhich outputs a two-column TSV file containing the final family identifiers and their corresponding member sequence identifiers. The file is saved at “/family_reps/<samplename>/<samplename>.tsv”. (by @vagkaratzas)
 
- #117
- Added SEQKIT_SEQfor optional sequence preprocessing in the quality check subworkflow. (by @vagkaratzas)
- Added SEQKIT_REPLACEfor optional sequence name parsing in the quality check subworkflow. (by @vagkaratzas)
- Added SEQKIT_RMDUPfor optional removal of duplicate names and sequences in the quality check subworkflow. (by @vagkaratzas)
 
- Added 
Changed
- #128 - nf-core tools template update to 3.4.1.
- #124
- Conditional workflow flags switched to their skipopposites;--trim_msato--skip_msa_trimming,--recruit_sequences_with_modelsto--skip_additional_sequence_recruiting,--remove_family_redundancyto--skip_family_redundancy_removal,--remove_sequence_redundancyto--skip_sequence_redundancy_removal. (by @vagkaratzas)
 
- Conditional workflow flags switched to their 
- #118
- Swapped the local CHECK_QUALITYsubworkflow with the new nf-core oneFAA_SEQFU_SEQKIT. (by @vagkaratzas)
- Based on protein family reproducibility benchmarks (i.e., computationally reproducing manually curated protein family resources), the cluster_seq_identityandcluster_coverageparameter default values have been updated to0.3and0.5(down from0.5and0.9) respectively. (by @vagkaratzas)
 
- Swapped the local 
- #117 - Swapped the local SEQKIT_STATSand the localSEQKIT_STATS_TO_MQCmodules with theSEQFU_STATSone, which runs a bit faster and produces a MultiQC-ready output without the need for manual parsing. (by @vagkaratzas)
Dependencies
| Tool | Previous version | New version | 
|---|---|---|
| seqfu | - | 1.20.3 | 
| multiqc | 1.30 | 1.31 | 
Deprecated
- #124 - Deprecated --trim_msa,--recruit_sequences_with_models,--remove_family_redundancyand--remove_sequence_redundancy. (by @vagkaratzas)
Special Thanks
To @jfy133, @erikrikarddaniel and @chrisAta for this version’s PR code reviews.
Fixed
- #112 - Fixed a bug in EXTRACT_FAMILY_REPS, where all sequences were pasted into the family representative one, and updated the relevant local nf-test. (by @vagkaratzas)
Changed
- #106 - Swapped the local EXECUTE_CLUSTERINGsubworkflow with the new nf-coreMMSEQS_FASTA_CLUSTERone. (by @vagkaratzas)
Dependencies
| Tool | Previous version | New version | 
|---|---|---|
| multiqc | 1.29 | 1.30 | 
Changed
- #104 - Pulling paramsfrom local subworkflows into main workflow.
- #103 - Parallelized execution for the EXTRACT_FAMILY_REPSlocal module and changed its input fromfull_msatofasta.
- #100 - CAT_CATmodule replaced withFIND_CONCATENATEto avoid large scaleArgument list too longerrors.
- #98 - nf-core tools template update to 3.3.2.
Added
- #105 - CHECK_QUALITYsubworkflow added at the start of the pipeline. It utilizes theseqkit/statsnf-core module to generate aMultiQC-ready report with statistics for the input amino acid sequences. The metro-map has been updated to reflect this change.
Added
- #93
- Added nf-test and meta.ymlfile for local subworkflowGENERATE_FAMILIES.
- Added nf-test and meta.ymlfile for local subworkflowREMOVE_REDUNDANCY.
- Added nf-test and meta.ymlfile for local subworkflowUPDATE_FAMILIES.
 
- Added nf-test and 
- #88
- Added nf-test and meta.ymlfile for local moduleBRANCH_HITS_FASTA.
- Added nf-test and meta.ymlfile for local moduleFILTER_NON_REDUNDANT_FAMS.
- Added nf-test and meta.ymlfile for local moduleIDENTIFY_REDUNDANT_FAMS.
- Added nf-test and meta.ymlfile for local moduleEXTRACT_FAMILY_REPS.
- Added the default pipeline end-to-end nf-test.
 
- Added nf-test and 
Changed
- #81 - nf-core tools template update to 3.3.1.
Fixed
- #80 - Fixed a bug where, due to a missing check for equal family sizes, non-redundant families were erroneously marked as redundant through transitive relationships and were removed
Changed
- #77 - Default branch changed from mastertomain.
- #73 - Changed the fasta parsing library of the CHUNK_CLUSTERSlocal module, frompyfastxback to the latest version ofbiopython, and parallelized its writing mechanism, achieving decreased execution time.
Dependencies
| Tool | Previous version | New version | 
|---|---|---|
| biopython | 1.84 | 1.85 | 
| pyfastx | 2.2.0 | 
Removed
- #73 - Deprecated pyfastxmodule version ofCHUNK_CLUSTERS, since it was struggling performance-wise with larger datasets.
Added
- #69 - Added the hhsuite/reformatnf-core module to reformat.stoalignments to.faswhen in-family sequence redundancy is not removed. Also added the option to save intermediate and final family fasta files throughout the workflow with varioussaveparameters.
- #58 - Added nf-test and meta.ymlfile for local moduleREMOVE_REDUNDANCY_SEQS(Hackathon 2025)
- #56 - Added nf-test and meta.ymlfile for local moduleFILTER_RECRUITED(Hackathon 2025)
- #55 - Added nf-test and meta.ymlfile for local moduleCHUNK_CLUSTERS(Hackathon 2025)
- #54 - Added nf-test for local subworkflow ALIGN_SEQUENCES(Hackathon 2025)
- #53 - Added nf-test for local subworkflow EXECUTE_CLUSTERING(Hackathon 2025)
- #51 - Added nf-test and meta.ymlfile for local moduleCALCULATE_CLUSTER_DISTRIBUTION(Hackathon 2025)
- #34 - Added the EXTRACT_UNIQUE_CLUSTER_REPSmodule, that calculates initialMMseqsclustering metadata, for each sample, to print withMultiQC(Id,Cluster Size,Number of Clusters)
Fixed
- #69 - Fixed a bug where redundant family alignments were not published properly, if intra-family redundancy removal mechanism was switched off #68
- #65 - Fixed a bug in CHUNK_CLUSTERS, where pipeline would crash if the module filtered out all clusters, due to a high membership threshold #64
- #35 - Fixed a bug in remove_redundant_fams.py, where comparison was between strings instead of integers to keep larger family
- #33 - Fixed an always-true condition at the filter_non_redundant_hmms.pyscript, by adding missing parentheses
- #29 - Fixed hmmalignempty input crash error, by preventing theFILTER_RECRUITEDmodule from creating an empty output .fasta.gz file, when there are no remaining sequences after filtering thehmmsearchresults #28
Changed
- #69 - Changed the publish directory architecture for HMMs, seed MSAs, full MSAs and family FASTA files, to make it more intuitive.
REMOVE_REDUNDANT_FAMSlocal module converted toIDENTIFY_REDUNDANT_FAMSto extract redundant family ids which will then be used downstream.FILTER_NON_REDUNDANT_HMMSlocal module converted toFILTER_NON_REDUNDANT_FAMSand reused four times (HMM, seed MSA, full MSA, FASTA). Changed the output format of theEXTRACT_FAMILY_REPSandREMOVE_REDUNDANT_SEQSlocal modules from.fato.faa. Metro map updated with newhhsuite/reformatmodule.
- #57 - slight improvements of nextflow_schema.json(Hackathon 2025)
- #57 - slight improtmenets of assets/schema_input.json(Hackathon 2025)
- #34 - Swapped the SeqIOpython library withpyfastxfor theCHUNK_CLUSTERSmodule, quartering its duration
- #32 - Updated ClipKIT2.4.0 -> 2.4.1, that now also allows ends-only trimming, to completely replace the customCLIP_ENDSmodule. Users can now also define its output format by setting the--clipkit_out_formatparameter (default:clipkit)
Dependencies
| Tool | Previous version | New version | 
|---|---|---|
| ClipKIT | 2.4.0 | 2.4.1 | 
| pyfastx | 2.2.0 | |
| hhsuite | 3.3.0 | |
| multiqc | 1.27 | 1.28 | 
Deprecated
- #32 - Deprecated CLIP_ENDSmodule and--clipping_toolparameter. The only option now isClipKIT, covering both previous modes, via setting--trim_ends_only
Initial release of nf-core/proteinfamilies, created with the nf-core template.
Added
- Amino acid sequence clustering (mmseqs)
- Multiple sequence alignment (famsa, mafft, clipkit)
- Hidden Markov Model generation (hmmer)
- Between families redundancy removal (hmmer)
- In-family sequence redundancy removal (mmseqs)
- Family updating (hmmer, seqkit, mmseqs, famsa, mafft, clipkit)
- Family statistics presentation (multiqc)
By @vagkaratzas and @mberacochea.