nf-core/proteinfold
Protein 3D structure prediction pipeline
Introduction
nf-core/proteinfold is a bioinformatics best-practice analysis pipeline for Protein 3D structure prediction.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from nf-core/modules in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!
On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources. The results obtained from the full-sized test can be viewed on the nf-core website.
Pipeline summary

| Mode | Protein | RNA | Small-molecule | PTM | Constraints | pLM | MSA server | Split MSA |
|---|---|---|---|---|---|---|---|---|
| AlphaFold2 | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
| ESMFold | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ |
| ColabFold | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ |
| RoseTTAFold2NA | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| RoseTTAFold-All-Atom | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
| AlphaFold3 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
| HelixFold3 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
| Boltz | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
nf-core/proteinfold supports multiple tools for general molecular structure prediction. Each of the methods have overlapping functionality which can be utilized within the pipeline. All tools support predicting protein structure from an input amino acid sequence. The pipeline is composed of the following steps:
-
Split input fasta file (Optional): The pipeline can split large batches of monomeric sequences (eg an entire genome) from a multi-entry fasta input using the
--split_fastaflag. -
Prepare databases for chosen methods: The pipeline downloads any required reference data.
-
Structure prediction:
i. Combined: MSA Search + Model Inference: Structures are predicted from MSAs generated using built-in homolog search pipelines.
ii. Split: AlphaFold2 MSA Search + Model Inference: The AlphaFold2 MSA generation pipeline is executed independently and then provided as input for AlphaFold2 structure prediction.
iii. Split: ColabFold MSA Search + Model Inference: The ColabFold MSA generation pipeline is used to produce input MSAs which can be used by ColabFold and Boltz.
iv. pLM: Protein Language Model: The ESMFold model is used to predict structures without generating an MSA.
-
Generate Report: The pipeline produces an interactive HTML report to visualize structure prediction outputs.
-
Comparison Report: The structures predicted by parallel modes are combined in an interactive HTML report.
-
MultiQC: The overall QC statistics are summarized.
Usage
If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.
First, prepare a samplesheet with your input data that looks as follows:
id,fasta
T1024,T1024.fasta
T1026,T1026.fastaNow, you can run the pipeline using:
nextflow run nf-core/proteinfold \
-profile <docker/singularity/.../institute> \
--input samplesheet.csv \
--outdir <OUTDIR> \
--mode <alphafold2/esmfold/colabfold/rosettafold2na/rosettafold-all-atom/alphafold3/boltz/helixfold3>The pipeline takes care of downloading the databases and parameters required by each of the modes. In case you have already downloaded the required files, you can skip this step by providing the path to the databases using the --db parameter.
nextflow run nf-core/proteinfold \
-profile <docker/singularity/.../institute> \
--input samplesheet.csv \
--outdir <OUTDIR> \
--mode <MODE> \
--db <DBDIR>The reference data for most methods is extremely large and may exceed individual user disk allocations on shared HPC systems.
In order to run multiple methods simultaneously where reference data is stored at different locations, the --db flag can be overwritten for each specific mode (e.g. --alphafold2_db, --colabfold_db, --esmfold_db and --rosettafold_all_atom_db). Please refer to the usage documentation to check the directory structure you must provide for each database.
Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.
For more details and further functionality, please refer to the usage documentation and the parameter documentation.
Pipeline output
To see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page. For more details about the output files and reports, please refer to the output documentation.
Adding new modes to the pipeline
For details on how to contribute new modes to the pipeline please refer to the Howto contribute new modes.
Credits
nf-core/proteinfold was originally written by Athanasios Baltzis (@athbaltzis), Jose Espinosa-Carrasco (@JoseEspinosa), Luisa Santus (@luisas) and Leila Mansouri (@l-mansouri) from The Comparative Bioinformatics Group at The Centre for Genomic Regulation, Spain under the umbrella of the BovReg project and Harshil Patel (@drpatelh) from Seqera Labs, Spain.
Many thanks to others who have helped out and contributed along the way too, including (but not limited to): Norman Goodacre and Waleed Osman from Interline Therapeutics (@interlinetx), Martin Steinegger (@martin-steinegger) and Raoul J.P. Bonnal (@rjpbonnal)
We would also like to thanks to the AWS Open Data Sponsorship Program for generously providing the resources necessary to host the data utilized in the testing, development, and deployment of nf-core proteinfold.
Contributions and Support
If you would like to contribute to this pipeline, please see the contributing guidelines.
For further information or help, don’t hesitate to get in touch on the Slack #proteinfold channel (you can join with this invite).
Citations
If you use nf-core/proteinfold for your analysis, please cite it using the following doi: 10.5281/zenodo.7437038
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
You can cite the nf-core publication as follows:
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.