nf-core/proteinfold
Edit

Protein 3D structure prediction pipeline

alphafold2colabfoldesmfoldprotein-fold-predictionprotein-foldingprotein-sequencesprotein-structure

This is the development version of the pipeline.

Launch development version https://github.com/nf-core/proteinfold

Introduction

nf-core/proteinfold is a bioinformatics best-practice analysis pipeline for Protein 3D structure prediction.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from nf-core/modules in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!

On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources. The results obtained from the full-sized test can be viewed on the nf-core website.

Pipeline summary

Alt text

Mode	Protein	RNA	Small-molecule	PTM	Constraints	pLM	MSA server	Split MSA
AlphaFold2	✅	❌	❌	❌	❌	❌	❌	✅
ESMFold	✅	❌	❌	❌	❌	✅	❌	❌
ColabFold	✅	❌	❌	❌	❌	❌	✅	✅
RoseTTAFold2NA	✅	✅	❌	❌	❌	❌	❌	❌
RoseTTAFold-All-Atom	✅	✅	✅	✅	❌	❌	❌	❌
AlphaFold3	✅	✅	✅	✅	❌	❌	❌	❌
HelixFold3	✅	✅	✅	✅	❌	❌	❌	❌
Boltz	✅	✅	✅	✅	✅	❌	✅	✅

nf-core/proteinfold supports multiple tools for general molecular structure prediction. Each of the methods have overlapping functionality which can be utilized within the pipeline. All tools support predicting protein structure from an input amino acid sequence. The pipeline is composed of the following steps:

Split input fasta file (Optional): The pipeline can split large batches of monomeric sequences (eg an entire genome) from a multi-entry fasta input using the --split_fasta flag.
Prepare databases for chosen methods: The pipeline downloads any required reference data.
Structure prediction:

i. Combined: MSA Search + Model Inference: Structures are predicted from MSAs generated using built-in homolog search pipelines.

ii. Split: AlphaFold2 MSA Search + Model Inference: The AlphaFold2 MSA generation pipeline is executed independently and then provided as input for AlphaFold2 structure prediction.

iii. Split: ColabFold MSA Search + Model Inference: The ColabFold MSA generation pipeline is used to produce input MSAs which can be used by ColabFold and Boltz.

iv. pLM: Protein Language Model: The ESMFold model is used to predict structures without generating an MSA.
Generate Report: The pipeline produces an interactive HTML report to visualize structure prediction outputs.
Comparison Report: The structures predicted by parallel modes are combined in an interactive HTML report.
MultiQC: The overall QC statistics are summarized.

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

First, prepare a samplesheet with your input data that looks as follows:

samplesheet.csv

id,fasta
T1024,T1024.fasta
T1026,T1026.fasta

Now, you can run the pipeline using:

nextflow run nf-core/proteinfold \
   -profile <docker/singularity/.../institute> \
   --input samplesheet.csv \
   --outdir <OUTDIR> \
   --mode <alphafold2/esmfold/colabfold/rosettafold2na/rosettafold-all-atom/alphafold3/boltz/helixfold3>

The pipeline takes care of downloading the databases and parameters required by each of the modes. In case you have already downloaded the required files, you can skip this step by providing the path to the databases using the --db parameter.

nextflow run nf-core/proteinfold \
   -profile <docker/singularity/.../institute> \
   --input samplesheet.csv \
   --outdir <OUTDIR> \
   --mode <MODE> \
   --db <DBDIR>

Warning

The reference data for most methods is extremely large and may exceed individual user disk allocations on shared HPC systems.

In order to run multiple methods simultaneously where reference data is stored at different locations, the --db flag can be overwritten for each specific mode (e.g. --alphafold2_db, --colabfold_db, --esmfold_db and --rosettafold_all_atom_db). Please refer to the usage documentation to check the directory structure you must provide for each database.

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

For more details and further functionality, please refer to the usage documentation and the parameter documentation.

Pipeline output

To see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page. For more details about the output files and reports, please refer to the output documentation.

Adding new modes to the pipeline

For details on how to contribute new modes to the pipeline please refer to the Howto contribute new modes.

Credits

nf-core/proteinfold was originally written by Athanasios Baltzis (@athbaltzis), Jose Espinosa-Carrasco (@JoseEspinosa), Luisa Santus (@luisas) and Leila Mansouri (@l-mansouri) from The Comparative Bioinformatics Group at The Centre for Genomic Regulation, Spain under the umbrella of the BovReg project and Harshil Patel (@drpatelh) from Seqera Labs, Spain.

Many thanks to others who have helped out and contributed along the way too, including (but not limited to): Norman Goodacre and Waleed Osman from Interline Therapeutics (@interlinetx), Martin Steinegger (@martin-steinegger) and Raoul J.P. Bonnal (@rjpbonnal)

We would also like to thanks to the AWS Open Data Sponsorship Program for generously providing the resources necessary to host the data utilized in the testing, development, and deployment of nf-core proteinfold.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don’t hesitate to get in touch on the Slack #proteinfold channel (you can join with this invite).

Citations

If you use nf-core/proteinfold for your analysis, please cite it using the following doi: 10.5281/zenodo.7437038

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

run with

See the docs on how to configure the Seqera Platform CLI.

video introduction

subscribers

162

stars

open issues

open PRs

last release

over 1 year ago

last update

over 1 year ago

included modules

included subworkflows

utils_nextflow_pipeline utils_nfcore_pipeline utils_nfschema_plugin

contributors

get help

Ask a question on Slack Open an issue on GitHub

nf-core/proteinfold Edit

Introduction

Pipeline summary

Usage

Pipeline output

Adding new modes to the pipeline

Credits

Contributions and Support

Citations

run with

video introduction

subscribers

stars

open issues

open PRs

last release

last update

included modules

included subworkflows

contributors

get help

nf-core/proteinfold
Edit