AI researchers unveil a spherical protein-folding model

ProtGPT2 generates sequences with properties ‘akin to their natural counterparts’

August 16, 2022

4 Min Read

ProtGPT2 generates sequences with properties ‘akin to their natural counterparts’

Academics from the University of Bayreuth in Germany have thrown their hats into the growing ring for protein prediction AI models by unveiling ProtGPT2 — in a bid to accelerate drug discovery and better understand diseases.

The unsupervised language model is capable of generating protein sequences that follow similar principles found in naturally originating proteins, according to a paper published in Nature. These proteins generated by ProtGPT2 display natural amino acid propensities.

The authors’ findings show that 88% of ProtGPT2-generated proteins are globular — effectively, spherical proteins that have a wide range of functions inside a cell — in line with natural sequences. Rival models such as AlphaFold generate string-like proteins.

Protein predictions use language models because there are various similarities between the two. Protein sequences consist of a “chemically defined” alphabet, the authors explained. “These ‘letters’ arrange to form secondary structural elements (words), which assemble to form domains (sentences) that undertake a function (meaning).” Thus, advances in language models enable great strides in protein prediction models as well.

ProtGPT2, a transformer-based pre-trained model, boasts 738 million parameters and generates sequences that “show predicted stabilities and dynamic properties akin to their natural counterparts,” the authors wrote.

Figure 1: Examples of proteins generated by ProtGPT2 (Image credit: Nature)

The globular proteins predicted by ProtGPT2 act as enzymes, or stocks of amino acids, and messengers, by transmitting messages to regulate biological processes like hormones.

“Since protein design has an enormous potential to solve problems in fields ranging from biomedical to environmental sciences, we believe that ProtGPT2 is a timely advance towards efficient high-throughput protein engineering and design,” the paper reads.

The model and datasets are available via HuggingFace and has already been downloaded over 4,300 times.

Faster drug discovery

Scientists believe that the ability to predict a protein’s 3D structure could enable faster drug discovery thanks to a better understanding of how the body’s proteins relate to diseases.

Each cell in the human body contains billions of proteins that control vital functions. Those proteins contain amino acids arranged in formations like strings or spheres. Those formations fold themselves into a 3D shape based on the interactions of these amino acids, which then perform different tasks in the body, such as carrying oxygen in the blood from the lungs to body tissues.

Here is an explainer from DeepMind about the concept:

Other protein models

There are several competing AI-powered protein prediction models around.

Arguably the most famous is AlphaFold, developed by Google-owned DeepMind. The deep-learning neural network has 21 million parameters and was trained on more than 170,000 proteins from a public repository of protein sequences and structures.

The system itself uses an attention network − a deep learning technique where an algorithm recognizes parts of a larger problem — then pieces them together to obtain the overall solution. It can do this in minutes or hours, depending on the size of the protein.

In late July, DeepMind published predicted structures for 200 million proteins, which it claimed represent “nearly all cataloged proteins known to science.”

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

See more from Ben Wodecki

Related Topics

Recent in ML

Related Topics

Recent in NLP

Related Topics

Recent in Data

Related Topics

Recent in Automation

Related Topics

Recent in Verticals

Related Topics

Recent in Responsible AI

Related Topics

Recent in Companies

Related Topics

Faster drug discovery

Other protein models

About the Author(s)

Latest News

Trending articles