Release Time:2023/8/9 15:04:00

This three-year-old small start-up company has for the first time used deep learning language models to synthesize new proteins that do not exist in nature, detonating a revolution in protein design.

The application of artificial intelligence has greatly accelerated the research of protein engineering.

Recently, a fledgling startup in Berkeley, California, made another stunning step forward.

Using a ChatGPt-like deep learning language model for protein engineering called Progen, scientists have achieved the first AI prediction of protein synthesis.

These proteins were not only completely different from the known ones, with the lowest similarity even being only 31.4%, but were just as effective as the natural proteins.

The work has now been published in the journal Nature.

The paper address: https://www.nature.com/articles/s41587-022-01618-2

The experiment also shows that natural language processing, although developed for reading and writing language texts, can also learn some basic principles of biology.

Comparable to the Nobel Prize for technology

In response, the researchers say the new technique could become more powerful than directed evolution, the Nobel Prize-winning protein design technique.

"It will breathe life into the 50-year-old field of protein engineering by accelerating the development of new proteins that can be used for almost everything from therapeutic agents to degradable plastics."

The company, Profluent, founded by the former head of AI research at Salesforce, has received $9 million in start-up funding to build an integrated wet lab and recruit machine learning scientists and biologists.

In the past, mining proteins in nature, or adapting them to their desired functions, has been laborious. Profulent's goal is to make the process effortless.

And they did.

Profluent founder and CEO Ali Madani

Madani said in an interview that Profulent has designed several families of proteins. These proteins function like exemplar proteins and are therefore highly active enzymes.

The task was very difficult and was done in a zero-shot manner, which meant that there were no rounds of optimization, or even any data from the wet lab at all.

The resulting proteins are highly reactive proteins that normally take hundreds of years to evolve.

ProGen based on language model

As a kind of deep neural network, conditional language model can not only generate semantically and syngrammatically correct and novel natural language text, but also use input control labels to guide style, topic, and so on.

Similarly, the researchers developed today's protagonist, ProGen, a 1.2 billion parameter conditional protein language model.

Specifically, ProGen based on the Transformer architecture simulates residue interactions through a self-attention mechanism and can generate different artificial protein sequences across protein families based on input control tags.

Artificial proteins are generated by conditional language models

To create the model, the researchers fed the amino acid sequences of 280 million different proteins and let it "digest" over a period of several weeks.

They then fine-tuned the model with 56,000 sequences from five lysozyme families and information about these proteins.

Progen's algorithm is similar to GPT3.5, the model behind ChatGPT, in that it learns how amino acids are ordered in proteins and how they relate to protein structure and function.

Soon, the model generated a million sequences.

Based on how similar they were to natural protein sequences, and how natural the amino acid "syntax" and "semantics" were, the researchers chose 100 to test.

Of these, 66 produced chemical reactions similar to the natural proteins that destroy bacteria in egg whites and saliva.

That said, these new AI-generated proteins can also kill bacteria.

The generated artificial proteins were diverse and well expressed in the experimental system

Going one step further, the researchers selected the five proteins with the strongest response and added them to a sample of E. coli.

Among them, two artificial enzymes are able to break down the bacterial cell wall.

By comparing with egg leukozyme (HEWL), it can be found that their activity is equivalent to HEWL.

The researchers then imaged the images with X-rays.

Although the amino acid sequences of the artificial enzymes differ by up to 30 percent from those of existing proteins and are only 18 percent identical with each other, their shape is similar to that of natural proteins and their function is comparable.

Applicability of conditional language modeling to other protein systems

In addition, for a highly evolved natural protein, it may only take a small mutation to stop it working.

But in another round of screening, the researchers found that even if only 31.4 percent of the AI-generated enzymes had the same sequence as known proteins, they showed comparable activity and similar structures.

Protein design, entering a new era

As you can see, ProGen works in a similar way to ChatGPT.

ChatGPT has been able to take Mbas and bar exams and write college essays by studying massive amounts of data.

ProGen learned how to make new proteins by learning the syntax of how amino acids combine to form 280 million existing proteins.

In an interview, Madani said, "Just as ChatGPT learns human languages like English, we are learning the language of biology and proteins."

"Artificially designed proteins perform much better than proteins inspired by evolutionary processes," said James Fraser, one of the paper's authors and a professor of bioengineering and therapeutic sciences at the UCSF School of Pharmacy.

"The language model is learning aspects of evolution, but it's different from the normal evolutionary process. We now have the ability to tweak the production of these properties to get specific effects. For example, to make an enzyme incredibly thermal stable, or like an acidic environment, or not interact with other proteins."

Back in 2020, Salesforce Research developed ProGen. It is based on natural language programming and was originally used to generate English text.

From previous work, the researchers learned that AI systems can teach themselves grammar and the meaning of words, as well as other basic rules that make writing organized.

"When you train sequence-based models with a lot of data, they are very powerful at learning structures and rules," said Dr. Nikhil Naik, Research director for artificial intelligence at Salesforce Research and senior author of the paper. "They learn which words can appear together and how to combine them."

"Now that we have demonstrated ProGen's ability to generate new proteins and published it publicly, everyone can build on our foundation."

As a protein, lysozyme is very small, with a maximum of about 300 amino acids.

But for every 20 possible amino acids, there are 20^300 possible combinations.

That's more than the sum of all human beings, times the number of grains of sand on Earth, times the number of atoms in the universe.

Given the nearly infinite possibilities, it is truly remarkable that Progen was able to design an effective enzyme with such ease.

"The ability to generate functional proteins from scratch right out of the box shows that we are entering a new era in protein design," said Dr. Ali Madani, founder of Profluent Bio and former Research scientist at Salesforce Research.

"This is a versatile new tool available to all protein engineers, and we look forward to seeing it applied to therapeutics."

In the meantime, researchers continue to improve ProGen, trying to break through more limitations and challenges.

One is that it's very data-dependent.

"We've explored how to improve the design of sequences by incorporating structure-based information," Naik says, "and we're also looking at how to improve the generation of models when you don't have much data on a particular protein family or domain."

It's worth noting that there are other startups experimenting with similar technologies, such as Cradle and Generate Biomedicines from biotech incubator Flagship Pioneering, though none of these studies have yet been peer-reviewed.

Previous:no more... Next:The FDA has approved Pfizer's new nasal spray for the treatment of acute migraine headaches