HyenaDNA: A Large Language Model for the Human Genome

HyenaDNA pretrained on sequences of up to a million tokens, far more than those used by prior models.

June 30, 2023

3 Min Read

At a Glance

Researchers from Stanford unveiled HyenaDNA, a large language model trained on one million human genomic sequences.
The team behind it believes future iterations could be used to prompt tools like ChatGPT for questions about diseases.

AI researchers from Stanford and Turing award winner Yoshua Bengio have trained a large language model on human genome data to better predict DNA profiles.

Researchers created HyenaDNA by combining the Hyena large language model and pre-training it on human reference genomic sequences of up to one million tokens. Prior models have typically used context lengths of 512 to 4,000 token, or less than 0.001% of the human genome.

Stanford’s researchers contend that most of the work on long context models have focused on natural language and code and that biology is “inherently made of ultralong sequences.”

Stay updated. Subscribe to the AI Business newsletter

The researchers hypothesize that future models like HyenaDNA could be used to prompt ChatGPT with an entire human genome and ask it questions about a disease or to predict drug reactions.

Andrew Brosnan, principal analyst for AI applications in life sciences at Omdia, said the model is “another example of a smaller model tuned to a specific task performing better than a larger (Transformer) model and using less computation resource and less training time. “

He explained: “The Hyena framework achieves this by substituting the attention operation with a convolution and thus breaking the ‘quadratic barrier’ since the number of parameters doesn’t scale quadratically with increases in textual input.”

Why Hyena?

LLaMA, Meta’s open source language model, has been the model of choice for many developers of late looking to build alternative large language models.

Stanford’s researchers and Bengio were among the team that designed Hyena, which makes more sense as to why they would choose it over LLaMA.

But there could be a more technical reason. Hyenas in the wild adhere to a strict hierarchy, where status levels among the animals are used to establish dominance. The Hyena LLM adheres much to the same concept – a hierarchy of data processing needs less time to complete a language task as it filters levels of context to a query.

How does it compare?

Genomic (DNA) sequences encode enormous amounts of information for gene regulation and protein synthesis. Previous AI genomics models also lose single nucleotide resolution as tokenizers were used to aggregate meaningful DNA units.

HyenaDNA, however, retains high-resolution context for each layer as the underlying model approaches tokens due to its hierarchical structure.

Brosnan noted the importance of high resolution in genomics, adding, “You lose resolution or fidelity with the aggregation. With gene expression a single character can make a difference.”

The researchers found that HyenaDNA reaches state-of-the-art performance levels on 12 of 17 benchmark datasets using a model with orders of magnitude fewer parameters and pretraining data.

On the GenomicBenchmarks, HyenaDNA was found to surpass all previous state-of-the-art models on all eight datasets on average by +9 accuracy points.

The largest version of HyenaDNA was pre-trained on a single node of eight Nvidia A100s.

A paper outlining the model can be found on Arxiv. For a lighter read, there’s a blog post on the Stanford website. And code details can be found on both GitHub and Hugging Face.

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

See more from Ben Wodecki

Related Topics

Recent in ML

Related Topics

Recent in NLP

Related Topics

Recent in Data

Related Topics

Recent in Automation

Related Topics

Recent in Verticals

Related Topics

Recent in Responsible AI

Related Topics

Recent in Companies

Related Topics

HyenaDNA: A Large Language Model for the Human Genome

At a Glance

Stay updated. Subscribe to the AI Business newsletter

Why Hyena?

How does it compare?

About the Author(s)

Latest News

Trending articles