June 30, 2023
At a Glance
- Researchers from Stanford unveiled HyenaDNA, a large language model trained on one million human genomic sequences.
- The team behind it believes future iterations could be used to prompt tools like ChatGPT for questions about diseases.
AI researchers from Stanford and Turing award winner Yoshua Bengio have trained a large language model on human genome data to better predict DNA profiles.
Researchers created HyenaDNA by combining the Hyena large language model and pre-training it on human reference genomic sequences of up to one million tokens. Prior models have typically used context lengths of 512 to 4,000 token, or less than 0.001% of the human genome.
Stanford’s researchers contend that most of the work on long context models have focused on natural language and code and that biology is “inherently made of ultralong sequences.”
Stay updated. Subscribe to the AI Business newsletter
The researchers hypothesize that future models like HyenaDNA could be used to prompt ChatGPT with an entire human genome and ask it questions about a disease or to predict drug reactions.
Andrew Brosnan, principal analyst for AI applications in life sciences at Omdia, said the model is “another example of a smaller model tuned to a specific task performing better than a larger (Transformer) model and using less computation resource and less training time. “
He explained: “The Hyena framework achieves this by substituting the attention operation with a convolution and thus breaking the ‘quadratic barrier’ since the number of parameters doesn’t scale quadratically with increases in textual input.”
LLaMA, Meta’s open source language model, has been the model of choice for many developers of late looking to build alternative large language models.
Stanford’s researchers and Bengio were among the team that designed Hyena, which makes more sense as to why they would choose it over LLaMA.
But there could be a more technical reason. Hyenas in the wild adhere to a strict hierarchy, where status levels among the animals are used to establish dominance. The Hyena LLM adheres much to the same concept – a hierarchy of data processing needs less time to complete a language task as it filters levels of context to a query.
How does it compare?
Genomic (DNA) sequences encode enormous amounts of information for gene regulation and protein synthesis. Previous AI genomics models also lose single nucleotide resolution as tokenizers were used to aggregate meaningful DNA units.
HyenaDNA, however, retains high-resolution context for each layer as the underlying model approaches tokens due to its hierarchical structure.
Brosnan noted the importance of high resolution in genomics, adding, “You lose resolution or fidelity with the aggregation. With gene expression a single character can make a difference.”
The researchers found that HyenaDNA reaches state-of-the-art performance levels on 12 of 17 benchmark datasets using a model with orders of magnitude fewer parameters and pretraining data.
On the GenomicBenchmarks, HyenaDNA was found to surpass all previous state-of-the-art models on all eight datasets on average by +9 accuracy points.
The largest version of HyenaDNA was pre-trained on a single node of eight Nvidia A100s.
Read more about:ChatGPT / Generative AI
About the Author(s)
You May Also Like