Sponsored by Google Cloud
Choosing Your First Generative AI Use Cases
To get started with generative AI, first focus on areas that can improve human experiences with information.
MIT scientists improve chatbot performance in extended conversations by optimizing memory usage
The longer you converse with a chatbot, the worse its responses typically become. Now, a team of researchers from MIT has developed a solution to enable the likes of ChatGPT or Gemini to chat nonstop without their performances deteriorating.
Dubbed StreamingLLM, the framework makes a change to the underlying model’s Key-value (KV) Cache which acts as a conversation memory.
Chatbots generate responses based on user inputs, storing those in the KV Cache. The system creates an attention map that plots each token and how it relates to others. KV Caches can only hold a finite amount of information and will ditch older information when nearing capacity.
MIT’s researchers propose a Sliding Cache – it removes less essential information while ensuring the cache retains key data points.
The resulting process allows a chatbot to go on conversing with a user without the performance dropping. The StreamingLLM paper states that the solution enabled models like Llama 2 and Falcon to perform stably even when a conversation went well beyond four million tokens in length.
The method even enabled models to return responses more than 22 times faster.
“By making a chatbot that we can always chat with, and that can always respond to us based on our recent conversations, we could use these chatbots in some new applications,” Guangxuan Xiao, the lead author on the StreamingLLM paper told MIT News.
Researchers found that the first few inputs of a query are the most important. If these get shunted out when capacity is reached, that causes models to fail in longer conversations. But if these inputs are kept in, the performance stays up. They call this phenomenon, ‘attention sink.’
The threshold of four initial tokens was enough to prevent a chatbot − using a Sliding Cache – from seeing deteriorating performance as conversations continue. In fact, it led to optimal performance.
The researchers also discovered that adding a placeholder token as a dedicated attention sink during pre-training can further improve deployment.
Song Han, a member of the MIT-IBM Watson AI Lab and a distinguished scientist of Nvidia told MIT News: “We need an attention sink, and the model decides to use the first token as the attention sink because it is globally visible — every other token can see it.”
“We found that we must always keep the attention sink in the cache to maintain the model dynamics.”
You can access StreamingLLM via Nvidia's large language model optimization library, TensorRT-LLM.
Read more about:
ChatGPT / Generative AIYou May Also Like