Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!
October 10, 2023
The high costs of large language models (LLMs) that power generative AI are a growing concern, but smaller models could be a solution.
“The rise of LLMs like GPT-4 has shown extraordinary leaps in performance, and with this advancement comes increased costs,” Adnan Masood, the chief AI architect of the technology company UST, said in an interview.
“The computational intensity of LLMs, due to their sheer size and computational needs due to billions of parameters, requires extensive power. This intense computation translates to larger energy consumption, increasing the operational cost and environmental impact,” he said. “With model sizes exceeding GPU memory limits, there is an ensuing demand for specialized hardware or complex model parallelism, further compounding infrastructure costs.”
Smaller language models can reduce costs and improve efficiency when they are fine-tuned, Masood said. He noted that there are techniques like distillation and quantization in LLMs to compress and optimize models. Distillation involves training a smaller model using the outputs of a larger one, and quantization reduces the precision of the model's numerical weights to make it smaller and faster.
Smaller models’ “reduced parameter count naturally demands less computational power, ensuring faster inferences and potentially shorter training durations,” he added. “This smaller footprint fits well within conventional GPU memory, eliminating the need for specialized, more expensive hardware setups. With the reduced computational and memory usage, energy consumption drops, directly trimming operational costs. Leveraging APIs for proofs-of-concept or prototyping in production workloads, the low per-token pricing proves beneficial during scaling. Yet, when applications experience rapid growth, relying solely on larger language models can lead to exponential cost increases.”
Smaller language models could also slash cloud infrastructure costs, Matt Barrington, Americas emerging technology leader for EY, said in an interview. For instance, fine-tuning a domain-specific model on a cloud-based service requires fewer resources, reducing the training time costs. Companies can also allocate AI resources to other crucial areas closer to the end user.
“By utilizing compact language models in edge computing scenarios, enterprises minimize the dependency on costly cloud resources, leading to cost savings,” he added.
More efficient AI models are already being rolled out. Examples of smaller models include recent ones such as phi-1.5, which, despite their compact size, rival the performances of larger models like GPT-4, Masood said. There are also domain-specific models like Med-PaLM 2, tailored for the health care sector and life sciences industry, and Sec-Palm, meant for security applications.
“Models like Llama 2 70b, which is priced considerably lower than its contemporaries like Google's PaLM 2, are emerging as cost-effective solutions,” Masood added. These are “a stark reduction from earlier models. Meta's 13-billion-parameter LLaMA even outperformed the larger GPT-3 in most benchmarks.”
Initiatives such as the BabyLM challenge at Johns Hopkins University aim to make small models as effective as LLMs. Amazon has a marketplace for these smaller models that can be customized with a company's data. Anyscale and MosaicML sell models such as the 70 billion-parameter Llama 2 at lower prices, highlighting the move towards cost-effective, strong models.
There is a pressing need to cut the costs of LLMs. One significant expense is the GPUs that are used for training the LLMs. Perhaps the most sought-after is Nvidia’s H100, which fetches $30,000 or more apiece, Muddu Sudhakar, CEO of Aisera, noted in an interview. There is a waitlist for such GPUs, with some VCs using them as bait to attract startups for funding.
Even if you get the GPUs, you need a business that generates enough revenues to cover their costs, Sudhakar said. A recent blog post from VC firm Sequoia notes a big monetization gap, which could be an issue for the generative AI market.
“Once you obtain the GPU, you will need data scientists, which are very tough to recruit. The comp packages are also substantial,” he added. “Finally, operationalizing LLMs is expensive in terms of processing interactions, managing and upgrading the models for prompt injections, security issues, hallucinations, etc.”
Masood predicted that fine-tuned LLMs would soon match the performance of larger models at a fraction of the cost. He said the open-source community has been addressing practical challenges with techniques like LongLoRA that show how context windows can be dramatically extended.
“If the trajectory is any indicator, the coming era might witness a synthesis of open-source models and smaller LLMs, forming the backbone of the next-generation language modeling ecosystem,” he added.
Read more about:ChatGPT / Generative AI
Sascha Brodsky is a freelance technology writer based in New York City. His work has been published in The Atlantic, The Guardian, The Los Angeles Times, Reuters, and many other outlets. He graduated from Columbia University's Graduate School of Journalism and its School of International and Public Affairs.
You May Also Like
Generative AI Journeys with CDW UK's Chief TechnologistFeb 28, 2024
Qantm AI CEO on AI Strategy, Governance and Avoiding PitfallsFeb 14, 2024
Deloitte AI Institute Head: 5 Steps to Prepare Enterprises for an AI FutureJan 31, 2024
Athenahealth's Data Science Architect on Benefits of AI in Health CareJan 19, 2024