Enterprise LLM Tool, Strategy Misconceptions Can Prove Costly

Decision makers need to ensure they use the right LLM for their use case to avoid unnecessary costs

Brian Sathianathan, Chief technology officer and co-founder, Iterate.ai.

June 20, 2024

6 Min Read
A blue CGI brain on a circuit board
Getty images

The large language model (LLM) ecosystem continues to grow at a remarkable pace, with new models – or fine-tunes of existing ones – arriving daily. But amid, and in many ways because of, this flurry of activity, there are also substantial misconceptions about the LLM tools and strategies available to enterprise teams. There’s significant risk, and expense, for those that don’t correctly match the right plan to their use cases.

Many of these misconceptions begin with the most high-profile LLM companies. OpenAI, Anthropic and others have large foundation models trained with billions of parameters. Those companies want to convince everyone to use this type of model. However, those models get tremendously expensive at scale. 

For example, each GPT-4 operation costs tokens and interpreting a single English word costs about one and a half tokens. This is like in the ancient days when AT&T would charge by the minute for a long-distance call. IT organizations within large companies tend to go with the largest vendors. That’s why “nobody gets fired for buying IBM” is burned into the lexicon. But that approach can put organizations in hot water in the LLM world. Going with the largest players could mean a lot more money with them.

Controlling costs means using the right tool for the job at hand. Every decision-maker should consider their specific use cases very closely. While certain cases require a large vendor, companies can use a more appropriate LLM to realize far better efficiencies.

Related:Bezos, Nvidia Backed ChatGPT Rival Doubles Valuation to $1B

New LLMs, Large and Small

From a developer’s perspective, two areas of LLM tooling demand attention right now.

One is the emergence of small language models (small LLMs, or SLMs), including Llama 2, Microsoft’s Phi-2, Google’s Gemma, TinyLlama, and many more. These SLMs are ideal for developing edge applications with limited compute power, applications that run on personal devices, or even on embedded systems such as telecom carrier infrastructure.

The other area is very large language models – those north of a hundred billion or even a trillion parameters. These serve developer teams working on high-accuracy artificial general intelligence (AGI) experiences for extremely sophisticated applications.

While misconceptions abound as developers’ LLM options grow every day, adopting a clear strategy and an LLM congruent with the business use case is essential to controlling costs at scale. A recent analysis compared the potential cost of calculations for 14.2 million input tokens and 1.2 million output tokens across large and small LLMs. Using GPT-4, those calculations would cost a company $500. Using Llama2, the same calculations would cost just $5 and with Mistral AI just $2.67.

Related:TinyLlama: The Mini AI Model with a Trillion-Token Punch

When selecting an LLM strategy, decision-makers should compare large, small API, and on-premise model optimization approaches. They should also fully vet and understand CPU (prediction) and GPU (training) utilization and costs. What is fascinating is that some open-source and smaller LLMs are catching up to the speed and capabilities of larger LLMs. That means they can run most use cases cost-effectively and at a much lower GPU usage.

Three Approaches to Building LLM Applications 

Enterprise developers building LLM applications take one of three paths.

The first includes scenarios where the LLM must be pruned, distilled, cleaned or otherwise made more performant and optimized to achieve an application’s goals. For example, an edge application that needs to overcome memory constraints.

The second and more common approach involves fine-tuning an LLM to provide information for a custom domain. For example, an LLM serving a finance industry application will need fine-tuning to learn to read financial statements.

The third approach that’s now rapidly scaling in the LLM world is called Retrieval Augmentation Generation (RAG). With RAG, an organization feeds its private data into a vector database to enable an LLM to answer questions about a domain that it isn’t familiar with. 

For example, if an employee asks the application, “What’s my PTO policy?” the application searches the database with that query, gives the answer as a context to the LLM, and asks it the same question. The LLM then regenerates an answer based on the content retrieved. This approach is popular because it delivers effective results easily and cheaply, with no retraining or fine-tuning required.

The Right LLM Tools for the Job

Different LLM tools are appropriate to different application approaches. One big exception is that organizations are using the Hugging Face API whether they’re working at the core of an LLM, fine-tuning an LLM or using a RAG approach. That popularity is due to Hugging Face abstracting many LLM capabilities in its toolkits.

PyTouch, an open-source kit from Facebook that allows users to manipulate LLMs, is another well-liked option. Most developers first dipping their toes in the water use LangChain, an open-source toolkit that lets users call LLMs. The tools are there that allow developers to create RAG applications without needing to write a line of code, simply by uploading documents as information sources.

Choosing Between Private LLMs and Public Generative AI Services

AI technology decision-makers must ask several incisive questions to determine their LLM strategy for a given application. Does the use case require cloud or edge computing, or both? What response times, accuracy, and security controls are required? What intellectual property is involved? How customized is the use case? And finally, what should the return on investment look like?

Examining example use cases, say an organization wants to build a restaurant drive-through ordering application. The approach should include using a custom fine-tuned LLM, along with NLU capabilities. For this application, an SLM such as Phi, Llama 2 or Mistral is appropriate and will be cost-effective.

Next, consider a general search chat application that accurately fulfills a broad array of tasks. That breadth requires buying a major tool like ChatGPT or Anthropic’s Claude.

Now take an application enabling search on private documents. Protecting that IP means avoiding public LLM offerings. Licensing a suitable tool and customizing an LLM to fulfill particular business needs is the ideal approach.

Finally, consider an application for providing information in a custom domain, such as finance. Any LLM and any approach – customizing, fine-tuning or RAG – can succeed. However, using open-source tools of appropriate size will deliver crucial efficiency.

Scale is Coming. Is Your Strategy Ready?

As enterprises implement more and more AI-based applications powered by LLMs, costs can hit the roof if they are not careful. At the same time, outside of OpenAI and similar players, few LLM models are operating at hyperscale. That will change very soon. Organizations that match the right tools for their business use cases now will only realize increasing dividends later, as this industry quickly matures.

About the Author(s)

Brian Sathianathan

Chief technology officer and co-founder, Iterate.ai., Iterate.ai

Brian Sathianathan is the chief technology officer and co-founder at Iterate.ai. Previously, Sathianathan worked at Apple on various emerging technology projects that included the Mac operating system and the first iPhone.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like