Building Applications#
TL;DR
When building LLM Applications, most people focus too much on the model.
You should focus on evals.
This is the order of steps to take when building an application:
Detail your use case, especially inputs and outputs. These become your evaluations.
Pick a model. Test a couple of cases to see if it’s possible with the most capable, but also expensive.
Modify your prompt/model if needed.
Get more data.
Repeat.
Add aspirational use cases to your eval set.
With the speed of development, use cases that may not work now may soon be possible.
In this guide, we show an application built step by step.
The decision-making and time to make decisions are both shown.
How to get stuck#
These days, nearly anyone can quickly start building an LLM application. With the availability of modern LLMs that can process text/images/audio paired with strong reasoning capabilities, it’s easy to build a basic demo.
However, these demos always have rough edges, not quite working as well as expected, or having some edge cases that clearly fail. What happens next is what I call “whack-a-mole”. Where a developer changes prompts as typically folks try random prompts ad hoc, but then breaks earlier cases. Often at this stage, multiple models are tried and soon it becomes hard to remember which prompt worked with which model, and the application never progresses from there.
This doesn’t have to be you. Here are my thoughts after developing products such as AI Studio, NotebookLM, and other products as part of my day job.
The Correct Order#
Here’s a quick run-through of the steps. These apply to anything, text-to-text models, image generation models, or LLM systems.
Step 1: Build an evaluation suite#
Hamel and I agree build your eval suite first. Here’s why:
They refine your thinking.
They define the uniqueness of your need.
You need them to pick a model.
Your eval set should include all inputs to the LLM, and at least a notion of the expected output. For a more specific breakdown of types of evals, read this section from the eval guide.
Step 2: Pick a model#
Most people focus way too quickly on the model, after all, it’s the most exciting thing. But by thinking through your evaluations first, you’ll be able to select models in a principled way. Now instead of guessing, you can run your suite and see what happens.
I typically do this in two stages:
Candidate model selection
Final model selection.
In candidate model selection, I take two or three of my evals and check the performance. The goal of this is to get a sense of what kind of model you need, particularly the size. If both Gemma 2b and GPT-4 are excelling on a task, then I know I have options. But if only LLAMA 70b or above is working, then I can ignore the smaller models. I typically run these “by hand” using a web ChatUI or on a command line with Ollama. The goal here is a gut check and not to spend too much time building infrastructure.
Once a set of candidates is selected (usually three), I then run the whole eval set across all of them. This is the step where some code is typically required to map your input datasets to the model, and then capture the outputs. For things like text models, this is quite straightforward these days. For more complicated inputs and outputs, like text or audio, some more rigging is needed.
Step 3: Modify your system#
Most LLMs are great out of the box for many applications, but not perfect. This is the step where you start making changes to the system to improve the performance. I use the word system deliberately because these modifications could be to any part of the end-to-end system.
This includes:
Prompt changes
Tuning to update the model
Adding additional components in your application, such as classifiers, LLM calls, etc.
Sometimes these changes are simple and can be done cheaply in an afternoon, other times this can take thousands of dollars of investment. The quality gap and your need will inform the strategy here.
Step 4: Capture the data#
As you use your application, be sure to capture as much data as possible.
You’ll use this to:
Build more evals and try out new models
Modify your existing system more, such as further fine-tuning
Train classifier/reward models (Advanced usage)
Step 5: Make more adjustments if needed#
Now with actual data and users, you’ll both:
Find new ways you can make your original use case better
Discover all the cases you missed and now need to handle
Iterative by nature#
In traditional applications, typically a specification is written down, and then the application is designed to perform that specification. For example, “Take this customer order and append a date”. The challenge with LLM applications is the probabilistic nature of the LLM itself. By design, we don’t know what we’ll get from an LLM, especially on the first try with a new prompt. This is an iterative nature of development, trying something, seeing what happened, adjusting, and trying again. This makes it closer to a “data science” workflow than a “computer science” one.
End to End Example: Copy Automator#
This guide was motivated from a marketing friend who needed help; they needed to write ~200 product descriptions for a client that sells swimsuits, such as the one below.
We worked through the entire application building process, evalset design, model and prompting interactions, and before moving onto finetuning a small model which turned out to be the right solution for this case.
Each step is laid out with the approximate time at the top so you can get a realistic sense of how much effort goes into each one.
Step 1: The Evalset (2 hours)#
She first asked what model I recommend, and I asked what the inputs and outputs to the system were. She provided a google sheet which had the product characteristics and the descriptions
Shopify Display Name |
Brand Group |
Suit Type |
Suit Style |
Product Description |
---|---|---|---|---|
Plus Size Swimsuit Cover Up in Hippy Dip Tie Dye |
Hippy Dip Tie Dye |
Swimsuit Cover Up |
Tunic |
Bring back the season of summer love in this Plus Size Swimsuit Cover Up in Hippy Dip Tie Dye. This flowy cover up creates a flattering and relaxed fit for summer days lounging by the pool or at the beach. Featuring a bold Hippy Dip Tie Dye print, this breezy tunic cover up drapes beautifully over your swimsuit, combining elegant sophistication and retro hippy vibes. Whether you’re headed to a seaside lunch with the girls or unwinding by the water, pair this plus size cover up with any of your swim outfits. |
Plus Size One Piece Swimsuit in Hippy Dip Tie Dye |
Hippy Dip Tie Dye |
One Piece |
Over The Shoulder| Underwire |
Our Plus Size One Piece Swimsuit in Hippy Dip Tie Dye will have your vibing with the summer this season. This plus size swimsuit is designed with a cascading front detail that adds texture and style, while the ruching at the sides creates a flattering fit for your figure. Hidden underwire creates comfortable and secure support while the bold tie die print brings a playful, eye-catching retro vibe to your swimwear. Whether you’re swaying to the music at the beach or lounging by the pool this summer, you’ll be feeling your best all season long. |
Plus Size V Neck Tankini in Hippy Dip Tie Dye |
Hippy Dip Tie Dye |
Bikini Tops |
Over The Shoulder| Tankini |
Flaunt your natural curves with our Plus Size V Neck Tankini in Hippy Dip Tie Dye. The tankini is designed with a stunning plunging neckline, highlighted with a gold ring accent, which creates a flattering cut for any figure. Ruching at the waistline enhances the silhouette with texture and style. Adjustable straps and supportive features make it comfortable and a perfect fit for all-day wear. Whether you’re swaying to the music at the beach or lounging by the pool this summer, you’ll be feeling your best all season long. |
The sheet contained 20 examples and became our evalset.
Attempt 1: Prompt and Model Zero Shot on Large Model (30 seconds)#
We first tried zero-shot prompting the default ChatGPT model, which at the time of writing is o1-Mini.
While we weren’t expecting a perfect response on the first try, this took 30 seconds and establishes a baseline.
Attempt 2: Instructable prompt and end-user opinion (1 minute)#
The next thing we tried was adding more instructions to the prompt.
write a 75-85 word product description for the details below. avoid using the following words: embrace, timeless, sophistication, flattering, elegant, bold, ensure, allure, elevate, boast.
Attached picture is how the product/print looks like. please use appropriate adjectives to describe the print.
Use the details details below: Product name: Swim Dress in Sizzling Summer Floral Print: Sizzling Summer Floral Swimsuit Type: One Piece Features: Removable Cups Strap Info: Adjustable Shoulder Straps
Please write a variation of this:
Create a variation of this description: Look fabulous and ready to hit the beach with these Zebra Stripe Splash Splash Bikini Bottoms, which features a high waist cut with stylish ruching at the sides for a comfortable, flattering fit. The bottoms also have an adjustable waistband that can transform your look from high waist to low waist, all while providing a full-coverage silhouette. You also love the wild Zebra Stripe Splash print which will bring a playful flair to your swimwear collection. Grab your favorite sunglasses and hit the beach in style this season!
This is what we got.
To myself, it seemed to look fine, but my friend rejected it and a couple of other samples. To her, a couple of pieces of language felt too generic. When building LLM applications, this is why you need an end user or expert on hand if you’re designing an application for someone else to use. They can often see things that you, as the application builder, cannot.
Attempt 3: Many Shot on Large Model (1 hour)#
The next step was to try few shot prompting. Since we had an evalset split this into a prompt and holdout set. I constructed a prompt generator. You can see the full prompt here. It’s quite long as it includes 16 input output examples.
Unfortunately this did not work either. We briefly considered including more input output examples, but started to realize that if every call required thousands of input tokens, the cost per run for all the copy my friend needed would be ~$5 dollars. Not too expensive for a couple of generations, but too expensive if we included all the iteration attempts.
Attempt 4: Gemma 2B Finetune (1 Hour)#
Given what we know about cost and generic language, I decided to try finetuning taking portions of the evalset and turning it into a training set.
I was hopeful about this approach because:
Finetuning updates the weights so the model actually learns to replicate my friend’s writing style
Cost of generation is trivial, especially for a small model
Training a small number of text to text examples is also trivial
I used a fork of my Gemma Bakery example, replacing the cake examples with:
The finished colab is here. LORA finetuning took about 5 minutes on colab free tier. The trained model now takes the following input and produces the following output:
<start_of_turn>user
Name: Plus Size Swimsuit Cover Up in Hippy Dip Tie Dye
Style: Tunic
Brand: Hippy Dip Tie Dye
Type: Swimsuit Cover Up
<end_of_turn>
<start_of_turn>model
Add a splash of color to your summer wardrobe with this Plus Size Swimsuit Cover Up in Hippy Dip Tie Dye. This groovy swimsuit cover up is designed with a flowy silhouette that drapes beautifully over your swimsuit, making it perfect for lounging by the pool or heading to a beachside bonfire. The tie dye print adds a vibrant flair to your look, while the lightweight fabric keeps you cool and comfortable all day long. Grab your favorite sunglasses and a cold drink for a summer day that’s ready for anything. <end_of_turn>
My friend was much happier with this copy, noting it matched her writing style better and avoided AI phrases she’s seen over and over again.
In addition to that, finetuning the model had these benefits:
Quicker Generation#
This happens for two reasons. A smaller model is typically faster than a bigger model, in this case likely a 100x reduction in computation cost just due to weights. The other is the elimination of instruction prompting and tokens, a second order of magnitude reduction in the number of tokens that needs processing. The fine-tuned model recognizes the specific input format and “knows” to produce the right output format.
Static Model#
Cloud models are tweaked and altered quite regularly. This is because cloud vendors are always trying to make their systems better in a competitive ecosystem. This could include changes to the model, safety filters, prompts.
A local model will stay the same. For this use case we prefer stability over “more new stuff”.
Easier Retries and Simplified Application Logic#
Canned phrases could still appear, but with everything local it’s easier to handle and retry generations. If an unwanted phrase is detected we can just truncate the output and try again. Even with good internet connections making requests, hoping we don’t exceed quota or hit timeouts, and handling all these errors just adds more weight to a simple application. Local LLMs, with full access to control tokens, mean we can just cut text in half and let the model completion “do its thing”.
References#
Hamel’s Eval Guide - Though relatively new, I’m betting this will be the classic guide on applications evals
Building LLM Applications at Discord - A great 15 min video on end to end application building for one of the largest chat platforms
Evalset for Copy - The evalset used for the example above. You’ll want something similar for any application you’re building, at least 20 examples of inputs and outputs of what you’re trying to do.
Prompt Builder Colab - Shows how I read in, call o1-Mini, and build a few shot prompt for preliminary application development
Finetuning Colab - Taking our evalset and repurposing parts of it to see if we can train a model to do what we need