Performance
AI programming brings a new set of performance considerations. The fundamental difference is that model calls (e.g. to GPT-4) are an order of magnitude slower than traditional API calls. Generating a few sentences can take a few seconds.
The key strategies are:
- Stream
- Minimize how long the output needs to be
- Avoid waterfalls / roundtrips
- Defer execution
- Use a faster model
Fortunately, AI.JSX helps you with all of these.
Aside from "stream", all these strategies trade off between performance and reliability of correctness. You'll have to find the tradeoff that works for your application. In general, we recommend starting with correctness, then trading off for performance while keeping correctness above an acceptable threshold. No one cares how fast your app is if the results are bad. 😄
Stream​
When you do a model call, you can either wait until it's fully complete before showing the user anything, or you can stream the output as it arrives. Streaming greatly improves perceived performance, and gives the user a chance to cancel the generation if it's heading down the wrong path.
By default, AI.JSX streams responses for you.
From UI​
// React component
<ResultContainer title={`AI comes up with a recipe for ${query}`}>
<AI.jsx>
{/* AI.JSX components */}
<ChatCompletion temperature={1}>
<UserMessage>Write me a poem about {query}</UserMessage>
</ChatCompletion>
</AI.jsx>
</ResultContainer>
This will call the chat model and automatically stream the results into the DOM.
From the API​
function MyComponent() {
return (
<ChatCompletion temperature={1}>
<UserMessage>Write me a poem about dogs</UserMessage>
</ChatCompletion>
);
}
/** Get the partial result as it's streamed */
function handleIntermediateResult() {}
const finalResult = await AI.createRenderContext().render(<MyComponent />, {
map: handleIntermediateResult,
});
For more detail, see Intermediate Results.
Minimize how long the output needs to be​
A model generation takes linearly more time as the output length increases, so the shorter your output can be, the faster it'll be. However, this inherently means the model is spending less time "thinking" about your result, which could degrade accuracy. You'll need to balance these tradeoffs.
For example, you may have a scenario where you instruct the model to give you output in a structured format, like JSON. If you can come up with a simpler JSON format that takes fewer characters, the model will produce it faster.
Additionally, if you know the limit of how long you want your response to be, you can set the max_tokens
param:
<ChatCompletion max_tokens={200}>
<UserMessage>Write a concise summary of an imaginary movie.</UserMessage>
</ChatCompletion>
This forces the model to end its response within 200 tokens. If you only want short responses, this both improves your correctness, and prevents the model from droning on by producing an unwanted long response.
Avoid waterfalls / roundtrips​
Just like with UI engineering, waterfalls can hurt your performance. Sometimes they're unavoidable, but be mindful when introducing them.
For instance:
function CharacterGenerator() {
return <ChatCompletion>
<UserMessage>Write me a profile for a fantasy character</UserMessage>
</ChatCompletion>
}
function ToJson({children}: {AI.Node}) {
return <ChatCompletion>
<UserMessage>Convert this to JSON: {children}</UserMessage>
</ChatCompletion>
}
function App() {
return <ToJson><CharacterGenerator /></ToJson>
}
This is a waterfall, because the CharacterGenerator
call needs to complete before the ToJson
call can start.
Instead, we could combine them into a single model call:
function CharacterGenerator() {
return (
<ChatCompletion>
<SystemMessage>Respond in JSON.</SystemMessage>
<UserMessage>Write me a profile for a fantasy character</UserMessage>
</ChatCompletion>
);
}
If you can get the model to do what you want in a single shot, that'll be more performant. However, asking the model to do more at once decreases reliability. It's more robust to split your workload into simple, discrete tasks for the model. So, there are tradeoffs to balance here. You want the task size to be as complicated as the model can reliably do in a single pass, but no more complicated.
A roundtrip is when your client needs to make a connection to your backend. Depending on the quality of the user's network connection, this can have a big negative impact on performance. As a result, many performance strategies involve minimizing roundtrips.
Any clientside app can have roundtrips (calling out to APIs, etc). With AI apps, we add a new type of roundtrip: calling out to a model provider (e.g. OpenAI).
The amount of roundtrips in your logic depends on how you structure your AI.JSX program. A program with many sequential calls out to a model will have more roundtrips than one that does a single shot. (Of course, unlike traditional API calls, model calls are so slow that the client/server latency is a less important contributor to the overall performance profile.)
Defer Execution​
AI.JSX's engine will maximally defer execution, giving you optimal parallelism. To take full advantage of this, avoid manual render
s within a component:
function StoryGenerator(props, { render }) {
const heroName = <CharacterName role="hero" />;
const villainName = <CharacterName role="villain" />;
return (
<ChatCompletion>
<UserMessage>
Write a story about a hero named {heroName} and a villain named named {villainName}.{heroName} should be nice.
</UserMessage>
</ChatCompletion>
);
}
If you run this example, you'll see that you have two different heroName
s generated, because each time you include the component in the JSX, it's a new instance. To get around this, you might reach for render
to burn the results into a string:
function StoryGenerator(props, { render }) {
// Anti-pattern! Do not do this.
const heroName = await render(<CharacterName role="hero" />);
const villainName = <CharacterName role="villain" />;
return (
<ChatCompletion>
<UserMessage>
Write a story about a hero named {heroName} and a villain named named {villainName}.{heroName} should be nice.
</UserMessage>
</ChatCompletion>
);
}
This hurts parallelism, because the StoryGenerator
component can't be returned until both those renders complete. (And, if you had the heroName
behind a conditional, the render might be wasted.)
Instead, use memo
:
import { memo } from 'ai-jsx/core/memoize';
function StoryGenerator(props, { render }) {
const heroName = await memo(<CharacterName role="hero" />);
const villainName = <CharacterName role="villain" />;
return (
<ChatCompletion>
<UserMessage>
Write a story about a hero named {heroName} and a villain named named {villainName}.{heroName} should be nice.
</UserMessage>
</ChatCompletion>
);
}
With this approach, you're still deferring execution optimally, but you also ensure that each instance of heroName
will resolve to the same generated result.
Use a faster model​
Different models have different performance profiles. GPT-4 is slower than GPT-3.5-Turbo, for instance. But, unfortunately, the slower models tend to be more correct. So you'll have to find the tradeoff that works for you.
OpenAI's recommendation is to start with the most powerful model, get your app working, then move to faster models if it's possible to do so without sacrificing correctness.
You may also want to consider different model providers; OpenAI vs. Anthropic may have different performance and uptime profiles. AI.JSX makes it easy to switch model providers for a part or whole of your app via context.