How we sped up our model inference 4x (or: How we learned to host AI models at production scale in-house)

Full name
11 Jan 2022
5 min read
A lightning bolt in a storm


When we first began building solutions that used embeddings models under the hood, we went straight for the top of the line: OpenAI. Despite the indisputably great performance, we found that the frequent outages and variability in latency were not worth it, particularly since things kept breaking during critical demos.

Deciding that we needed to manage things ourselves, we proceeded to host our own models and use a self-hosted vector database. It turns out that when you're serving millions of API requests to resource-intensive AI models, there are huge payoffs to managing the entire stack yourself.

Figuring that if we needed it someone else must, we began to provide access to our APIs to other startups working in the space. While the initial beta release did work, it did not handle traffic spikes elegantly and suffered from occasional outages. It was time for a rework of our deployment and inference stack. 2 months and over 10 million inferences later, we got the latency down to around 70ms, with only the occasional error. The real test came later, with a huge spike from a customer putting significant strain on the new system, which it handled effortlessly.

"Over the last 2 months and 10 million+ inferences, we achieved an average latency of 70 ms for our embeddings models."

So, how’d we do it?

Let's look at a few of the things we did to speed up our inference.

Your API server matters a lot

The first version of our API server did a lot of things; authentication, logging, billing, load balancing, rate limiting and more. Naively, we thought this would scale up linearly with our usage. Unfortunately that is simply not true - databases have finite connection pools, loggers have rate limits of their own and dynamic GPU pricing from our vendors means that billing isn’t just simple multiplication anymore.

The first step is to offload as much of this work as you can. We used pub/sub to do almost everything mentioned above async. This meant that we could handle a significantly greater workload using the same resources, and didn’t have to wait for databases or other APIs to scale up as fast as the rest of our system did.

Next up is to cache as much as possible. We realized that there were time savings to be had for folks who are either developing new features or running automated tests. The key finding here was that their inputs to the model were usually the same, albeit run hundreds or thousands of times over in a short period of time. Taking a leaf out of the dynamic programming handbook, we realized that you don’t need to re-compute the same inputs every time - storage is cheap and plentiful, so once you see the same value come through frequently, it’s worth storing and looking up the outputs. Due credit to this article which helped tremendously. Remember though that this is a stateful operation and set up your API server accordingly.

Handle Kubernetes with care

Kubernetes is a great piece of technology, but it is sometimes too smart for its own good and needs guardrails. Especially when autoscalers get involved, setting up your clusters, workloads and their YAML files can make a big difference in performance and reliability. Specifically:

Hardware choices, especially when hosting AI models, are incredibly important. The obvious decision is that of which GPU or CPU to pick. But especially as usage scales, a few more big questions come up. Mainly by experimenting with a bunch of different setups, we found that:

  1. Shared A100s performed far better than we initially thought. In other words, half of an A100 is usually better than an entire T4. The caveat is that you must ensure that your workloads can actually share a GPU comfortably without encroaching on each other (so no ALiBi on a shared GPU - subject for another post).
  2. GPU availability starts to become an issue. As you are probably aware, getting access to GPUs at all major cloud providers is a hassle. But it gets a lot more painful when the cluster region simply doesn’t have the necessary resources available, even if you have the quota for it. We’ve resorted to keeping a number of GPUs alive and empty to act as “overflow” - it isn’t ideal but this ended up being a lifesaver during spikes.

Re: the YAMLs, by using a combination of limits, affinities and anti-affinities, we were able to ensure that pods - especially new ones spun up during spikes - found nodes that were correctly set up for this purpose.

Diversify across regions and (if you can) cloud providers. We’ve banged our heads against walls many times when we needed 4xA100’s in one particular region and been denied, so having clusters in several regions across different clouds really helps.

Optimize your containers

Optimizing a container is a broad task and really depends on the model, hardware and usage pattern you’re optimizing for, so there isn’t a straightforward answer. But a good starting point is to identify a base image that works really well for your use-case, and for AI models, the official Nvidia ones tend to be best.

At its core though, you should at least be consistent about how you containerize and deploy your models. We found success in standardizing our Dockerfiles and always using the same base Docker image, which makes our build times more predictable (separately, shoutout to Depot for helping us cut down our build times dramatically!).

Use bf16 when you can

In addition to hosting embeddings models, we started to fine-tune and deploy much larger LLMs - ones like Falcon, Dolly and Llama V2 - which meant we had to pick between fp16 and bf16 training and inference. In every case where we’ve had the chance, we’ve used bf16 for both fine-tuning and inference, and it has consistently done well. The caveat is that you usually need an A100 for this. Also, look into quantization using bitsandbytes - subject for another post.


Much like any other tech stack, there’s rarely a “right” answer to the question of how to build an actually scalable back-end. Inevitably, there’s a fair amount of experimentation, failures and downtime as you learn. That said, once you figure it out, it’s pretty cool to watch your system pass stress tests that seemed implausibly large just a couple of months ago .

Looking to train and host a model on either a managed cloud or your own? Just want advice or want to commiserate? Drop us a line at

Stop pushing paper manually

Automate your organzation's mundane, repetitive tasks and use that time to grow

Need more help? Email us by  clicking here