Scaling your LLM Infra Cost to Zero

Ashok Raja
3 min readJul 20, 2024

--

Who does’t like to keep their cloud billing as minimal as possible and how many dream of not paying for idle resources, when there is no one using it. Specially when it comes to GPU resources which are way more expensive.

AI is now the industry buzz word or should I say The Magic Word. But it doesn’t come cheap, at least not yet.

What if ?

I recently stumbled upon this GPU Cloud service runpod.io . My very first reaction is it’s just WOW wonderful. There are many more such services mushrooming with this AI madness.

Disclaimer : Don’t get me wrong, this is just my perspective and I am not promoting any of their service or offering. This is purely my experience playing with their cloud.

How is it even Possible ?

Serverless !! Serverless !! Serverless !! Serverless !! Serverless !!

Yes, we can run the LLM as a Serverless Endpoint and pay for only what you use.

RunPods offer an option where you can use GPU with ServerLess which autoscale based on no. of requests or request queue delay.

How complicated is it ? Do I need to learn a whole set of new tools/framework ?

Thankfully no. You just need these

Image Sources : Google with Creative Common Licenses
  1. write wrapper code in Python
  2. Very basic bash shell scripting
  3. Docker to containerize your code
  4. Knowledge on how to run ollama models. (If you have read so far you I am assuming you must be familiar with it)

You don’t even need to be champion on these techs just very basic understanding and exprience is enough.

Quick Demo

I am not going to bore you with writing code etc, you can find all the code required to build this project in my GitHub. Or even the completed final Docker image in my Docker Hub page.

GitHub Code Repo : https://github.com/ashokrajar/runpod-ollama-serverless

Docker Hub Final Image Repo : https://hub.docker.com/repository/docker/ashokrajar/runpod-ollama-serverless/tags

Above request costed me only $0.08/request. It can go even further down when multiple requests are made as with Serverless there is cold start time. When multiple requests are made following requests doesn’t require cold start time and it will be even more cheaper

If you have made up your mind to give RunPod.IO a try kindly use my referal link https://runpod.io?ref=xq02lxa1 and help me write more such content.

--

--

Ashok Raja
Ashok Raja

Written by Ashok Raja

I’m a Computer Engineer by profession and a Traveler by heart. I love all things Computers, Travelling, Trekking and Biking.

No responses yet