This blog is a read-along for the repository xprilion/ollama-cloud-run which shows how to deploy various models using the Ollama API on Cloud Run, to run inference using CPU only on a serverless platform - incurring bills only when you use them.

Ollama is a framework that makes it easy for developers to prototype apps with open models. It comes with a REST API, and this repository provides Dockerfiles and deployment scripts for each model.

Google Cloud Run is a fully managed compute platform that automatically scales your stateless containers. You can run code in any language, and all dependencies are included in a container image, which Google Cloud Run handles deployment and scaling for.

Ispiration (and gemma2b code) from wietsevenema/samples.

Models

Model	Version	Folder Link
codegemma	2b	codegemma/2b
codegemma	7b	codegemma/7b
gemma	2b	gemma/2b
gemma	7b	gemma/7b
gemma2	9b	gemma2/9b
llama3	8b	llama3/8b
llava	7b	llava/7b
mistral	7b	mistral/7b
phi3	3.8b	phi3/3.8b
qwen2	0.5b	qwen2/0.5b

Usage

To build the container with a specific model included and deploy the Ollama API to a publicly accessible URL on Cloud Run, use the following command from the corresponding model's directory. For example, to deploy gemma:2b:

bash gemma/2b/deploy.sh

Respond to any prompts the command gives you. You might need to enable a few APIs and choose a region to deploy to.

Building the container takes roughly 3-20 minutes, depending on model size.

Once the command completes, the deploy command shows the public URL of the service.

Explore the API

Ask the deployed model a question:

curl <PUBLIC_URL>/api/generate -d \
 '{ 
    "model": "gemma:2b", 
    "prompt": "Why is the sky blue?" 
  }'

The first request to a new instance will take some extra setup time because the model is loaded into memory. Ollama keeps the model in memory for 5 minutes.

For the full Ollama API, refer to the API docs.

Clean Up

To clean up after following this short tutorial, you can do the following:

In Artifact Registry, find the cloud-run-source-deploy repository and remove the container image used by the Cloud Run service you created.
In Cloud Run, delete the service you created.

Links

Ollama

Use for research, exploration, and prototyping.

Ollama Models on Cloud Run