27 June, 2024

Ollama Models on Cloud Run

This blog is a read-along for the repository xprilion/ollama-cloud-run which shows how to deploy various models using the Ollama API on Cloud Run, to run inference using CPU only on a serverless platform - incurring bills only when you use them.

Ollama is a framework that makes it easy for developers to prototype apps with open models. It comes with a REST API, and this repository provides Dockerfiles and deployment scripts for each model.

Google Cloud Run is a fully managed compute platform that automatically scales your stateless containers. You can run code in any language, and all dependencies are included in a container image, which Google Cloud Run handles deployment and scaling for.

Ispiration (and gemma2b code) from wietsevenema/samples.


To build the container with a specific model included and deploy the Ollama API to a publicly accessible URL on Cloud Run, use the following command from the corresponding model's directory. For example, to deploy gemma:2b:

bash gemma/2b/deploy.sh

Respond to any prompts the command gives you. You might need to enable a few APIs and choose a region to deploy to.

Building the container takes roughly 3-20 minutes, depending on model size.

Once the command completes, the deploy command shows the public URL of the service.

Explore the API

Ask the deployed model a question:

curl <PUBLIC_URL>/api/generate -d \
    "model": "gemma:2b", 
    "prompt": "Why is the sky blue?" 

The first request to a new instance will take some extra setup time because the model is loaded into memory. Ollama keeps the model in memory for 5 minutes.

For the full Ollama API, refer to the API docs.

Clean Up

To clean up after following this short tutorial, you can do the following:

  • In Artifact Registry, find the cloud-run-source-deploy repository and remove the container image used by the Cloud Run service you created.

  • In Cloud Run, delete the service you created.

Use for research, exploration, and prototyping.

