Serverless LLM inference with Ollama
Introduction
Built on top of Ollama’s, this repo allows you to quickly and cheaply deploy any popular open-source LLM as a serverless API.
Ollama is one of the easiest ways nowadays to deploy LLMs. However, to integrate Ollama into business applications, users need to deploy it on a hosted VM which normally runs 24/7. This can be costly for businesses that only need occasional LLM invocations and/or are simply prototyping with LLMs. The serverless implementation described here offers a solution.
The API runs on AWS Lambda, making it scalable, budget-friendly, and simple to integrate into any of your existing business solutions.
Here’s the summary:
- This repo helps deploy Ollama as a serverless function on AWS Lambda
- It’s a simpler, potentially more cost-effective approach compared to traditional deployments, especially since it doesn’t rely on complex/costly GPU resources
- While this setup might limit speed and input sizes, it’s ideal for small-scale or custom LLM applications that need a quick start
This might be the first implementation of its kind for Ollama, and I’m quite excited about it. It could be a handy solution for anyone looking to deploy LLMs without much hassle or high costs.
If you want dive right into using this serverless implementation, click here
All code for this article can be found in my GitHub repo.
Table of contents
1. Getting started
Prerequisites
1. Install Docker
2. Install and configure AWS CLI
3. Make Docker available to non-root users (Linux instructions)
Instructions:
To get started run the following commands. You can replace ‘llama2’ with any other LLM mentioned here.
# Make Docker available for all users (MacOS)
sudo chown -R $(whoami):staff /Users/$(whoami)/.docker
# Clone the repository
git clone https://github.com/sebastianpdw/ollama_serverless.git
cd ollama_serverless
# Run the main setup script
./scripts/setup.sh llama2
Below I have described the different components of the serverless implementation. This has been written for the technical user that would like to understand the architecture.
If you only want to use the deployment, please refer to the information above.
2. Python script
The main component of our serverless LLM inference setup is the Python script, which serves as the bridge between the Ollama API and AWS Lambda. This script uses the same arguments as the Ollama API and forwards these to the API on the localhost of the AWS Lambda on port 11343 (where the Ollama API is running).
def generate_text(input_text, format=None, options=None, system=None, template=None, context=None, raw=None):
api_url = BASE_API_URL + "/api/generate"
# Define the data payload for the API request
data = {
"model": os.environ["MODEL_NAME"],
"prompt": input_text,
"format": format,
"options": options,
"system": system,
"template": template,
"context": context,
"raw": raw,
"stream": False
}
# Making a POST request to the Ollama API
print(f"Making API request to {api_url} with data {data}")
response = requests.post(api_url, json=data)
response_dict = response.json()
print(f"Response from API: {response_dict}")
if response.status_code != 200:
raise ValueError(f"API request failed with status code {response.status_code}: {response.reason}")
print(f"Generated text: {response_dict['response']}")
return response_dict
3. Docker image
The Docker image encapsulates all necessary dependencies and configurations, ensuring a consistent and isolated environment in our AWS Lambda. The Docker image consists of two stages.
Stage 1: Download and get model files
In this stage the Docker image downloads the LLM model file by calling a startup.sh script. This script only pulls the right model based on the $MODEL_NAME argument.
FROM ollama/ollama as downloader
ARG MODEL_NAME
# Copy the startup script into the image
COPY startup.sh /startup.sh
# Make sure the script is executable
RUN chmod +x /startup.sh
# Run the initialization script to perform the pull operation
RUN /startup.sh $MODEL_NAME
Stage 2: Setup Ollama environment
For our Ollama environment to work, we copy the model files that we just pulled from stage 1 into our container.
FROM ollama/ollama
# Copy the model files
COPY --from=downloader /root/.ollama/ /root/.ollama/
# Copy the ollama binary
COPY --from=downloader /bin/ollama /bin/ollama
# Set variables for ollama
ENV HOME=/root
RUN chmod 777 /root
Then we install the AWS Lambda libraries and dependencies such that we can call our Python script. This part is based on the official AWS Lambda Python 3.12 image. See the main snippet below.
...
# Clone the specific git repository
RUN git clone https://github.com/aws/aws-lambda-base-images.git
WORKDIR ./aws-lambda-base-images/
RUN git checkout python3.12
# Set the working directory to the cloned directory
WORKDIR ./x86_64/
# Unzip all .xz and .tar files in the directory
RUN mkdir -p ./unzipped
RUN for file in *.xz; do \
[ -e "$file" ] && unxz -k "$file" && mv "${file%.*}" unzipped/ || echo "No .xz files found or error extracting $file"; \
done \
&& for file in ./unzipped/*.tar; do \
[ -e "$file" ] && tar -xvf "$file" -C unzipped/ || echo "No .tar files found or error extracting $file"; \
done \
&& rm -rf ./unzipped/*.tar
WORKDIR ./unzipped/
# Copy only relevant directories
RUN cp -r ./usr/local/bin/ /usr/local/bin/
RUN cp -r ./var/lang/ /var/lang/
RUN cp -r ./var/runtime/ /var/runtime/
# Copy the lambda-entrypoint.sh script
RUN mkdir /var/task/
RUN cp ./lambda-entrypoint.sh /var/task/lambda-entrypoint.sh
# Set environment variables and entrypoint for AWS Lambda
ENV LANG=en_US.UTF-8
ENV TZ=:/etc/localtime
ENV PATH=/var/lang/bin:/usr/local/bin:/usr/bin/:/bin:/opt/bin:$PATH
ENV LD_LIBRARY_PATH=/var/lang/lib:/lib64:/usr/lib64:/var/runtime:/var/runtime/lib:/var/task:/var/task/lib:/opt/lib
ENV LAMBDA_TASK_ROOT=/var/task
ENV LAMBDA_RUNTIME_DIR=/var/runtime
WORKDIR /var/task/
ENTRYPOINT ["./lambda-entrypoint.sh"]
...
4. CloudFormation
To make deployments reproducible and make it easy to update the Lambda function programmatically, we use an Infrastructure-As-Code (IaC) approach with CloudFormation. We require one yaml file for this which defines Lambda function and its execution role. The most relevant code (see below) configures the AWS Lambda function with settings for the ephemeral storage, memory size (RAM) , environment variables (used in our Python script), and image URI to our Docker image.
Resources:
OllamaServerless:
Type: AWS::Lambda::Function
Properties:
FunctionName: OllamaServerless
EphemeralStorage:
Size: 10240
MemorySize: 10240
Timeout: 120
Role: !GetAtt LambdaExecutionRole.Arn
PackageType: Image
Environment:
Variables:
MODEL_NAME: !Ref ModelName
Code:
ImageUri: !Ref DockerImageUri
5. Deployment scripts
In the sections above we have defined the components required to deploy our AWS Lambda. To automate this process we have a setup.sh script that pushes our Docker image to an AWS ECR repository and deploys the AWS Lambda function as described in our Cloudformation yaml file.
This setup.sh script can be called with a MODEL_NAME argument as the first argument, for example:
./scripts/setup.sh llama2
This script will do the following:
- Initializes variables such as the MODEL_NAME
- Runs the ./scripts/push_ecr.sh script which builds and pushes the ./docker/Dockerfile/ to an AWS ECR repository
- Deploys the AWS Lambda function as described in our ./cloudformation/iac.yaml fille
After running this script, you will have an AWS Lambda function deployed in your AWS environment. This function can be called with the parameters below. Check out Ollama’s API documentation for more information.
- prompt (string): Input text for the model to process.
- format (string, optional): Desired format of the output (e.g. json)
- options (object, optional): Additional model parameters
- system (string, optional): System message
- template (string, optional): Full prompt or prompt template
- context (string, optional): Context parameter returned from a previous request
- raw (boolean, optional): Flag to determine if the input should be forwarded raw.
Summary
In summary, this article has presented a new approach to deploying open-source Large Language Models (LLMs) using a serverless architecture on AWS Lambda, leveraging Ollama. This method allows a cost-effective, scalable approach for businesses and developers who need occasional LLM usage and/or are in the prototyping phase.
Data Scientists and Machine Learning Engineers in the field are encouraged to experiment with this serverless implementation, available in my GitHub repository. Your feedback and contributions will be more than welcome!
Limitations
- Resource constraints on AWS Lambda: While AWS Lambda offers scalability and cost-effectiveness, it also comes with limitations in terms of computing resources like memory and processing power. Especially because AWS Lambda does not support GPU yet. This may restrict the performance of larger, more resource-intensive LLMs.
- Latency issues: Due to the nature of serverless architecture, there can be latency issues, particularly with cold starts. This could affect response times, especially for applications requiring real-time inference.
- Limited customization of the underlying infrastructure: Users have limited control over the underlying infrastructure, which can be a hindrance for specific optimization or customization needs for the LLM deployment.
Please note that this project was developed as a hobby and is not intended for production use. Users should exercise caution and thoroughly test the system in a controlled environment before considering any form of deployment. This project is provided “as is” without any warranties, and the creators are not responsible for any issues that may arise from its use. Your discretion is advised.