Common mistakes in local LLM deployments — an Ollama example

Sebastian Panman de Wit
7 min readDec 23, 2024

--

Introduction

Local deployments of open-source Large Language Models (LLMs) have made AI technology accessible to everyone with a decent computer. While tools like Ollama have made this process more accessible, there are several common mistakes that users encounter.

Let’s explore the top 3 mistakes through the lens of Ollama, one of the most popular local LLM solutions. And keep reading until the end for a bonus mistake that affects not just local LLMs, but all AI interactions!

Table of contents

  1. Introduction
  2. Mistake 1: Using simplified model versions
  3. Mistake 2: Using limited context size
  4. Mistake 3: Using command line interface
  5. Bonus mistake
  6. Key takeaways

1. Using simplified model versions

Situation: When you first install AI tools like Ollama, they typically use “quantized” (simplified) versions of models to ensure they work on most computers. Quantization reduces the precision of the numbers used in the model’s calculations. This is similar to how your phone might compress photos to save space. While the original model might use more precise numbers (e.g. 16-bit floating points), the quantized default version uses simpler numbers (like 8-bit integers) to run more easily on basic hardware.

Problem: The default quantized settings, while meant to be helpful, might be unnecessarily limiting your model’s capabilities. Many users don’t realize their computer could actually handle a higher-precision, non-quantized version of the model. It’s like watching a video in 480p when your screen can display 4K. This way you’re getting a working but reduced experience simply because the quantized version was automatically selected.

Solution: When your hardware allows, choose higher precision models. In Ollama, you can explicitly request FP16 models:

ollama pull llama3.1:8b-instruct-fp16

If a model runs too slowly or won’t run on your system, you can try different quantization levels. Each model in Ollama’s library has multiple versions, accessible through their tags. For example, visiting https://ollama.com/library/llama3.1/tags shows all available versions of the Llama 3.1 model, ranging from highest to lowest precision. Viewing the tags shows that the default Llama 3.1 8B model is a 4-bit quantized version of the original. More specifically, the ‘latest’ 8B model version refers to the ‘8b-instruct-q4_K_M’ model.

Want to know more about these different quantization? Check out this Medium article by Maxime Labonne.

2. Using limited context size

Situation: LLMs maintain a conversation history (context) to provide coherent responses. This memory has a fixed size, typically measured in tokens (chunks of text). Think of it like having a notepad with limited pages. Once you fill it up, you need to erase something to write more.

Problem: Many local LLM implementations, including Ollama’s default setup, will quietly forget older parts of the conversation when reaching memory limits. Users often don’t realize this is happening until they notice the AI giving inconsistent or contradictory responses. Additionally, Ollama uses a default context size of 2048 tokens, while most current models allow a longer context window . This means that you will hit the context size limit earlier than needed, leading to unnecessary loss of conversation history and potentially degraded responses.

Solution: To solve the above issue you can increase the default context size by running the command below (more info here). The maximum context size is dependent on your system and the LLM context size. You can find the maximum context size of your LLM by going to the Model Card and then clicking on ‘Model’ (e.g: https://ollama.com/library/llama3.1 > ‘Model’).

curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b-instruct-fp16",
"prompt": "Why is the sky blue?",
"options": {
"num_ctx": 4096
}
}'

Additionally, you can leverage more advanced implementations such as Retrieval Augmented Generation (RAG). RAG allows you to connect external data to your LLM by breaking documents into chunks, storing them in a vector database, and retrieving relevant pieces when needed. Instead of keeping entire documents in context, RAG fetches only the most relevant information for each query, effectively bypassing context window limitations. For more info on using RAG with Ollama, see the section below.

Lastly, the easiest way to avoid the dropped context issue is by avoiding long conversations by resetting your chat session before reaching the context limit.

Want to read more about RAG vs long-context chats? Check out this blog post by Cohere.

3. Using command-line interface

Situation: Ollama’s default interface is a command-line tool accessed through ollama run. While this is functional, it provides a basic text-only interface without features that can enhance the interaction experience.

Problem: The CLI interface is nice but lacks essential features like chat history management, easy model switching etc.

Solution: Consider using dedicated UI tools that integrate with Ollama. Some popular options include:

  1. Enchanted (my personal favorite for Mac): A native MacOS app with advanced features such as: RAG integration, voice mode, and image support.
  2. Open WebUI: A comprehensive web interface with advanced features such as: role-based access, web-search, RAG integration, Function Calling and image support.
  3. LM Studio: A cross-platform desktop tool with advanced features such as: RAG integration, and image support.

These interfaces can significantly improve your interaction with local LLMs while maintaining all the benefits of running models locally.

Want to explore more tools and applications for local LLMs? Check out this GitHub repository by janhq.

Bonus mistake: treating AI like an all-knowing system

Situation: LLMs are powerful pattern recognition systems trained on vast amounts of text data. In simpler terms: these AI models are like highly sophisticated autocomplete systems that have read millions of document. They can recognize patterns and generate responses, but they don’t actually “know” things the way humans do. They have a fixed training cutoff date and no ability to access new information (unless you are using LLMs with advanced features such as Function Calling or web search).

Problem: Users often treat these models as if they were all-knowing beings connected to the internet, expecting them to know about recent events, specific products, or have access to real-time information.

Solution: Understand and respect model limitations. Below are a list of tasks LLMs can do well and what they can not do well.

What LLMs can do well (generally speaking):

  1. Process and analyze text based on their training For example: “Analyze this text for sentiment” or “Summarize this article about quantum computing”
  2. Help with general knowledge questions (within their training period) For example: “What causes earthquakes?” or “How does photosynthesis work?”
  3. Assist with tasks like writing and programming. For example: “Help me write a Python function to sort a list” or “Draft an email to reschedule a meeting”

What (local) LLMs cannot do (by default):

  • Access current events or real-time information. For example: If you ask “Who won yesterday’s basketball game?” or “What’s the current stock price of Apple?”, the LLM can’t provide accurate answers unless you’ve specifically implemented internet access through advanced features such as Function Calling.
  • Read external databases or documents. For example: if you ask “What’s in my company’s latest financial report?” or “Analyze my customer feedback database” the LLM can’t help unless you’ve implemented RAG and connected these data sources first.

Note: while the above mistake isn’t unique to local LLM deployments, it’s especially relevant in this context. Users who deploy local LLMs often experiment with advanced models without fully understanding their capabilities and limitations. Unlike cloud services that might have built-in guardrails and user-friendly interfaces, local deployments put you in direct control of the model. This makes it even more critical to understand what these systems can and cannot do.

Want to learn more about integrating advanced features into your local LLM deployments? Check out Open WebUI’s documentation on web-search (here) or RAG (here).

Key Takeaways

  1. Understand quantization trade-offs: Be aware that default quantized models may sacrifice quality for compatibility. Choose higher precision models when your hardware allows.
  2. Manage context window: Be conscious of context length limitations and explicitly set appropriate context sizes to prevent silent dropping of conversation history.
  3. Use appropriate interfaces: Don’t limit yourself to command-line interactions. Leverage dedicated UIs for better model management and interaction experience.
  4. Recognize (local) LLM limitations: Understand that LLMs are pattern recognition systems with fixed training data, not all-knowing systems with real-time information access.

And lastly, remember: do not blindly trust any (Gen) AI models. Always verify important information with reliable sources.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Sebastian Panman de Wit
Sebastian Panman de Wit

Written by Sebastian Panman de Wit

Freelance Data Scientist working on AI, LLM and ML projects

No responses yet

Write a response