LLM Arena: Llama3 vs. Gemma2 — A playful experiment
LLM arenas where models battle against each other on high school math problems are fun. However, watching them perform in real-life everyday games is even more fun!
In this blog, I describe my experiments with Meta’s latest LLM (Llama3) against one of Google’s newest LLM (Gemma2) in the classic Animal Chain game. You can find all Python code on my GitHub.
Requirements
For these experiments we are using the following:
- Macbook Pro 36GB RAM
- Ollama with it’s latest version supporting parallel requests
- Python
Getting started
To get started, run the following commands
# Clone the repository
git clone https://github.com/sebastianpdw/llm-word-game-arena
cd llm-word-game-arena
# Install requirements
pip install -r requirements.txt
# Download models
ollama pull llama3:8b-instruct-q8_0
ollama pull gemma2:9b-instruct-q8_0
# Run the main script
python main.py
Please note that these experiments use quantized versions of both models to ensure they fit in memory on a MacBook. If you want to use the full precision models, you can use GPU Cloud Platforms like Hyperstack to leverage the latest NVIDIA GPU cards for these experiments.
Experiments
The main script simulates 100 games of an Animal Chain Game. The goal is to name an animal that starts with the last letter of the previous animal mentioned. Players cannot name an animal that has already been mentioned.
The following system prompt was used for both models:
You are playing a game of word-snake.
In this game, you need to respond with an animal name that starts
with the last letter of the previous animal mentioned.
You are not allowed to repeat any animal that has already been mentioned.
If the other player breaks a rule, respond with: 'Disqualified [reason].'
If you can't think of a valid animal name, respond with: 'I forfeit the game.'
Otherwise, Respond only with the animal name. Do not include any other text.
An example game (actual game played by the LLMs) is seen below. In this case, player A won the game.
A: Giraffe
B: Elephant
A: Tiger
B: Rattlesnake
A: Echidna
B: Aardvark
A: Kangaroo
B: Ostrich
A: Hyena
B: Axolotl
A: Lemur
B: Rhino
A: Ocelot
B: Turtle
A: Eagle
B: Elephant
A: Disqualified repeated animal.
After running 100 experiments, the results were as follows:
- Meta’s Llama3–8B emerged as the best-performing model, winning about 60% of the games.
- These findings remained consistent even when the experiments were rerun with a different starting player.
- Both players forfeited in less than 1% of the games.
Based on the above experiments we see that Meta’s Llama3–8B is the best performing model. These findings were the same when we re-ran the experiments with a different starting player.
Disclaimer
This research is in no way academic or indicative of the actual performance of these models. Nevertheless it is/was a fun project. In practice, both LLMs were not always able to catch the other player breaking a rule.