Size of smallest open-source LLM marching GPT 3.5's performance in 2025?
2
Ṁ105
Dec 31

Invalid contract

Criteria for meeting GPT-3.5 is either ≥ 70% performance on MMLU (5-shot prompt is acceptable) or ≥ 35% performance on GPQA Diamond

Resolves to the amount of memory the open-source LLM takes up when run on an ordinary GPU. Only counting models that aren't fine-tuned directly on the task. Quantizations are allowed. Chain-of-thought prompting is allowed. Reasoning is allowed. For GPQA, giving examples is not allowed. For MMLU, maximum of 5 examples. Something like "an chatbot finetuned on math/coding reasoning problems" would be acceptable. I hold discretion in what counts, ask me if you have any concerns.

Global-MMLU and Global-MMLU Lite are considered acceptable substitutes for MMLU for the purposes of evaluation.

Absent any specific measurement, I will be taking the model size in GB as the "amount of memory the open-source LLM takes up when run on an ordinary GPU", but if you can show that it takes up less in memory, I'll use that. If needed, I might run it on my own GPU and measure the memory usage on that.

Open-source is defined as "you can download the weights and run it on your GPU." For example, Llama models count as open-source.

Get Ṁ1,000 play money
Sort by:

https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf

for reference, Gemma 3 12B, which gets 40% on GPQA and 69.5% on Global MMLU-Lite, takes up 24GB with bf16, or 12.4GB with SFP8 quantization (because the evals were likely done on the original, I will not be using the quant model size, only sharing for reference)

Gemma 3 4B, which gets 30% on GPQA and 54% on Global MMLU-Lite, which would not qualify for this market, has a size of 8GB with bf16, or 4.4GB with SFP8 quantization