- Alphawise
- Posts
- Google DeepMind Benchmark for LLM Reasoning
Google DeepMind Benchmark for LLM Reasoning
Learn how to train your own ChatGPT o1 equivalent for less than $450
What is today’s beat?
RELEASES
🧨 Google DeepMind’s FACTS benchmark
🧨 UC Berkely’s model NovaSky, trains for under $450
BUILDER BYTES
⭐️ models, code, repos, and a Huggingface blog
COMMUNITY
🤩 some talks to keep you entertained, and training to keep you sharp
Your FREE newsletter
share or subscribe
to show support
🎯 RELEASES 🎯
Bringing insights into the latest trends and breakthroughs in AI
Google DeepMind
FACTS Benchmark and dataset

FACTS benchmark leaders
Google DeepMind unveiled the FACTS benchmark and FACTS dataset in December to evaluate LLMs. With the new benchmark, you will notice minor differences in performance standings on its leaderboard - Google Models outperforming OpenAI. LLM “measurement of reliability” is tricky but the dataset is new, updated, and provides a comprehensive set of tasks for practitioners to evaluate against.
Introduction of FACTS Benchmark:
FACTS evaluates factuality in LLMs with over 50,000 questions across 12, instead of the 6 domains common to Chatbot arena.Comparison with Previous Benchmarks:
FACTS builds on existing tools like TruthfulQA and MMLU, but expands domain coverage and uses human-verified sources for validation.Key Metrics of FACTS:
FACTS evaluates models on factual accuracy (percentage of correct answers) and source citation reliability where all answers are from human raters. The format includes (1) system instruction, (2)user request and (3) a context document with a maximum of 32k tokens, and the answer is a long-form response judged by automated judge models.
FACTS addresses limitations in earlier benchmarks, which often lacked domain diversity and source verification. Doubling their categories, the benchmark helps provide more granularity to model performance and its weaknesses.
NovaSky
Sky-T1-32B-Preview - train for under $450

Training Stats
Novasky, from UC Berkeley Sky Computing Lab, has open-sourced Sky-T1-32B-Preview, an efficient reasoning AI model designed for low-cost training and high adaptability. Known for their contributions to Huggingface, they have released a reasoning model that performs on par with o1-preview on popular reasoning and coding benchmarks. And, yes their code is all open source.
Cost Efficiency Without Performance Loss:
Sky-T1-32B-Preview scores 82.4% on Math500 and 86.3% on LiveCodeBench-Easy benchmarks.Detailed Performance Metrics:
Sky-T1-32B-Preview outperforms Qwen-2.5-32B-Instruct on several benchmarks, including a 26.6 percentage point improvement on AIME2024 (43.3% vs. 16.7%) and a 16.1 percentage point increase on LiveCodeBench-Medium (56.8% vs. 40.8%).Model Size Matters:
Sky-T1 found that models with less than 32B parameters had small gains. The Sky-T1-32B is trained from the Qwen2.5-32B-Instruct with 17K data.
Novasky’s Sky-T1 let’s ML practitioners build their own o1 reasoning model for less than $450 and they provide code and datasets to replicate it. This is thanks to Alibaba for making the Qwen2.5-32B-Instruct model and dataset public.
Read their blog post for more details
⚙️ BUILDERS BYTES ⚙️
Informing builders of latest technologies and how to use them
Trending
MODEL | PTA-1: Controlling Computers with Small Models
PTA (Prompt-to-Automation) is a vision language model for computer & phone automation, based on Florence-2. With only 270M parameters it outperforms much larger models in GUI text and element localization. This enables low-latency computer automation with local execution.
Try it here for free.
PAPER | Test-time Computing: from System-1 Thinking to System-2 Thinking
Explores how enhancing computational efforts during inference can improve model performance, transitioning from intuitive (System-1) to deep reasoning (System-2) approaches.
BLOG | CO₂ Emissions and Models Performance: Insights from the Open LLM Leaderboard
Huggingface analyzes CO₂ emissions of over 3,000 models from the Open LLM Leaderboard, highlighting the trade-off between model size, performance, and sustainability, urging transparency for greener AI.
PAPER | Search-o1: Agentic Search-Enhanced Large Reasoning Models
Introduces Search-o1, a framework that augments large reasoning models (LRMs) with an agentic retrieval-augmented generation mechanism and a Reason-in-Documents module.
🤩 COMMUNITY 🤩
Cultivating curiosity with latest in professional development
TALKS
LEARNING
THANK YOU
Our Mission at AlphaWise
AlphaWise strives to cultivate a vibrant and informed community of AI enthusiasts, developers, and researchers. Our goal is to share valuable insights into AI, academic research, and software that brings it to life. We focus on bringing you the most relevant content, from groundbreaking research and technical articles to expert opinions to curated community resources.
Looking to connect with us?
We actively seek to get involved in community with events, talks, and activities. Email us at [email protected]