Alphawise
Posts
Tokens and Token Elasticity

Tokens and Token Elasticity

Matthew McCann
December 31, 2024

A technical AI newsletter
written with an entrepreneurial spirit for builders

What is today’s beat?

One of our readers, Johnathan Lightfoot ask a question about a paper we posted on BlueSky.

the post

the comment from Johnathan

As a CTO who actively develops LLM and computer vision solutions, I find this question extremely intriguing. For me, it may open up LLM’s utility for smaller devices.

So today let’s look at the paper, and answer his question!

Your FREE newsletter
share
to show support

🎯 BACKSTORY 🎯

Before diving in, let’s look at some relevant contextual information first

Kaparthy Explains Tokens

Tokens are needed to encode words, or convert them, into numbers. If you want to learn about this from one of the best online teachers, and influential R&D leader from Tesla and OpenAI, Andrej Karpathy, then these resources will not disappoint.

Video (2 hours): Let’s Build a GPT Tokenizer
Notebook (2 minutes): Tiktoken_emoji

Important Concepts

Static reasoning is the most common approach to determine how many tokens are used for reasoning models. So, static reasoning is the current strategy for token budget. Let’s look at some common ways to get the job done.

Common Static Reasoning Processes

Chain-of-Thought (CoT): (this is today’s focus)
- Approach: Breaks down problems into intermediate reasoning steps, ensuring logical flow.
- Example Use Case: Solving math problems or multi-step reasoning tasks.
- Issue: Inefficient for simpler tasks, as it generates unnecessary intermediate steps.
Zero-Shot Reasoning: (this is today’s focus)
- Approach: Directly predicts the answer without intermediate steps or prior examples.
- Example Use Case: Answering factual questions.
- Issue: Lacks adaptability and struggles with complex, multi-step tasks.
Few-Shot Reasoning:
- Approach: Provides examples within the input to guide reasoning.
- Example Use Case: Tasks requiring understanding of a specific format or style.
- Issue: Inefficient in token usage because the examples take up a significant part of the input.
ReAct (Reasoning + Acting):
- Approach: Integrates reasoning with actions for tasks like retrieving external information.
- Example Use Case: Retrieval-augmented generation tasks.
- Issue: Token inefficiency arises from repeatedly generating reasoning and action steps for each query.

Key Issues

Static reasoning processes in AI systems inherently provide several key issues.

Lack of adaptability prevents them from adjusting token usage to match task complexity, leading to inefficiencies in simpler tasks and excessive reasoning for complex ones.

Overhead in token consumption arises from methods like Few-Shot and Chain-of-Thought (CoT), which introduce unnecessary tokens for intermediate steps or guidance.

Task-specific limitations mean these processes are optimised for certain task types, with Zero-Shot reasoning struggling on complex, multi-step problems, and CoT often overproducing reasoning steps.

Finally, scalability challenges emerge as static approaches rely on fixed formats and reasoning paths, making them ill-suited for diverse and dynamic task requirements.

Basically, the R&D world has focused on making LLMs work with accuracy (i.e. benchmark results) without giving great consideration to optimising its response.

Models You Know

Let's look at what some major models use ...

OpenAI GPT-4:
- Notes: Primarily uses static token budgets in methods like Few-Shot or Chain-of-Thought prompting. Token allocation remains consistent regardless of task complexity.
- Issue: Inefficiency in simpler tasks due to fixed token usage, leading to unnecessary costs.
Google's PaLM:
- Note: Similar to GPT-4, PaLM relies on Few-Shot prompting and CoT reasoning, consuming a static number of tokens tailored to specific prompt formats.
- Issue: Limited scalability for dynamic task complexities; excessive token usage in simpler scenarios.
Anthropic's Claude:
- Notes: Often employs Zero-Shot and Few-Shot techniques with limited dynamic adjustments.
- Issue: Difficulty balancing efficiency and accuracy across diverse tasks.

This information is extremely hard to come by, and is not something that LLM providers are shouting out to the world. If you can find out what OpenAI o1 uses, I’m all ears! This information here is a best effort from my personal searches and an web scraping agent I’ve created that did 30 minutes of work to scour the internet!

⭐️ PAPER ⭐️

Overview

A brief punch-shot about the paper

the paper

paper: https://arxiv.org/html/2412.18547v2
code: https://github.com/GeniusHTX/TALE

Key issues being addressed

We are looking at reasoning models and their static reasoning processes
Methods like Chain-of-Thought (CoT) reasoning enhance LLM performance by decomposing problems into intermediate steps
these intermediate steps also incur significant overhead in token usage and may devalue the output due to “extra useless information”

Observe the three images below. The first is a direct answer, and is short and wrong. The second, implements chain of though (process thinking) and is correct, but the intermediate steps are a big lengthy. The third is much better, and more efficient. Take note of the goals:
- let’s think step by step
- let’s think step by step and use less than 50 tokens.

(a)Direct answering (15 output tokens).

(b)Vanilla CoT (258 output tokens).

(d)CoT with an reasonable budget (86 output tokens).

Observations

The authors observed a problem they call "Token Redundancy in LLM Reasoning" where LLMs produce unnecessarily lengthy reasoning processes.
The authors observe that specifying a reasonable token budget (not fixed, but dynamic) in prompts can significantly reduce token usage without compromising answer correctness.
To address this, they propose a framework that dynamically estimates optimal token budgets based on the complexity of the problem, aiming to balance efficiency and accuracy in LLM reasoning

During their search for an optimal token budget, the budget search process, they observe a “token elasticity” phenomenon as they approach the minimal budget. In the context of large language models (LLMs), "token elasticity" refers to a phenomenon observed when imposing strict token limits on the model's output (the current practice). While setting a reasonable token budget can effectively reduce token usage during reasoning, setting the budget too low can trigger token elasticity, leading to increased token costs. So, this is where the idea TALE (Token-Budget-Aware LLM REasoning) is born.

The workflow of TALE

Analogy

The key intuition behind estimated token budget is the human-like thinking paradigm. When presented with a mathematical question, although it may take humans a few minutes to calculate the answer, they can typically estimate the time or effort required to solve it after just briefly reviewing the question.

For example, when presented with a question from primary school arithmetic and another from college-level calculus, a human may not immediately provide the answers. Still, it is easy to infer which can be solved in seconds. This "initial estimate" of the complexity and effort changes how the problem is approached, and its response.

Remember your college days, and those stressful tests with limited time and long-form answers. This is basically taking the same approach: hey think about the complexity and effort before you solve the problem. Hey, let’s a model can do that!

Experiment

Basically, they said “hey let's use a small LLM to estimate the budget required to answer the prompt.”

The target of TALE is to balance the LLM performance and extra redundant token costs.

Results

TALE reduces token usage by 68.9% on average with less than a 5% accuracy loss
Outperforms Vanilla CoT in cost-effectiveness while generalising well across various LLMs.

🤩 REFLECTION 🤩

Reflection

Okay, so let’s finally get back to Johnathan’s inquiry.

Efficiency and the Runtime Environment

LLMs primary run on these two types of systems

cloud computing - right now
edge devices - a bit right now,more in the future, but its super hard

I believe the biggest impact will be in smart autonomous devices in robotics.

Impacts on Cloud:

Resource Optimization: uses massive amounts of energy, and tons of water to cool. Currently not sustainable.
Scalability: LLM usage can be balanced easier. Especially important for fault tollerant systems.
Delay: reduces payload between user and cloud resource, and increases real-time speed.

Impacts on Edge Robotics and IoT:

Currently, it is super hard to run LLMs on edge devices, and it’s giving me gray hair. I have worked with the Nvidia Jetson line of products for 5 years now. In March of last year, I had the opportunity to meet with some OG creators at Nvidia headquarters in California like Deepu Talla. Keep an eye our for the Nvidia Thor, to be released in 2025 Q1.

Here is what I gather:

increased intelligence on edge devices (e.g. a mini computer on a robot that does the computational tasks)
reduced computational resources, real-time response times, and battery usage when compared to current LLM usage

Agentic AI

Agentic AI refers to systems that autonomously act, reason, and interact with their environments. Tooling is super important here. A solution to static reasoning or fixed token budget can improve AI by enhancing autonomy, adaptability, and overall effectiveness. And ultimately connect machines together. Keep an eye out for things like the Model Context Protocol - i.e. a way for one AI system to talk with another AI system. Here is what I gather:

Exploration and Decision-Making: helps diverse reasoning patterns.
Instead of committing to a fixed reasoning path (e.g., Chain-of-Thought), these systems can efficiently explore multiple approaches and adapt dynamically based on feedback.
Agents need to be real-time if they are making real-world decisions. Speed is important, whether its latency from cloud to user or computational time on an edge device.
areas where low-computational resources (robotics, IoT devices, Drones etc) can become "more intelligent" where things like SWAP (size, weight, and power) are critical constraints.

THANK YOU

Found something cool?
Want something different?

Our Mission at AlphaWise

AlphaWise strives to cultivate a vibrant and informed community of AI enthusiasts, developers, and researchers. Our goal is to share valuable insights into AI, academic research, and software that brings it to life. We focus on bringing you the most relevant content, from groundbreaking research and technical articles to expert opinions to curated community resources.

Looking to connect with us?

We actively seek to get involved in community with events, talks, and activities. Email us at [email protected]