A team from startup Cactus has released Needle, a lightweight AI model designed to perform tool calling on everyday consumer devices like phones, watches, and glasses. With just 26 million parameters, the model achieves 6,000 tokens per second during the prefill phase and 1,200 tokens per second during decoding, making it practical for edge deployment without cloud dependency.
Breaking free from oversized models
The developers behind Needle argue that most modern large language models (LLMs) are unnecessarily large for the task of tool calling—where the primary function is retrieving the correct tool and formatting its arguments into JSON output. They contend that reasoning-heavy models are overkill for this use case, as tool calling relies more on pattern matching than deep contextual analysis.
Their experiments led to a striking conclusion: feed-forward networks (FFNs), commonly used in transformer architectures, may be redundant for tasks where structured knowledge is provided in the input rather than memorized in model weights. This insight inspired the creation of a stripped-down architecture focused solely on attention mechanisms and gating, eliminating traditional FFN layers entirely.
Training a tool-specialized model
Needle was trained in two phases. First, it underwent pretraining on 200 billion tokens using 16 TPU v6e chips over 27 hours. Next, it was post-trained on 2 billion tokens of synthesized function-calling data, a process that took just 45 minutes. The synthetic dataset was generated with the help of Gemini and covers 15 tool categories, including timers, messaging, navigation, and smart home controls.
The team emphasizes that Needle’s efficiency gains extend beyond tool calling. They suggest that any task involving retrieval-augmented generation (RAG) or access to external structured knowledge can benefit from this approach, as the model no longer needs to store factual information in its weights.
Performance against the competition
In benchmarks, Needle outperforms several larger models in single-shot function calling tasks, including FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M. However, these competitors often offer broader conversational capabilities and greater capacity for complex reasoning. The Cactus team encourages developers to test Needle on their own tools using the provided playground and fine-tune the model for specific use cases.
Built for edge devices
Needle is part of a broader initiative called Cactus, an inference engine designed from scratch for mobile devices, wearables, and custom hardware. The team previously discussed Cactus in a Hacker News post, highlighting its goal of enabling AI capabilities on resource-constrained platforms. All components of Needle, including its weights and codebase, are released under the MIT license, promoting open collaboration and accessibility.
With on-device processing becoming increasingly important for privacy and latency, Needle represents a compelling step toward efficient, specialized AI models that don’t sacrifice performance for size. Whether it sparks broader adoption of agentic experiences on consumer hardware remains to be seen, but its technical approach offers a fresh perspective on optimizing AI for real-world constraints.
AI summary
Cactus’un Needle modeli, akıllı telefonlar ve giyilebilir cihazlarda çalışan 26 milyon parametreli fonksiyon çağırma modeli sunuyor. Yüksek hız ve verimlilik ile dikkat çekiyor.
