Key Points
• Apple has collaborated with NVIDIA to integrate its Recurrent Drafter (ReDrafter) technique into NVIDIA TensorRT-LLM, a tool that accelerates large language models (LLMs) on NVIDIA GPUs.
• The integration has resulted in a 2.7x speed-up in generated tokens per second for greedy decoding on tens-of-billions parameter production models, reducing latency and power consumption.
• Developers can now use ReDrafter’s accelerated token generation with NVIDIA GPUs for their production LLM applications, benefiting from faster inference efficiency and reduced computational costs.
As a tech journalist, I’m excited to share the latest development in the world of large language models (LLMs). Apple has recently published a blog post detailing its collaboration with NVIDIA to implement faster text generation performance with LLMs.
ReDrafter: A Novel Approach to Speculative Decoding
Apple’s Recurrent Drafter (ReDrafter) technique, open-sourced earlier this year, represents a new method for generating text with LLMs that is significantly faster and achieves state-of-the-art performance. It combines two techniques: beam search and dynamic tree attention to efficiently handle choices. While its research demonstrated strong results, Apple collaborated with NVIDIA to apply ReDrafter in production.
Integration with NVIDIA TensorRT-LLM
NVIDIA added new operators or exposed existing ones to integrate ReDrafter into NVIDIA TensorRT-LLM, a tool that helps run LLMs faster on NVIDIA GPUs. This integration enables ML developers using NVIDIA GPUs to easily benefit from ReDrafter’s accelerated token generation for their production LLM applications with TensorRT-LLM.
Benchmarking Results
In benchmarking a tens-of-billions parameter production model on NVIDIA GPUs, using the NVIDIA TensorRT-LLM inference acceleration framework with ReDrafter, Apple saw a 2.7x speed-up in generated tokens per second for greedy decoding. These results indicate that this technology could significantly reduce latency users may experience, while also using fewer GPUs and consuming less power.
Conclusion
Apple’s collaboration with NVIDIA demonstrates the potential for faster and more efficient text generation with LLMs. As LLMs continue to power production applications, improving inference efficiency can both impact computational costs and reduce latency for users. With ReDrafter’s novel approach to speculative decoding integrated into the NVIDIA TensorRT-LLM framework, developers can now benefit from faster token generation on NVIDIA GPUs for their production LLM applications.
What’s Next?
To learn more about this work, check out Apple’s website and NVIDIA’s blog post. For developers looking to take advantage of this technology, you can start exploring the integration of ReDrafter with NVIDIA TensorRT-LLM to accelerate your LLM applications. With the power of Apple and NVIDIA’s collaboration, we can expect to see even more innovative applications of LLMs in the future.
Don’t forget to check out some new iPhone Tutorials.