The Future of the Transformer Pt 2 with Trey Kollmer

Trey Kollmer and Nathan Labenz delve into AI research, discussing new techniques to reduce global compute and enhance LLM memory.


Trey Kollmer returns to discuss the latest AI research revelations with Nathan Labenz. They explore how new techniques will shave 10% off global compute needs, how analogical prompting beats few-shot prompting, and how compressive historical records can increase LLM memory and retention abilities. If you need an ERP platform, check out our sponsor NetSuite:

🎬 The show outline:
Think Before You Speak:
SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking:
Large Language Models as Analogical Reasoners:
Ring Attention:

(00:00:00) - Episode Preview
(00:01:11) - Paper: Think Before You Speak
(00:03:13) - Multimodal models for combining vision and language
(00:04:19) - Backspace Paper
(00:06:25) - Chain of thought prompting for step-by-step reasoning
(00:09:14) - Backspacing in language models to correct mistakes
(00:12:05) - Attention sinks for expanding context length
(0012:41) - Paper: Large Language Models as Analogical Reasoners
(00:15:24) - Pause tokens for language models to "think"
(00:18:23) - Analogical prompting to recall relevant examples
(00:20:52) - Long context windows for language models
(00:23:20) - Markdown works best for OpenAI
(00:24:23) - Ring attention to break memory constraints
(00:26:15) - Paper: StreamingLLMs
(00:27:46) - Potential for superhuman performance with longer contexts
(00:31:01) - Dynamic context window adjustment at runtime
(00:33:53) - Retention and memory capabilities for transformers
(00:37:12) - Planning algorithms combined with memory and scale
(00:39:49) - Paper: Ring Attention
(00:42:35) - Executive assistant prompting and critique
(00:45:23) - Self-RAG for language models to find own examples
(00:48:02) - Timelines and predictions for future capabilities
(00:50:37) - Applications like analyzing long texts and scripts
(00:53:15) - Local versus global attention in transformers
(00:55:59) - Architectural changes versus just training adjustments
(00:58:41) - Pre-training strategies like random start points
(01:01:16) - Representing transformers for intuition versus efficiency

Producer: Vivian Meng
Executive Producers: Amelia Salyers, and Erik Torenberg
Editor: Graham Bessellieu
