LLM Context Windows

The LLMs have a maximum prompt size for computational reasons, as generating text with them involves a number of computations which quickly increases as the prompt size increases.

ChatGPT has a 8192 context window: https://mobile.twitter.com/goodside/status/1598874674204618753?t=70_OKsoGYAx8MY38ydXMAA&s=19

My tweet about infinite context windows
Why they are limited: https://www.reddit.com/r/ChatGPT/comments/10kedpj/comment/j5q7g9d/?utm_source=share&utm_medium=web2x&context=3
Perceiver IO: Handle all types of data and large context windows https://www.deepmind.com/blog/building-architectures-that-can-handle-the-worlds-data
DIY ChatGPT with long term memory and other ChatGPT stuff Some implementations
Meta Blenderbot. Can search the internet and has longer term memory
What if there was a LLM simply meant for storage? Could you stack LLMs to extend context window where each LLM has a key that it stores information on when it comes up?
One issue with this that comes up is that with larger context windows it becomes easier to hack the LLM: https://learnprompting.org/docs/prompt_injection/defensive_measures

DeepMind Retro

OpenAIs context length of 4k tokens means its hard to pass larger documents like an entire Google Drive or codebase into the context window. You either have to fine-tune or add it into the embeddings.

This paper is a solution to that. Instead of doing a bunch of hacks involving embedding search, string manipulation, etc., RETRO allows you to condition your output on large amounts of external docs, e.g. your Google Drive, in a differentiable manner

RETRO's core idea: add an external database with a collection of informative documents in it. Let the model pull docs from this DB at inference time to improve predictions

You retrieve external documents from the DB. At the level of contiguous token chunks instead of individual tokens which reduces storage and computation requirements by a large linear factor

=> constructs a key-value database where values store raw chunks of text tokens and keys are frozen Bert embedddings. Frozen model to avoid re-computing embeddings.

=> Instead, the training sequences are split into chunks, which are augmented with the nearest neighbors retrieved from the database. An encoder-decoder architecture integrates the retrieved chunks into the model's predictions.

(when you can crush down many relevant docs to a fixed size, less context length required)

Retrieval-enhanced autoregressive token models