The LLMs have a maximum prompt size for computational reasons, as generating text with them involves a number of computations which quickly increases as the prompt size increases.

ChatGPT has a 8192 context window: https://mobile.twitter.com/goodside/status/1598874674204618753?t=70_OKsoGYAx8MY38ydXMAA&s=19

DeepMind Retro

OpenAIs context length of 4k tokens means its hard to pass larger documents like an entire Google Drive or codebase into the context window. You either have to fine-tune or add it into the embeddings.

This paper is a solution to that. Instead of doing a bunch of hacks involving embedding search, string manipulation, etc., RETRO allows you to condition your output on large amounts of external docs, e.g. your Google Drive, in a differentiable manner

RETRO's core idea: add an external database with a collection of informative documents in it. Let the model pull docs from this DB at inference time to improve predictions

FoEnNsuaQAAa18x.jpg

You retrieve external documents from the DB. At the level of contiguous token chunks instead of individual tokens which reduces storage and computation requirements by a large linear factor

=> constructs a key-value database where values store raw chunks of text tokens and keys are frozen Bert embedddings. Frozen model to avoid re-computing embeddings.

=> Instead, the training sequences are split into chunks, which are augmented with the nearest neighbors retrieved from the database. An encoder-decoder architecture integrates the retrieved chunks into the model's predictions.

(when you can crush down many relevant docs to a fixed size, less context length required)

FoElYAmaMAAXBjH.jpg

Retrieval-enhanced autoregressive token models