These models use artificial neural networks to predict the next word in a sequence. They are trained using large datasets and can take into account contextual information when making predictions.

Simple looping RNN

Hidden state: The activations that are updated at each step of a recurrent neural network.

# simple RNN
class LMModel2(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        
    def forward(self, x):
        h = 0
        for i in range(3):
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
        return self.h_o(h)

colors represent weight matrices

colors represent weight matrices

1_d_POV7c8fzHbKuTgJzCxtA.gif

learn = Learner(dls, LMModel2(len(vocab), 64), loss_func=F.cross_entropy, 
                metrics=accuracy)
learn.fit_one_cycle(8, 1e-3)

Untitled

In the above the hidden state is being initialized to 0 every time. That means with every new 3 word sample we don’t know the effect of the previous or next 3 word sample. We can only predict the word directly after, even though the model is learning predictions on 2nd, 3rd, and nth words. Our weights being treated as a discrete problem for the current sample, even though that information would help.

However, this approach would create a neural network as deep as the entire number of tokens in the dataset, which would be slow and memory-intensive. To solve this problem, we can use the detach method in PyTorch to remove all of the gradient history, and only keep the last three layers of gradients. Essentially, our weights are continuing to update based on previous activations, but the gradients are being updated for only the 3 words in an individual sample.

BPTT

class LMModel3(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        self.h = 0
        
    def forward(self, x):
        for i in range(3):
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
        out = self.h_o(self.h)
        self.h = self.h.detach()
        return out
    
    def reset(self): self.h = 0

Backpropagation through time (BPTT): In BPTT, the error is propagated back through the entire sequence of inputs to update the parameters of the model. The main idea behind BPTT is that, in a recurrent neural network (RNN), the current output is not only a function of the current input but also a function of the previous inputs and their corresponding outputs. Thus, in order to update the parameters of the model, we need to take into account the entire sequence of inputs and corresponding outputs.

However, in RNNs, the gradients can explode or vanish when they are propagated through many time steps. Truncated Backpropagation through time TBPTT addresses this problem by truncating the number of time steps over which the gradients are calculated.

It breaks the input sequence into chunks, and runs backpropagation on each chunk, rather than the entire sequence. It effectively limits the number of steps of the sequence that the gradients are backpropagated through, hence the term "Truncated".

BPTT with 3 Chunk (BPT3C) is a different technique for training RNNs, which is similar to TBPTT but with a different way of handling the input sequences. The input sequences are divided into chunks of 3 contiguous time steps. The chunks are then processed in a rolling window fashion, where the model is trained on 3 chunks at a time. The rolling window is moved one time step forward after each training step. This allows the model to learn context-aware features using the past, present, and future context of each word. It is similar to TBPTT in the sense that it also truncates the number of steps that gradients are backpropagated through.

This is done by re-organizing the data into chunks of the first, second, nth samples. We want our model to see a chunk of contiguous text of size 3*m (since each text is of size 3) on each line of the batch.

m = len(seqs)//bs
m,bs,len(seqs)

OUT: (328, 64, 21031)

def group_chunks(ds, bs):
    m = len(ds) // bs
    new_ds = []
    for i in range(m): new_ds += L(ds[i + m*j] for j in range(bs))
    return new_ds

To explain, imagine you have a book with 21031 pages and you want to read it in 64-page chunks. The value of "m" would be 328, since 21031 / 64 = 328.8. This means that the book will be divided into 328 chunks of 64 pages each.

The code then arranges these chunks of the book in a specific way for each training epoch. In the first batch, the code selects samples 0, m, 2*m, ..., (bs-1)m. In other words, it selects the first page of each group of 64 pages. In the second batch, the code selects samples 1, m+1, 2m+1, ..., (bs-1)m+1. Now it selects the second page of each group of 64 pages. This pattern continues for all batches, so that at each epoch, the model is seeing a chunk of contiguous text of size 3m, which is a chunk of contiguous text of size 984 in this case.