Monty Python and the Holy GrailSentence generated by the language model built in this post

Sounds like something King Arthur would say.

In this post we’ll talk about a possible first approach to language models, N-grams! We will find a way to assign a degree of certainty (not really a probability) to sequences of words, and once we’re done, sample randomly from our distribution. If you feel curious about the math and code of a N-gram model, expand the sections below. If not, enjoy this picture of the movie “Monty Python and the Holy Grail”, and definitely check it out if you haven’t seen it.


The Math

A language model is a function that maps a sentence into a degree of certainty. Usually we normalize it to be between zero and one, so it resembles a probability but it’s not. A N-gram is the simplest language model, so let’s start there. We also call n-gram a sequence of size n:

\[\begin{align*} \vec{w}&=w_{1}^{n}=(w_{1},w_{2},\ldots,w_{n})\\ f(\vec{w})&=p \end{align*}\]

The standard way to compute a probability of a sequence of events is through the chain rule:

\[\begin{align*} f\left(w_{1}^{n}\right) &=f((w_{1},w_{2},\ldots,w_{N}))\\ &=f\left(w_{1}\right) f\left(w_{2} | w_{1}\right) f\left(w_{3} | w_{1}^{2}\right) \ldots f\left(w_{n} | w_{1}^{n-1}\right) \\ &=\prod_{k=1}^{n} f\left(w_{k} | w_{1}^{k-1}\right) \end{align*}\]

How to compute the degree of certainty of a word $w_{k}$ given a context $w_{1}^{k-1}$? We would have to count how many times the phrase $w_{1}^{k}$ and the context $w_{1}^{k-1}$ were written.

\[\begin{align*} f\left(w_{k} | w_{1}^{k-1}\right)&= \frac{C(w_{1}^{k})}{C(w_{1}^{k-1})} \end{align*}\]

This is impractical since for a good estimate we would need to have a lot of data.

We can also see here that we won’t be able to use standard probabilities since we cannot verify the following equality:

\[\begin{align*} f(w_{k-1},w_{k})&=f(w_{k-1})f(w_{k}|w_{k-1})\\ &\neq f(w_{k})f(w_{k-1}|w_{k}) \end{align*}\]

So, we will relax the problem and come up with estimate of $f(w_{k}|w_{1}^{k-1})$.

\[\begin{align*} f\left(w_{n} | w_{1}^{n-1}\right) \approx f\left(w_{n} | w_{n-1}\right) \end{align*}\]

Mathematicians call this approximation, Markov’s assumption. The 2-gram model then becomes:

\[\begin{align*} f\left(w_{1}^{n}\right) &=\prod_{k=1}^{n} f\left(w_{k} | w_{k-1}\right) \end{align*}\]

How do we compute $f\left(w_{k} | w_{k-1}\right)$?

\[\begin{align*} f\left(w_{k} | w_{k-1}\right)&=\frac{C\left(w_{n-1} w_{n}\right)}{\sum_{w} C\left(w_{n-1} w\right)}\\ &=\frac{C(w_{k-1},w_{k})}{C(w_{k-1})} \end{align*}\]

Now is a good time to pause for a moment and implement this model.

Breaking down the Implementation

Since we are going to represent the text in terms of N-sized sequences, we’ll split the text into N-sized lists, and compute their frequencies.

def compute_ngrams(text,n):
    if len(text)!=n:
        return [tuple(text[i:(i+n)]) for i in range(len(text)-n+1)]
        return [tuple(text)]

def compute_ngram_counts(text,n):
    ngram_counts = Counter(compute_ngrams(text,n))
    context_counts = Counter(compute_ngrams(text,n-1))
    return ngram_counts,context_counts

To compute the probability of a given N-gram we just divide its absolute frequency by the frequency of its context.

def compute_ngram_probability(ngram,ngram_counts,context_counts,verbose=False):
    return ngram_counts[ngram]/context_counts[ngram[:len(ngram)-1]]

Now we only need a way to sample a sentence from our N-gram model. We will assume a word is given for the model to complete the rest of the sentence. We will find the N-grams whose context is the seed word we chose and sample randomly from them.

def sample_word_from_context(context_word,ngram_counts):
    ngrams = list(ngram_counts.keys())
    context_ngrams = np.array([ngram for ngram in ngrams if ngram[0]==context_word])
    next_possible_words = np.array([ngram[1] for ngram in context_ngrams])
    context_probs = np.array([ngram_counts[ngram] for ngram in ngrams if ngram[0]==context_word],dtype='float')
    return np.random.choice(a=next_possible_words,size=1,p=context_probs)[0]

Finally we take the sampled word as the next seed word and repeat the process.

def generate_random_sentence(seed_word,ngram_counts,stop_word='.',sentence_max_length=300,):
    sentence = seed_word
    next_word = sample_word_from_context(context_word=seed_word,ngram_counts=ngram_counts)
    sentence += ' '+next_word
    while next_word not in ('.','!','?') and len(sentence.split(' '))<sentence_max_length:
        next_word = sample_word_from_context(context_word=next_word,ngram_counts=ngram_counts)
        sentence += ' '+next_word
    print("Random Sentence:",sentence)
    return None

Don’t forget you can try out the code using this ‘ere jupyter notebook.