Monty Python and the Holy GrailSentence generated by the language model built in this post

Sounds like something King Arthur would say.

In this post we’ll talk about a possible first approach to language models, N-grams! We will find a way to assign a degree of certainty (not really a probability) to sequences of words, and once we’re done, sample randomly from our distribution. If you feel curious about the math and code of a N-gram model, expand the sections below. If not, enjoy this picture of the movie “Monty Python and the Holy Grail”, and definitely check it out if you haven’t seen it.

N-gram

The Math


A language model is a function that maps a sentence into a degree of certainty. Usually we normalize it to be between zero and one, so it resembles a probability but it’s not. A N-gram is the simplest language model, so let’s start there. We also call n-gram a sequence of size n:

\[\begin{align*} \vec{w}&=w_{1}^{n}=(w_{1},w_{2},\ldots,w_{n})\\ f(\vec{w})&=p \end{align*}\]

The standard way to compute a probability of a sequence of events is through the chain rule:

\[\begin{align*} f\left(w_{1}^{n}\right) &=f((w_{1},w_{2},\ldots,w_{N}))\\ &=f\left(w_{1}\right) f\left(w_{2} | w_{1}\right) f\left(w_{3} | w_{1}^{2}\right) \ldots f\left(w_{n} | w_{1}^{n-1}\right) \\ &=\prod_{k=1}^{n} f\left(w_{k} | w_{1}^{k-1}\right) \end{align*}\]

How to compute the degree of certainty of a word $w_{k}$ given a context $w_{1}^{k-1}$? We would have to count how many times the phrase $w_{1}^{k}$ and the context $w_{1}^{k-1}$ were written.

\[\begin{align*} f\left(w_{k} | w_{1}^{k-1}\right)&= \frac{C(w_{1}^{k})}{C(w_{1}^{k-1})} \end{align*}\]

This is impractical since for a good estimate we would need to have a lot of data.

We can also see here that we won’t be able to use standard probabilities since we cannot verify the following equality:

\[\begin{align*} f(w_{k-1},w_{k})&=f(w_{k-1})f(w_{k}|w_{k-1})\\ &\neq f(w_{k})f(w_{k-1}|w_{k}) \end{align*}\]

So, we will relax the problem and come up with estimate of $f(w_{k}|w_{1}^{k-1})$.

\[\begin{align*} f\left(w_{n} | w_{1}^{n-1}\right) \approx f\left(w_{n} | w_{n-1}\right) \end{align*}\]

Mathematicians call this approximation, Markov’s assumption. The 2-gram model then becomes:

\[\begin{align*} f\left(w_{1}^{n}\right) &=\prod_{k=1}^{n} f\left(w_{k} | w_{k-1}\right) \end{align*}\]

How do we compute $f\left(w_{k} | w_{k-1}\right)$?

\[\begin{align*} f\left(w_{k} | w_{k-1}\right)&=\frac{C\left(w_{n-1} w_{n}\right)}{\sum_{w} C\left(w_{n-1} w\right)}\\ &=\frac{C(w_{k-1},w_{k})}{C(w_{k-1})} \end{align*}\]

Now is a good time to pause for a moment and implement this model.

Breaking down the Implementation


Since we are going to represent the text in terms of N-sized sequences, we’ll split the text into N-sized lists, and compute their frequencies.

def compute_ngrams(text,n):
    if len(text)!=n:
        return [tuple(text[i:(i+n)]) for i in range(len(text)-n+1)]
    else:
        return [tuple(text)]

def compute_ngram_counts(text,n):
    ngram_counts = Counter(compute_ngrams(text,n))
    context_counts = Counter(compute_ngrams(text,n-1))
    return ngram_counts,context_counts

To compute the probability of a given N-gram we just divide its absolute frequency by the frequency of its context.

def compute_ngram_probability(ngram,ngram_counts,context_counts,verbose=False):
    return ngram_counts[ngram]/context_counts[ngram[:len(ngram)-1]]

Now we only need a way to sample a sentence from our N-gram model. We will assume a word is given for the model to complete the rest of the sentence. We will find the N-grams whose context is the seed word we chose and sample randomly from them.

def sample_word_from_context(context_word,ngram_counts):
    ngrams = list(ngram_counts.keys())
    context_ngrams = np.array([ngram for ngram in ngrams if ngram[0]==context_word])
    next_possible_words = np.array([ngram[1] for ngram in context_ngrams])
    context_probs = np.array([ngram_counts[ngram] for ngram in ngrams if ngram[0]==context_word],dtype='float')
    context_probs/=np.sum(context_probs)
    return np.random.choice(a=next_possible_words,size=1,p=context_probs)[0]

Finally we take the sampled word as the next seed word and repeat the process.

def generate_random_sentence(seed_word,ngram_counts,stop_word='.',sentence_max_length=300,):
    sentence = seed_word
    next_word = sample_word_from_context(context_word=seed_word,ngram_counts=ngram_counts)
    sentence += ' '+next_word
    while next_word not in ('.','!','?') and len(sentence.split(' '))<sentence_max_length:
        next_word = sample_word_from_context(context_word=next_word,ngram_counts=ngram_counts)
        sentence += ' '+next_word
    print("Random Sentence:",sentence)
    return None

Don’t forget you can try out the code using this ‘ere jupyter notebook.

Sources