*Sentence generated by the language model built in this post*

Sounds like something King Arthur would say.

In this post we’ll talk about a possible first approach to language models, N-grams! We will find a way to assign a degree of certainty (not really a probability) to sequences of words, and once we’re done, sample randomly from our distribution. If you feel curious about the math and code of a N-gram model, expand the sections below. If not, enjoy this picture of the movie “Monty Python and the Holy Grail”, and definitely check it out if you haven’t seen it.

## N-gram

*The Math*

A language model is a function that maps a sentence into a degree of certainty.
Usually we normalize it to be between zero and one, so it resembles a probability but it’s not.
A N-gram is the simplest language model, so let’s start there.
We also call n-gram a sequence of size n:

The standard way to compute a probability of a sequence of events is through the chain rule:

\[\begin{align*} f\left(w_{1}^{n}\right) &=f((w_{1},w_{2},\ldots,w_{N}))\\ &=f\left(w_{1}\right) f\left(w_{2} | w_{1}\right) f\left(w_{3} | w_{1}^{2}\right) \ldots f\left(w_{n} | w_{1}^{n-1}\right) \\ &=\prod_{k=1}^{n} f\left(w_{k} | w_{1}^{k-1}\right) \end{align*}\]How to compute the degree of certainty of a word $w_{k}$ given a context $w_{1}^{k-1}$? We would have to count how many times the phrase $w_{1}^{k}$ and the context $w_{1}^{k-1}$ were written.

\[\begin{align*} f\left(w_{k} | w_{1}^{k-1}\right)&= \frac{C(w_{1}^{k})}{C(w_{1}^{k-1})} \end{align*}\]This is impractical since for a good estimate we would need to have a lot of data.

We can also see here that we won’t be able to use standard probabilities since we cannot verify the following equality:

\[\begin{align*} f(w_{k-1},w_{k})&=f(w_{k-1})f(w_{k}|w_{k-1})\\ &\neq f(w_{k})f(w_{k-1}|w_{k}) \end{align*}\]So, we will relax the problem and come up with estimate of $f(w_{k}|w_{1}^{k-1})$.

\[\begin{align*} f\left(w_{n} | w_{1}^{n-1}\right) \approx f\left(w_{n} | w_{n-1}\right) \end{align*}\]Mathematicians call this approximation, Markov’s assumption. The 2-gram model then becomes:

\[\begin{align*} f\left(w_{1}^{n}\right) &=\prod_{k=1}^{n} f\left(w_{k} | w_{k-1}\right) \end{align*}\]How do we compute $f\left(w_{k} | w_{k-1}\right)$?

\[\begin{align*} f\left(w_{k} | w_{k-1}\right)&=\frac{C\left(w_{n-1} w_{n}\right)}{\sum_{w} C\left(w_{n-1} w\right)}\\ &=\frac{C(w_{k-1},w_{k})}{C(w_{k-1})} \end{align*}\]Now is a good time to pause for a moment and implement this model.

*Breaking down the Implementation*

Since we are going to represent the text in terms of N-sized sequences, we’ll split the text into N-sized lists, and compute their frequencies.

```
def compute_ngrams(text,n):
if len(text)!=n:
return [tuple(text[i:(i+n)]) for i in range(len(text)-n+1)]
else:
return [tuple(text)]
def compute_ngram_counts(text,n):
ngram_counts = Counter(compute_ngrams(text,n))
context_counts = Counter(compute_ngrams(text,n-1))
return ngram_counts,context_counts
```

To compute the probability of a given N-gram we just divide its absolute frequency by the frequency of its context.

```
def compute_ngram_probability(ngram,ngram_counts,context_counts,verbose=False):
return ngram_counts[ngram]/context_counts[ngram[:len(ngram)-1]]
```

Now we only need a way to sample a sentence from our N-gram model. We will assume a word is given for the model to complete the rest of the sentence. We will find the N-grams whose context is the seed word we chose and sample randomly from them.

```
def sample_word_from_context(context_word,ngram_counts):
ngrams = list(ngram_counts.keys())
context_ngrams = np.array([ngram for ngram in ngrams if ngram[0]==context_word])
next_possible_words = np.array([ngram[1] for ngram in context_ngrams])
context_probs = np.array([ngram_counts[ngram] for ngram in ngrams if ngram[0]==context_word],dtype='float')
context_probs/=np.sum(context_probs)
return np.random.choice(a=next_possible_words,size=1,p=context_probs)[0]
```

Finally we take the sampled word as the next seed word and repeat the process.

```
def generate_random_sentence(seed_word,ngram_counts,stop_word='.',sentence_max_length=300,):
sentence = seed_word
next_word = sample_word_from_context(context_word=seed_word,ngram_counts=ngram_counts)
sentence += ' '+next_word
while next_word not in ('.','!','?') and len(sentence.split(' '))<sentence_max_length:
next_word = sample_word_from_context(context_word=next_word,ngram_counts=ngram_counts)
sentence += ' '+next_word
print("Random Sentence:",sentence)
return None
```

Don’t forget you can try out the code using this ‘ere jupyter notebook.