There are two versions of the Pumping Lemma. One is for context free grammars and one is for regular languages. This post is about the latter. The Pumping Lemma describes a property that all natural languages share. While it cannot be used by itself to prove that any given language is regular, it can be used to prove, often using proof by contradiction, that a language is not regular. In this sense the Pumping Lemma provides a necessary condition for a language to be regular but not a sufficient one.

In my last post, “Kleene’s Theorem,” I provided some useful background information about strings, regular languages, regular expressions, and finite automata before introducing the eponymously named theorem that has become one of the cornerstones of artificial intelligence and more specifically, natural language processing (NLP). Kleene’s Theorem tells us that regular expressions and finite state automata are one and the same when it comes to describing regular languages. In the post I will provide a proof of this groundbreaking principle.

Stephen Cole Kleene was an American mathematician who’s groundbreaking work in the sub-field of logic known as recursion theory laid the groundwork for modern computing. While most computer programmers might not know his name or the significance of his work regarding computable functions, I am willing to bet that anyone who has ever dealt with regular expressions is intimately familiar with an indispensable operator that resulted directly from his work and even bears his name, the *****, or as it is formally known, the **Kleene star**.

While his contributions to computer science in general cannot be overstated, Kleene also authored a theorem that plays an important role in artificial intelligence, specifically the branch known as natural language processing, or NLP for short. Kleene’s Theorem relates regular languages, regular expressions, and finite state automata (FSAs). In short, he was able to prove that regular expressions and finite state automata were the same thing, just two different representations of any given regular language.

Continue reading »