As a computer programmer for more than a quarter of century, I don’t think I have ever thought much about strings. I knew the basics. In every language I’d worked with, strings were a data type unto themselves. Superficially they are a sequence of characters, but behind the scenes, computers store and manipulate them as arrays of one or more binary bytes. In programs, they can be stored in variables or constants, and often show up in source code as literals, ie., fixed, quoted values like “salary” or “bumfuzzle.” (That is my new favorite word, btw.) Outside of occasionally navigating the subtleties of encoding and decoding them, I never gave strings a second thought.

Even when I first dipped my toe into the waters of natural language processing, aka NLP (not to be confused with the quasi-scientific neuro linguistic programming which unfortunately shares the same acronym), I still really only worked with strings as whole entities, words or affixes, As I made my through familiarizing myself with existing NLP tools, I didn’t have to dive any deeper than that. It was only when I started programming my own tools from the ground up, did I learn about the very formal mathematics behind strings and their relationship to sets and set theory. This post will be an attempt to explain what I learned.

Strings are inextricably tied to sets. They are composed from elements of alphabets, which are just sets of symbols, and languages, which as we will see, are sets of strings. To understand strings requires a firm understanding of sets, their properties and the operations that can be performed on them. Unfortunately this means we need to take a detour into what be unfamiliar territory for some.

## Abstract Algebra: Magmas, Semigroups & Monoids Oh My!

It isn’t my intention to take this discussion all the way back to abstract algebra’s first principles, but I think it’s important (at least it was for me) to know some of the terminology since it pops up in textbooks and other documents about strings in both natural and formal languages.

At the heart of the study of abstract algebra is the idea of **algebraic structures**. An algebraic structure is simply a set with one more operations defined on its elements that satisfies certain axioms. An example of an operation is addition which when performed on two or more numbers obeys the associative and commutative laws. A set can be thought of as a degenerate algebraic structure that has no operations.

One very basic type of algebraic structure is the **magma**, or **groupoid**. A magma has a single closed, binary operation defined on it, but doesn’t need to satisfy an axiom other than closure. If the magma is represented *M* and its binary operator generically represented as •, then

for all *a*, *b* in *M*, *a* • *b* is also in *M*.

In mathematical notation, this is:

∀ *a*, *b* ∈ *M*: *a* • *b* ∈ *M*

In other words, •: *M x M → M. *This requirement is known as the **magma axiom** or **closure axiom**.

A step of above magmas is the **semigroup**. Like magmas, semigroups have a closed, binary operation (•: *S x S → S*), but in addition to closure, the operation also satisfies the associative property. So for semigroup *S*:

for all *a*, *b*, *c* in *S*, the equation (*a* • *b*) • *c* = *a* • (*b* • *c*) holds.

Again, closure also holds for this operation. A semigroup can be thought of as an *associative* magma.

Semigroups in turn give rise to **monoids**. A monoid is a semigroup that has an **identity element**. So in addition to satisfying closure and associativity as described above, there exists an element *e* in monoid *S* such that for every element *a* in *S* :

*e* • *a* = *a* • *e* = *a*.

The identity element of a monoid is unique.

Finally, before finally getting into strings, there is a variation of both semigroups and monoids to be covered. The **free monoid** on a set is the set of all possible finite sequences constructed from *zero* or more of the set’s elements. The unique set having zero members, denoted ε, is the identity element, and concatenation is the monoid operation. The free monoid of a set *A* is denoted *A**.

A **free semigroup** over a set is all possible finite sequences constructed from *one* or more of the set’s elements. Unlike free monoids, free semigroups have no identity element and the empty set ε does not exist. Like free monoids, concatenation is the semigroup operation. The free semigroup of a set *A* is denoted *A+*.

## Alphabets: The ABCs of Strings

Sometime around the age of 2 or 3, likely with the help of that earworm of a tune, you learned the alphabet. It turns out what you learned was *an* alphabet, not *the* alphabet. In mathematics, and more specifically formal language theory, the definition of an alphabet is entirely more generic than what the everyday user of a language might ever know.

Formally, an **alphabet** is a set of **symbols**. These symbols are most often letters, digits, or other characters. The basic alphabet that we know and love is the set of 26 letters {A, B, C, …, X, Y, Z}. This is overly simplified, of course, because that set does not contain digits, punctuation, or letters modified by umlauts, tildes, etc. Computers and digital circuits use a much simpler alphabet consisting of only two digits: {0, 1}. Alphabets can be finite like the English alphabet, or they can be infinite. The real numbers are an example of an infinite alphabet.

Alphabets are usually denoted with the upper case Greek symbol sigma, **∑**.

## Strings: The Main Event

Now that we know what alphabets really are, we can formally define strings. A **string** is any *finite* sequence of symbols drawn from a specific alphabet Σ. Using this definition, words like “dog” and “slangwhanger” (It’s real. Look it up.) are strings. Every word in this post is a string. The 8-bit binary representation of the decimal number 17, 00010001, is a string.

There is even a special string that contains no symbols at all, the **empty string**. Since by definition it contains no symbols by which to represent it, the empty string is usually denoted * ε*. If you made it through the discussion of free monoids above, you probably noticed this is the same symbol as the one that denotes a free monoid’s identity element. This is not a coincidence, but more on that later.

One property of strings that every programmer has likely encountered is length. A string’s **length** is simply the number of symbols it contains. The length of string *s* is denoted **| s|**. Based on it’s definition, it’s easy to see why the length must be a positive number, but in actuality it only has to be non-negative. The empty string by convention is said to have length zero.

Now let’s consider all of the strings that an alphabet is able to generate. The set of strings of length *n* over an alphabet Σ is denoted ∑^{n}. For example, if we have an alphabet Σ = {a, b} then:

∑^{2} = {aa, ab, ba, bb}

For any alphabet, ∑^{0} is the set containing the empty string:

∑^{0} = {ε}

We take a huge leap towards languages when we consider the free monoid of an alphabet. As a matter of fact, the free monoid of an alphabet, denoted **∑***, is so important it warrants its own name: the **Kleene Closure**. The Kleene Closure of an alphabet is the set of all strings of any length over an alphabet. Because it is a free monoid, the Kleene Closure also includes the empty string ε. The strings in the Kleene Closure are called the alphabet’s **words**.

While the lengths of the words in the Kleene Closure are all finite, the set itself is countably infinite since there is no upper bound on how long the words themselves can be.

Since it’s a free monoid, the Kleene Closure’s set operation is concatenation. While most people have an intuitive idea of what concatenation is, it is worthwhile to define it in a more formal sense. Concatenation is a binary operation and for any two strings, *s* and *t* in ∑*, it is defined as the sequence of the symbols in *s* followed by the sequence of the symbols in *t* and is denoted *st*.

For example, if our alphabet is the set of lowercase letters, ∑ = {a, b, …, z}, and *s* = after and *t* = math, then *st* = aftermath. If the words were concatenated in reverse order we’d have *ts* = mathafter.

Concatenation is associative but as this last example illustrated, it is not commutative. It is not commutative with one exception. When one of the strings is the empty string ε, concatenation is commutative:

ε*s* = *s*ε = *s*

## Grammars, Languages, and Everything Else

Armed with this knowledge about strings and sets of strings, we can begin to study languages and grammars, languages’ structural rules governing words, phrases and clauses. That will be for another post I am afraid. This one turned out much longer and denser than I expected it would when I began it.

But because I teased much earlier that we would see how languages are sets of words, I will end, then, with the relationship of sets of strings to languages.

Drumroll please …

A **formal language** is defined as any subset of the Kleene Closure ∑* over an alphabet ∑.

I hope that was worth the wait. There’s a lot of information packed in the terminology of that relatively short definition, so if you understood it, congratulations. You now know the string facts.