The concept of a **sigma-algebra** (a.k.a. a **σ****-algebra**, **sigma-field** or **σ-field)** lies at the heart of measure theory, and therefore at the heart of axiomatic probability theory, and the related concept of a **filtration** is fundamental in the study of stochastic processes. However, many people studying probability have an uneasy relationship with these concepts. While the formal definitions themselves are easy to state and to interpret literally, it is less obvious why these particular definitions are used and what the concepts really represent.

The latter question is open to glib answers such as “a filtration represents the ‘information’ available at each of a range of instants in time”, but such explanations are hard to square with the mathematical definitions. Is this really the best way to express “information”? Isn’t it all a bit cumbersome? Isn’t there a simpler way?

The problem becomes particularly acute when you encounter things like the **Markov property**, whose formal definition relies on these basic concepts. Without a clear understanding of sigma-algebras and filtrations (along with related concepts such as **measurability**), the definition of the Markov property can look like complete gobbledygook, and even with an understanding of what the definition literally says you can be left wondering whether it was really necessary to use so much machinery to define something as intuitive and apparently simple as “memorylessness” in a stochastic process.

In this multi-part article I’m going to explain the place of sigma-algebras and filtrations in probability theory, and particularly in the theory of stochastic processes. The goal will be to try and make it clear not just what these things are and what they mean, but also why they *are* in fact the most simple and elegant ways available to express the things they are intended to express. Where necessary I will do this by proposing “straw man” alternatives and showing why these simpler versions fail and how their failure leads us inevitably to the real constructs.

This first part will explain why we need to use a sigma-algebra in the definition of a probability space, which will turn out to be kind of “technical”: We’re essentially forced into it by the properties of the real numbers. In the later parts, I will move on to the use of sigma-algebras and filtrations in the context of conditional expectations and stochastic processes, where the usefulness of the concepts in their own right should become clear.

### Probability Spaces

In probability theory, the first place a sigma-algebra shows up is in the very definition of a **probability space**, which is a triple, , consisting of:

- a sample space, , which is a set whose elements are called
**outcomes**, - a sigma-algebra, , which is a collection of subsets of (the sets that are members of the sigma-algebra are known as
**events**), and - a probability measure, , which is a function that maps events onto their probability values.

This is just an overview, of course – the full definition can be seen at wikipedia and places various restrictions on each of the parts that make up the probability space, some of which we will encounter later. The description above is enough to see the role of the sigma-algebra, though, which is to act as the **domain** of the probability measure. It defines which sets of outcomes are considered to be “events” and therefore to have probability values.

### Probability Measure as a Set Function

The probability measure in the definition above is a **set function**, a function that maps from sets to real numbers. The first obvious question, then, is why should this be so? Why can’t we just assign probabilities to the individual outcomes and then work out the probability of any given event by adding up the probabilities of the outcomes it contains? The reason for this isn’t hard to see. It’s because although such an approach works can work if is finite (or even countably infinite), it fails for uncountably infinite probability spaces.

As an example, consider the uniform distribution on . That is, imagine the “experiment” consists of picking a single number between 0 and 1, and that we want to have every number in that range be “equally likely”. You can immediately see here that simply assigning some equal probability to every outcome isn’t going to work. Picking any probability other than zero would result in any event that contains a sufficient number of outcomes having a probability greater than 1, which isn’t permitted, so it’s clear that the probability of any individual number must be zero. However, if each outcome has probability zero, how are you supposed to calculate the probability that the number is between, say, 0 and 0.5?

In this example, we really do need a measure function that maps whole sets of outcomes, rather than just individual outcomes, to their corresponding probabilities. That way, the fact that we have zero probabilities for all the individual outcomes doesn’t prevent us from having sensible, non-zero probabilities for events like the number being between 0 and 0.5.

The next obvious question is a bit trickier to answer. If the measure function maps from sets of outcomes, why don’t we just consider *any* set of outcomes to be an event and have a probability? The short answer to this one is that in important cases, such as the example above, it turns out we *can’t* define a measure that supplies probabilities for all sets and satisfies the basic properties we require of such a function. The longer answer will require looking at the measure more closely…

### Requirements for Probability Measures and Sigma-Algebras

The requirements for the probability measure in the probability space definition above are:

- For any event , must be between zero and one, inclusive.
- must assign probability 1 to the whole sample space: .
- must be
**countably additive**(a.k.a.**sigma-additive).**

The second requirement just means that “something must happen”. The third means that for any finite or countably infinite collection of events that is *pairwise disjoint* (i.e. no two events overlap), the probability of *any* of the events happening must be the sum of the probabilities of the individual events.

Why the “or countably infinite” part? Well, we certainly want to have this property for countably infinite sets of events, as it allows us to apply the mathematics of series. We can construct a particular event of interest by forming the union of an infinite collection of disjoint events, and since the sum of the resulting series will always converge (this is guaranteed by the first two requirements and the “finite” part of the third) we can obtain the desired probability, which is very useful indeed. The reason we can’t be more ambitious and require some kind of “uncountable additivity” is that it’s not clear what adding up an uncountable collection of numbers would even mean.^{[1]}

The requirements for the sigma-algebra itself are similarly motivated:

- The complement of any event in the sigma-algebra must also be in the sigma-algebra.
- The union of any finite or countably infinite collection of events in the sigma-algebra must also be in the sigma-algebra.

The first of these ensures that if there is a probability for some particular event happening, there is also a probability for it not happening. From the requirements on , this probability will be one minus the probability of the event happening. The second requirement complements the countable additivity requirement on , ensuring that the union of any finite or countable collection of events is itself an event.

Satisfying these two very simple closure requirements is all it takes for a collection of sets to be considered a sigma-algebra. Simple as they are, though, these properties have some interesting implications, as we’ll see.

### Probabilities of Subsets of

Returning to the example of the uniform distribution on , it seems we need to come up with a function that can assign appropriate probability values to subsets of this line, and we need to decide exactly which subsets of the line this function will be defined for.

The starting point is the “easy” sets, which are intervals. It’s pretty obvious what the probability of the selected number lying in the interval should be: 0.5. In general, for any interval , where , the probability should clearly be its length: .

From this observation about intervals, you can work out the probabilities of various other sets. is the complement of , so it has probability . As a more complex example, the set of rational numbers between zero and one is a countable set of individual points, each of which has probability zero. The probability of this set is therefore itself zero. The complement of this, the set of irrational numbers between zero and one, has probability 1. As you can see, you can figure out probabilities for quite a variety of sets just from this simple observation about intervals.

How much further can we take this? Can we define probabilities for all subsets of just based on the probabilities of intervals? If not, what sets *can* we define probabilities for in this way?

Well, in order to be useful in a probability space as defined above, we’re going need the domain of our measure function to be a sigma-algebra of some kind. It seems that our minimum requirement, then, is that there be *some* sigma-algebra that we can define all the probabilities for. Setting our sights low, let’s look at the “smallest” possible sigma algebra that includes all intervals. Before we can do that, though, we’re going to have to define what that means.

### The Sigma-Algebra Generated by a Collection of Sets

There are two ways of defining what is meant by the “smallest” sigma-algebra containing a given collection of sets, which turn out to be equivalent. You define it from the “outside” or from the “inside”. We’ll look at the approach from the outside first.

It’s easy to see from the two closure properties above that if you have a collection of sigma-algebras, the intersection of the collection is itself another sigma-algebra. Taking the first property, if a given set is present in the intersection then it is present in all of the original sigma-algebras. Its complement is therefore also present in all of the original sigma-algebras and is therefore present in the intersection. The same approach shows that the second property also holds for the intersection.

This is true of *any* collection of sigma-algebras, not merely for finite or even countably infinite collections. Intersections of completely arbitrary collections of sigma-algebras result in sigma-algebras. This gives us our first way of defining the “smallest” sigma-algebra that contains a particular collection of sets, which is knows as the sigma-algebra “generated by” the collection:

The sigma-algebra **generated by** a collection of sets is the sigma-algebra obtained by taking the intersection of all sigma-algebras that contain the collection.

This is the standard definition, and is perfectly simple to state. However, sometimes this sigma-algebra is also informally described in another way, as being the sigma-algebra you get by starting with the initial collection and “adding in” all the complements and unions required by the closure properties. This approach from the “inside” sounds more concrete, but as stated it isn’t very well defined. The idea can be made explicit, though. I won’t go into details, but I will briefly describe how the construction works, because it is nice to know that this idea of “adding in” the sets needed to make a sigma-algebra does indeed have some validity.

The first step is to take the initial collection and add to it all of the sets that can be obtained by either taking the complement of a set or taking the union of a finite or countable subcollection of sets. The resulting set doesn’t satisfy the closure properties yet (for example, take the irrationals mentioned above – to reach them you have to take a countable union and *then* take the complement), but it’s a step forward. Let’s do it again, then, adding all further complements and countable unions into the mix.

Unfortunately, it turns out that even if you keep doing this an “infinite number of times” (more precisely, if you take the union of all the collections that can be constructed by a finite sequence of such steps), you still don’t arrive at a collection that satisfies the two closure properties above. In order to complete the construction and obtain a valid sigma-algebra, you actually have to continue the construction to further “degrees of infinity”, using a technique known as **transfinite induction**. This is pretty fascinating and I may write an article specifically about it at some point, but for now I’ll just say that the result of this construction ends up being the same sigma-algebra defined above: it’s the sigma-algebra “generated by” the collection.

### Lebesgue Measure

So, now we know what is meant by the “smallest” sigma-algebra containing a given collection of sets, let’s get back to the question of whether our observation about the lengths of intervals is enough to define a measure on a sigma-algebra.

The sigma-algebra generated by the collection of all intervals^{[2]} is called the **Borel sigma-algebra**, after Émile Borel, and the sets it contains are known as **Borel sets**. So the question is whether our observation about the probabilities of intervals is sufficient to determine probabilities for all Borel sets in .

Fortunately, there’s a theorem that gives us a clear answer in the positive: Carathéodory’s extension theorem. I won’t attempt to prove it here, or even state it in detail, but what it essentially says is that if you have a set function that is defined on a collection of sets (as long as the collection obeys some restrictions, which the collection of intervals does satisfy), and if that function satisfies, on those sets, the properties expected of a measure, then you can define a measure on the sigma-algebra *generated* by that collection which agrees with the original function for the sets in the original collection. Furthermore, the theorem ensures that this measure is unique.

This means that our observation about intervals is sufficient to define a unique measure function, which can supply us with probabilities for all Borel sets. This measure is known as Lebesgue measure, after Henri Lebesgue.^{[3]}^{[4]}

### Measurable and Non-Measurable Sets

The next question is whether Lebesgue measure can be defined for a sigma-algebra larger than the minimum Borel sigma-algebra. It turns out that it can, and the largest sigma-algebra it can be defined for is known as the sigma-algebra of **measurable sets**.

I’m not going to go into the details of how the Lebesgue measure is extended to this larger sigma-algebra, since this is supposed to be an article about sigma-algebras themselves rather than about measures (Wikipedia has details). However, I do want to talk a bit about the sets *outside* this sigma-algebra, the sets that you can’t reasonably ascribe any particular probability to, because the existence of these troublesome **non-measurable sets** is what necessitates the appearance of the sigma-algebra in the definition of a probability space. If it weren’t for the non-measurable sets, the measure could simply have been defined as a function that assigns a probability value to *any* subset of .

To construct a non-measurable set, you can start by defining an **equivalence relation** on the set of real numbers, by considering any two real numbers to be “equivalent” if the difference between them is a rational number. The effect of this is to split the set of real numbers into an infinite number of **equivalence classes**, where the numbers within a given equivalence class all differ from each other by rational amounts. Each of these equivalence classes is in fact a “clone” of the set of rational numbers itself, but shifted along by some irrational quantity.

Having done this, you can then form a set by choosing exactly one number from each of these equivalence classes, constraining yourself each time to choosing a number in the range (being shifted “clones” of the rational numbers, all of the equivalence classes contain values in this range). The resulting set has an interesting property: for *any* given real number , the set contains exactly one number, , such that is rational. This is a direct consequence of the definition: the set of all numbers such that is rational is one of the equivalence classes described above, and we’ve chosen exactly one element from each such class. This set we have built is known as a **Vitali set**, after Guiseppe Vitali. We will refer to it as .

Now you can use V to form a collection of further sets, each of which is a “clone” of itself, but shifted along by some rational number taken from the range . More precisely, let be the set , where is a rational number in . The “mod 1” ensures that is a subset of – any “overflow” created by shifting the set is wrapped back around to the beginning of the range.

The shifted sets have the following properties

- If one of the sets has a probability, it must be the same as the probability of itself. This is because Lebesgue measure is
**translation invariant**– shifting a set along the real number line does not alter its measure. Also, the additivity property of the measure ensures that the “wrap-around” effect from the use of “mod 1” does not affect the probability. - There is no overlap – the sets are pairwise disjoint. To see this, observe that the transformation from the original Vitali set to , which involves adding the rational number to each element in and doing a “mod 1”, moves each element of to a different element
*within the same equivalence class*. So, if you take a value from one of the shifted sets and consider a different shifted set , the only value in that could possibly coincide with is the one that comes from the same equivalence class as , but the corresponding value in will not equal because it will be the same original number from , but shifted along by a different number, . - Every value in is in one of the shifted sets, . To see this, pick any number in . By the nature of the original Vitali set , there will be exactly one value in such that is rational. will have to be in the range [-1,1], since both and are in . If is in , then will be in the shifted set $V_{x-y}$. If is in [-1,0), then will be in the shifted set , as a result of the “mod 1” wraparound.

Since every shifted set is a subset of [0,1], the last property above implies that the union of all the sets is exactly the whole interval. So we have a collection of pairwise disjoint sets whose union is , and this collection of sets is countable, since there is one shifted set for each rational number in , and the rational numbers are countable. By the countable addivity property, then, the sum of the probabilities of all the sets should be equal to the probability of , which is 1. However, as mentioned above, all the sets have the *same* probability, the probability of , and therefore it is impossible for the sum to be 1. If the probability of were zero, the sum would be zero, whereas if it was some non-zero value the sum would be infinite.

The conclusion, then, is that it is impossible to assign a probability value to the Vitali set without violating the definition of a valid measure. is non-measurable.

### Axiom of Choice and Solovay’s Model

The crucial step in setting up the Vitali set in the above proof involved selecting one representative element from each of the uncountably infinite collection of equivalence classes. This is an explicit invocation of the historically controversial Axiom of Choice. We didn’t actually construct an explicit example of a non-measurable set, as that would have required specifying exactly *which* element to pick from each of these sets. All the proof shows, really, is that non-measurable Vitali sets *exist.*

Is it possible to go a step further and construct a specific non-measurable set? It turns out the answer is “no”. A construction known as **Solovay’s model** shows that out of the axioms of (Zermelo-Fraenkel) set theory, the only one that prevents Lebesgue measure from being defined for all sets of real numbers is the axiom of choice. To obtain a non-measurable set, you have to appeal to this axiom.

The reason I mention this is to make it clear that the class of measurable sets is very large indeed. To construct a non-measurable set you really have to go out of your way, and such sets are extremely unlikely to be encountered in applications of probability theory. Even the much smaller sigma-algebra of Borel sets is more than sufficient for most applications.

### Summary

So, we’ve now seen what a sigma-algebra is and encountered a couple of major examples: the sigma-algebras of Borel and measurable sets. We’ve also seen *why* the sigma-algebra concept is necessary in the definition of a probability space. Although we might like to define probabilities for all subsets of the sample space, even in simple cases like the uniform distribution on , it turns out we cannot. The presence of the sigma-algebra in the definition makes it clear that the domain of the measure function has to be specfied, and places a couple of closure requirements that this domain must satisfy.

We’ve also seen that non-measurable sets can’t be constructed explicitly, and are extremely unlikely to be encountered in applications. It might seem, then, that sigma-algebras are just a bit of a pain, a necessary concession to mathematical rigour that we’d rather have done without. In the next part, I’ll talk about conditional expectation, particularly in the context of stochastic processes, and explain the concept of a **filtration**. In this setting the power of the sigma-algebra concept in its own right should hopefully become clear.

## Footnotes

1. Actually, there are ways of defining uncountable sums of *non-negative* terms, but if even if you do this (and the definition you use has the usual properties expected of summation) it turns out that in order to have a finite sum all but a countable number of the terms must be zero. To see this, let be the set of terms greater than for some positive integer . must clearly contain a finite number of terms, as otherwise the overall sum would be infinite (for any reasonable definition, the overall sum must be greater than the sum of any subset of the terms). Since any number greater than zero is greater than for some , the union of all the sets contains all the non-zero terms. However, this is the union of a countable collection of finite sets, and is therefore itself countable. (Incidentally, the order of terms in this argument doesn’t matter because any convergent sequences involved converge absolutely.)

2. In this article, I use the set of *closed* intervals (i.e. numbers between and *inclusive*) as a starting point. It doesn’t actually matter which intervals you start with: As long as you choose *all** *intervals of a particular type (open, closed, half-open on one side or half-open on the other), you end up with the same generated sigma-algebra and the same (Lebesgue) measure.

3. The naming here seems to vary. This measure is distinct from Lebesgue measure defined on the larger sigma-algebra of *measurable sets*, defined in the next section, because the domain of a function is part of its definition. Some sources (eg. the Wikipedia article on Borel measure) refer to the measure we just defined as “*The* Borel Measure” (see the wiki link for why “The” is italicised). In this article, I’m using the naming in *Probability & Measure* by Patrick Billingsley, which refers to both functions as “Lebesgue measure”.

4. I’m glossing over the differences between the case we have been looking at, which is restricted to the interval , and what happens when you consider *all* real numbers. The “proper” definitions of Borel sets and Lebesgue measure are not restricted to . However, my aim here is to talk about sigma-algebras in probability theory, not measure theory in general, and Lebesgue measure without this restriction isn’t a probability measure because the Lebesgue measure of the whole set of real numbers isn’t 1. Worse, whereas probability measures are always finite, Lebesgue measure without the restriction to is only **sigma-finite**. It doesn’t seem worth getting bogged down in such matters here.

This was really helpful. Hope you write the next part soon. Thank you very much.

Did you ever write the second part?

This piece was very informative and motivates the subject unlike most textbooks or online sources that I’ve looked at. I hope the second part will be posted sometime.

Great!

Where can I find Part 2, please?

Hi Yijia,

I’m afraid I still haven’t got round to writing part 2 yet. I don’t post frequently on this blog, but when I do I like to put some effort in. I promise that I will write it eventually, but I can’t say when it will be. My excuses are a small child, a full time job, a qualification I’m studying for outside work and another project that is currently taking up whatever time I have remaining after that!

Tom

Hey if you dont have time to write the second part, maybe you can leave the reader with some references (the more the better) of both parts so we can have a look. Thanks a lot and great article!