The square root variance called the standard deviation

See for example

⋆ It turns out the Borel σ-algebra can be defined alternatively as the smallest σ-algebra that contains all

the closed subsets. Can you see why? It also follows that singleton sets {x} where x ∈ R is in the Borel σ-algebra.

Such a triple (Ω, B, P) is called a probability space.

⋆ When Ω is finite, P({ω}) has the intuitive meaning: the chance that the outcome is ω. When Ω = R, P((a, b)) is the “probability mass” assigned to the interval (a, b).

Probability Background		2
2

P(X ∈ A) = P(X−1(A)) = P({ω : X(ω) ∈ A}) P(X = x) = P(X−1(x)) = P({ω : X(ω) = x})		(1) (2)
Example 2 Let X = ω be a uniform random variable (to be defined later) with the sample space Ω = [0, 1].�x2+ y2. A sequence of different random variables {Xn}∞0 otherwise: n=1can be defined as follows, where 1{z} = 1 if z is true, and

X3(ω) = ω + 1{ω ∈ [1 2, 1]}		(5)
X4(ω) = ω + 1{ω ∈ [0,1 3]}		(6)
X5(ω) = ω + 1{ω ∈ [1 3, 2 3]}		(7)
		(8)

Given a random variable X, the cumulative distribution function (CDF) is the function FX : R �→ [0, 1]

FX(x) = P(X ≤ x) = P(X ∈ (−∞, x]) = P({ω : X(ω) ∈ (−∞, x]}). (9)

The function fX is called the probability density function (PDF) of X. CDF and PDF are related by 2.

3. for every a ≤ b, P(X ∈ [a, b]) =
� ∞−∞fX(x)dx = 1

P(x) = 0 for all x. Recall that P and fX is related by integration over an interval. We will often use p

instead of fX to denote a PDF later in class.

If X1 ∼ Binomial(n1, p) and X2 ∼ Binomial(n2, p) then X1 + X2 ∼ Binomial(n1 + n2, p). Think this as merging two coin flip experiments on the same coin. f(x) = �0 x �px(1 − p)n−x otherwise for x = 0, 1, . . . , n (12) n

⋆ The binomial distribution is used to model test set error of a classifier. Assuming a classifier’s true error rate is p (with respect to the unknown underlying joint distribution – we will make it precise later in class), then on a test set of size n the number of misclassified items follow Binomial(n, p).

Poisson. X ∼ Poisson(λ) if f(x) = e−λ λx x!for x = 0, 1, 2, . . .. If X1 ∼ Poisson(λ1) and X2 ∼ Poisson(λ2) then X1 + X2 ∼ Poisson(λ1 + λ2).⋆ This is a distribution on unbounded counts with a probability mass function“hump” (mode – not the mean – at ⌈λ⌉ − 1). It can be used, for example, to model the length of a document.

⋆ This is another distribution on unbounded counts. Its probability mass function has no “hump” (mode –Geometric. X ∼ Geom(p) if f(x) = p(1 − p)x−1for x = 1, 2, . . .

The square root of variance σ > 0 is called the standard deviation. If µ = 0, σ = 1, X has a standard normal distribution. In this case, X is usually written as Z. Some useful properties:
• (Scaling) If X ∼ N(µ, σ2), then Z = (X − µ)/σ ∼ N(0, 1)

has a χ2distribution with k degrees of freedom. If Xi ∼ N(µi, σ2 a χ2distribution with k degrees of freedom. The PDF for the χ2distribution with k degrees of freedom is χ2distribution. If Z1, . . . , Zk are independent standard normal random variables, then Y =�k i) are independent, then �� σi �2 iZ2

has

If we have two (groups of) variables that are jointly Gaussian:

�x�	∼ N	��µx	�	,	� A C⊤	C	��	(17)
�x�	∼ N	��µx	�	,	� A C⊤	B	��	(17)

Gamma(1, β) is the same as Exp(β).

	(19)

n→∞FXn(t) = F(t)	(23)

at all t where F is continuous. Here, FXn is the CDF of Xn, and F is the CDF of X. We expect to see the next outcome in a sequence of random experiments becoming better and better modeled by the probability distribution of X. In other words, the probability for Xn to be in a given interval is approximately equal to the probability for X to be in the same interval, as n grows.

Example 3 Let X1, . . . , Xn be iid continuous random variables. Then trivially Xn ⇝ X1. But note P(X1 = Xn) = 0.

Example 5 Let Xn ∼ uniform[0, 1/n]. Then Xn ⇝ δ0. This is often written as Xn ⇝ 0.

⋆ Interestingly, note F(0) = 1 but FXn(0) = 0 for all n, so limn→∞ FXn(0) ̸= F(0). This does not contradict the definition of convergence in distribution, because t = 0 is not a point at which F is

P→ X, if for any ϵ > 0

n→∞P (|Xn − X| > ϵ) = 0. (25)

n→∞P ({ω : |Xn(ω) − X(ω)| > ϵ}) = 0.

That is, the fraction of outcomes ω on which Xn and X disagree must shrink to zero. When Xn(ω) and X(ω) do disagree, they can differ by a lot in value. More importantly, note that Xn(ω) does not need to converge to X(ω) pointwise for any ω. This will be the distinguishing property between converge in probability and converge almost surely (to be defined shortly).

	��	= 1	(26)

Probability Background 7

Example 11 Xn

0. To see this, let→ 0 (and hence Xn P→ 0 and Xn ⇝ 0) does not imply convergence in expectation E(Xn) →

Xn converges in rth mean where r ≥ 1, written as Xn→ X, if

⋆ → implies→, if r > s ≥ 1. There is no general order between→ and→.

Theorem 2 (The Weak Law of Large Numbers). If X1, . . . , Xn are iid, then¯Xn =1