Joint Probability Distributions

Joint probability distributions are fundamental to understanding how multiple random variables behave together.

Joint Distribution

A joint probability distribution describes the probability of events involving multiple random variables simultaneously. Think of it as extending our understanding from single-variable probability to multi-variable scenarios where we can capture complex relationships and dependencies.

Joint distributions are high-dimensional PDFs (continuous variables) or PMFs (discrete variables).

Notation Convention

To avoid confusion, I’ll use capital letters ( $X, Y$ ) for random variables and lowercase letters ( $x, y$ ) for their specific values. For example:

$P (X)$ represents the probability distribution of $X$

$p (x)$ or $p_{X} (x) = P (X = x)$ represents the PMF of event $X = x$ occurs

Mathematical Formulation

For two discrete random variables $X$ and $Y$ , the joint probability mass function (PMF) is:

P (X = x, Y = y) = p_{X, Y} (x, y)

For continuous variables, we have the joint probability density function (PDF):

P (a \leq X \leq b, c \leq Y \leq d) = \int_{a}^{b} \int_{c}^{d} f_{X, Y} (x, y) d x d y

As we add more variables, the dimensionality grows naturally:

1D: $p_{X} (x)$ or $f_{X} (x)$
2D: $p_{X, Y} (x, y)$ or $f_{X, Y} (x, y)$
3D: $p_{X, Y, Z} (x, y, z)$ or $f_{X, Y, Z} (x, y, z)$
nD: $p_{X} (x)$ or $f_{X} (x)$ where $X = (X_{1}, X_{2}, ..., X_{n})$

Essential Properties

Every joint distribution must satisfy these fundamental properties of probability distributions:

Non-negativity: $p_{X, Y} (x, y) \geq 0$ for all $(x, y)$
Normalization: $\sum_{x} \sum_{y} p_{X, Y} (x, y) = 1$ (discrete) or $\int_{- \infty}^{\infty} \int_{- \infty}^{\infty} f_{X, Y} (x, y) d x d y = 1$ (continuous)

These properties ensure that joint distributions are valid probability measures.

Marginal Distributions

From a joint distribution, we can derive marginal distributions for individual variables by “summing out” or “integrating out” the other variables:

Discrete case:

p_{X} (x) = y \sum p_{X, Y} (x, y)

p_{Y} (y) = x \sum p_{X, Y} (x, y)

Continuous case:

f_{X} (x) = \int_{- \infty}^{\infty} f_{X, Y} (x, y) d y

f_{Y} (y) = \int_{- \infty}^{\infty} f_{X, Y} (x, y) d x

The marginal distribution tells us about individual variables when we ignore the others.

Example

Let say we have a bakery that tracks join distribution of bread ( $B$ ) and coffee ( $C$ ) sales $P (B, C)$ . The marginal probability $P_{B} (b) = \sum_{c} p_{B, C} (b, c)$ answers “how often do people buy bread?—regardless of whether they also buy coffee or not”, focusing just on bread sales alone.

Conditional Distributions

Conditional probability answers: “Given that $X$ has occurred, what’s the probability that $Y$ also occurs?” It’s like updating our beliefs based on new information.

P (Y ∣ X) = \frac{P ( X , Y )}{P ( X )}

You might wonder: “what is the different between this and the joint distribution P(X,Y)?“. You can think of conditional probability as focusing on the “world” in which $X$ has occurred, and asking: “Within that restricted world, what’s the likelihood that $Y$ also occurs?”

Essential Properties

Like join probability distribution, conditional distribution also satisfy the same fundamental properties

Non-negativity: $p_{X, Y} (y ∣ x) \geq 0$ for all $y$ and $x$
Normalization: $\sum_{y} p_{X, Y} (y ∣ x) = 1$ or $\int_{y} p_{X, Y} (y ∣ x) = 1$ in a world that $X = x$ happended

The Chain Rule

The fundamental relationship connecting joint and conditional distributions is the chain rule:

P (X, Y) = P (X ∣ Y) \cdot P (Y) = P (Y ∣ X) \cdot P (X)

Basically, the chain rule decomposes a joint probability into a sequence of conditional probabilities. Each factor represents the probability of one variable given all the previous variables in the sequence.

P (X, Y, Z) = P (X ∣ Y, Z) \cdot P (Y ∣ Z) \cdot P (Z)

For $n$ variables $X_{1}, X_{2}, \dots, X_{n}$ :

P (X_{1}, X_{2}, \dots, X_{n}) = P (X_{1} ∣ X_{2}, \dots, X_{n}) \cdot P (X_{2} ∣ X_{3}, \dots, X_{n}) \dots P (X_{n - 1} ∣ X_{n}) \cdot P (X_{n})

Multi-Variable Chain Rule

Scenario: Consider three variables:

$W$ : Weather (sunny, rainy)

$T$ : Traffic (light, heavy)

$M$ : Meeting attendance (attend, skip)

Given probabilities:

$P (W = sunny) = 0.7$

$P (T = light ∣ W = sunny) = 0.8$

$P (T = light ∣ W = rainy) = 0.3$

$P (M = attend ∣ W = sunny, T = light) = 0.9$

$P (M = attend ∣ W = sunny, T = heavy) = 0.6$

$P (M = attend ∣ W = rainy, T = light) = 0.7$

$P (M = attend ∣ W = rainy, T = heavy) = 0.2$

Question: What’s the probability of attending a meeting on a sunny day with light traffic?

Solution using chain rule: $P (W = sunny, T = light, M = attend) = P (M = attend ∣ W = sunny, T = light) \cdot P (T = light ∣ W = sunny) \cdot P (W = sunny)$

$= 0.9 \times 0.8 \times 0.7 = 0.504$

Interpretation: There’s a 50.4% chance of attending a meeting on a sunny day with light traffic.

Independence

Two random variables $X$ and $Y$ are independent if and only if:

P (X, Y) = P (X) \cdot P (Y)

Independence means that knowing the value of one variable doesn’t change our beliefs about the other. The joint probability factors into the product of individual probabilities.

Coin Toss Independence

Scenario: Consider two fair coin tosses:

$X$ : First coin (H = 1, T = 0)

$Y$ : Second coin (H = 1, T = 0)

Joint probability table:

$X$ $Y$ $P (X, Y)$
0 0 0.25
0 1 0.25
1 0 0.25
1 1 0.25

Test for independence:

Step 1: Calculate marginal probabilities

$P (X = 0) = 0.25 + 0.25 = 0.5$

$P (X = 1) = 0.25 + 0.25 = 0.5$

$P (Y = 0) = 0.25 + 0.25 = 0.5$

$P (Y = 1) = 0.25 + 0.25 = 0.5$

Step 2: Check independence condition

$P (X = 0, Y = 0) = 0.25$

$P (X = 0) \cdot P (Y = 0) = 0.5 \times 0.5 = 0.25$ ✓

$P (X = 0, Y = 1) = 0.25$

$P (X = 0) \cdot P (Y = 1) = 0.5 \times 0.5 = 0.25$ ✓

$P (X = 1, Y = 0) = 0.25$

$P (X = 1) \cdot P (Y = 0) = 0.5 \times 0.5 = 0.25$ ✓

$P (X = 1, Y = 1) = 0.25$

$P (X = 1) \cdot P (Y = 1) = 0.5 \times 0.5 = 0.25$ ✓

Conclusion: All conditions hold, so $X$ and $Y$ are independent.

$X$	$Y$	$P (X, Y)$
0	0	0.25
0	1	0.25
1	0	0.25
1	1	0.25

Dependent Coin Tosses

Scenario: Consider a modified experiment where the second coin is biased based on the first:

If first coin is H, second coin has 0.8 probability of H

If first coin is T, second coin has 0.3 probability of H

Joint probability table:

$X$ $Y$ $P (X, Y)$
0 0 0.35
0 1 0.15
1 0 0.10
1 1 0.40

Test for independence:

Step 1: Calculate marginal probabilities

$P (X = 0) = 0.35 + 0.15 = 0.5$

$P (X = 1) = 0.10 + 0.40 = 0.5$

$P (Y = 0) = 0.35 + 0.10 = 0.45$

$P (Y = 1) = 0.15 + 0.40 = 0.55$

Step 2: Check independence condition

$P (X = 0, Y = 0) = 0.35$

$P (X = 0) \cdot P (Y = 0) = 0.5 \times 0.45 = 0.225$ ✗

Conclusion: Since $0.35 \neq = 0.225$ , $X$ and $Y$ are not independent.

$X$	$Y$	$P (X, Y)$
0	0	0.35
0	1	0.15
1	0	0.10
1	1	0.40

Bayes’ Theorem

Rearranging the chain rule gives us Bayes’ theorem:

P (Y ∣ X) = \frac{P ( X ∣ Y ) \cdot P ( Y )}{P ( X )}

Bayes’ rule is a fundamental principle for updating beliefs based on new evidence. It tells us how to revise our initial beliefs when we observe new data. This is extremely important when we want to experiment and observe new data in an unknown world - it provides a principled framework for learning from experience and adapting our understanding as we gather more information. I will try to cover this aspect in future blog posts on Maximum Likelihood Estimation and Maximum A Postiriories

Components of Bayes’ Theorem

$P (Y ∣ X)$ : Posterior probability
$P (X ∣ Y)$ : Likelihood (how likely is $X$ given $Y$ ?)
$P (Y)$ : Prior probability (our initial belief about $Y$ )
$P (X)$ : Evidence (probability of observing $X$ )

Medical Test Interpretation

Scenario: A disease affects 1% of the population. A test is 95% accurate (95% of sick people test positive, 95% of healthy people test negative).

Question: If someone tests positive, what’s the probability they actually have the disease?

Solution using Bayes’ rule:

$P (disease) = 0.01$ (prior: 1% of population)

$P (positive ∣ disease) = 0.95$ (likelihood: test accuracy)

$P (positive) = P (positive ∣ disease) \cdot P (disease) + P (positive ∣ healthy) \cdot P (healthy)$

$P (positive) = 0.95 \times 0.01 + 0.05 \times 0.99 = 0.059$

Bayes’ rule: $P (disease ∣ positive) = \frac{P ( positive ∣ disease ) \cdot P ( disease )}{P ( positive )}$

$P (disease ∣ positive) = \frac{0.95 \times 0.01}{0.059} \approx 0.161$

Surprising result: Even with a positive test, there’s only a 16.1% chance of having the disease! This is because the disease is rare (1%) and false positives are common when testing a large healthy population.

The Monty Hall Problem

The Monty Hall problem is one of the most famous probability puzzles that often challenges our intuition. Named after the host of the game show “Let’s Make a Deal,” this problem demonstrates how Bayesian reasoning can help us understand counterintuitive probability results.

Problem Setup: You are a contestant on a game show with three doors. Behind one door is a valuable prize (like a car), and behind the other two doors are less desirable prizes (like goats). The game proceeds as follows:

You choose one of the three doors (but don’t open it yet)

The host, who knows what’s behind all doors, opens one of the remaining doors that contains a goat

The host then offers you a choice: stick with your original door or switch to the other unopened door

Question: Should you stick with your original choice or switch doors?

I highly recommend that you try to work out a decision for this question on your own. I’ll be sharing my explanation in a future blog post about MLE and MAP.

Key Takeaways

Fundamental Concepts

Joint distributions: high-dimensional PDFs (continuous variables) or PMFs (discrete variables).
Marginal distributions: can be derived by “summing out” other variables from joint distributions
Conditional distributions: describe how one variable behaves given knowledge of another
Independence means variables don’t influence each other:

P (X, Y) = P (X) \cdot P (Y)

Chain rule:

P (X, Y) = P (X ∣ Y) \cdot P (Y) = P (Y ∣ X) \cdot P (X)

Bayes’ theorem:

P (Y ∣ X) = \frac{P ( X ∣ Y ) \cdot P ( Y )}{P ( X )}

Hoai-Chau Tran

Explorer

Joint Probability Distributions

Joint Distribution

Mathematical Formulation

Essential Properties

Marginal Distributions

Conditional Distributions

Essential Properties

The Chain Rule

Independence

Bayes’ Theorem

Key Takeaways

Fundamental Concepts

Notes

Table of Contents

Related Notes