Joint probability distributions are fundamental to understanding how multiple random variables behave together.


Joint Distribution

A joint probability distribution describes the probability of events involving multiple random variables simultaneously. Think of it as extending our understanding from single-variable probability to multi-variable scenarios where we can capture complex relationships and dependencies.

Joint distributions are high-dimensional PDFs (continuous variables) or PMFs (discrete variables).

Mathematical Formulation

For two discrete random variables and , the joint PMF is written as , where:

For continuous variables, PDF are computed by integrating:

As we add more variables, the dimensionality grows naturally:

  • 1D:
  • 2D:
  • 3D:
  • D: or where

Essential Properties

Every distributbion must satisfy these fundamental properties:

  1. Non-negativity: for all
  2. Normalization: (discrete) or (continuous)

These properties ensure that joint distributions are valid probability measures.


Conditional Distributions

Conditional probability answers: “Given that has occurred, what’s the probability that also occurs?” It’s like updating our beliefs based on new information.

You might wonder: “what is the difference between this and the joint distribution ?” You can think of conditional probability as focusing on the “world” in which has occurred, and asking: “Within that restricted world, what’s the likelihood that also occurs?“. And like the any distribution, conditional distributions also satisfy the same fundamental properties like non-negativity and Normalization

The Chain Rule

The fundamental relationship connecting joint and conditional distributions is the chain rule:

The chain rule decomposes a joint probability into a sequence of conditional probabilities. Each factor represents the probability of one variable given all the previous variables in the sequence:

For variables :

Independence

Two random variables and are independent if and only if:

Independence means that knowing the value of one variable doesn’t change our beliefs about the other. The joint probability factors into the product of individual probabilities.


Marginal Distributions

From a joint distribution, we can derive marginal distributions for individual variables by “summing out” or “integrating out” the other variables:

Discrete case:

Continuous case:

The marginal distribution tells us about individual variables when we ignore the others.

Bayes’ Theorem

Rearranging the chain rule gives us Bayes’ theorem:

Bayes’ rule is a fundamental principle for updating beliefs based on new evidence. It tells us how to revise our initial beliefs when we observe new data. This is extremely important when we want to experiment and observe new data in an unknown world - it provides a principled framework for learning from experience and adapting our understanding as we gather more information. I will try to cover this aspect in future blog posts on Maximum Likelihood Estimation and Maximum A Posteriori.

Components of Bayes’ Theorem:

  • : Posterior probability
  • : Likelihood (how likely is given ?)
  • : Prior probability (our initial belief about )
  • : Evidence (probability of observing )

Key Takeaways

Fundamental Concepts

  • Joint distributions: high-dimensional PDFs (continuous variables) or PMFs (discrete variables).
  • Marginal distributions: can be derived by “summing out” other variables from joint distributions.
  • Conditional distributions: describe how one variable behaves given knowledge of another.

Independence means variables don’t influence each other:

Bayes’ theorem:

Chain rule:

Essential Properties

All kind of distribution must satisfy:

  1. Non-negativity: for all
  2. Normalization: (discrete) or (continuous)