Stats for AI
Statistics for AI
Probability, distributions, hypothesis testing, Bayesian thinking, and information theory for ML practitioners.
📖 5 sections
⏰ 15 min read
✅ Quizzes included
01Probability Foundations
Sample space
Set of all possible outcomes.
Event
Subset of sample space.
P(A)
0 to 1. P(certain)=1, P(impossible)=0.
Independence
P(A,B)=P(A)*P(B). Knowing A does not affect B.
Conditional
P(A|B)=P(A,B)/P(B). P(A) given B occurred.
Bayes' Theorem
P(A|B)=P(B|A)*P(A)/P(B). Update beliefs with evidence.
STATSBayes example
# Medical test: 1% have disease (prior)
# Test: 99% true positive, 1% false positive
P_disease = 0.01
P_pos_given_disease = 0.99
P_pos_given_healthy = 0.01

P_pos = P_pos_given_disease*P_disease + P_pos_given_healthy*0.99
P_disease_given_pos = P_pos_given_disease*P_disease/P_pos
# Result: only ~50% if test positive!
02Key Distributions
DistributionPMF/PDFMeanVarianceML use
Bernoulli(p)p^x*(1-p)^(1-x)pp(1-p)Binary classification
Binomial(n,p)C(n,k)p^k(1-p)^(n-k)npnp(1-p)Count successes
Gaussian N(mu,sigma^2)(1/sigma*sqrt(2pi))*exp(-((x-mu)/sigma)^2/2)musigma^2Most ML problems
Poisson(lambda)lambda^k*e^-lambda/k!lambdalambdaEvent counts
Exponential(lambda)lambda*e^(-lambda*x)1/lambda1/lambda^2Time between events
03Statistical Inference
STATSConfidence intervals & tests
# Confidence interval (sample mean)
CI = x_bar +- z*(sigma/sqrt(n))
95% CI: z=1.96, 99% CI: z=2.576

# t-test (unknown population std)
t = (x_bar - mu0) / (s/sqrt(n))
degrees of freedom = n-1

# p-value interpretation:
p < 0.05: reject H0 (significant)
p > 0.05: fail to reject H0

# Effect size (Cohen d)
d = (mean1 - mean2) / pooled_std
Small: d=0.2, Medium: d=0.5, Large: d=0.8
❓ Quiz
In ML, what does a p-value < 0.05 indicate?
p < 0.05 means there is strong statistical evidence against the null hypothesis (less than 5% chance results occurred by chance if H0 is true).
04Information Theory
Entropy H(X)
-sum(p*log2(p)). Measures uncertainty/randomness.
High entropy
Uniform distribution. Maximum uncertainty.
Low entropy
Concentrated distribution. Predictable.
Cross-entropy
H(p,q)=-sum(p*log(q)). Loss function in classification.
KL divergence
D_KL(P||Q)=sum(p*log(p/q)). How different Q is from P.
Mutual information
How much knowing X reduces uncertainty about Y.
STATSEntropy calculation
# Binary: p=0.5 vs p=0.9
import numpy as np

def entropy(p):
    return -p*np.log2(p)-(1-p)*np.log2(1-p)

entropy(0.5)  # 1.0 (maximum)
entropy(0.9)  # 0.469 (less uncertain)
entropy(1.0)  # 0.0 (certain)

# Cross-entropy loss (classification)
loss = -sum(y_true * log(y_pred))
05Correlation & Regression
STATSCorrelation types
# Pearson r: linear correlation
r = cov(X,Y)/(std(X)*std(Y))

# Spearman: rank correlation (non-linear)
from scipy.stats import spearmanr
r_s, p = spearmanr(x, y)

# Linear regression OLS
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
print(model.coef_, model.intercept_)

# R-squared: proportion of variance explained
model.score(X_test, y_test)
Correlation does NOT imply causation. Always check for confounding variables before drawing conclusions.