Interactive notebooks accompanying the Probabilistic Machine Learning book.
| 3.2 - Multivariate Gaussian | Marginals and Conditionals (2D) | Predicting EGFR protein levels from mRNA gene expression | $p(x_1 \mid x_2)$ |
| Marginals and Conditionals (5D) | Comprehensive real estate valuation with 5 correlated features | $p(\mathbf{x}_1 \mid \mathbf{x}_2)$ | |
| Missing Value Imputation | Patient health records with missing lab results | $p(\mathbf{x}_h \mid \mathbf{x}_v)$ | |
| 3.3 - Linear Gaussian System | Bayes Rule for Gaussians | Blood pressure estimation with multiple clinical devices | $p(\mathbf{z} \mid \mathbf{y}) \propto p(\mathbf{y} \mid \mathbf{z})\,p(\mathbf{z})$ |
| Bayes Rule with Non-Trivial W | Inferring physiological state from derived health metrics | $\mathbf{y} = W\mathbf{z} + \boldsymbol{\epsilon}$ | |
| Inferring an Unknown Scalar | Bayesian HER2 gene expression estimation from qPCR | $p(\mu \mid \mathbf{y})$ | |
| Inferring an Unknown Vector | Cytokine concentration estimation from noisy ELISA replicates | $p(\mathbf{z} \mid \mathbf{y})$ | |
| Sensor Fusion | Cell state estimation combining RNA-seq, flow cytometry, and ATAC-seq | $p(\mathbf{z} \mid \mathbf{y}_1, \mathbf{y}_2, \mathbf{y}_3)$ | |
| 3.5 - Mixture Models | Gaussian Mixture Models | Cell type discovery in scRNA-seq brain tissue using GMM | $p(\mathbf{x}) = \sum_k \pi_k \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \Sigma_k)$ |
| 6.1 - Entropy | Entropy | Cell state uncertainty in whole cell modeling: discrete, binary, joint, conditional entropy and perplexity | $H(X) = -\sum_k p_k \log p_k$ |
| 6.2 - KL Divergence | KL Divergence | Comparing gene expression distributions across healthy and diseased tissue | $D_{\text{KL}}(p | q) = \sum_k p_k \log \frac{p_k}{q_k}$ |
| 6.3 - Mutual Information | Mutual Information | Identifying informative biomarkers for cell state classification | $I(X;Y) = H(X) - H(X \mid Y)$ |
| 7.0 - Foundations | Scalar, Vector, and Matrix Products | Basic multiplication shapes and the dimension compatibility rule | $c = \mathbf{a}^T\mathbf{b},\ \mathbf{c} = A\mathbf{b},\ C = AB$ |
| Inner Product (Dot Product) | Measuring alignment and similarity between vectors | $\mathbf{a}^T\mathbf{b} = \sum_i a_i b_i$ | |
| Outer Product | Creating matrices from vectors and building covariance | $\mathbf{a}\mathbf{b}^T$ | |
| Matrix-Vector Multiplication | Geometric transformations: rotation, scaling, shearing | $\mathbf{y} = A\mathbf{x}$ | |
| Matrix-Matrix Multiplication | Composing transformations and the associativity property | $(AB)C = A(BC)$ | |
| Quadratic Forms | Weighted distance and Mahalanobis | $\mathbf{x}^T W \mathbf{x}$ | |
| The A B A.T Pattern | Transforming covariance through linear maps | $\Sigma_y = A\Sigma_x A^T$ | |
| Schur Complement | Block matrix operations and Gaussian conditioning | $M/D = A - BD^{-1}C$ | |
| 7.3 - Matrix Inversion | Factor Model Covariance | Building gene expression covariance from transcription factor pathways | $\Sigma = WW^T + \Psi$ |
| Low-Rank Covariance Update | Why adding $XX^T$ to a covariance matrix models new pathway exposures | $\Sigma' = \Sigma + XX^T$ | |
| Sherman-Morrison Formula | Rank-1 precision updates when discovering a single transcription factor | $(A + \mathbf{u}\mathbf{v}^T)^{-1}$ | |
| Matrix Inversion Lemma (Woodbury) | Efficient precision matrix updates for gene regulatory network inference | $(A + UCV)^{-1}$ | |
| 7.4 - Eigenvalue Decomposition | Geometry of Quadratic Forms | Ellipsoidal level sets applied to protein binding affinity in drug discovery | $\mathbf{x}^T A\mathbf{x} = c$ |
| 7.5 - Singular Value Decomposition | SVD | Gene expression profiling: discovering latent biological programs with SVD | $A = USV^T$ |
| 7.6 - Matrix Decompositions | Cholesky Sampling from MVN | Clinical trial simulation with correlated patient biomarkers | $\Sigma = LL^T$ |
| 8.1 - The EM Algorithm | Expectation-Maximization (EM) | Medical diagnosis with latent disease types using Gaussian mixtures | $\mathcal{Q}(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text{old}})$ |
| 8.2 - First-Order Methods | Gradient Descent | Drug dose-response curve fitting with gradient descent, line search, and momentum | $\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \nabla \mathcal{L}(\boldsymbol{\theta}_t)$ |
| 8.3 - Second-Order Methods | Newton, BFGS, and Trust Regions | Enzyme kinetics parameter estimation with Hessian-based optimizers | $\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{H}_t^{-1} \nabla \mathcal{L}(\boldsymbol{\theta}_t)$ |
| 8.4 - Stochastic Gradient Descent | SGD, Scheduling, and Adam | Predicting drug sensitivity from gene expression with adaptive optimizers | $\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \mathbf{M}_t^{-1} \nabla \mathcal{L}(\boldsymbol{\theta}_t, z_t)$ |
| 9.2 - Gaussian Discriminant Analysis | Gaussian Discriminant Analysis | NSCLC cancer subtype classification from blood protein biomarkers | $p(y=c \mid \mathbf{x}) \propto \pi_c \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_c, \boldsymbol{\Sigma}_c)$ |
| 9.3 - Naive Bayes Classifiers | Naive Bayes Classifiers | Antimicrobial compound screening from molecular fingerprints | $p(\mathbf{x} \mid y=c) = \prod_d p(x_d \mid y=c, \theta_{dc})$ |
| 10.2 - Binary Logistic Regression | Binary Logistic Regression | Predicting tumor drug response from gene expression biomarkers | $p(y \mid \mathbf{x}) = \sigma(\mathbf{w}^\top \mathbf{x} + b)$ |
| 10.3 - Multinomial Logistic Regression | Multinomial Logistic Regression | Classifying NSCLC tumor subtypes from gene expression biomarkers | $p(y=c \mid \mathbf{x}) = \text{softmax}(W\mathbf{x})_c$ |
| 11.2 - Least Squares | Least Squares Linear Regression | Predicting protein abundance from mRNA expression in whole cell modeling | $\hat{\mathbf{w}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ |
| 11.3 - Ridge Regression | Ridge Regression | Predicting drug sensitivity from high-dimensional gene expression profiles | $\hat{\mathbf{w}} = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}$ |
| 11.4 - Lasso Regression | Lasso Regression | Identifying antibiotic resistance biomarkers from bacterial gene expression | $\hat{\mathbf{w}} = \arg\min |\mathbf{X}\mathbf{w}-\mathbf{y}|^2 + \lambda|\mathbf{w}|_1$ |
| 11.7 - Bayesian Linear Regression | Bayesian Linear Regression | Predicting protein abundance from transcriptomics with full posterior uncertainty | $p(\mathbf{w} \mid \mathcal{D}) = \mathcal{N}(\hat{\mathbf{w}}, \hat{\boldsymbol{\Sigma}})$ |
| 13.1 - Backpropagation | Backpropagation for an MLP | Classifying bacterial antibiotic resistance from genomic features | $\boldsymbol{\delta}_2 = (U^\top \boldsymbol{\delta}_1) \odot H(\mathbf{z})$ |
| 16.2 - Learning Distance Metrics | Learning Distance Metrics | Drug compound similarity from molecular descriptors using LMNN, NCA, and deep metric learning | $d_M(\mathbf{x}, \mathbf{x}') = \sqrt{(\mathbf{x} - \mathbf{x}')^\top M (\mathbf{x} - \mathbf{x}')}$ |
| 16.3 - Kernel Density Estimation | Kernel Density Estimation | Non-parametric profiling of single-cell flow cytometry, T-cell classification, and dose-response regression | $p(x \mid \mathcal{D}) = \frac{1}{N}\sum_n K_h(x - x_n)$ |
| 17.1 - Mercer Kernels | Mercer Kernels | Cell phenotype similarity from gene expression using RBF, Matern, ARD, and kernel combination | $\kappa(\mathbf{x}, \mathbf{x}') = \exp!\left(-\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\ell^2}\right)$ |
| 17.2 - Gaussian Processes | Gaussian Processes | Predicting enzyme activity across temperature conditions with GP prior, posterior, and marginal likelihood | $p(\mathbf{f} \mid \mathcal{D}) = \mathcal{N}(\boldsymbol{\mu}_, \mathbf{K}{,} - \mathbf{K}_{X,}^\top \mathbf{K}\sigma^{-1} \mathbf{K}{X,})$ |
| 17.3 - Support Vector Machines | Support Vector Machines | Classifying bacterial cells as stressed vs. normal with hard/soft margin, kernel trick, and SVR | $f(\mathbf{x}) = \sum_{n \in \mathcal{S}} \alpha_n \tilde{y}_n \kappa(\mathbf{x}_n, \mathbf{x}) + \hat{w}_0$ |
| Kernel Ridge Regression | Predicting enzyme activity from substrate concentration using the kernel trick | $f(\mathbf{x}) = \mathbf{k}^\top(\mathbf{K} + \lambda \mathbf{I})^{-1}\mathbf{y}$ |