Overview: Introduction to Machine Learning
The Learnable Universe | Module 2 | COMP 536
“The purpose of computing is insight, not numbers.”
— Richard Hamming
Learning Objectives
By the end of this module, you will be able to:
- Articulate what machine learning is and how it differs from (and connects to) traditional scientific computing
- Identify when machine learning is appropriate for astrophysical problems versus when physics-based simulation is better
- Understand the fundamental learning problem: balancing model complexity with data limitations
- Connect machine learning to statistical inference (Module 5) and function approximation (Module 1)
- Distinguish between supervised, unsupervised, and reinforcement learning paradigms
- Recognize common pitfalls: overfitting, underfitting, and the bias-variance tradeoff
- Design validation strategies to assess model generalization
- Appreciate the philosophical shift from “physics first” to “data first” approaches in modern astrophysics
This module builds directly on Module 5: Inferential Thinking, extending Bayesian inference to function space:
The Statistical Foundation (Modules 1–4):
- Module 1: Probability, distributions, moments, sampling \(\to\) statistical description of uncertainty
- Module 2: Statistical mechanics, Boltzmann equation \(\to\) physics from statistics
- Module 3: Phase space, dynamics, N-body \(\to\) simulating the universe
- Module 4: Radiative transfer, Monte Carlo \(\to\) photons and computation
The Inferential Framework (Module 5) — THIS IS OUR FOUNDATION:
- Bayesian inference: \(p(\boldsymbol{\theta} \,\vert\, \mathcal{D})\) \(\to\) learning parameters from data
- MCMC: sampling complex posteriors \(\to\) computational inference
- Model comparison: evidence and Bayes factors \(\to\) choosing between models
- Key insight: Update beliefs using data (likelihood) and prior knowledge
The Machine Learning Extension (This Module):
- Overview (This): ML as Bayesian inference in function space
- Part 1: Neural Networks \(\to\) flexible function approximators and approximate inference
- Part 2: Uncertainty Quantification \(\to\) principled error bars on ML predictions
The Connection: Module 5 taught you \(p(\boldsymbol{\theta} \,\vert\, \mathcal{D})\). Now we learn \(p(f \,\vert\, \mathcal{D})\).
Same Bayesian principles, infinite-dimensional space.
From Parameter Inference to Function Learning
The Bayesian Foundation (Module 5 Recap)
In Module 5: Inferential Thinking, you learned Bayesian inference for parameters:
The Parameter Inference Problem:
- Given data \(\mathcal{D}\) and model \(M\), what parameters \(\boldsymbol{\theta}\) are plausible?
- Posterior: \(p(\boldsymbol{\theta} \,|\, \mathcal{D}) \propto p(\mathcal{D} \,|\, \boldsymbol{\theta}) p(\boldsymbol{\theta})\)
- MCMC: Sample the posterior when no closed form exists
- Model comparison: Bayes factors, evidence
Example from Module 5: Fitting a line to data \[ y_i = \theta_0 + \theta_1 x_i + \epsilon_i \]
You inferred the parameters \(\boldsymbol{\theta} = (\theta_0, \theta_1)\) from noisy observations.
The Machine Learning Extension
Machine learning is Bayesian inference in function space.
Instead of asking “what parameters \(\boldsymbol{\theta}\) are plausible?”, we ask:
“What functions \(f\) are plausible?”
Given data \(\mathcal{D} = \{(\mathbf{x}_i, y_i)\}\), we want the posterior over functions: \[ p(f \,|\, \mathcal{D}) \propto p(\mathcal{D} \,|\, f) \cdot p(f) \]
The mathematical structure is identical to Module 5 — only the space changed:
| Module 5: Parameter Space | Module 2: Function Space |
|---|---|
| Parameter \(\boldsymbol{\theta} \in \mathbb{R}^d\) | Function \(f: \mathbb{R}^d \to \mathbb{R}\) |
| Prior \(p(\boldsymbol{\theta})\) | Prior \(p(f)\) (architecture, regularization) |
| Likelihood \(p(\mathcal{D} \,\vert\, \boldsymbol{\theta})\) | Likelihood \(p(\mathcal{D} \,\vert\, f)\) |
| MCMC samples \(p(\boldsymbol{\theta} \,\vert\, \mathcal{D})\) | Learning algorithms find \(p(f \,\vert\, \mathcal{D})\) |
| Predictive: marginalize posterior | Predictive: marginalize posterior |
Module 5: Learn plausible parameters from data
Module 2: Learn plausible functions from data
Functions are infinite-dimensional parameters! ML extends your Bayesian intuition to function space.
What This Module Teaches You
Overview: What is machine learning? Why does astrophysics need it?
Part 1: Neural Networks
- Flexible function approximators
- Training = Finding maximum a posteriori (MAP) estimate
- Universal approximation and scalable inference
Part 2: Uncertainty Quantification
- Principled approaches to error bars on ML predictions
- Bayesian neural networks, ensembles, and calibration
- Connecting back to Module 5’s posterior reasoning
Throughout: Bayesian thinking guides ML design choices
Projects 1–4: Implement physics simulations
Final project JAX rebuild: Generate validated N-body simulation data
Final Project:
- Use simulations as training data
- Learn \(f:\) (initial conditions) \(\to\) (cluster evolution)
- Predict outcomes ~100–1000\(\times\) faster than physics simulation (depending on accuracy requirements)
This is physics-informed machine learning: Bayesian inference trained on physics.
Part 1: What Is Machine Learning?
The Traditional Scientific Method
You’ve spent this entire semester doing physics-based computational astrophysics:
- Write down equations: \(\frac{d\mathbf{v}_i}{dt} = \sum_{j \neq i} \frac{Gm_j(\mathbf{r}_j - \mathbf{r}_i)}{|\mathbf{r}_j - \mathbf{r}_i|^3}\)
- Solve numerically: Runge-Kutta, leapfrog, adaptive timesteps
- Analyze results: Measure observables, compare to theory
- Iterate: Refine physics, improve numerics, explore parameter space
This is model-driven science: we start with physical laws and derive predictions.
Strengths:
- Interpretable (we understand every step)
- Generalizable (physics is universal)
- Predictive (can extrapolate beyond training regime)
- Satisfying (we understand why things happen)
Limitations:
- Computationally expensive (Project 2: minutes per simulation)
- Doesn’t scale to complex systems (turbulence, galaxy formation)
- Requires known physics (what about dark matter?)
- Parameter space exploration is prohibitive (need 10,000+ simulations)
The Machine Learning Paradigm
Machine learning inverts the traditional approach:
Instead of: Physics equations \(\to\) Simulation \(\to\) Predictions
We do: Data \(\to\) Learning algorithm \(\to\) Predictions
Traditional science asks: “Given these physical laws, what will happen?”
Machine learning asks: “Given these observations, what patterns exist?”
This isn’t replacing physics — it’s complementing it. We use ML when:
- Physics is too expensive to simulate directly
- Physics is unknown or incomplete
- Data is abundant but complex
- We need fast predictions for exploration/optimization
A Concrete Example: Your Final Project
The Problem: Predict how a star cluster evolves from \(t=0\) to \(t=200\) Myr.
Physics-based approach (Project 2 plus the final project’s JAX rebuild):
Initial conditions → N-body equations → Integrate 200 Myr → r_core(t)
(Computationally expensive!)Machine learning approach (Final Project):
Many N-body simulations → Learn patterns → Instant predictions
(Upfront cost, then fast!)The key insight: We’re not replacing physics — we’re using N-body simulations to train a fast surrogate model.
This is physics-informed machine learning: combining physical knowledge with data-driven learning.
Part 2: The Learning Problem — Generalization from Data
What Does It Mean to “Learn”?
Suppose we have training data: \[ \mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^{N} \]
where:
- \(\mathbf{x}_i\) are inputs (e.g., initial cluster conditions)
- \(y_i\) are outputs (e.g., core radius at \(t=100\) Myr)
Goal: Find a function \(f\) such that for new, unseen inputs \(\mathbf{x}_*\): \[ f(\mathbf{x}_*) \approx y_* \]
This is the learning problem: generalize from observed data to unobserved cases.
Remember basis function expansions? \[ f(x) = \sum_{i=1}^{M} w_i \phi_i(x) \]
Machine learning is function approximation where:
- The basis functions \(\phi_i\) are learned from data (not chosen a priori)
- We care about generalization to new data (not just fitting training data)
- We quantify uncertainty in predictions
This connects directly to your work on Fourier series, Legendre polynomials, etc. in Module 1!
The Bayesian Perspective: Learning = Inference
Module 5: Bayesian inference for parameters \(\boldsymbol{\theta}\) \[ p(\boldsymbol{\theta} \,|\, \mathcal{D}) = \frac{p(\mathcal{D} \,|\, \boldsymbol{\theta}) p(\boldsymbol{\theta})}{p(\mathcal{D})} \]
Machine learning: Bayesian inference for functions \(f\) \[ p(f \,|\, \mathcal{D}) = \frac{p(\mathcal{D} \,|\, f) p(f)}{p(\mathcal{D})} \]
Same structure, infinite-dimensional space!
- Prior \(p(f)\): Encodes beliefs about plausible functions (smoothness, scales, etc.)
- Likelihood \(p(\mathcal{D} \,|\, f)\): How well does \(f\) explain the data?
- Posterior \(p(f \,|\, \mathcal{D})\): Updated beliefs after seeing data — this is what learning computes
Neural networks perform approximate inference — MAP estimation finds a single “best” function, and scales to millions of data points. Uncertainty quantification techniques (Part 2) then recover principled error bars on these predictions.
The Fundamental Challenge: Bias-Variance Tradeoff
Consider fitting a polynomial to data:
Underfitting (high bias): \[ f(x) = w_0 + w_1 x \quad \text{(linear model)} \]
- Too simple to capture true pattern
- Poor performance on both training and test data
- Bias: model assumptions are wrong
Overfitting (high variance): \[ f(x) = \sum_{i=0}^{100} w_i x^i \quad \text{(100th degree polynomial)} \]
- Fits training data perfectly (even noise!)
- Poor performance on test data
- Variance: model is too sensitive to training data
Just right: \[ f(x) = \sum_{i=0}^{5} w_i x^i \quad \text{(modest degree polynomial)} \]
- Captures true pattern without memorizing noise
- Good performance on both training and test data
| Model complexity | Training behavior | Test behavior | Diagnosis |
|---|---|---|---|
| Too simple | Cannot match the pattern | Poor generalization | High bias |
| Well matched | Captures signal without chasing noise | Best generalization | Balanced bias and variance |
| Too flexible | Can memorize the training set | Poor generalization | High variance |
The mathematical formulation:
Expected prediction error decomposes as: \[ \mathbb{E}[(y - \hat{f}(x))^2] = \underbrace{\text{Bias}[\hat{f}(x)]^2}_{\text{systematic error}} + \underbrace{\text{Var}[\hat{f}(x)]}_{\text{sensitivity to data}} + \underbrace{\sigma^2}_{\text{irreducible noise}} \]
Key insight: There’s an optimal model complexity that balances bias and variance!
Module 5 connection: The bias-variance tradeoff relates to Bayesian model comparison via evidence \(p(\mathcal{D} \,|\, \mathcal{M})\). Simple models may underfit (high bias), complex models may overfit (high variance). The optimal model balances both.
Full Bayesian inference marginalizes over functions rather than selecting a single “best” function — this is the foundation for the uncertainty quantification techniques we explore in Part 2.
Training, Validation, and Test Sets
The gold standard: Split data into three sets:
- Training set (60–80%): Fit model parameters
- Validation set (10–20%): Tune hyperparameters (model complexity, regularization)
- Test set (10–20%): Final evaluation (never used during development!)
Why three sets?
- Training: Learns patterns
- Validation: Prevents overfitting to training data
- Test: Unbiased estimate of generalization performance
Never use the test set for:
- Choosing model architecture
- Tuning hyperparameters
- Deciding when to stop training
- Any decision whatsoever!
The test set is sacred — touch it only once at the very end.
If you tune your model based on test performance, you’re effectively training on the test set (indirectly). This gives you an overly optimistic estimate of how well your model will generalize.
Cross-Validation: Making the Most of Limited Data
When data is scarce (e.g., 200 expensive N-body simulations), splitting into train/val/test is wasteful.
K-fold cross-validation:
- Divide data into \(K\) folds (typically \(K=5\) or \(K=10\))
- For each fold \(k\):
- Train on all folds except \(k\)
- Validate on fold \(k\)
- Average performance across all \(K\) folds
This uses all data for both training and validation (at different times).
import jax.numpy as jnp
def k_fold_cross_validation(X, y, model_fn, k=5):
"""K-fold cross-validation
Args:
X: Input data (N, d)
y: Outputs (N,)
model_fn: Function that trains and evaluates model
k: Number of folds
Returns:
scores: Validation scores for each fold
"""
N = len(X)
fold_size = N // k
scores = []
for i in range(k):
# Split into train and validation
val_start = i * fold_size
val_end = (i + 1) * fold_size
# Validation fold
X_val = X[val_start:val_end]
y_val = y[val_start:val_end]
# Training folds (everything else)
X_train = jnp.concatenate([X[:val_start], X[val_end:]])
y_train = jnp.concatenate([y[:val_start], y[val_end:]])
# Train and evaluate
score = model_fn(X_train, y_train, X_val, y_val)
scores.append(score)
return jnp.array(scores)
# Usage
mean_score = jnp.mean(scores)
std_score = jnp.std(scores)
print(f"CV Score: {mean_score:.3f} ± {std_score:.3f}")Part 3: Types of Machine Learning
Supervised Learning: Learning from Labeled Examples
Setup: We have input-output pairs \(\{(\mathbf{x}_i, y_i)\}\)
Goal: Learn mapping \(f: \mathbf{x} \to y\)
Two subtypes:
- Regression: Predict continuous values
- Example: Initial conditions \(\to\) core radius
- Loss: Mean Squared Error (MSE)
- Classification: Predict discrete categories
- Example: Galaxy image \(\to\) [spiral, elliptical, irregular]
- Loss: Cross-entropy
Astrophysical applications:
- Photometric redshifts (colors \(\to\) distance)
- Supernova classification (light curve \(\to\) type Ia vs core-collapse)
- Exoplanet detection (light curve \(\to\) planet or not)
- Your project: Initial conditions \(\to\) cluster evolution
Supervised learning is maximum likelihood estimation!
If we assume Gaussian noise: \(y_i = f(\mathbf{x}_i) + \epsilon_i\), \(\epsilon_i \sim \mathcal{N}(0, \sigma^2)\)
Then the likelihood is: \[ p(\mathbf{y} \,|\, \mathbf{X}, \boldsymbol{\theta}) = \prod_{i=1}^{N} \mathcal{N}(y_i \,|\, f_{\boldsymbol{\theta}}(\mathbf{x}_i), \sigma^2) \]
Maximizing likelihood = minimizing negative log-likelihood: \[ -\log p(\mathbf{y} \,|\, \mathbf{X}, \boldsymbol{\theta}) = \frac{1}{2\sigma^2} \sum_{i=1}^{N} (y_i - f_{\boldsymbol{\theta}}(\mathbf{x}_i))^2 + \text{const} \]
This is Mean Squared Error when \(\sigma^2\) is treated as a known constant!
Important caveat: This equivalence holds only if the noise variance \(\sigma^2\) is pre-specified or known. In Bayesian inference, if \(\sigma^2\) is unknown, it becomes a hyperparameter that must be either:
- Pre-specified from prior knowledge
- Estimated from data (complicating the likelihood)
- Marginalized over in a full Bayesian treatment
In practice, standard neural networks do not learn \(\sigma^2\) — they implicitly assume a fixed, predetermined noise level. If you want to learn \(\sigma^2\) from data, you’d need to either:
- Add it as an output parameter (heteroskedastic regression)
- Use a Bayesian approach that marginalizes over it
So training neural networks is maximum likelihood estimation under the assumption of known noise variance, connecting back to Bayesian inference from Module 5.
Unsupervised Learning: Finding Structure Without Labels
Setup: We only have inputs \(\{\mathbf{x}_i\}\) (no outputs!)
Goal: Discover hidden structure, patterns, or groupings
Common tasks:
- Clustering: Group similar objects
- K-means, hierarchical clustering
- Example: Group galaxies by properties (without pre-defined labels)
- Dimensionality reduction: Find low-dimensional representation
- PCA, t-SNE, autoencoders
- Example: Compress 1000-dimensional galaxy spectra to 10 principal components
- Density estimation: Learn probability distribution
- Example: What’s the distribution of stellar masses in clusters?
Astrophysical applications:
- Discovering new classes of objects (quasars, gamma-ray bursts)
- Anomaly detection (unusual supernovae, transients)
- Data compression (large survey data)
Supervised: “Here are 1000 galaxies labeled as spiral or elliptical. Learn to classify new galaxies.”
Unsupervised: “Here are 1000 unlabeled galaxies. Do they naturally group into categories?”
Unsupervised learning can discover categories we didn’t know existed!
Reinforcement Learning: Learning from Interaction
Setup: Agent interacts with environment, receives rewards
Goal: Learn policy (strategy) that maximizes cumulative reward
Components:
- State \(s\): Current situation
- Action \(a\): What agent can do
- Reward \(r\): Feedback signal
- Policy \(\pi(a|s)\): Strategy for choosing actions
Astrophysical applications (less common but emerging):
- Optimizing telescope scheduling
- Adaptive optics control
- Gravitational wave detector tuning
Not the focus of this course (supervised learning is most relevant for emulation).
Part 4: Why Machine Learning for Astrophysics?
The Data Deluge
Astronomy is drowning in data:
| Survey/Instrument | Data Volume | Time to Analyze Manually |
|---|---|---|
| Sloan Digital Sky Survey (SDSS) | 200 million objects | Centuries |
| Large Synoptic Survey Telescope (LSST) | 30 TB/night | Impossible |
| Square Kilometer Array (SKA) | 160 TB/second | Absolutely impossible |
| Gaia | 1 billion stars | Many lifetimes |
Traditional approach: Manually inspect each object, classify, measure properties
Machine learning approach: Train algorithms to do this automatically
The Complexity Challenge
Some astrophysical systems are too complex for analytic solutions:
Simple (we have equations):
- Two-body problem: Solved exactly (Kepler orbits)
- Linear perturbations: Analytic solutions exist
- Spherical symmetry: Reduces to 1D ODEs
Complex (need simulation):
- N-body problem (\(N > 2\)): No closed-form solution
- Turbulence: Highly nonlinear, chaotic
- Galaxy formation: Multi-scale, multi-physics
Very complex (even simulation is hard):
- Cosmological structure formation: Box size vs resolution tradeoff
- Stellar interiors with magnetic fields: MHD is expensive
- Radiative transfer in 3D: Photon transport is costly
Machine learning solution: Learn simplified models from expensive simulations
The Emulation Use Case: Your Final Project
This is the primary motivation for your final project:
The Problem:
- N-body simulation: 1 minute per run
- Want to explore 5D parameter space: Need ~10,000 runs
- Total time: ~1 week of continuous computing
- Then if we update physics? Start over!
The ML Solution:
- Run 500 simulations (8 hours)
- Train surrogate model (1 hour)
- Make 10,000 predictions (seconds)
- Total: <10 hours instead of 1 week
The benefit: ~100–1000\(\times\) speedup (depending on emulator type and accuracy) enables:
- Parameter space exploration
- Uncertainty quantification
- Optimization (find best-fit parameters)
- Real-time analysis
Physics simulation:
- Exact (within numerical precision)
- Interpretable (we understand every force)
- Generalizes (physics is universal)
- Slow (minutes to hours per run)
ML emulator:
- Fast (milliseconds per prediction)
- Smooth (learns continuous functions)
- Approximate (prediction error ~1–10%)
- Interpolates (not extrapolates) well
- Black box (harder to interpret)
The goal isn’t to replace physics — it’s to accelerate exploration while staying honest about uncertainties.
When NOT to Use Machine Learning
ML is powerful but not always appropriate:
Don’t use ML when:
- Physics is cheap: If simulation takes seconds, no need for emulation
- Data is scarce: <100 training examples \(\to\) physics priors are better
- Interpretability is critical: Need to understand mechanism, not just predict
- Extrapolation is required: ML fails outside training distribution
- Uncertainty quantification is essential: Standard NNs don’t provide this (but see Part 2 for solutions!)
Use ML when:
- Physics is expensive: Emulation saves time
- Data is abundant: Enough examples to learn patterns
- Patterns are complex: Too nonlinear for simple models
- Speed matters: Need real-time or interactive predictions
- You validate carefully: Test generalization thoroughly
Part 5: The Spectrum of Models — From Physics to Data
A Taxonomy of Approaches
| Level | Model style | What carries the scientific burden | Course example |
|---|---|---|---|
| 1 | Pure physics | Equations, numerical methods, and validation | Project 2 N-body simulator |
| 2 | Simulation-trained emulator | Validated simulator plus held-out emulator tests | Final-project neural emulator |
| 3 | Constrained ML | Data fit plus physics-aware constraints | Optional regularization or conservation checks |
| 4 | Pure ML | Data coverage, model capacity, and validation | Standard neural-network regression |
| 5 | End-to-end learning | Representation learning from raw inputs | Galaxy morphology from pixels |
Level 1: Pure Physics (Project 2 and the JAX rebuild)
- Start from first principles (\(F = ma\))
- Solve equations numerically
- No learning from data
- Example: Your N-body code
Level 2: Simulation-Trained Emulation (Final project)
- Use the validated simulator to generate training examples
- Learn a fast map from physical inputs to simulation outputs
- Example: \((Q_0, a) \to\) cluster summary diagnostics
- Best when the emulator is checked against held-out simulations
Level 3: Constrained ML (Physics-Informed Neural Networks)
- Encode physics in loss functions
- Network is flexible but respects constraints
- Example: Energy conservation penalty
Level 4: Pure ML (Standard Neural Networks)
- Learn directly from data
- No explicit physics
- Example: Image recognition, time series prediction
Level 5: End-to-End Learning (Deep Learning on Raw Data)
- Learn representations + predictions jointly
- Example: Galaxy morphology from pixels
Most astrophysical applications sit at Levels 2–3:
- We have strong physical priors (conservation laws, symmetries)
- Data is expensive but available
- We want speed + interpretability
Pure data-driven ML (Levels 4–5) works when:
- Physics is unknown or extremely complex
- Massive datasets available (images, spectra)
- Task is pattern recognition rather than understanding
Example: Photometric Redshifts
The Problem: Estimate galaxy distance from broad-band colors (no spectroscopy)
Level 1 (Pure Physics):
- Model galaxy spectral energy distributions (SEDs)
- Compute expected colors at each redshift
- Find best-fit template
- Pro: Interpretable. Con: Templates may not match real galaxies
Level 2–3 (Physics-Informed ML):
- Train on galaxies with known redshifts (spectroscopy)
- Use physics-motivated features (colors, emission lines)
- Constrain predictions to be positive, ordered
- Pro: More accurate. Con: Needs training data
Level 4 (Pure ML):
- Neural network: colors \(\to\) redshift
- No physics assumptions
- Pro: Very accurate. Con: Black box, hard to diagnose failures
Current best practice: Hybrid approaches (Levels 2–3)
Part 6: The Learning Algorithm Landscape
Classical Machine Learning (Pre-Deep Learning)
Linear Models:
- Linear regression: \(y = \mathbf{w}^T \mathbf{x} + b\)
- Logistic regression: \(p(y=1) = \sigma(\mathbf{w}^T \mathbf{x} + b)\)
- Pro: Fast, interpretable. Con: Limited expressivity
Tree-Based Methods:
- Decision trees: Binary splits on features
- Random forests: Ensemble of trees
- Gradient boosting (XGBoost, LightGBM)
- Pro: Handles nonlinearity, feature importance. Con: Can overfit
Support Vector Machines (SVMs):
- Find maximum-margin hyperplane
- Kernel trick for nonlinearity
- Pro: Good for small data. Con: Doesn’t scale to large datasets
Deep Learning (Modern ML)
Neural Networks (Part 1 of this module):
- Multilayer perceptrons (MLPs)
- Convolutional networks (CNNs) for images
- Specialized networks for images, sequences, and structured data
- Pro: Universal approximators, scalable. Con: Data-hungry, black-box
Specialized Architectures:
- Graph neural networks: Data with graph structure
- Physics-informed neural networks: Encode PDEs
- Pro: Incorporate domain knowledge. Con: More complex to train
The Modern Paradigm:
- Deep learning dominates when data is abundant (millions of examples)
- Classical ML still competitive for small datasets (<10,000 examples)
- Your project: a small neural emulator for simulation summary statistics, with uncertainty quantification techniques layered on top
Part 7: Evaluation and Validation
Metrics for Regression
For continuous predictions \(\hat{y}_i\) vs true values \(y_i\):
Mean Squared Error (MSE): \[ \text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 \]
- Units: squared output units
- Penalizes large errors heavily
Root Mean Squared Error (RMSE): \[ \text{RMSE} = \sqrt{\text{MSE}} \]
- Units: same as output
- Easier to interpret than MSE
Mean Absolute Error (MAE): \[ \text{MAE} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i| \]
- More robust to outliers than MSE
\(R^2\) Score (Coefficient of Determination): \[ R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2} \]
- \(R^2 = 1\): Perfect predictions
- \(R^2 = 0\): No better than predicting mean
- \(R^2 < 0\): Worse than predicting mean!
Classification metrics: Accuracy, precision, recall, F1 score, confusion matrix
Visualizing performance:
- Residual plots: Should show random scatter (no systematic bias)
- Prediction plots: Points should fall on diagonal
- Learning curves: Training and validation loss should converge (not diverge = overfitting)
Part 8: Regularization = Bayesian Priors
In Module 5, you learned that Bayesian inference combines prior beliefs \(p(\boldsymbol{\theta})\) with data via Bayes’ theorem.
Regularization = putting Module 5’s priors into practice! When we add regularization terms to a loss function, we’re encoding prior beliefs about what functions are plausible.
The Bayesian View: Regularization IS Prior Knowledge
In machine learning, we typically don’t compute the full posterior \(p(f \,|\, \mathcal{D})\). Instead, we find the MAP (Maximum A Posteriori) estimate: \[ f_{\text{MAP}} = \arg\max_f p(f \,|\, \mathcal{D}) = \arg\max_f \left[ p(\mathcal{D} \,|\, f) \cdot p(f) \right] \]
Taking the negative log: \[ f_{\text{MAP}} = \arg\min_f \left[ \underbrace{-\log p(\mathcal{D} \,|\, f)}_{\text{Data loss } \mathcal{L}_{\text{data}}} + \underbrace{-\log p(f)}_{\text{Prior penalty}} \right] \]
This is training with regularization!
\[ \boxed{\text{Training loss} = \text{Data loss} + \text{Regularization} = -\log p(\mathcal{D} \,|\, f) - \log p(f)} \]
Different regularization techniques = different Bayesian priors on functions
L2 Regularization = Gaussian Prior
Regularization term: \[ \mathcal{L}_{\text{total}} = \mathcal{L}_{\text{data}} + \frac{\lambda}{2} \sum_i w_i^2 \]
Bayesian interpretation: This is MAP estimation with a Gaussian prior on weights: \[ p(w_i) = \mathcal{N}(0, \sigma^2) \quad \Rightarrow \quad -\log p(\mathbf{w}) = \frac{1}{2\sigma^2} \sum_i w_i^2 + \text{const} \]
where \(\lambda = 1/\sigma^2\) connects the regularization strength to prior variance.
What this prior says: “I believe weights should be small, with typical magnitude \(\sigma\). Large weights are possible but unlikely.”
Effect: Smooth functions (no wild oscillations)
Also known as: Ridge regression, weight decay
L1 Regularization = Laplace Prior
Regularization term: \[ \mathcal{L}_{\text{total}} = \mathcal{L}_{\text{data}} + \lambda \sum_i |w_i| \]
Bayesian interpretation: This is MAP estimation with a Laplace prior: \[ p(w_i) = \frac{1}{2b} \exp\left(-\frac{|w_i|}{b}\right) \quad \Rightarrow \quad -\log p(\mathbf{w}) = \frac{1}{b} \sum_i |w_i| + \text{const} \]
What this prior says: “I believe most weights should be exactly zero. Only a few features matter.”
Effect: Sparse models — drives many weights to exactly zero
Use case: Automatic feature selection
Also known as: Lasso regression
Comparing Priors Visually
| Regularizer | Bayesian interpretation | Geometric intuition | Practical effect |
|---|---|---|---|
| L2 | Gaussian prior on weights | Round contours | Shrinks weights smoothly |
| L1 | Laplace prior on weights | Diamond-shaped contours | Encourages exact zeros |
Key insight: The sharp corners of the L1 diamond cause weights to hit zero exactly!
Early Stopping = Implicit Prior on Solution Trajectory
What it is: Stop training when validation loss stops improving
Bayesian interpretation:
- Training follows a path through weight space
- Early stopping = prior belief that “good solutions appear early in training”
- Equivalent to a complexity prior: simpler solutions (earlier in training) are preferred
Why this makes sense: Neural network training typically finds simpler patterns first, then overfits to noise later.
Dropout = Approximate Bayesian Model Averaging
What it is: During training, randomly drop neurons with probability \(p\)
Bayesian interpretation (Gal & Ghahramani, 2016): \[ \text{Dropout} \approx \text{Sampling from approximate posterior over networks} \]
At test time with dropout:
- Each forward pass = sample from an approximate posterior \(\hat{p}(f \,|\, \mathcal{D})\)
- Average predictions \(\approx\) approximate marginal prediction
- Variance of predictions \(\approx\) approximate epistemic uncertainty estimate
Important caveats: This connection is approximate, and the approximation quality depends on:
- Coverage underfitting: Dropout uncertainty often underestimates true epistemic uncertainty (empirical coverage < nominal — e.g., 68% coverage instead of predicted 95%)
- Missing structure: The approximate posterior doesn’t capture all correlations present in the true Bayesian posterior
- Hyperparameter sensitivity: Dropout rate \(p\) must be tuned carefully; no principled way to choose it from Bayes’ theorem
- No marginal likelihood: Dropout networks can’t compute marginal likelihood for model comparison
When to use for uncertainty:
- Quick uncertainty estimates for exploratory analysis
- Use with caution for scientific publications — verify calibration against held-out data
- Avoid for active learning where you need exact uncertainty for adaptive sampling
# Dropout for uncertainty quantification
def predict_with_uncertainty(model, x, n_samples=100):
"""Use dropout at test time for uncertainty estimates"""
predictions = []
for _ in range(n_samples):
# Keep dropout active at test time
y_pred = model(x, training=True)
predictions.append(y_pred)
predictions = jnp.array(predictions)
mean = jnp.mean(predictions, axis=0)
std = jnp.std(predictions, axis=0) # Epistemic uncertainty
return mean, stdThe Fundamental Connection
Three equivalent perspectives:
- Practical ML: Regularization prevents overfitting
- Add penalty terms to loss function
- Choose \(\lambda\) via cross-validation
- Bayesian inference (Module 5): Priors encode assumptions
- \(p(f)\) represents beliefs before seeing data
- MAP estimation finds most probable function
- Philosophy: Occam’s razor
- Prefer simpler explanations
- Complexity requires justification (better fit to data)
They’re all the same thing!
Regularization techniques:
- L2: Gaussian prior \(\to\) smooth functions
- L1: Laplace prior \(\to\) sparse functions (feature selection)
- Early stopping: Implicit complexity prior \(\to\) prevent overtraining
- Dropout: Approximate posterior sampling \(\to\) uncertainty quantification
When you add regularization, you’re imposing Bayesian priors. When you tune \(\lambda\), you’re choosing prior strength.
Part 9: The Philosophical Shift — From Understanding to Prediction
The Traditional Physics Mindset
As a physicist, you’ve been trained to ask:
- Why does this happen?
- What are the fundamental laws?
- Can we derive this from first principles?
This is the explanatory paradigm: science as understanding mechanism.
Example: Newton didn’t just predict planetary motion — he explained it via gravity.
The Machine Learning Mindset
Machine learning often asks:
- What will happen?
- How accurately can we predict?
- What patterns exist in the data?
This is the predictive paradigm: science as pattern recognition.
Example: A neural network might predict galaxy morphology accurately without understanding how galaxies form.
No! They’re complementary.
Physics gives us:
- Fundamental understanding
- Generalization beyond training data
- Confidence in extrapolation
Machine learning gives us:
- Speed (predictions in milliseconds)
- Ability to handle complexity
- Discovery of unexpected patterns
Best approach: Combine both!
- Use physics to constrain ML models
- Use ML to accelerate physics simulations
- Use ML to discover new physics (anomaly detection)
Interpretability vs Accuracy
The tradeoff: Linear models are interpretable; deep neural networks are powerful but opaque.
For astrophysics: Choose based on goal — interpretability for theory testing, prediction for observations.
Scientific Integrity
Key principles: Hold out test set (use only once!), report all experiments (including failures), validate predictions physically, practice open science.
This course emphasizes understanding before using:
Neural Networks (NNs): Build from scratch
- Implement MLPs, backprop, training loops
- Understand gradient flow, loss landscapes
Uncertainty Quantification: Build from principles
- Implement ensemble methods, calibration checks
- Understand when and why uncertainty estimates can fail
Why this matters:
- You’ll know when models fail (and why)
- You’ll choose architectures wisely
- You’ll debug effectively
- You’ll use libraries responsibly
Analogy: You wouldn’t use an N-body code without understanding Newton’s laws. Same principle applies to ML!
Part 10: Looking Ahead — The Module Structure
Part 1: Neural Networks
What you’ll learn:
- Universal approximation theorem
- Backpropagation and gradient descent
- Small multilayer perceptrons as scientific emulators
- Normalization, train/validation/test splits, and baseline comparisons
- Held-out evaluation before inference
What you’ll build:
- Feedforward networks (MLPs)
- A baseline MLP emulator
- A simple comparison model
- A training pipeline with validation and saved normalization statistics
What you’ll discover:
- Trade-offs: expressivity vs interpretability
- When different architectures are appropriate
Part 2: Uncertainty Quantification
What you’ll learn:
- Bayesian neural networks and approximate inference
- Ensemble methods for predictive uncertainty
- Calibration: are your error bars trustworthy?
- Connections back to Module 5’s posterior reasoning
What you’ll build:
- Ensemble-based uncertainty estimates
- Calibration diagnostics
- Prediction intervals with coverage guarantees
What you’ll discover:
- When uncertainty matters most (and when shortcuts suffice)
- How to combine fast neural network predictions with principled error bars
The Final Project Arc
First pass: Build a trustworthy baseline
- Build: small MLP emulator for chosen summary statistics
- Train: learn from simulation outputs generated by your JAX-native simulator
- Validate: held-out metrics, predicted-vs-true plots, and comparison to a simple baseline
Second pass: Add uncertainty and inference
- Add: Ensemble methods, calibration checks
- Evaluate: Are predictions trustworthy? Where do they fail?
- Compare: Different architectures and UQ strategies
Deliverable: Comprehensive analysis answering:
- Does the emulator beat a simple baseline? (mean, linear model, or nearest-neighbor/interpolation-style baseline)
- How trustworthy are the predictions? (Calibrated uncertainty estimates)
- Where does the emulator fail? (edges of parameter space, sparse training regions, or poorly chosen summary statistics)
Conceptual Checkpoints
Before moving to Parts 1 and 2, reflect on these questions:
Generalization: You have 500 N-body simulations. How do you know your ML model will work on the 501st simulation you haven’t run yet?
Bias-Variance: A simple linear model has high bias but low variance. A 100-layer neural network has low bias but high variance. For 200 training samples, which would you choose? Why?
Cross-validation: Why do we need a separate validation set? Can’t we just look at training error?
Physics vs ML: Your physics-based N-body code gives exact answers (within numerical precision). Your ML emulator gives approximate answers ($$5% error). Why would you ever use the emulator?
Interpretability: A neural network predicts accurately but doesn’t tell you why a particular input feature matters. How would you investigate feature importance? What tools from Module 5 (Bayesian inference) could help?
Overfitting: You train a model that achieves 99.9% accuracy on training data but only 60% accuracy on test data. What went wrong? How would you fix it?
Regularization: Explain why adding a penalty \(\lambda \sum_i w_i^2\) to the loss function is equivalent to putting a Gaussian prior on weights in Bayesian inference (Module 5 connection!).
Connection to Module 1: How is machine learning related to function approximation with basis functions? What makes ML different from choosing Fourier or Legendre basis functions?
Further Reading
Foundational Machine Learning
- James, Witten, Hastie, Tibshirani (2013): An Introduction to Statistical Learning
- Excellent introduction, light on math
- Free PDF: https://www.statlearning.com/
- Bishop (2006): Pattern Recognition and Machine Learning
- More mathematical, comprehensive
- Covers Bayesian perspective
- Murphy (2012): Machine Learning: A Probabilistic Perspective
- Connects ML to probabilistic inference
- Great for physics backgrounds
ML for Physical Sciences
- Mehta et al. (2019): “A high-bias, low-variance introduction to Machine Learning for physicists”
- Physics Today review article
- Excellent overview for scientists
- https://arxiv.org/abs/1803.08823
- Cranmer et al. (2020): “The frontier of simulation-based inference”
- Connects ML to statistical inference
- Relevant for emulation
- Carleo et al. (2019): “Machine learning and the physical sciences”
- Reviews of Modern Physics
- Comprehensive survey
Astrophysics Applications
- Baron (2019): “Machine Learning in Astronomy: A Practical Overview”
- Survey of ML applications in astronomy
- Ntampaka et al. (2019): “The Role of Machine Learning in Astrophysics”
- Philosophical and practical perspectives
- Fluke & Jacobs (2020): “Surveying the reach and maturity of machine learning and artificial intelligence in astronomy”
- State of the field
Deep Learning
- Goodfellow, Bengio, Courville (2016): Deep Learning
- Comprehensive textbook
- Free online: https://www.deeplearningbook.org/
What’s Next
You now understand:
- What machine learning is (and isn’t)
- The fundamental learning problem (generalization)
- Types of ML (supervised, unsupervised, reinforcement)
- Why ML matters for astrophysics (speed, complexity, data)
- How to evaluate models (metrics, validation, regularization)
- The philosophical shift (explanation vs prediction)
In Part 1, you’ll learn:
- Neural Networks: Universal function approximators
- How to build and train a small MLP emulator
- Training, validation, and architecture design
In Part 2, you’ll learn:
- Uncertainty Quantification: Principled error bars on ML predictions
- Bayesian neural networks, ensembles, and calibration
- When and how to trust your model’s confidence
Together, these parts will give you:
- Deep understanding of modern ML methods
- Ability to choose appropriate tools for astrophysical problems
- Skills to build and validate emulators for expensive simulations
- Critical perspective on the role of ML in science
At the beginning of this course, you had:
- Theory: Equations of motion, statistical mechanics, radiative transfer
- Numerics: Integrators, Monte Carlo methods, discretization schemes
Now you’re adding:
- Statistics: Bayesian inference (Module 5)
- Machine Learning: Neural networks and uncertainty quantification (this module)
These aren’t separate domains — they’re deeply connected:
\[ \text{Physics} \xrightarrow{\text{simulations}} \text{Data} \xrightarrow{\text{ML}} \text{Fast predictions} \xrightarrow{\text{inference}} \text{Scientific insights} \]
You’re not just a programmer or a physicist or a statistician.
You’re a computational scientist who uses all these tools synergistically to understand the universe.
Welcome to the 21st century of astrophysics.