Overview: Introduction to Machine Learning

The Learnable Universe | Module 2 | COMP 536

Author

Anna Rosen

“The purpose of computing is insight, not numbers.”

— Richard Hamming

Learning Objectives

By the end of this module, you will be able to:

  1. Articulate what machine learning is and how it differs from (and connects to) traditional scientific computing
  2. Identify when machine learning is appropriate for astrophysical problems versus when physics-based simulation is better
  3. Understand the fundamental learning problem: balancing model complexity with data limitations
  4. Connect machine learning to statistical inference (Module 5) and function approximation (Module 1)
  5. Distinguish between supervised, unsupervised, and reinforcement learning paradigms
  6. Recognize common pitfalls: overfitting, underfitting, and the bias-variance tradeoff
  7. Design validation strategies to assess model generalization
  8. Appreciate the philosophical shift from “physics first” to “data first” approaches in modern astrophysics
NoteWhere We Are in the Course

This module builds directly on Module 5: Inferential Thinking, extending Bayesian inference to function space:

The Statistical Foundation (Modules 1–4):

  • Module 1: Probability, distributions, moments, sampling \(\to\) statistical description of uncertainty
  • Module 2: Statistical mechanics, Boltzmann equation \(\to\) physics from statistics
  • Module 3: Phase space, dynamics, N-body \(\to\) simulating the universe
  • Module 4: Radiative transfer, Monte Carlo \(\to\) photons and computation

The Inferential Framework (Module 5) — THIS IS OUR FOUNDATION:

  • Bayesian inference: \(p(\boldsymbol{\theta} \,\vert\, \mathcal{D})\) \(\to\) learning parameters from data
  • MCMC: sampling complex posteriors \(\to\) computational inference
  • Model comparison: evidence and Bayes factors \(\to\) choosing between models
  • Key insight: Update beliefs using data (likelihood) and prior knowledge

The Machine Learning Extension (This Module):

  • Overview (This): ML as Bayesian inference in function space
  • Part 1: Neural Networks \(\to\) flexible function approximators and approximate inference
  • Part 2: Uncertainty Quantification \(\to\) principled error bars on ML predictions

The Connection: Module 5 taught you \(p(\boldsymbol{\theta} \,\vert\, \mathcal{D})\). Now we learn \(p(f \,\vert\, \mathcal{D})\).

Same Bayesian principles, infinite-dimensional space.


From Parameter Inference to Function Learning

The Bayesian Foundation (Module 5 Recap)

In Module 5: Inferential Thinking, you learned Bayesian inference for parameters:

The Parameter Inference Problem:

  • Given data \(\mathcal{D}\) and model \(M\), what parameters \(\boldsymbol{\theta}\) are plausible?
  • Posterior: \(p(\boldsymbol{\theta} \,|\, \mathcal{D}) \propto p(\mathcal{D} \,|\, \boldsymbol{\theta}) p(\boldsymbol{\theta})\)
  • MCMC: Sample the posterior when no closed form exists
  • Model comparison: Bayes factors, evidence

Example from Module 5: Fitting a line to data \[ y_i = \theta_0 + \theta_1 x_i + \epsilon_i \]

You inferred the parameters \(\boldsymbol{\theta} = (\theta_0, \theta_1)\) from noisy observations.

The Machine Learning Extension

Machine learning is Bayesian inference in function space.

Instead of asking “what parameters \(\boldsymbol{\theta}\) are plausible?”, we ask:

“What functions \(f\) are plausible?”

Given data \(\mathcal{D} = \{(\mathbf{x}_i, y_i)\}\), we want the posterior over functions: \[ p(f \,|\, \mathcal{D}) \propto p(\mathcal{D} \,|\, f) \cdot p(f) \]

The mathematical structure is identical to Module 5 — only the space changed:

Module 5: Parameter Space Module 2: Function Space
Parameter \(\boldsymbol{\theta} \in \mathbb{R}^d\) Function \(f: \mathbb{R}^d \to \mathbb{R}\)
Prior \(p(\boldsymbol{\theta})\) Prior \(p(f)\) (architecture, regularization)
Likelihood \(p(\mathcal{D} \,\vert\, \boldsymbol{\theta})\) Likelihood \(p(\mathcal{D} \,\vert\, f)\)
MCMC samples \(p(\boldsymbol{\theta} \,\vert\, \mathcal{D})\) Learning algorithms find \(p(f \,\vert\, \mathcal{D})\)
Predictive: marginalize posterior Predictive: marginalize posterior
TipThe Key Insight

Module 5: Learn plausible parameters from data

Module 2: Learn plausible functions from data

Functions are infinite-dimensional parameters! ML extends your Bayesian intuition to function space.

What This Module Teaches You

Overview: What is machine learning? Why does astrophysics need it?

Part 1: Neural Networks

  • Flexible function approximators
  • Training = Finding maximum a posteriori (MAP) estimate
  • Universal approximation and scalable inference

Part 2: Uncertainty Quantification

  • Principled approaches to error bars on ML predictions
  • Bayesian neural networks, ensembles, and calibration
  • Connecting back to Module 5’s posterior reasoning

Throughout: Bayesian thinking guides ML design choices

NoteConnection to Your Projects

Projects 1–4: Implement physics simulations

Final project JAX rebuild: Generate validated N-body simulation data

Final Project:

  • Use simulations as training data
  • Learn \(f:\) (initial conditions) \(\to\) (cluster evolution)
  • Predict outcomes ~100–1000\(\times\) faster than physics simulation (depending on accuracy requirements)

This is physics-informed machine learning: Bayesian inference trained on physics.


Part 1: What Is Machine Learning?

The Traditional Scientific Method

You’ve spent this entire semester doing physics-based computational astrophysics:

  1. Write down equations: \(\frac{d\mathbf{v}_i}{dt} = \sum_{j \neq i} \frac{Gm_j(\mathbf{r}_j - \mathbf{r}_i)}{|\mathbf{r}_j - \mathbf{r}_i|^3}\)
  2. Solve numerically: Runge-Kutta, leapfrog, adaptive timesteps
  3. Analyze results: Measure observables, compare to theory
  4. Iterate: Refine physics, improve numerics, explore parameter space

This is model-driven science: we start with physical laws and derive predictions.

Strengths:

  • Interpretable (we understand every step)
  • Generalizable (physics is universal)
  • Predictive (can extrapolate beyond training regime)
  • Satisfying (we understand why things happen)

Limitations:

  • Computationally expensive (Project 2: minutes per simulation)
  • Doesn’t scale to complex systems (turbulence, galaxy formation)
  • Requires known physics (what about dark matter?)
  • Parameter space exploration is prohibitive (need 10,000+ simulations)

The Machine Learning Paradigm

Machine learning inverts the traditional approach:

Instead of: Physics equations \(\to\) Simulation \(\to\) Predictions

We do: Data \(\to\) Learning algorithm \(\to\) Predictions

TipThe Profound Shift

Traditional science asks: “Given these physical laws, what will happen?”

Machine learning asks: “Given these observations, what patterns exist?”

This isn’t replacing physics — it’s complementing it. We use ML when:

  1. Physics is too expensive to simulate directly
  2. Physics is unknown or incomplete
  3. Data is abundant but complex
  4. We need fast predictions for exploration/optimization

A Concrete Example: Your Final Project

The Problem: Predict how a star cluster evolves from \(t=0\) to \(t=200\) Myr.

Physics-based approach (Project 2 plus the final project’s JAX rebuild):

Initial conditions → N-body equations → Integrate 200 Myr → r_core(t)
                    (Computationally expensive!)

Machine learning approach (Final Project):

Many N-body simulations → Learn patterns → Instant predictions
(Upfront cost, then fast!)

The key insight: We’re not replacing physics — we’re using N-body simulations to train a fast surrogate model.

This is physics-informed machine learning: combining physical knowledge with data-driven learning.


Part 2: The Learning Problem — Generalization from Data

What Does It Mean to “Learn”?

Suppose we have training data: \[ \mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^{N} \]

where:

  • \(\mathbf{x}_i\) are inputs (e.g., initial cluster conditions)
  • \(y_i\) are outputs (e.g., core radius at \(t=100\) Myr)

Goal: Find a function \(f\) such that for new, unseen inputs \(\mathbf{x}_*\): \[ f(\mathbf{x}_*) \approx y_* \]

This is the learning problem: generalize from observed data to unobserved cases.

NoteConnection to Module 1: Function Approximation

Remember basis function expansions? \[ f(x) = \sum_{i=1}^{M} w_i \phi_i(x) \]

Machine learning is function approximation where:

  1. The basis functions \(\phi_i\) are learned from data (not chosen a priori)
  2. We care about generalization to new data (not just fitting training data)
  3. We quantify uncertainty in predictions

This connects directly to your work on Fourier series, Legendre polynomials, etc. in Module 1!

The Bayesian Perspective: Learning = Inference

TipConnection to Module 5: From Parameters to Functions

Module 5: Bayesian inference for parameters \(\boldsymbol{\theta}\) \[ p(\boldsymbol{\theta} \,|\, \mathcal{D}) = \frac{p(\mathcal{D} \,|\, \boldsymbol{\theta}) p(\boldsymbol{\theta})}{p(\mathcal{D})} \]

Machine learning: Bayesian inference for functions \(f\) \[ p(f \,|\, \mathcal{D}) = \frac{p(\mathcal{D} \,|\, f) p(f)}{p(\mathcal{D})} \]

Same structure, infinite-dimensional space!

  • Prior \(p(f)\): Encodes beliefs about plausible functions (smoothness, scales, etc.)
  • Likelihood \(p(\mathcal{D} \,|\, f)\): How well does \(f\) explain the data?
  • Posterior \(p(f \,|\, \mathcal{D})\): Updated beliefs after seeing data — this is what learning computes

Neural networks perform approximate inference — MAP estimation finds a single “best” function, and scales to millions of data points. Uncertainty quantification techniques (Part 2) then recover principled error bars on these predictions.

The Fundamental Challenge: Bias-Variance Tradeoff

Consider fitting a polynomial to data:

Underfitting (high bias): \[ f(x) = w_0 + w_1 x \quad \text{(linear model)} \]

  • Too simple to capture true pattern
  • Poor performance on both training and test data
  • Bias: model assumptions are wrong

Overfitting (high variance): \[ f(x) = \sum_{i=0}^{100} w_i x^i \quad \text{(100th degree polynomial)} \]

  • Fits training data perfectly (even noise!)
  • Poor performance on test data
  • Variance: model is too sensitive to training data

Just right: \[ f(x) = \sum_{i=0}^{5} w_i x^i \quad \text{(modest degree polynomial)} \]

  • Captures true pattern without memorizing noise
  • Good performance on both training and test data
Bias-variance tradeoff for supervised learning. The goal is not maximum flexibility; the goal is a model flexible enough to capture the signal and constrained enough to generalize.
Model complexity Training behavior Test behavior Diagnosis
Too simple Cannot match the pattern Poor generalization High bias
Well matched Captures signal without chasing noise Best generalization Balanced bias and variance
Too flexible Can memorize the training set Poor generalization High variance

The mathematical formulation:

Expected prediction error decomposes as: \[ \mathbb{E}[(y - \hat{f}(x))^2] = \underbrace{\text{Bias}[\hat{f}(x)]^2}_{\text{systematic error}} + \underbrace{\text{Var}[\hat{f}(x)]}_{\text{sensitivity to data}} + \underbrace{\sigma^2}_{\text{irreducible noise}} \]

Key insight: There’s an optimal model complexity that balances bias and variance!

NoteConnection to Module 5: Model Comparison

Module 5 connection: The bias-variance tradeoff relates to Bayesian model comparison via evidence \(p(\mathcal{D} \,|\, \mathcal{M})\). Simple models may underfit (high bias), complex models may overfit (high variance). The optimal model balances both.

Full Bayesian inference marginalizes over functions rather than selecting a single “best” function — this is the foundation for the uncertainty quantification techniques we explore in Part 2.

Training, Validation, and Test Sets

The gold standard: Split data into three sets:

  1. Training set (60–80%): Fit model parameters
  2. Validation set (10–20%): Tune hyperparameters (model complexity, regularization)
  3. Test set (10–20%): Final evaluation (never used during development!)

Why three sets?

  • Training: Learns patterns
  • Validation: Prevents overfitting to training data
  • Test: Unbiased estimate of generalization performance
WarningCritical Mistake to Avoid

Never use the test set for:

  • Choosing model architecture
  • Tuning hyperparameters
  • Deciding when to stop training
  • Any decision whatsoever!

The test set is sacred — touch it only once at the very end.

If you tune your model based on test performance, you’re effectively training on the test set (indirectly). This gives you an overly optimistic estimate of how well your model will generalize.

Cross-Validation: Making the Most of Limited Data

When data is scarce (e.g., 200 expensive N-body simulations), splitting into train/val/test is wasteful.

K-fold cross-validation:

  1. Divide data into \(K\) folds (typically \(K=5\) or \(K=10\))
  2. For each fold \(k\):
    • Train on all folds except \(k\)
    • Validate on fold \(k\)
  3. Average performance across all \(K\) folds

This uses all data for both training and validation (at different times).

import jax.numpy as jnp

def k_fold_cross_validation(X, y, model_fn, k=5):
    """K-fold cross-validation

    Args:
        X: Input data (N, d)
        y: Outputs (N,)
        model_fn: Function that trains and evaluates model
        k: Number of folds

    Returns:
        scores: Validation scores for each fold
    """
    N = len(X)
    fold_size = N // k
    scores = []

    for i in range(k):
        # Split into train and validation
        val_start = i * fold_size
        val_end = (i + 1) * fold_size

        # Validation fold
        X_val = X[val_start:val_end]
        y_val = y[val_start:val_end]

        # Training folds (everything else)
        X_train = jnp.concatenate([X[:val_start], X[val_end:]])
        y_train = jnp.concatenate([y[:val_start], y[val_end:]])

        # Train and evaluate
        score = model_fn(X_train, y_train, X_val, y_val)
        scores.append(score)

    return jnp.array(scores)

# Usage
mean_score = jnp.mean(scores)
std_score = jnp.std(scores)
print(f"CV Score: {mean_score:.3f} ± {std_score:.3f}")

Part 3: Types of Machine Learning

Supervised Learning: Learning from Labeled Examples

Setup: We have input-output pairs \(\{(\mathbf{x}_i, y_i)\}\)

Goal: Learn mapping \(f: \mathbf{x} \to y\)

Two subtypes:

  1. Regression: Predict continuous values
    • Example: Initial conditions \(\to\) core radius
    • Loss: Mean Squared Error (MSE)
  2. Classification: Predict discrete categories
    • Example: Galaxy image \(\to\) [spiral, elliptical, irregular]
    • Loss: Cross-entropy

Astrophysical applications:

  • Photometric redshifts (colors \(\to\) distance)
  • Supernova classification (light curve \(\to\) type Ia vs core-collapse)
  • Exoplanet detection (light curve \(\to\) planet or not)
  • Your project: Initial conditions \(\to\) cluster evolution
NoteConnection to Module 5: Maximum Likelihood

Supervised learning is maximum likelihood estimation!

If we assume Gaussian noise: \(y_i = f(\mathbf{x}_i) + \epsilon_i\), \(\epsilon_i \sim \mathcal{N}(0, \sigma^2)\)

Then the likelihood is: \[ p(\mathbf{y} \,|\, \mathbf{X}, \boldsymbol{\theta}) = \prod_{i=1}^{N} \mathcal{N}(y_i \,|\, f_{\boldsymbol{\theta}}(\mathbf{x}_i), \sigma^2) \]

Maximizing likelihood = minimizing negative log-likelihood: \[ -\log p(\mathbf{y} \,|\, \mathbf{X}, \boldsymbol{\theta}) = \frac{1}{2\sigma^2} \sum_{i=1}^{N} (y_i - f_{\boldsymbol{\theta}}(\mathbf{x}_i))^2 + \text{const} \]

This is Mean Squared Error when \(\sigma^2\) is treated as a known constant!

Important caveat: This equivalence holds only if the noise variance \(\sigma^2\) is pre-specified or known. In Bayesian inference, if \(\sigma^2\) is unknown, it becomes a hyperparameter that must be either:

  • Pre-specified from prior knowledge
  • Estimated from data (complicating the likelihood)
  • Marginalized over in a full Bayesian treatment

In practice, standard neural networks do not learn \(\sigma^2\) — they implicitly assume a fixed, predetermined noise level. If you want to learn \(\sigma^2\) from data, you’d need to either:

  1. Add it as an output parameter (heteroskedastic regression)
  2. Use a Bayesian approach that marginalizes over it

So training neural networks is maximum likelihood estimation under the assumption of known noise variance, connecting back to Bayesian inference from Module 5.

Unsupervised Learning: Finding Structure Without Labels

Setup: We only have inputs \(\{\mathbf{x}_i\}\) (no outputs!)

Goal: Discover hidden structure, patterns, or groupings

Common tasks:

  1. Clustering: Group similar objects
    • K-means, hierarchical clustering
    • Example: Group galaxies by properties (without pre-defined labels)
  2. Dimensionality reduction: Find low-dimensional representation
    • PCA, t-SNE, autoencoders
    • Example: Compress 1000-dimensional galaxy spectra to 10 principal components
  3. Density estimation: Learn probability distribution
    • Example: What’s the distribution of stellar masses in clusters?

Astrophysical applications:

  • Discovering new classes of objects (quasars, gamma-ray bursts)
  • Anomaly detection (unusual supernovae, transients)
  • Data compression (large survey data)
TipExample: Unsupervised vs Supervised

Supervised: “Here are 1000 galaxies labeled as spiral or elliptical. Learn to classify new galaxies.”

Unsupervised: “Here are 1000 unlabeled galaxies. Do they naturally group into categories?”

Unsupervised learning can discover categories we didn’t know existed!

Reinforcement Learning: Learning from Interaction

Setup: Agent interacts with environment, receives rewards

Goal: Learn policy (strategy) that maximizes cumulative reward

Components:

  • State \(s\): Current situation
  • Action \(a\): What agent can do
  • Reward \(r\): Feedback signal
  • Policy \(\pi(a|s)\): Strategy for choosing actions

Astrophysical applications (less common but emerging):

  • Optimizing telescope scheduling
  • Adaptive optics control
  • Gravitational wave detector tuning

Not the focus of this course (supervised learning is most relevant for emulation).


Part 4: Why Machine Learning for Astrophysics?

The Data Deluge

Astronomy is drowning in data:

Survey/Instrument Data Volume Time to Analyze Manually
Sloan Digital Sky Survey (SDSS) 200 million objects Centuries
Large Synoptic Survey Telescope (LSST) 30 TB/night Impossible
Square Kilometer Array (SKA) 160 TB/second Absolutely impossible
Gaia 1 billion stars Many lifetimes

Traditional approach: Manually inspect each object, classify, measure properties

Machine learning approach: Train algorithms to do this automatically

The Complexity Challenge

Some astrophysical systems are too complex for analytic solutions:

Simple (we have equations):

  • Two-body problem: Solved exactly (Kepler orbits)
  • Linear perturbations: Analytic solutions exist
  • Spherical symmetry: Reduces to 1D ODEs

Complex (need simulation):

  • N-body problem (\(N > 2\)): No closed-form solution
  • Turbulence: Highly nonlinear, chaotic
  • Galaxy formation: Multi-scale, multi-physics

Very complex (even simulation is hard):

  • Cosmological structure formation: Box size vs resolution tradeoff
  • Stellar interiors with magnetic fields: MHD is expensive
  • Radiative transfer in 3D: Photon transport is costly

Machine learning solution: Learn simplified models from expensive simulations

The Emulation Use Case: Your Final Project

This is the primary motivation for your final project:

The Problem:

  • N-body simulation: 1 minute per run
  • Want to explore 5D parameter space: Need ~10,000 runs
  • Total time: ~1 week of continuous computing
  • Then if we update physics? Start over!

The ML Solution:

  1. Run 500 simulations (8 hours)
  2. Train surrogate model (1 hour)
  3. Make 10,000 predictions (seconds)
  4. Total: <10 hours instead of 1 week

The benefit: ~100–1000\(\times\) speedup (depending on emulator type and accuracy) enables:

  • Parameter space exploration
  • Uncertainty quantification
  • Optimization (find best-fit parameters)
  • Real-time analysis
ImportantThe Tradeoff

Physics simulation:

  • Exact (within numerical precision)
  • Interpretable (we understand every force)
  • Generalizes (physics is universal)
  • Slow (minutes to hours per run)

ML emulator:

  • Fast (milliseconds per prediction)
  • Smooth (learns continuous functions)
  • Approximate (prediction error ~1–10%)
  • Interpolates (not extrapolates) well
  • Black box (harder to interpret)

The goal isn’t to replace physics — it’s to accelerate exploration while staying honest about uncertainties.

When NOT to Use Machine Learning

ML is powerful but not always appropriate:

Don’t use ML when:

  1. Physics is cheap: If simulation takes seconds, no need for emulation
  2. Data is scarce: <100 training examples \(\to\) physics priors are better
  3. Interpretability is critical: Need to understand mechanism, not just predict
  4. Extrapolation is required: ML fails outside training distribution
  5. Uncertainty quantification is essential: Standard NNs don’t provide this (but see Part 2 for solutions!)

Use ML when:

  1. Physics is expensive: Emulation saves time
  2. Data is abundant: Enough examples to learn patterns
  3. Patterns are complex: Too nonlinear for simple models
  4. Speed matters: Need real-time or interactive predictions
  5. You validate carefully: Test generalization thoroughly

Part 5: The Spectrum of Models — From Physics to Data

A Taxonomy of Approaches

Spectrum from physics-based models to data-driven models. The final project sits near Level 2: the simulator remains the scientific anchor, and the emulator is useful only after validation.
Level Model style What carries the scientific burden Course example
1 Pure physics Equations, numerical methods, and validation Project 2 N-body simulator
2 Simulation-trained emulator Validated simulator plus held-out emulator tests Final-project neural emulator
3 Constrained ML Data fit plus physics-aware constraints Optional regularization or conservation checks
4 Pure ML Data coverage, model capacity, and validation Standard neural-network regression
5 End-to-end learning Representation learning from raw inputs Galaxy morphology from pixels

Level 1: Pure Physics (Project 2 and the JAX rebuild)

  • Start from first principles (\(F = ma\))
  • Solve equations numerically
  • No learning from data
  • Example: Your N-body code

Level 2: Simulation-Trained Emulation (Final project)

  • Use the validated simulator to generate training examples
  • Learn a fast map from physical inputs to simulation outputs
  • Example: \((Q_0, a) \to\) cluster summary diagnostics
  • Best when the emulator is checked against held-out simulations

Level 3: Constrained ML (Physics-Informed Neural Networks)

  • Encode physics in loss functions
  • Network is flexible but respects constraints
  • Example: Energy conservation penalty

Level 4: Pure ML (Standard Neural Networks)

  • Learn directly from data
  • No explicit physics
  • Example: Image recognition, time series prediction

Level 5: End-to-End Learning (Deep Learning on Raw Data)

  • Learn representations + predictions jointly
  • Example: Galaxy morphology from pixels
TipThe Astrophysics Sweet Spot

Most astrophysical applications sit at Levels 2–3:

  • We have strong physical priors (conservation laws, symmetries)
  • Data is expensive but available
  • We want speed + interpretability

Pure data-driven ML (Levels 4–5) works when:

  • Physics is unknown or extremely complex
  • Massive datasets available (images, spectra)
  • Task is pattern recognition rather than understanding

Example: Photometric Redshifts

The Problem: Estimate galaxy distance from broad-band colors (no spectroscopy)

Level 1 (Pure Physics):

  • Model galaxy spectral energy distributions (SEDs)
  • Compute expected colors at each redshift
  • Find best-fit template
  • Pro: Interpretable. Con: Templates may not match real galaxies

Level 2–3 (Physics-Informed ML):

  • Train on galaxies with known redshifts (spectroscopy)
  • Use physics-motivated features (colors, emission lines)
  • Constrain predictions to be positive, ordered
  • Pro: More accurate. Con: Needs training data

Level 4 (Pure ML):

  • Neural network: colors \(\to\) redshift
  • No physics assumptions
  • Pro: Very accurate. Con: Black box, hard to diagnose failures

Current best practice: Hybrid approaches (Levels 2–3)


Part 6: The Learning Algorithm Landscape

Classical Machine Learning (Pre-Deep Learning)

Linear Models:

  • Linear regression: \(y = \mathbf{w}^T \mathbf{x} + b\)
  • Logistic regression: \(p(y=1) = \sigma(\mathbf{w}^T \mathbf{x} + b)\)
  • Pro: Fast, interpretable. Con: Limited expressivity

Tree-Based Methods:

  • Decision trees: Binary splits on features
  • Random forests: Ensemble of trees
  • Gradient boosting (XGBoost, LightGBM)
  • Pro: Handles nonlinearity, feature importance. Con: Can overfit

Support Vector Machines (SVMs):

  • Find maximum-margin hyperplane
  • Kernel trick for nonlinearity
  • Pro: Good for small data. Con: Doesn’t scale to large datasets

Deep Learning (Modern ML)

Neural Networks (Part 1 of this module):

  • Multilayer perceptrons (MLPs)
  • Convolutional networks (CNNs) for images
  • Specialized networks for images, sequences, and structured data
  • Pro: Universal approximators, scalable. Con: Data-hungry, black-box

Specialized Architectures:

  • Graph neural networks: Data with graph structure
  • Physics-informed neural networks: Encode PDEs
  • Pro: Incorporate domain knowledge. Con: More complex to train

The Modern Paradigm:

  • Deep learning dominates when data is abundant (millions of examples)
  • Classical ML still competitive for small datasets (<10,000 examples)
  • Your project: a small neural emulator for simulation summary statistics, with uncertainty quantification techniques layered on top

Part 7: Evaluation and Validation

Metrics for Regression

For continuous predictions \(\hat{y}_i\) vs true values \(y_i\):

Mean Squared Error (MSE): \[ \text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 \]

  • Units: squared output units
  • Penalizes large errors heavily

Root Mean Squared Error (RMSE): \[ \text{RMSE} = \sqrt{\text{MSE}} \]

  • Units: same as output
  • Easier to interpret than MSE

Mean Absolute Error (MAE): \[ \text{MAE} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i| \]

  • More robust to outliers than MSE

\(R^2\) Score (Coefficient of Determination): \[ R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2} \]

  • \(R^2 = 1\): Perfect predictions
  • \(R^2 = 0\): No better than predicting mean
  • \(R^2 < 0\): Worse than predicting mean!

Classification metrics: Accuracy, precision, recall, F1 score, confusion matrix

Visualizing performance:

  • Residual plots: Should show random scatter (no systematic bias)
  • Prediction plots: Points should fall on diagonal
  • Learning curves: Training and validation loss should converge (not diverge = overfitting)

Part 8: Regularization = Bayesian Priors

TipConnection to Module 5: Priors in Action

In Module 5, you learned that Bayesian inference combines prior beliefs \(p(\boldsymbol{\theta})\) with data via Bayes’ theorem.

Regularization = putting Module 5’s priors into practice! When we add regularization terms to a loss function, we’re encoding prior beliefs about what functions are plausible.

The Bayesian View: Regularization IS Prior Knowledge

In machine learning, we typically don’t compute the full posterior \(p(f \,|\, \mathcal{D})\). Instead, we find the MAP (Maximum A Posteriori) estimate: \[ f_{\text{MAP}} = \arg\max_f p(f \,|\, \mathcal{D}) = \arg\max_f \left[ p(\mathcal{D} \,|\, f) \cdot p(f) \right] \]

Taking the negative log: \[ f_{\text{MAP}} = \arg\min_f \left[ \underbrace{-\log p(\mathcal{D} \,|\, f)}_{\text{Data loss } \mathcal{L}_{\text{data}}} + \underbrace{-\log p(f)}_{\text{Prior penalty}} \right] \]

This is training with regularization!

\[ \boxed{\text{Training loss} = \text{Data loss} + \text{Regularization} = -\log p(\mathcal{D} \,|\, f) - \log p(f)} \]

Different regularization techniques = different Bayesian priors on functions

L2 Regularization = Gaussian Prior

Regularization term: \[ \mathcal{L}_{\text{total}} = \mathcal{L}_{\text{data}} + \frac{\lambda}{2} \sum_i w_i^2 \]

Bayesian interpretation: This is MAP estimation with a Gaussian prior on weights: \[ p(w_i) = \mathcal{N}(0, \sigma^2) \quad \Rightarrow \quad -\log p(\mathbf{w}) = \frac{1}{2\sigma^2} \sum_i w_i^2 + \text{const} \]

where \(\lambda = 1/\sigma^2\) connects the regularization strength to prior variance.

What this prior says: “I believe weights should be small, with typical magnitude \(\sigma\). Large weights are possible but unlikely.”

Effect: Smooth functions (no wild oscillations)

Also known as: Ridge regression, weight decay

L1 Regularization = Laplace Prior

Regularization term: \[ \mathcal{L}_{\text{total}} = \mathcal{L}_{\text{data}} + \lambda \sum_i |w_i| \]

Bayesian interpretation: This is MAP estimation with a Laplace prior: \[ p(w_i) = \frac{1}{2b} \exp\left(-\frac{|w_i|}{b}\right) \quad \Rightarrow \quad -\log p(\mathbf{w}) = \frac{1}{b} \sum_i |w_i| + \text{const} \]

What this prior says: “I believe most weights should be exactly zero. Only a few features matter.”

Effect: Sparse models — drives many weights to exactly zero

Use case: Automatic feature selection

Also known as: Lasso regression

Comparing Priors Visually

L2 and L1 regularization as priors on model weights. L2 says “small weights are more plausible”; L1 says “many weights may be unnecessary.”
Regularizer Bayesian interpretation Geometric intuition Practical effect
L2 Gaussian prior on weights Round contours Shrinks weights smoothly
L1 Laplace prior on weights Diamond-shaped contours Encourages exact zeros

Key insight: The sharp corners of the L1 diamond cause weights to hit zero exactly!

Early Stopping = Implicit Prior on Solution Trajectory

What it is: Stop training when validation loss stops improving

Bayesian interpretation:

  • Training follows a path through weight space
  • Early stopping = prior belief that “good solutions appear early in training”
  • Equivalent to a complexity prior: simpler solutions (earlier in training) are preferred

Why this makes sense: Neural network training typically finds simpler patterns first, then overfits to noise later.

Dropout = Approximate Bayesian Model Averaging

What it is: During training, randomly drop neurons with probability \(p\)

Bayesian interpretation (Gal & Ghahramani, 2016): \[ \text{Dropout} \approx \text{Sampling from approximate posterior over networks} \]

At test time with dropout:

  • Each forward pass = sample from an approximate posterior \(\hat{p}(f \,|\, \mathcal{D})\)
  • Average predictions \(\approx\) approximate marginal prediction
  • Variance of predictions \(\approx\) approximate epistemic uncertainty estimate

Important caveats: This connection is approximate, and the approximation quality depends on:

  1. Coverage underfitting: Dropout uncertainty often underestimates true epistemic uncertainty (empirical coverage < nominal — e.g., 68% coverage instead of predicted 95%)
  2. Missing structure: The approximate posterior doesn’t capture all correlations present in the true Bayesian posterior
  3. Hyperparameter sensitivity: Dropout rate \(p\) must be tuned carefully; no principled way to choose it from Bayes’ theorem
  4. No marginal likelihood: Dropout networks can’t compute marginal likelihood for model comparison

When to use for uncertainty:

  • Quick uncertainty estimates for exploratory analysis
  • Use with caution for scientific publications — verify calibration against held-out data
  • Avoid for active learning where you need exact uncertainty for adaptive sampling
# Dropout for uncertainty quantification
def predict_with_uncertainty(model, x, n_samples=100):
    """Use dropout at test time for uncertainty estimates"""
    predictions = []
    for _ in range(n_samples):
        # Keep dropout active at test time
        y_pred = model(x, training=True)
        predictions.append(y_pred)

    predictions = jnp.array(predictions)
    mean = jnp.mean(predictions, axis=0)
    std = jnp.std(predictions, axis=0)  # Epistemic uncertainty

    return mean, std

The Fundamental Connection

ImportantRegularization = Bayesian Priors = Occam’s Razor

Three equivalent perspectives:

  1. Practical ML: Regularization prevents overfitting
    • Add penalty terms to loss function
    • Choose \(\lambda\) via cross-validation
  2. Bayesian inference (Module 5): Priors encode assumptions
    • \(p(f)\) represents beliefs before seeing data
    • MAP estimation finds most probable function
  3. Philosophy: Occam’s razor
    • Prefer simpler explanations
    • Complexity requires justification (better fit to data)

They’re all the same thing!

Regularization techniques:

  • L2: Gaussian prior \(\to\) smooth functions
  • L1: Laplace prior \(\to\) sparse functions (feature selection)
  • Early stopping: Implicit complexity prior \(\to\) prevent overtraining
  • Dropout: Approximate posterior sampling \(\to\) uncertainty quantification

When you add regularization, you’re imposing Bayesian priors. When you tune \(\lambda\), you’re choosing prior strength.


Part 9: The Philosophical Shift — From Understanding to Prediction

The Traditional Physics Mindset

As a physicist, you’ve been trained to ask:

  • Why does this happen?
  • What are the fundamental laws?
  • Can we derive this from first principles?

This is the explanatory paradigm: science as understanding mechanism.

Example: Newton didn’t just predict planetary motion — he explained it via gravity.

The Machine Learning Mindset

Machine learning often asks:

  • What will happen?
  • How accurately can we predict?
  • What patterns exist in the data?

This is the predictive paradigm: science as pattern recognition.

Example: A neural network might predict galaxy morphology accurately without understanding how galaxies form.

TipAre These in Conflict?

No! They’re complementary.

Physics gives us:

  • Fundamental understanding
  • Generalization beyond training data
  • Confidence in extrapolation

Machine learning gives us:

  • Speed (predictions in milliseconds)
  • Ability to handle complexity
  • Discovery of unexpected patterns

Best approach: Combine both!

  • Use physics to constrain ML models
  • Use ML to accelerate physics simulations
  • Use ML to discover new physics (anomaly detection)

Interpretability vs Accuracy

The tradeoff: Linear models are interpretable; deep neural networks are powerful but opaque.

For astrophysics: Choose based on goal — interpretability for theory testing, prediction for observations.

Scientific Integrity

Key principles: Hold out test set (use only once!), report all experiments (including failures), validate predictions physically, practice open science.

ImportantThe Glass-Box Philosophy in ML

This course emphasizes understanding before using:

Neural Networks (NNs): Build from scratch

  • Implement MLPs, backprop, training loops
  • Understand gradient flow, loss landscapes

Uncertainty Quantification: Build from principles

  • Implement ensemble methods, calibration checks
  • Understand when and why uncertainty estimates can fail

Why this matters:

  • You’ll know when models fail (and why)
  • You’ll choose architectures wisely
  • You’ll debug effectively
  • You’ll use libraries responsibly

Analogy: You wouldn’t use an N-body code without understanding Newton’s laws. Same principle applies to ML!


Part 10: Looking Ahead — The Module Structure

Part 1: Neural Networks

What you’ll learn:

  • Universal approximation theorem
  • Backpropagation and gradient descent
  • Small multilayer perceptrons as scientific emulators
  • Normalization, train/validation/test splits, and baseline comparisons
  • Held-out evaluation before inference

What you’ll build:

  • Feedforward networks (MLPs)
  • A baseline MLP emulator
  • A simple comparison model
  • A training pipeline with validation and saved normalization statistics

What you’ll discover:

  • Trade-offs: expressivity vs interpretability
  • When different architectures are appropriate

Part 2: Uncertainty Quantification

What you’ll learn:

  • Bayesian neural networks and approximate inference
  • Ensemble methods for predictive uncertainty
  • Calibration: are your error bars trustworthy?
  • Connections back to Module 5’s posterior reasoning

What you’ll build:

  • Ensemble-based uncertainty estimates
  • Calibration diagnostics
  • Prediction intervals with coverage guarantees

What you’ll discover:

  • When uncertainty matters most (and when shortcuts suffice)
  • How to combine fast neural network predictions with principled error bars

The Final Project Arc

First pass: Build a trustworthy baseline

  • Build: small MLP emulator for chosen summary statistics
  • Train: learn from simulation outputs generated by your JAX-native simulator
  • Validate: held-out metrics, predicted-vs-true plots, and comparison to a simple baseline

Second pass: Add uncertainty and inference

  • Add: Ensemble methods, calibration checks
  • Evaluate: Are predictions trustworthy? Where do they fail?
  • Compare: Different architectures and UQ strategies

Deliverable: Comprehensive analysis answering:

  • Does the emulator beat a simple baseline? (mean, linear model, or nearest-neighbor/interpolation-style baseline)
  • How trustworthy are the predictions? (Calibrated uncertainty estimates)
  • Where does the emulator fail? (edges of parameter space, sparse training regions, or poorly chosen summary statistics)

Conceptual Checkpoints

Before moving to Parts 1 and 2, reflect on these questions:

  1. Generalization: You have 500 N-body simulations. How do you know your ML model will work on the 501st simulation you haven’t run yet?

  2. Bias-Variance: A simple linear model has high bias but low variance. A 100-layer neural network has low bias but high variance. For 200 training samples, which would you choose? Why?

  3. Cross-validation: Why do we need a separate validation set? Can’t we just look at training error?

  4. Physics vs ML: Your physics-based N-body code gives exact answers (within numerical precision). Your ML emulator gives approximate answers ($$5% error). Why would you ever use the emulator?

  5. Interpretability: A neural network predicts accurately but doesn’t tell you why a particular input feature matters. How would you investigate feature importance? What tools from Module 5 (Bayesian inference) could help?

  6. Overfitting: You train a model that achieves 99.9% accuracy on training data but only 60% accuracy on test data. What went wrong? How would you fix it?

  7. Regularization: Explain why adding a penalty \(\lambda \sum_i w_i^2\) to the loss function is equivalent to putting a Gaussian prior on weights in Bayesian inference (Module 5 connection!).

  8. Connection to Module 1: How is machine learning related to function approximation with basis functions? What makes ML different from choosing Fourier or Legendre basis functions?


Further Reading

Foundational Machine Learning

  1. James, Witten, Hastie, Tibshirani (2013): An Introduction to Statistical Learning
  2. Bishop (2006): Pattern Recognition and Machine Learning
    • More mathematical, comprehensive
    • Covers Bayesian perspective
  3. Murphy (2012): Machine Learning: A Probabilistic Perspective
    • Connects ML to probabilistic inference
    • Great for physics backgrounds

ML for Physical Sciences

  1. Mehta et al. (2019): “A high-bias, low-variance introduction to Machine Learning for physicists”
  2. Cranmer et al. (2020): “The frontier of simulation-based inference”
    • Connects ML to statistical inference
    • Relevant for emulation
  3. Carleo et al. (2019): “Machine learning and the physical sciences”
    • Reviews of Modern Physics
    • Comprehensive survey

Astrophysics Applications

  1. Baron (2019): “Machine Learning in Astronomy: A Practical Overview”
    • Survey of ML applications in astronomy
  2. Ntampaka et al. (2019): “The Role of Machine Learning in Astrophysics”
    • Philosophical and practical perspectives
  3. Fluke & Jacobs (2020): “Surveying the reach and maturity of machine learning and artificial intelligence in astronomy”
    • State of the field

Deep Learning

  1. Goodfellow, Bengio, Courville (2016): Deep Learning

What’s Next

You now understand:

  • What machine learning is (and isn’t)
  • The fundamental learning problem (generalization)
  • Types of ML (supervised, unsupervised, reinforcement)
  • Why ML matters for astrophysics (speed, complexity, data)
  • How to evaluate models (metrics, validation, regularization)
  • The philosophical shift (explanation vs prediction)

In Part 1, you’ll learn:

  • Neural Networks: Universal function approximators
  • How to build and train a small MLP emulator
  • Training, validation, and architecture design

In Part 2, you’ll learn:

  • Uncertainty Quantification: Principled error bars on ML predictions
  • Bayesian neural networks, ensembles, and calibration
  • When and how to trust your model’s confidence

Together, these parts will give you:

  • Deep understanding of modern ML methods
  • Ability to choose appropriate tools for astrophysical problems
  • Skills to build and validate emulators for expensive simulations
  • Critical perspective on the role of ML in science

ImportantFinal Thought: The Computational Astrophysicist’s Toolkit

At the beginning of this course, you had:

  • Theory: Equations of motion, statistical mechanics, radiative transfer
  • Numerics: Integrators, Monte Carlo methods, discretization schemes

Now you’re adding:

  • Statistics: Bayesian inference (Module 5)
  • Machine Learning: Neural networks and uncertainty quantification (this module)

These aren’t separate domains — they’re deeply connected:

\[ \text{Physics} \xrightarrow{\text{simulations}} \text{Data} \xrightarrow{\text{ML}} \text{Fast predictions} \xrightarrow{\text{inference}} \text{Scientific insights} \]

You’re not just a programmer or a physicist or a statistician.

You’re a computational scientist who uses all these tools synergistically to understand the universe.

Welcome to the 21st century of astrophysics.