Overview: Introduction to Machine Learning

The Learnable Universe | Module 2 | COMP 536

Author

Anna Rosen

“The purpose of computing is insight, not numbers.”

– Richard Hamming

Learning Objectives

By the end of this module, you will be able to:

Frame machine learning as function learning for scientific emulation, not as a replacement for physics
Separate the physical simulator \(g(\mathbf{x})\), the learned emulator \(f_{\boldsymbol{\theta}}(\mathbf{x})\), and the residual/error model
Build a likelihood from a concrete observable, a model prediction, and assumptions about residuals
Explain when mean squared error is equivalent to a Gaussian negative log-likelihood, and when that interpretation breaks
Distinguish likelihoods, posteriors over parameters, induced posteriors over functions, and posterior predictive distributions
Use train/validation/test discipline to evaluate whether an emulator generalizes to held-out simulations
Interpret regularization as prior information under specific likelihood and prior assumptions
Identify when machine learning is scientifically useful for astrophysical problems, and when a direct physics model is better

Required Path and Optional Depth

Required for the final project: Focus on the emulator hierarchy, train/validation/test discipline, training-only normalization, baseline comparison, residual diagnostics, held-out evaluation, ensemble uncertainty, and calibration basics.

Optional depth: Broader ML taxonomy, philosophical context, MC dropout, predicted variance heads, conformal prediction, and Bayesian neural networks are useful landscape ideas. They are not required to build a successful final project.

Where We Are in the Course

This module builds on the tools you already have:

Modules 1–4 gave you probability, numerical simulation, N-body dynamics, and Monte Carlo thinking.
Module 5 gave you Bayesian inference: likelihoods, priors, posteriors, MCMC, and model comparison.
This module asks what changes when the object we want to learn is not one or two physical parameters, but a whole predictive function.

The bridge is not “machine learning magic.” The bridge is the same statistical question from Module 5, applied to a different object:

\[ \text{parameters} \quad \longrightarrow \quad \text{functions}. \]

The Driving Question: Can We Trust a Fast Emulator?

In the final project, the expensive object is not the neural network. The expensive object is the physics simulator.

Suppose an initial star cluster is described by a vector of physical inputs,

\[ \mathbf{x} = (Q_0, a, N, \ldots), \]

where \(Q_0\) might represent an initial virial ratio, \(a\) might represent a Plummer scale radius, and the remaining entries represent any other controlled choices in the simulation setup. An N-body simulation maps those inputs to a measurable summary of the cluster’s later evolution. One useful example is

\[ y = r_{\mathrm{core}}(t = 200\,\mathrm{Myr}), \]

or, more generally, a vector of summary statistics measured at several times:

\[ \mathbf{y} = \left[ r_{\mathrm{core}}(t_1), r_{\mathrm{core}}(t_2), \ldots, r_{\mathrm{core}}(t_K) \right]. \]

This reading will use \(y = r_{\mathrm{core}}(200\,\mathrm{Myr})\) as the canonical observable, even though your final project may use several diagnostics. Keeping one observable in view lets us derive the statistical model cleanly.

The physical simulator defines a forward model:

\[ g(\mathbf{x}) = \text{N-body prediction for } r_{\mathrm{core}}(200\,\mathrm{Myr}). \]

But evaluating \(g(\mathbf{x})\) can be expensive. Machine learning enters as a surrogate:

\[ f_{\boldsymbol{\theta}}(\mathbf{x}) \approx g(\mathbf{x}). \]

The scientific question is not:

Can a neural network fit the training simulations?

That question is too weak. A flexible enough model can memorize many training sets. The scientific question is:

Can a validated neural emulator replace some expensive N-body simulations while preserving scientifically trustworthy predictions about cluster evolution?

That question forces the structure of the whole module:

Layer	Question	Final-project example
Observable	What quantity are we predicting or measuring?	\(r_{\mathrm{core}}(200\,\mathrm{Myr})\)
Physics model	What produces the quantity?	N-body simulator \(g(\mathbf{x})\)
Noise/error model	What mismatch do we expect?	simulation variability, numerical error, emulator error
Likelihood/loss	How do we score a proposed emulator?	Gaussian residual model or another explicit assumption
Inference	What do we want after training?	point predictions, uncertainty, or parameter recovery

Observable, Model, Inference

Machine learning is useful in this course only when it stays attached to a scientific chain:

flowchart TD
    A["Initial cluster inputs x"] --> B["N-body simulator g(x)"]
    B --> C["Observable y = r_core(200 Myr)"]
    C --> D["Training data D = {(x_i, y_i)}"]
    D --> E["Emulator f_theta(x)"]
    E --> F["Residual model y_i = f_theta(x_i) + epsilon_i"]
    F --> G["Likelihood p(y | X, theta, sigma)"]
    G --> H["Predictions, uncertainty, or inference"]

The learning algorithm is not the causal engine. The physics generates observables; the statistical model describes mismatches; computation estimates functions or parameters consistent with the data.

What Is the Data in Simulation-Trained Machine Learning?

Simulation-trained emulation can feel strange because the “data” are not telescope measurements. They are outputs from a trusted, expensive computation. That is still data, but the uncertainty story is different.

For a simulation ensemble,

\[ \mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^{N}, \]

each item has a specific meaning:

Symbol	Meaning
\(\mathbf{x}_i\)	chosen simulation input parameters
\(g(\mathbf{x}_i)\)	expensive physics simulation
\(y_i\)	observable or summary statistic extracted from the simulation
\(f_{\boldsymbol{\theta}}(\mathbf{x}_i)\)	emulator prediction
\(y_i - f_{\boldsymbol{\theta}}(\mathbf{x}_i)\)	emulator residual
\(\sigma\)	assumed residual scale, measured residual scale, or learned uncertainty scale

The physical simulator itself comes from the N-body equations you already know. In simplified form, the acceleration of particle \(i\) is

\[ \frac{d\mathbf{v}_i}{dt} = \sum_{j \ne i} G m_j \frac{\mathbf{r}_j - \mathbf{r}_i} {|\mathbf{r}_j - \mathbf{r}_i|^3}. \]

Let’s unpack the roles:

\(\mathbf{r}_i\) and \(\mathbf{v}_i\) are the position and velocity of particle \(i\) in the chosen simulation units.
\(m_j\) is the mass of particle \(j\).
\(G\) is the gravitational constant in the unit system used by the simulator.
The sum says that every other particle contributes to the acceleration of particle \(i\).

The simulator integrates these equations from an initial condition \(\mathbf{x}_i\) and then extracts a summary:

\[ \mathbf{x}_i \xrightarrow{\ g\ } \text{full simulated cluster history} \xrightarrow{\ \text{measure}\ } y_i. \]

The emulator skips the expensive integration and tries to approximate the input-output map directly:

\[ \mathbf{x}_i \xrightarrow{\ f_{\boldsymbol{\theta}}\ } \hat{y}_i. \]

The scientific burden is now on validation: does \(\hat{y}_i\) match held-out simulations closely enough for the question you are asking?

From Residuals to a Likelihood

A neural network does not become scientific just because it makes predictions. It becomes scientifically usable when we can say what its errors mean.

For one simulation example, suppose the simulator gives

\[ y_i = g(\mathbf{x}_i), \]

and the emulator predicts

\[ \hat{y}_i = f_{\boldsymbol{\theta}}(\mathbf{x}_i). \]

The residual is

\[ r_i = y_i - f_{\boldsymbol{\theta}}(\mathbf{x}_i). \]

To turn this residual into a likelihood, we need an assumption about residuals. A simple first assumption is

\[ r_i \sim \mathcal{N}(0, \sigma^2). \]

This assumption says five things:

The residuals are independent from one training example to the next.
The residuals have zero mean, so the emulator is not systematically biased.
The residual scale \(\sigma\) is known, fixed, or estimated separately.
The same variance applies to every data point unless we build a heteroskedastic model.
Gaussian residuals are an adequate working description of the mismatch.

Under those assumptions,

\[ p(y_i \,|\, \mathbf{x}_i, \boldsymbol{\theta}, \sigma) = \mathcal{N} \left( y_i \,\middle|\, f_{\boldsymbol{\theta}}(\mathbf{x}_i), \sigma^2 \right). \]

For \(N\) independent training examples,

\[ p(\mathbf{y} \,|\, \mathbf{X}, \boldsymbol{\theta}, \sigma) = \prod_{i=1}^{N} \mathcal{N} \left( y_i \,\middle|\, f_{\boldsymbol{\theta}}(\mathbf{x}_i), \sigma^2 \right). \]

This is the likelihood. It is not the posterior.

The likelihood asks:

If this emulator and residual model were correct, how probable would these simulation outputs be?

The posterior asks the inverse question:

\[ p(\boldsymbol{\theta} \,|\, \mathbf{X}, \mathbf{y}, \sigma) \propto p(\mathbf{y} \,|\, \mathbf{X}, \boldsymbol{\theta}, \sigma) p(\boldsymbol{\theta}). \]

That is:

After seeing the simulations, which emulator parameters remain plausible?

Because products of many probabilities become numerically tiny, we usually work with log-likelihoods:

\[ \log p(\mathbf{y} \,|\, \mathbf{X}, \boldsymbol{\theta}, \sigma) = -\frac{1}{2} \sum_{i=1}^{N} \left[ \frac{ \left(y_i - f_{\boldsymbol{\theta}}(\mathbf{x}_i)\right)^2 }{\sigma^2} + \log(2\pi\sigma^2) \right]. \]

When \(\sigma\) is fixed and the same for every data point, maximizing this Gaussian log-likelihood is equivalent to minimizing summed squared error. Mean squared error is therefore not arbitrary; it is the part of the Gaussian negative log-likelihood that depends on the model prediction under a specific residual model.

Likelihood Is Not Loss, But Loss Can Come From Likelihood

A likelihood is a probabilistic statement about data under a model and assumptions.

A loss function is the computational objective we minimize.

A loss becomes scientifically interpretable when we can say which likelihood and prior assumptions produced it. MSE is meaningful when squared residuals are the right summary of the mismatch; it is less meaningful when the residuals are biased, heavy-tailed, correlated, or strongly dependent on location in parameter space.

import jax.numpy as jnp

def gaussian_nll(y_true, y_pred, sigma):
    """Compute a Gaussian negative log-likelihood for emulator residuals.

    Purpose: score emulator predictions under independent Gaussian residuals.
    Inputs: y_true and y_pred with matching shapes; sigma as a positive scalar
        or an array broadcastable to y_true.
    Output: scalar negative log-likelihood to minimize.
    Common pitfall: sigma must be in the same units as y_true, and should not be
        tuned on the held-out test set.
    """
    residual = y_true - y_pred
    return 0.5 * jnp.sum(
        (residual / sigma) ** 2 + jnp.log(2 * jnp.pi * sigma ** 2)
    )

def mse_loss(y_true, y_pred):
    """Compute mean squared error.

    Purpose: measure average squared prediction error.
    Inputs: y_true and y_pred with matching shapes.
    Output: scalar MSE, in squared output units.
    Common pitfall: MSE alone does not specify a calibrated uncertainty scale.
    """
    return jnp.mean((y_true - y_pred) ** 2)

For a trained Equinox-style model, the same idea becomes:

def emulator_nll(params, model_apply, X, y, sigma):
    """Score an emulator on simulation data using a Gaussian residual model.

    Purpose: connect the trained emulator to an explicit likelihood.
    Inputs: params for the emulator, model_apply(params, X) returning predictions,
        input matrix X, target outputs y, and residual scale sigma.
    Output: scalar negative log-likelihood.
    Common pitfall: X and y should be normalized using training-set statistics
        only; validation and test data must not set the normalization.
    """
    y_pred = model_apply(params, X)
    return gaussian_nll(y, y_pred, sigma)

What Kind of Uncertainty Are We Modeling?

In observational astronomy, the noise model often begins with measurement uncertainty: photon counting, detector noise, calibration error, or sky background. In simulation-trained emulation, the uncertainty can come from different sources. We need to name them because the likelihood depends on what mismatch we are modeling.

Source	Meaning for the final project	How it can show up
Physical or aleatoric stochasticity	Different random realizations of a cluster with similar macroscopic inputs	scatter in \(y\) even at similar \(\mathbf{x}\)
Numerical error	finite timestep, solver tolerance, finite precision, or force approximation	systematic drift or timestep-dependent outputs
Emulator approximation error	\(f_{\boldsymbol{\theta}}\) cannot perfectly match \(g\) with limited data and capacity	residuals on validation/test simulations
Epistemic uncertainty	lack of training coverage in part of input space	wider ensemble disagreement or poor extrapolation
Observational uncertainty	real measurement noise if comparing emulator predictions to telescope data	likelihood term for observed cluster quantities

For the final project, the Gaussian residual model is often a working model for emulator mismatch, not automatically a detector-noise model. That distinction matters. If your simulator output is deterministic for a fixed seed and timestep, residuals mostly measure emulator approximation error. If you vary random initial realizations, residuals may also include physical sampling variability.

Predict – Try – Explain: What Does \(\sigma\) Mean?

Predict: If two emulator models have the same MSE but one has residuals that grow near the edge of parameter space, should they use the same single \(\sigma\)?

Try: Sketch residuals versus \(Q_0\) or \(a\) for both cases.

Explain: A single Gaussian residual scale is a starting model. Residual plots tell you whether that model is scientifically honest.

From Parameter Inference to Function Learning

In Module 5, you learned Bayesian inference for parameters. Given data \(\mathcal{D}\) and a model with parameters \(\boldsymbol{\phi}\), the posterior is

\[ p(\boldsymbol{\phi} \,|\, \mathcal{D}) \propto p(\mathcal{D} \,|\, \boldsymbol{\phi})p(\boldsymbol{\phi}). \]

In machine learning, we often want to learn a function. A Bayesian lens lets us write

\[ p(f \,|\, \mathcal{D}) \propto p(\mathcal{D} \,|\, f)p(f). \]

This equation is powerful, but it needs care. Most neural-network training does not compute the full posterior over functions. A finite neural network represents a function using many parameters:

\[ f_{\boldsymbol{\theta}}(\mathbf{x}), \qquad \boldsymbol{\theta} \in \mathbb{R}^{P}. \]

Training usually finds a point estimate:

\[ \hat{\boldsymbol{\theta}} = \arg\min_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}), \]

or, with a likelihood and prior interpretation, a MAP-like estimate:

\[ \hat{\boldsymbol{\theta}}_{\mathrm{MAP}} = \arg\max_{\boldsymbol{\theta}} p(\boldsymbol{\theta} \,|\, \mathcal{D}). \]

That gives one parameter setting and therefore one function, \(f_{\hat{\boldsymbol{\theta}}}\). It does not by itself give the full posterior \(p(f \,|\, \mathcal{D})\).

Module 5: Parameter Inference	Machine Learning: Function Learning
infer plausible \(\boldsymbol{\phi}\)	learn a predictive function \(f\)
likelihood \(p(\mathcal{D} \,\|\, \boldsymbol{\phi})\)	residual model \(p(\mathcal{D} \,\|\, f)\)
prior \(p(\boldsymbol{\phi})\)	architecture, regularization, smoothness, scale assumptions
posterior \(p(\boldsymbol{\phi} \,\|\, \mathcal{D})\)	full Bayesian methods approximate \(p(f \,\|\, \mathcal{D})\)
point estimate or posterior samples	point estimate, ensemble, approximate posterior, or full Bayesian model

The Precise Version of the Slogan

It is tempting to say “machine learning is Bayesian inference in function space.” A better scientific statement is:

A Bayesian lens lets us interpret machine learning as approximate inference over functions.

Neural-network training usually finds a point estimate \(f_{\hat{\boldsymbol{\theta}}}\). Bayesian neural networks, Gaussian processes, and calibrated ensembles try to represent uncertainty over plausible functions.

The final object we often want is not just a best function. It is the posterior predictive distribution:

\[ p(y_* \,|\, \mathbf{x}_*, \mathcal{D}) = \int p(y_* \,|\, \mathbf{x}_*, f) p(f \,|\, \mathcal{D}) \,df. \]

This distribution answers the scientific prediction question:

Given the simulations we have run, what outcomes are plausible for a new cluster input \(\mathbf{x}_*\)?

The Learning Problem: Generalization from Data

Suppose we have training data

\[ \mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^{N}. \]

The goal is not to reproduce these \(N\) examples perfectly. The goal is to predict a new, unseen case:

\[ f_{\boldsymbol{\theta}}(\mathbf{x}_*) \approx y_*. \]

This is the learning problem: generalize from observed examples to unobserved examples.

Connection to Module 1: Function Approximation

In Module 1, you saw basis expansions such as

\[ f(x) = \sum_{m=1}^{M} w_m \phi_m(x). \]

Machine learning is also function approximation, but with three changes:

The useful features or basis-like representations may be learned from data.
Generalization to held-out cases matters more than fitting the training set.
For scientific use, we need uncertainty and failure diagnostics, not only point predictions.

Bias and Variance

A model that is too simple underfits. It cannot represent the real structure in the data, so both training and validation performance are poor. A model that is too flexible can overfit. It memorizes the training set, including noise or simulator quirks, and then fails on held-out examples.

For squared-error prediction, the expected error can be decomposed schematically as

\[ \mathbb{E}\left[ \left(y - \hat{f}(x)\right)^2 \right] = \underbrace{\mathrm{Bias}\left[\hat{f}(x)\right]^2}_{\text{systematic error}} + \underbrace{\mathrm{Var}\left[\hat{f}(x)\right]}_{\text{sensitivity to training data}} + \underbrace{\sigma^2}_{\text{irreducible residual variance}}. \]

What this equation is really saying: a useful emulator needs enough flexibility to capture the simulator’s structure, but enough constraint that it does not chase accidental features of a limited training set.

Model complexity	Training behavior	Validation/test behavior	Diagnosis
Too simple	cannot match the pattern	poor generalization	high bias
well matched	captures signal without chasing residual noise	best generalization	balanced bias and variance
too flexible	can memorize the training set	poor generalization	high variance

Predict – Try – Explain: Bias and Variance

Predict: If you have only 80 training simulations, which is more dangerous: a model that is too simple, or one that is too flexible?

Try: Sketch the training and validation loss curves for both cases.

Explain: A useful emulator is not the model with the lowest training loss. It is the model whose validation behavior gives evidence that it will generalize.

Train, Validation, and Test Roles

The standard split is:

Training set: fit model parameters.
Validation set: tune architecture, learning rate, regularization, early stopping, and other choices.
Test set: final held-out evaluation, used only after decisions are locked.

The Test Set Is Not a Dashboard

Never use the test set for choosing architecture, tuning hyperparameters, deciding when to stop training, choosing normalization, selecting \(\sigma\), or comparing many failed attempts. If you repeatedly look at test performance while making decisions, you have indirectly trained on the test set.

A concrete failure story: suppose you try 20 architectures and keep the one with the best test RMSE. Some model will look good by chance, especially with a small test set. Your reported test error is then too optimistic, because the test set participated in model selection.

When simulation data are scarce, \(K\)-fold cross-validation can help estimate validation performance. You still need a final held-out test set if you want an honest final number.

import jax.numpy as jnp

def k_fold_scores(X, y, train_and_score, k=5):
    """Compute validation scores with K-fold cross-validation.

    Purpose: estimate generalization when simulation data are limited.
    Inputs: X with shape (N, d), y with shape (N, ...), and a function
        train_and_score(X_train, y_train, X_val, y_val) -> scalar score.
    Output: array of K validation scores.
    Common pitfall: preprocessing statistics must be fit inside each training
        fold, never on the full dataset before splitting.
    """
    N = len(X)
    fold_size = N // k
    scores = []

    for fold in range(k):
        val_start = fold * fold_size
        val_end = (fold + 1) * fold_size

        X_val = X[val_start:val_end]
        y_val = y[val_start:val_end]
        X_train = jnp.concatenate([X[:val_start], X[val_end:]])
        y_train = jnp.concatenate([y[:val_start], y[val_end:]])

        scores.append(train_and_score(X_train, y_train, X_val, y_val))

    return jnp.array(scores)

Physical Degeneracy: When Different Inputs Look Similar

Astrophysical observables are often not one-to-one maps back to physical parameters. For example, a compact cluster with a high initial velocity dispersion and a more extended cluster with a lower initial velocity dispersion may produce similar late-time core radii. The observable \(r_{\mathrm{core}}(t)\) alone may not uniquely identify the initial condition.

In likelihood language, this creates a ridge: many combinations of inputs or parameters receive nearly the same likelihood because they predict similar observables.

flowchart LR
    A["Compact, dynamically hotter cluster"] --> C["Similar r_core(200 Myr)"]
    B["Extended, dynamically colder cluster"] --> C
    C --> D["Degeneracy ridge in likelihood space"]

This is not a failure of statistics. It is physics telling you that one observable may be insufficient. The scientific response is to add informative diagnostics, report uncertainty honestly, or restrict the inference question.

Why Machine Learning for Astrophysics?

Machine learning is useful when it solves a real scientific-computing bottleneck. In astrophysics, three bottlenecks are common.

First, simulations can be expensive. A single N-body run might take seconds, minutes, or longer depending on \(N\), timestep, hardware, and diagnostics. Exploring a multi-dimensional parameter space with direct simulation alone can become prohibitive.

Second, data can be too large for manual analysis. The Vera C. Rubin Observatory Legacy Survey of Space and Time will produce an enormous time-domain data stream; Gaia has measured positions, parallaxes, and motions for more than a billion sources; future radio surveys will push data rates even further. In these settings, automated classification, anomaly detection, and regression are not luxuries.

Third, some mappings are complex but learnable. Photometric redshift estimation, supernova classification, galaxy morphology, and simulation emulation all involve high-dimensional patterns where flexible models can be useful.

When Not to Use Machine Learning

Do not use ML just because it is available. A direct physics calculation is usually better when:

the simulator is cheap enough to run directly,
the dataset is too small for the model complexity,
extrapolation outside the training domain is required,
mechanistic explanation matters more than prediction,
uncertainty requirements exceed what the ML model can honestly provide.

Use ML when physics is expensive, data coverage is adequate, speed matters, and validation evidence shows that the emulator is trustworthy for the intended question.

Types of Machine Learning, Anchored to Scientific Questions

The final project is mainly supervised learning. We have input-output pairs from simulations:

\[ (\mathbf{x}_i, y_i) = (\text{initial cluster inputs}, \text{simulation summary}). \]

The task is regression: learn a continuous output such as \(r_{\mathrm{core}}(200\,\mathrm{Myr})\) or a vector of summary statistics.

Other ML paradigms matter in astronomy, but they are landscape material for this reading:

Paradigm	Setup	Astrophysical example	Failure question
supervised regression	labeled pairs \((\mathbf{x}, y)\)	initial cluster inputs \(\to\) cluster diagnostics	does it generalize to held-out simulations?
supervised classification	labeled examples with categories	light curve \(\to\) supernova class	are rare classes missed?
unsupervised learning	inputs without labels	group spectra or find anomalies	are clusters physically meaningful?
reinforcement learning	actions, states, rewards	telescope scheduling	does the learned policy exploit the reward in an unphysical way?

The taxonomy is useful, but it is secondary. The core final-project skill is supervised scientific emulation with honest validation.

The Spectrum from Physics to Data

Not all ML uses carry the same scientific burden.

Level	Model style	What carries the scientific burden	Course example
1	pure physics	equations, numerical methods, validation	Project 2 N-body simulator
2	simulation-trained emulator	validated simulator plus held-out emulator tests	final-project neural emulator
3	constrained ML	data fit plus physics-aware constraints	optional conservation or monotonicity checks
4	pure ML	data coverage, model capacity, validation	image or light-curve classification
5	end-to-end learning	representation learning from raw inputs	galaxy morphology from pixels

The final project sits near Level 2. The simulator remains the scientific anchor. The emulator is useful only after you show that it predicts held-out simulation outputs well enough for your scientific purpose.

Regularization as Prior Information

Regularization prevents overfitting in practical ML. Through a Bayesian lens, many regularizers can also be interpreted as priors, but the mapping depends on the likelihood and scaling conventions.

Suppose the residual model is Gaussian:

\[ p(\mathbf{y} \,|\, \mathbf{X}, \boldsymbol{\theta}) \propto \exp \left[ -\frac{1}{2\sigma_y^2} \sum_i \left( y_i - f_{\boldsymbol{\theta}}(\mathbf{x}_i) \right)^2 \right]. \]

Now suppose the weights have a Gaussian prior:

\[ p(\boldsymbol{\theta}) \propto \exp \left[ -\frac{1}{2\sigma_w^2} \sum_j \theta_j^2 \right]. \]

The MAP objective is proportional to

\[ \sum_i \left( y_i - f_{\boldsymbol{\theta}}(\mathbf{x}_i) \right)^2 + \frac{\sigma_y^2}{\sigma_w^2} \sum_j \theta_j^2. \]

So an L2 penalty corresponds to a Gaussian weight prior under these assumptions, and the effective regularization strength depends on both the residual scale \(\sigma_y\) and the prior weight scale \(\sigma_w\).

Priors Are Scientific Assumptions

For the final project, the most important prior belief is not merely “small weights are nice.” The physical belief is that nearby initial conditions should usually produce nearby cluster summaries, unless the dynamics create a real transition. Regularization, architecture choices, and training data coverage should all support that scientific expectation.

Regularizer or method	Bayesian interpretation	Scientific caution
L2 weight decay	Gaussian prior on weights under a chosen likelihood scaling	small weights do not automatically guarantee physical behavior
L1 penalty	Laplace prior encouraging sparsity	useful for feature selection, but can remove weak physical signals
early stopping	implicit complexity control along an optimization path	validation loss, not training loss, decides when to stop
dropout uncertainty	approximate Bayesian procedure under additional assumptions	must be calibrated against held-out data
ensembles	approximate distribution over trained solutions	useful, but not a full posterior unless assumptions justify it

Evaluation and Validation

For continuous predictions \(\hat{y}_i\) versus simulation outputs \(y_i\), common metrics include:

\[ \mathrm{MSE} = \frac{1}{N} \sum_{i=1}^{N} \left(y_i - \hat{y}_i\right)^2, \]

\[ \mathrm{RMSE} = \sqrt{\mathrm{MSE}}, \]

\[ \mathrm{MAE} = \frac{1}{N} \sum_{i=1}^{N} \left|y_i - \hat{y}_i\right|, \]

and

\[ R^2 = 1 - \frac{ \sum_i \left(y_i - \hat{y}_i\right)^2 }{ \sum_i \left(y_i - \bar{y}\right)^2 }. \]

Each metric answers a different question. RMSE has the same units as the output and is easy to interpret. MAE is less sensitive to outliers. \(R^2\) compares against a mean-prediction baseline, and can be negative when the model is worse than that baseline.

Numbers alone are not enough. A trustworthy emulator report should include:

predicted-versus-true plots,
residuals versus each input parameter,
learning curves for training and validation loss,
comparison against a simple baseline such as the training-set mean, a linear model, or a nearest-neighbor style predictor,
test-set results reported once after model choices are fixed,
calibration checks if uncertainty intervals are reported.

Code Craft: Normalization Belongs to the Training Set

Fit normalization statistics on the training set only. Then apply those saved statistics to validation, calibration, test, and inference inputs.

If you compute the mean and standard deviation from the full dataset before splitting, information from validation and test cases leaks into training. The model may look slightly better than it really is.

The Philosophical Shift: Prediction With Accountability

Traditional physics asks why a system behaves as it does. Machine learning often begins by asking what will happen. Those questions are not enemies, but they are different.

For scientific machine learning, prediction is acceptable only with accountability. If a neural network predicts cluster evolution accurately but cannot explain the mechanism, it can still be useful when:

the training data come from a validated physical simulator,
held-out tests show generalization in the intended domain,
residuals and calibration checks reveal where the emulator fails,
the emulator is used for interpolation rather than unsupported extrapolation,
the report states what uncertainty means.

This is the glass-box philosophy of the course. You do not need every neural-network weight to be interpretable, but you do need the workflow to be inspectable.

Looking Ahead

Part 1: Neural Networks and Scientific Emulation

You will build the practical machinery:

small multilayer perceptrons,
forward propagation and automatic differentiation,
loss functions and gradient-based training,
normalization and baseline comparison,
held-out evaluation,
ensemble-based uncertainty estimates.

Part 2: Uncertainty Quantification

You will ask whether predictions come with trustworthy error bars:

ensemble uncertainty,
calibration diagnostics,
prediction intervals,
Bayesian neural-network ideas,
the difference between epistemic uncertainty, aleatoric uncertainty, and emulator error.

The Final Project Arc

The final project connects the whole semester:

Build and validate a JAX-native N-body simulator.
Generate simulation data with a reproducible design.
Train a small emulator for cluster summary statistics.
Evaluate it on held-out simulations.
Add uncertainty or calibration checks.
Use the emulator for a limited inference or recovery task.

The central question remains:

Does the emulator make scientifically trustworthy predictions quickly enough to justify replacing some direct simulator calls?

Conceptual Checkpoints

Before moving to Parts 1 and 2, make sure you can answer these:

What is the observable in the final-project emulator example?
What is the difference between \(g(\mathbf{x})\) and \(f_{\boldsymbol{\theta}}(\mathbf{x})\)?
What assumptions are required before MSE can be interpreted as a Gaussian negative log-likelihood?
Why is \(p(\mathbf{y} \,|\, \mathbf{X}, \boldsymbol{\theta})\) not the same object as \(p(\boldsymbol{\theta} \,|\, \mathbf{X}, \mathbf{y})\)?
Why does a point estimate \(f_{\hat{\boldsymbol{\theta}}}\) not equal a posterior over functions?
What sources of uncertainty can enter a simulation-trained emulator?
What would a residual plot reveal that RMSE alone might hide?
Why must normalization statistics come only from the training set?
What physical degeneracy could make \(r_{\mathrm{core}}(t)\) alone insufficient for parameter recovery?
When would direct simulation be preferable to machine learning?

Final Thought

At the beginning of this course, you had theory and numerics: equations of motion, statistical mechanics, radiative transfer, integrators, Monte Carlo methods, and discretization schemes.

Now you are adding statistics and machine learning:

\[ \text{physics} \xrightarrow{\ \text{simulations}\ } \text{data} \xrightarrow{\ \text{emulation}\ } \text{fast predictions} \xrightarrow{\ \text{inference}\ } \text{scientific insight}. \]

These are not separate domains. They are one workflow. The point is not to make the neural network look impressive. The point is to make the scientific claim inspectable, reproducible, and honest about uncertainty.