Final Project: Expectations & Sample Repo Contract

COMP 536 | Final Project

Author

Dr. Anna Rosen

Published

April 28, 2026

How To Use This Page

This page complements the Final Project Assignment, the Technical Guide, the Final Project Launch Worksheet, and the Official Project Rubric & Grading Scheme.

It does two things:

it makes the final-project expectations more concrete without turning them into a fake point checklist,
it gives you a sample repo contract for a reproducible submission.

Use it as a public standard for what a strong final-project submission looks like. It is not a hidden second assignment.

What this page does not add

This page does not introduce extra milestones, proposal requirements, progress-report checkpoints, or presentation baggage. The live public contract is still the one defined in the syllabus and the assignment page.

What A Strong Final Project Demonstrates

The final project is judged holistically, just like the other projects, but the evidence has a different shape because the workflow is longer and more open-ended.

A strong final project makes it easy for a reader to see four things:

your JAX-native Leapfrog simulator, rebuilt from Project 2’s validated model, is producing meaningful data,
your emulator is accurate enough to use as a scientific instrument,
your inference step is built on evidence rather than hope,
your repository is organized so the whole workflow can be rerun and inspected.

That means strong work is not only about model complexity. A smaller, well-validated project is more convincing than a larger one with weak evidence.

What Usually Counts As Baseline Final-Project Evidence

At minimum, a credible submission should usually include:

a reproducible code path from raw simulation inputs to final figures,
at least one clear Leapfrog simulation sanity check,
one comparison against a simple non-neural emulator baseline,
validation/calibration checks used for training choices and uncertainty settings,
held-out emulator evaluation reserved for final reporting,
figures that show what the emulator gets right and where it struggles,
one inference result whose interpretation is explained in words, not only plotted,
a readable README.md that tells a grader how to run the project.

If one of those pieces is missing, the work may still show meaningful progress, but it becomes much harder to trust scientifically.

What Tends To Distinguish Good From Excellent Work

The jump from “working” to “excellent” usually comes from quality of evidence rather than from piling on extra features.

Projects tend to move upward when they:

validate the rebuilt JAX-native simulator and the emulator separately instead of blending the checks together,
show that the neural emulator improves on at least one simple baseline rather than assuming “neural network” automatically means “better”,
explain why the chosen summary statistics or observables are informative,
show edge behavior or uncertainty instead of only central-case accuracy,
make the inference assumptions explicit, especially the uncertainty model,
write the report as interpretation rather than as a chronological diary of coding steps.

Projects tend to lose trust when they:

present inference results before the emulator is shown to be reliable,
give only training loss and no held-out evaluation,
make the repo hard to run or hard to audit,
include polished figures with unclear scientific meaning,
rely on vague claims like “the model worked well” without quantitative evidence.

Sample Repo Contract

The exact layout of your final-project repo can vary, but it should be obvious where the major responsibilities live and how the pipeline is executed.

A strong sample structure looks like this:

final_project_surrogate/
├── run.py
├── pyproject.toml
├── README.md
├── src/
│   └── final_project_surrogate/
│       ├── __init__.py
│       ├── forces.py
│       ├── integrator.py
│       ├── simulation.py
│       ├── summary_stats.py
│       ├── emulator.py
│       ├── inference.py
│       ├── diagnostics.py
│       └── viz.py
├── scripts/
│   ├── generate_data.py
│   ├── train_emulator.py
│   └── run_inference.py
├── data/
│   ├── raw/
│   └── processed/
├── outputs/
│   ├── figures/
│   ├── models/
│   └── results/
├── tests/
│   ├── test_leapfrog.py
│   ├── test_summary_stats.py
│   ├── test_emulator.py
│   ├── test_inference.py
│   └── test_pipeline.py
├── report.pdf
└── growth_synthesis.pdf

You do not have to copy that literally. The point is that your repo should make the simulation, emulator, inference, outputs, and tests easy to identify.

In this course, the simulator portion should read as a JAX-native rebuild of Project 2’s scientific core, not as a mysterious external dependency and not as a totally unrelated fresh codebase.

Suggested `run.py` Contract

The final project is open-ended enough that different teams will name commands differently, but a thin command-line entrypoint is still the easiest way to make your work reproducible.

A good default is for these commands to work from the repo root:

python run.py --help
python run.py test
python run.py validate
python run.py make-figures

In addition, your repo should provide a documented way to:

run or validate a small Leapfrog simulation,
show the Project 2 to JAX rebuild path clearly in the repo organization or README,
generate or reload the training dataset,
train the emulator,
run the inference stage.

Those can live behind additional run.py subcommands or in clearly documented scripts. What matters is that a grader does not need to reverse-engineer your workflow.

Your README should also explain which external scientific libraries you used and why. For this project, using JAX, Equinox, Optax, and NumPyro is not a shortcut around understanding; it is part of learning a modern computational-science stack. The key is that your repo still makes the simulator, emulator, optimizer, and inference responsibilities inspectable.

Why These Commands Matter

test should tell me whether your core code still behaves as expected.

validate should run the fast scientific checks that catch major mistakes before expensive reruns. For this project, that often includes Leapfrog sanity checks, summary-statistic sanity checks, emulator spot checks, and one small inference or likelihood sanity check.

make-figures should regenerate the main report figures non-interactively.

What Each Part Of The Repo Should Own

Use clear ownership boundaries. A good division of labor is:

Path	Responsibility
`run.py`	Thin command dispatcher only
`src/.../forces.py`	Acceleration or force calculations
`src/.../integrator.py`	Leapfrog update step and time-stepping logic
`src/.../simulation.py`	N-body evolution driver and simulation-facing wrappers
`src/.../summary_stats.py`	Compute physically meaningful outputs from simulation state
`src/.../emulator.py`	Model definition, training, prediction, normalization helpers
`src/.../inference.py`	Priors, likelihood, posterior sampling, recovery workflow
`src/.../diagnostics.py`	Accuracy metrics, uncertainty summaries, convergence helpers
`src/.../viz.py`	Figure generation utilities

The main engineering goal is to avoid hiding scientific logic in notebook cells or in one giant run.py.

Minimum Test Coverage

You do not need a giant test suite, but you do need useful tests.

At minimum, aim to cover:

one Leapfrog or force-calculation check on a simple known system,
one summary-statistic check on a simple known configuration,
one emulator-shape or normalization round-trip test,
one inference or likelihood sanity check,
one lightweight end-to-end smoke test that proves the major functions compose.

The purpose of tests here is not bureaucratic completeness. It is to protect the scientific claims you make later in the report.

README Expectations

Your README.md should answer five questions quickly:

What scientific problem does this repo solve?
How do I install dependencies?
What commands do I run first?
Where do the outputs and figures appear?
What evidence should I look at to decide whether the pipeline worked?

Think of the README as a guide for a tired collaborator who wants to trust your work without guessing how it is organized.

Figures And Report Expectations

Your final report is only 5 - 7 pages, so each figure has to earn its space.

A strong final-project report usually includes:

one figure showing training-data coverage or dataset design, preferably a Latin Hypercube or similarly space-filling design over the varied parameters,
one figure showing emulator fit quality,
one figure showing uncertainty, edge behavior, or failure modes,
one figure showing the inference result.

The report should then explain:

why those figures are enough to support your main conclusions,
what the validation evidence says,
what the main limitations still are.

Common Contract Violations

These are the most common ways a final project becomes harder to grade and harder to trust:

no clear entrypoint,
no evidence that the JAX-native Leapfrog simulator is a validated rebuild of Project 2’s scientific core,
no held-out evaluation,
output files committed without a way to reproduce them,
report claims that are stronger than the validation evidence,
simulation, emulator, and inference logic tangled together in one script or notebook,
figure captions that describe what is visible but not what it means scientifically.

Final Planning Checklist

Before you submit, ask:

Can another person rerun my main workflow without guessing?
Did I validate the rebuilt JAX-native Leapfrog simulator before using it to generate data?
Did I validate the emulator before using it for inference?
Do my figures support clear scientific claims?
Does my report explain what I learned, not only what I coded?
Is the repo structure easy to audit under deadline pressure?

If the answer to one of those is “not really,” that is probably the best place to spend your last improvement pass.