Membership testing: list (O(n)) vs set (O(1))
=======================================================
Size List (mus) Set (mus) Speedup
-------------------------------------------------------
100 0.58 0.03 21x
1,000 5.67 0.03 203x
10,000 57.05 0.03 2,028x
100,000 572.75 0.03 20,455x
As n grows, list time grows proportionally (O(n))
while set time stays nearly constant (O(1)).
For simulations with many membership checks,
this can mean the difference between minutes and hours!
title: “Chapter 4: Data Structures - Organizing Scientific Data” subtitle: “Python Fundamentals | COMP 536” author: “Anna Rosen” draft: false format: html: toc: true execute: echo: true warning: true error: false freeze: auto —
Learning Objectives
By the end of this chapter, you will be able to:
If any boxes are unchecked, review the indicated chapters first.
Chapter Overview
Imagine you’re simulating the interactions between a million particles - whether they’re dark matter particles in a galaxy, molecules in a gas, or nodes in a network. Each timestep, you need to find which particles are close enough to interact strongly. With the wrong data structure, this neighbor search could take hours per timestep. With the right one - a spatial hash table or tree structure - it takes seconds. That’s the difference between a simulation finishing in a day versus running for months. This chapter teaches you to make these critical choices that determine whether your code scales to research problems or remains stuck with toy models.
This chapter transforms you from someone who stores data to someone who orchestrates it strategically for scientific computing. You’ll discover not just that dictionary lookups are fast, but why they’re fast - through hash functions that turn particle positions into array indices. You’ll understand when they might fail - like when hash collisions cluster your data. And you’ll learn how to verify performance yourself - because in computational science, measurement beats assumption every time.
These concepts directly prepare you for the numerical computing ahead. The memory layout discussions explain why NumPy arrays can be 10-100\(\times\) more efficient than Python lists for vectorized operations. The immutability concepts prepare you for functional programming paradigms used in modern frameworks like JAX, where immutable operations enable automatic differentiation. The performance profiling skills will help you identify bottlenecks whether you’re solving differential equations or analyzing experimental data. By chapter’s end, you’ll think about data organization like a computational scientist, architecting for performance from the start.
In Chapter 3, you learned how algorithms flow through decisions and loops. In Chapter 4, you’ll learn how algorithms scale—because data structures determine whether your logic runs in seconds or in days.
Most courses teach data structures as a list of methods to memorize. This chapter teaches something more useful: how to choose a structure that makes your algorithm feasible.
If you internalize this chapter, you’ll start thinking like a computational scientist:
- You’ll design around access patterns (what you need to look up, update, or traverse).
- You’ll predict when code will scale to large problems — and when it will explode.
- You’ll avoid whole classes of bugs by using immutability and careful copying.
- You’ll measure performance instead of guessing (because in science, measurement beats assumption).
You do not need to memorize everything here. You do want to revisit this chapter all semester—especially when your code starts getting slow, memory-hungry, or mysteriously wrong.
Treat this chapter as a reference and toolbox, not something to read once and memorize.
A good workflow is:
- Identify your problem’s access pattern (search, lookup, append, membership, caching).
- Choose the simplest structure that matches that pattern.
- Measure performance with small benchmarks when in doubt.
- Revisit this chapter when your code hits performance or correctness walls.
If this material feels “advanced,” that’s normal. It clicks fastest when you apply it to real code.
4.1 What Is a Data Structure?
data structure A way of organizing data in memory to enable efficient access and modification
A data structure is fundamentally about organizing information to match your access patterns. Think about an N-body simulation where particles interact through forces - these particles could represent stars in a star cluster, molecules in a gas, or charges in a plasma. You could store particles in order of creation (like a list), organize them by spatial region for fast neighbor finding (like a dictionary of cells), or track unique particle IDs (like a set). Each choice profoundly affects your simulation’s performance - the difference between O(\(\mathrm{n}^{2}\)) all-pairs checks and O(n log n) tree-based algorithms.
Big-O notation (e.g, O(\(\mathrm{n}^{2}\)), O(n log n)) Big-O notation describing how runtime scales with input size n.
Quick Preview: Python’s Core Data Structures
| Structure | Mutable | Ordered | Duplicates | Use Case |
|---|---|---|---|---|
| list | Yes | Yes | Yes | Particle arrays, time series |
| tuple | No | Yes | Yes | Constants, coordinates, dict keys |
| dict | Yes | Yes* | Keys: No | Lookup tables, properties |
| set | Yes | No | No | Unique particles, membership |
*Dicts maintain insertion order (Python 3.7+ language guarantee)
Decision Framework: Choosing Your Data Structure
Before diving into each structure, here’s the mental model for choosing:
Need ordered sequence with duplicates?
├── Yes -> Need to modify it?
│ ├── Yes -> list
│ └── No (constants/config) -> tuple
└── No -> Need key-value mapping?
├── Yes -> dict (O(1) lookup by key)
└── No -> Need uniqueness/set math?
├── Yes -> set (O(1) membership)
└── No -> Probably list
Key invariants to remember:
| Structure | Mutable? | Keys/Elements Must Be… | When to Use |
|---|---|---|---|
list |
Yes | Anything | Ordered data, will modify |
tuple |
No | Anything | Constants, dict keys, records |
dict |
Yes | Hashable (immutable) | Fast lookup by key |
set |
Yes | Hashable (immutable) | Uniqueness, membership tests |
An object is hashable if it has a hash value that never changes during its lifetime. In practice:
- Hashable:
int,float,str,tuple(if contents hashable),frozenset - Not hashable:
list,dict,set(because they’re mutable)
This is why you can use tuples as dict keys but not lists!
Think of the last piece of code you wrote (in this course or elsewhere). Where did it spend most of its time: searching, looping, looking up values, or building results? That answer usually tells you which data structure matters most.
Ariane 5 (1996) - $370M lost: A 64-bit float was cast to 16-bit int without bounds checking. Integer overflow crashed the rocket 39 seconds after launch.1
Mars Pathfinder (1997) - Near mission loss: Priority inversion in a task queue caused system resets on Mars. JPL debugged it from 50 million miles away.2
Cassini (2006) - Data loss risk: Wrong buffer structure caused data loss during communication transitions. Fixed with a circular buffer (deque-like structure).3
Each disaster traces to a data structure choice. Bounds checking, queue design, buffer management - these “basic” concepts have billion-dollar consequences.
Building Intuition: Measuring Speed Empirically
Ready to discover something that will transform how you think about organizing data in your simulations? We’re going to measure the dramatic performance difference between different data structures. This empirical approach - measure first, understand why, then optimize - is exactly how computational scientists approach performance optimization.
This performance difference isn’t just academic - it determines whether your galaxy simulation can evolve for billions of years or gets stuck after a few million. The secret behind this magic is the hash table, which you’re about to understand completely.
Understanding Big-O Notation
Now that you’ve witnessed this dramatic performance difference empirically, let’s understand the mathematical pattern behind it. Big-O notation describes how an algorithm’s runtime scales with input size - it’s the language computational scientists use to discuss whether an algorithm is feasible for large-scale simulations.
Big-O describes how runtime grows with input size—not the exact speed—so constants, memory layout, and access patterns still matter.
Complexity Scaling for Physics Simulations:
| Complexity | 1,000 particles | 1,000,000 particles | Example |
|---|---|---|---|
| O(1) | 1 op | 1 op | Hash table lookup |
| O(log n) | 10 ops | 20 ops | Tree traversal |
| O(n) | 1,000 ops | 1,000,000 ops | Direct summation |
| O(n log n) | 10,000 ops | 20,000,000 ops | Tree codes (GADGET) |
| O(\(\mathrm{n}^{2}\)) | 1,000,000 ops | 1,000,000,000,000 ops | All-pairs (impossible!) |
#| label: fig-bigo
#| fig-cap: "Algorithm complexity scaling for particle simulations"
import matplotlib.pyplot as plt
import math
# Visualize how different algorithms scale for physics simulations
n = range(1, 101)
constant = [1 for _ in n]
logarithmic = [math.log2(x) if x > 0 else 0 for x in n]
linear = list(n)
nlogn = [x * math.log2(x) if x > 0 else 0 for x in n]
quadratic = [x**2 for x in n]
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(n, constant, label='O(1) - Hash table lookup', linewidth=2)
ax.plot(n, logarithmic, label='O(log n) - Binary search', linewidth=2)
ax.plot(n, linear, label='O(n) - Linear scan', linewidth=2)
ax.plot(n, nlogn, label='O(n log n) - Tree codes', linewidth=2, linestyle='--')
ax.plot(n, quadratic, label='O(n^2) - All-pairs', linewidth=2)
ax.set_xlim(0, 100)
ax.set_ylim(0, 500)
ax.set_xlabel('Number of Particles (thousands)', fontsize=12)
ax.set_ylabel('Operations (arbitrary units)', fontsize=12)
ax.set_title('Algorithm Scaling: Why Data Structures Matter', fontsize=14)
ax.legend(loc='upper left')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()For a million-particle simulation:
O(1): 1 operation - spatial hash lookup
O(log n): 20 operations - tree traversal
O(n): 1,000,000 operations - direct force sum
O(n log n): 2e+07 operations - tree codes
O(n^2): 1,000,000,000,000 ops - all pairs (impossible!)
In astronomy: This enables galaxy simulations with billions of stars
In chemistry: This allows protein folding with millions of atoms
In climate: This permits global models with millions of grid cells
The table above reveals why algorithmic complexity matters for computational physics. The naive O(\(\mathrm{n}^{2}\)) approach that checks all particle pairs becomes impossible beyond a few thousand particles. But with smart data structures like trees (O(n log n)) or spatial hashing (O(n)), we can simulate entire galaxies, proteins, or climate systems!
This universal pattern appears throughout computational physics: trading memory for speed.
The Pattern: - Use more memory \(\to\) organize data cleverly \(\to\) enable faster computation - Use less memory \(\to\) simpler organization \(\to\) slower computation
Real Physics Examples: - Particle mesh methods: Store density on a grid (memory) to avoid O(\(\mathrm{n}^{2}\)) particle interactions - Neighbor lists: Store nearby particles (memory) vs O(\(\mathrm{n}^{2}\)) distance checks for every step - Tree codes: Store spatial hierarchy (memory) for O(n log n) force calculation - Lookup tables: Store pre-computed values vs recalculating expensive functions - FFT methods: Store complex coefficients for O(n log n) vs O(\(\mathrm{n}^{2}\)) convolution
In astronomy, the GADGET cosmology code uses 200 bytes per particle (vs 24 for just position/velocity) to achieve O(n log n) scaling, enabling billion-particle dark matter simulations. This 8\(\times\) memory cost enables 1000\(\times\) speedup!
4.2 Lists: Python’s Workhorse Sequence
Lists are Python’s most versatile data structure - perfect for particle arrays, time series data, or any ordered collection. But here’s something crucial for scientific computing: Python lists don’t store your numbers directly! Understanding this explains why NumPy arrays are so much faster for numerical work.
How Lists Really Work in Memory
Let’s explore how Python actually stores your simulation data in memory. This knowledge will help you understand when to use Python lists versus when you need NumPy arrays.
List container: 88 bytes
Each coordinate: 24 bytes
sys.getsizeof() Measures Shallow Size Only
sys.getsizeof() returns the memory used by the container object itself, not including the objects it contains. For nested structures, use recursive measurement:
# Shallow: just the list container
sys.getsizeof(position) # ~80 bytes
# Deep: container + all contained objects
total = sys.getsizeof(position) + sum(sys.getsizeof(x) for x in position)For accurate deep memory profiling, use pympler.asizeof() or write a recursive helper. This distinction is critical when optimizing memory for large simulations.
Total: 160 bytes for 3 floats!
That's 6.7x more than raw floats would need!
Why so much memory?
- List stores pointers, not values
- Each float is a full object with type info
- Python must track reference counts
This is why NumPy arrays (Chapter 7) store raw values contiguously!
Here’s what’s actually happening in memory when you store particle positions:
Python List of Positions: Objects in Memory:
┌─────────────────┐ ┌──────────────────┐
│ size: 3 │ │ float: 1.5e10 │
│ capacity: 4 │ ┌───>│ type info │
├─────────────────┤ │ │ ref count │
│ pointer ────────┼──────────┘ └──────────────────┘
│ pointer ────────┼──────────────>┌──────────────────┐
│ pointer ────────┼──────┐ │ float: 2.3e10 │
│ (unused) │ │ └──────────────────┘
└─────────────────┘ └────────>┌──────────────────┐
│ float: 0.8e10 │
└──────────────────┘
NumPy Array (preview):
┌────┬────┬────┐
│1.5 │2.3 │0.8 │ <- Raw values, no pointers!
└────┴────┴────┘
List Operations: Performance for Simulations
Different list operations have vastly different costs - critical knowledge when processing particle data or building spatial data structures.
Adding particle at END: 29.42 microseconds
Adding particle at START: 58.25 microseconds
Beginning is 2x slower!
Why? Inserting at the beginning shifts ALL particles in memory!
For particle systems, use collections.deque if you need
fast operations at both ends (e.g., boundary conditions).
If your simulation needs fast operations at both ends (like particles entering/leaving through boundaries), use collections.deque:
from collections import deque
particles = deque(maxlen=1000000) # Efficient at both ends!This is particularly useful in molecular dynamics where particles cross periodic boundaries, or in astronomical simulations tracking objects entering and leaving a field of view.
List Growth Strategy: Understanding Dynamic Arrays
Watch how Python manages memory as your particle system grows. This pattern appears in many languages and understanding it helps you write efficient simulation codes.
Watch Python's list growth strategy:
(Critical for understanding simulation performance)
Length -> Capacity (overallocation)
----------------------------------------
1 -> larger allocation
5 -> larger allocation
9 -> larger allocation
Watch list grow with overallocation:
Length -> Size change
------------------------------
Length 1 -> + 32 bytes
Length 5 -> + 32 bytes
Length 9 -> + 64 bytes
Length 17 -> + 64 bytes
Key insight: Python overallocates to make future
appends O(1) amortized. Exact growth factor is
implementation-dependent (~1.125x in CPython).
This growth pattern is why appending to lists is “amortized O(1)” - usually fast, occasionally slow when reallocation happens. For time-critical simulations, pre-allocate your arrays when particle count is known!
amortized Average cost over many operations, even if individual operations vary
4.3 Tuples: The Power of Immutability
immutable Objects whose state cannot be modified after creation
What if you need to guarantee that initial conditions or physical constants won’t change during your simulation? Enter tuples - Python’s immutable sequences. This isn’t a limitation - it’s protection against an entire category of bugs that plague scientific codes!
One of the most insidious bugs in computational physics happens when functions unexpectedly modify their inputs. Remember the defensive programming principles from Chapter 1? Here’s why they matter:
# THE BUG THAT CORRUPTS YOUR SIMULATION:
def evolve_system(state, dt):
# Trying to be clever, modifying in-place
state[0] += state[1] * dt # Modifies original!
state[1] += calculate_force(state[0]) * dt
return state
initial_state = [1.0, 0.0] # Position, velocity
state_t1 = evolve_system(initial_state, 0.01)
print(initial_state) # [1.01, ...] - Initial conditions corrupted!
# THE FIX: Use tuples to prevent modification
initial_state = (1.0, 0.0) # Tuple!
# Now modification attempts raise errors immediately!This connects directly to Chapter 1’s defensive programming: catch errors early, fail loudly!
Understanding Immutability in Physics Simulations
Let’s see how immutability protects your simulations and enables powerful optimizations:
With mutable list:
G was: 6.674e-08, now: 6.674e-05
Constants CORRUPTED!
With immutable tuple:
Caught bug immediately: 'tuple' object does not support item assignment
Lesson: Immutability turns silent corruption into immediate, loud errors!
Tuples as Dictionary Keys: Caching Expensive Calculations
Here’s where immutability becomes a superpower for computational physics - tuples can be dictionary keys, enabling powerful memoization patterns for expensive calculations.
Never use raw floating-point values as cache keys! Tiny representation differences (e.g., 1.0000000001 vs 1.0) cause cache misses. Always use:
- Discrete indices:
(T_index, rho_index) - Rounded values:
(round(T, -3), round(rho, 6)) - Grid bin IDs:
(T_bin, rho_bin, composition_id)
EOS lookups during stellar evolution:
---------------------------------------------
First lookup: P = 8.88e+11 dyne/cm^2
Second lookup: P = 8.88e+11 dyne/cm^2 (cache hit!)
New point: P = 9.77e+11 dyne/cm^2
Total expensive calculations: 2
Cache hits avoid recalculation entirely!
This pattern is fundamental to stellar evolution codes like MESA, which use pre-computed opacity and equation of state tables with discrete grid indices for O(1) lookup.
You’re storing 1 million particle positions. Calculate the memory difference between:
- List of lists:
[[x, y, z] for _ in range(1_000_000)] - List of tuples:
[(x, y, z) for _ in range(1_000_000)] - Single flat list:
[x1, y1, z1, x2, y2, z2, ...]
Which is most memory efficient? Which is easiest to use?
Solution: Tuples save ~10% memory over lists. Single flat list is most memory efficient but harder to work with. This is why NumPy uses flat arrays internally with views for usability!
Stellar evolution codes like MESA use pre-computed opacity and equation of state tables rather than calculating these values from first principles at every timestep4. The choice of data structure for storing and accessing these tables matters: O(1) dictionary lookups are faster than O(n) list searches, especially when accessed repeatedly throughout a simulation.
4.4 The Mutable vs Immutable Distinction
Time for one of Python’s most important concepts for scientific computing! Understanding mutability is the key to avoiding mysterious bugs where your simulation state changes unexpectedly. This connects directly to Chapter 1’s defensive programming principles - catching errors early prevents corrupted results.
Python’s Reference Model in Physics Simulations
Python doesn’t store values in variables - it stores references to objects. This has profound implications for simulation codes:
With IMMUTABLE objects (safe for constants):
Initially: G=6.674e-08, backup=6.674e-08
Same object in memory? True
After change: G=6.674e-08, backup=6.674e-07
Original constant is safe!
==================================================
With MUTABLE objects (dangerous for state!):
Initially: same object? True
Original particles: [[999.0, 2.0, 3.0], [4.0, 5.0, 6.0]]
'Backup': [[999.0, 2.0, 3.0], [4.0, 5.0, 6.0]]
You just corrupted your simulation state!
Lesson: Use copy.deepcopy() for simulation state backups!
On August 1, 2012, Knight Capital Group lost $440 million in 45 minutes due to a software deployment error that created unintended feedback loops in their trading \(\mathrm{system}^{1}\).
According to SEC filings and news reports, Knight was updating their trading software to handle a new NYSE order type. The deployment went to 7 of their 8 servers correctly, but one server retained old test code from 2003. When the market opened, this server began processing live orders using the old “Power Peg” test functionality, which was designed to continuously buy and sell stocks for testing purposes.
The catastrophic interaction occurred because:
- Both old and new code read from the same order queue (shared data structure)
- The old code interpreted a flag differently than the new code
- No validation existed to detect incompatible code versions
- The shared state created a feedback loop of unintended trades
In 45 minutes, the malfunctioning system:
- Executed over 4 million unintended trades
- Accumulated $7 billion in unwanted positions
- Generated 10% of total US market volume that morning
- Lost $10 million per minute
This disaster illustrates critical data structure principles:
- Shared mutable state between incompatible systems is dangerous
- Queue structures without version validation can corrupt workflows
- Missing defensive checks on shared data structures cascade failures
- A single configuration dictionary misinterpretation can destroy a company
Knight Capital, once the largest US equity trader, was acquired within a week for a fraction of its previous value. The lesson: when multiple systems share data structures, defensive programming and validation aren’t optional—they’re essential.5
The Classic Mutable Default Argument Bug
This bug is so dangerous that Python linters specifically check for it. It’s particularly insidious in iterative simulations:
Run 1:
Energy history: [100, 95]
Run 2 (should be independent):
Energy history: [100, 95, 200]
Contains Run 1 data! Runs are coupled!
THE FIX: Use None sentinel pattern
Run 3:
Energy history: [95]
Run 4 (properly independent):
Energy history: [200]
Runs are independent!
Shallow vs Deep Copies in Grid Simulations
This distinction is critical for grid-based simulations in computational fluid dynamics, stellar atmospheres, or any field-based calculation:
Simulating heat diffusion on a grid:
Initial grid: [20.0, 50.0, 20.0]
--- SHALLOW COPY (aliases inner arrays) ---
Original grid center: 35.0 degC
Next step grid center: 35.0 degC
Modified both grids! Simulation is corrupted!
--- DEEP COPY (independent arrays) ---
Original grid center: 50.0 degC
Next step grid center: 35.0 degC
Grids are independent! Simulation is correct!
Memory visualization:
Shallow: grid -> [ref1, ref2, ref3] -> same inner arrays
Deep: grid -> [ref1, ref2, ref3] -> independent arrays
For iterative grid updates (heat equation, fluid dynamics), deepcopy is too slow. Use the ping-pong buffer pattern instead:
# Allocate TWO grids once at the start
size = 100
grid_a = [[0.0] * size for _ in range(size)]
grid_b = [[0.0] * size for _ in range(size)]
# Point current/next to the grids
current = grid_a
next_step = grid_b
# Simulation loop
for timestep in range(1000):
# Compute next_step from current
for i in range(1, size-1):
for j in range(1, size-1):
# Heat diffusion stencil
next_step[i][j] = 0.25 * (
current[i+1][j] + current[i-1][j] +
current[i][j+1] + current[i][j-1]
)
# Swap references (O(1), no copying!)
current, next_step = next_step, current
# current now points to the latest stateThis swaps two pointers instead of copying millions of values - the canonical pattern in CFD, climate models, and any time-stepping simulation!
A student’s particle simulation is producing inconsistent results. Here’s their code:
def update_particles(particles, forces=[]):
"""Update particle positions based on forces."""
for i, particle in enumerate(particles):
if i >= len(forces):
forces.append(calculate_force(particle))
particle['velocity'] += forces[i] * dt
particle['position'] += particle['velocity'] * dt
return particles, forces
# Simulation loop
particles = [{'position': [0,0,0], 'velocity': [1,0,0]} for _ in range(10)]
for timestep in range(100):
particles, forces = update_particles(particles)Find and fix THREE bugs related to mutable defaults and aliasing: 1. The forces=[] default is created once and shared between calls 2. The function modifies the original particles list (no defensive copy) 3. The forces list grows unbounded, carrying over between timesteps
Solution:
def update_particles(particles, forces=None):
"""Update particle positions based on forces."""
if forces is None:
forces = []
particles = copy.deepcopy(particles) # Defensive copy
forces_new = [] # Fresh forces each time
for particle in particles:
force = calculate_force(particle)
forces_new.append(force)
particle['velocity'] += force * dt
particle['position'] += particle['velocity'] * dt
return particles, forces_new4.5 Dictionaries: Average-Case O(1) Lookup for Physics
Now we explore one of computer science’s most elegant inventions - the dictionary! In computational physics, dictionaries are perfect for particle properties, lookup tables, and caching expensive calculations. You’re about to understand exactly how they achieve near-instantaneous lookups regardless of size.
Organizing Simulation Data with Dictionaries
Let’s see how dictionaries transform particle management in an N-body simulation:
Accessing particle_0002:
Mass: 5.97e+27 g
Type: planet
Particles by type: {'star': ['particle_0001'], 'planet': ['particle_0002']}
In astronomy: Perfect for star/planet/dark matter separation
Understanding Hash Tables for Physics Applications
Let’s demystify how dictionaries achieve O(1) lookup - critical for lookup tables in equation of state calculations:
Dictionary and set operations are average-case O(1), not guaranteed O(1). In pathological cases with many hash collisions, lookups can degrade to O(n). This happens when:
- Many keys hash to the same bucket (adversarial inputs)
- Poor hash function design
- Hash table needs resizing
For scientific computing, Python’s built-in hash functions work well for typical data. The O(1) claim holds in practice for nearly all physics applications.
How hash tables accelerate physics lookups:
============================================================
Key -> Hash -> Bucket
------------------------------------------------------------
T=1e+06, rho=1e-03 -> 7060515190775938164 -> 64
T=2e+06, rho=2e-03 -> -3756423604195419920 -> 20
T=5e+06, rho=1e-02 -> -5145451484737341997 -> 97
H_ionization -> 3730082394818695734 -> 34
He_ionization -> -4861540764280204506 -> 6
opacity_table -> -3903449810662890793 -> 93
The O(1) lookup process:
1. Hash the key -> integer
2. Map to bucket index
3. Direct array access!
This is why equation of state tables are fast!
Python salts string hashes for security. Running this code multiple times will show different hash values and bucket assignments - that’s expected! The key insight is that the process (hash \(\to\) bucket \(\to\) O(1) access) remains constant, even if specific values change.
Dictionary Performance for Particle Lookups
Time to see the dramatic performance difference for particle system queries:
#| eval: false
# Stage 1: Create test data (10 lines)
import time
import random
# Seed for reproducibility - critical for scientific computing!
rng = random.Random(42)
# Create a large particle system
n_particles = 100_000 # Reduced for reasonable runtime
print(f"Creating system with {n_particles:,} particles...")
# List approach (what beginners might try)
particle_list = [(f"particle_{i:07d}", rng.uniform(1e27, 1e30))
for i in range(n_particles)]#| eval: false
# Stage 2: Create dictionary version (8 lines)
# Reset RNG for same masses
rng = random.Random(42)
# Dictionary approach (professional solution)
particle_dict = {f"particle_{i:07d}": rng.uniform(1e27, 1e30)
for i in range(n_particles)}
# Search for specific particle
target = "particle_0050000"
print(f"Finding {target} among {n_particles:,} particles...")#| eval: false
# Stage 3: Time the lookups (15 lines)
# Time list search - O(n)
start = time.perf_counter()
for pid, mass in particle_list:
if pid == target:
mass_list = mass
break
list_time = time.perf_counter() - start
# Time dict lookup - O(1)
start = time.perf_counter()
mass_dict = particle_dict[target]
dict_time = time.perf_counter() - start
print(f"\nList search: {list_time*1000:.3f} ms")
print(f"Dict lookup: {dict_time*1000:.6f} ms")
print(f"Dictionary is {list_time/dict_time:,.0f}x faster!")
print("\nFor astronomical catalogs with millions of objects:")
print("This difference determines feasibility!")Sample output (from 100,000 particles on typical hardware):
Finding particle_0050000 among 100,000 particles...
List search: 2.847 ms
Dict lookup: 0.000312 ms
Dictionary is 9,125x faster!
For astronomical catalogs with millions of objects:
This difference determines feasibility!
This caching pattern dramatically speeds up physics codes:
from functools import lru_cache
@lru_cache(maxsize=10000)
def equation_of_state(temperature, density):
# Expensive calculation cached automatically
return complex_eos_calculation(temperature, density)Real physics applications:
- Nuclear reaction rates: Cache temperature-dependent rates
- Opacity calculations: Cache (T, \(\rho\), composition) lookups
- Molecular dynamics: Cache pairwise interaction potentials
- Climate models: Cache radiative transfer calculations
4.6 Sets: Mathematical Operations for Particle Systems
Sets are perfect for tracking unique particles, finding neighbors, and performing mathematical operations on particle groups. They provide O(1) membership testing plus elegant set operations - ideal for computational physics!
Set Operations in Particle Dynamics
Boundary particles needing communication: {'p004', 'p007', 'p003', 'p005'}
Interior particles (A): {'p002', 'p001'}
Total unique particles: 9
Particles in multiple domains: {'p004', 'p007', 'p003', 'p005'}
Sets make domain decomposition bookkeeping trivial!
Used in: MPI parallel codes, astronomical survey overlaps
Regular sets are mutable (unhashable), so they can’t be dict keys. Use frozenset for an immutable, hashable set:
# Track which particle combinations we've processed
processed = {}
pair = frozenset({'p001', 'p002'}) # Order doesn't matter!
processed[pair] = True
# These are the same pair:
frozenset({'p001', 'p002'}) == frozenset({'p002', 'p001'}) # TruePerformance: Sets vs Lists for Neighbor Finding
Finding neighbors among 100,000 particles...
Checking 100 potential neighbors:
List approach: 0.08 ms
Set approach: 0.02763 ms
Set is 3x faster!
Why is checking particle membership O(n) with lists but O(1) with sets?
Think about the search process…
Answer: Lists must check each element sequentially until finding a match (or reaching the end) - this is O(n) linear time. Sets use hash tables: they compute a hash of the particle ID and jump directly to where it would be stored - this is O(1) constant time.
For a million-particle simulation, this is the difference between microseconds and seconds per lookup. When you’re checking thousands of particle interactions per timestep, this performance difference determines whether your simulation finishes in hours or weeks!
4.7 Memory and Performance for Scientific Computing
Now that we understand how different data structures behave algorithmically with sets, dictionaries, and lists, let’s examine their actual memory footprint and cache performance implications. Understanding memory layout is crucial for scientific computing because it explains why specialized numerical libraries are so much faster than pure Python and will help you write cache-efficient simulation codes.
Memory Usage in Particle Systems
Let’s examine memory costs for different data structures in a particle simulation context:
Storing 10,000 particle IDs:
List: 85,176 bytes
Set: 524,504 bytes
100 particles as list: 920 bytes
With spatial hash: ~4,688 bytes
Extra 3,768 bytes enables O(1) neighbor finding!
Memory for 1,000 particles (ID, x, y, z):
-------------------------------------------------------
List of lists : 8,856 bytes (container only!)
List of tuples : 8,856 bytes (container only!)
Dictionary : 36,952 bytes (container only!)
NumPy .nbytes : 32,000 bytes (actual data!)
NumPy getsizeof : 32,128 bytes (header only!)
⚠️ getsizeof() is SHALLOW - it misses contained objects!
NumPy's .nbytes gives the actual data buffer size.
Cache Efficiency in Grid Computations
Modern CPUs are fast, but memory access is slow. Understanding cache efficiency is critical for grid-based simulations:
Simulating heat diffusion finite differences:
(Common in CFD and atmospheric modeling)
Processing 500x500 grid:
Row-major (cache-friendly): 16.6 ms
Column-major (cache-hostile): 19.0 ms
Ratio: 1.1x
⚠️ Important caveat about Python nested lists:
Each row is a SEPARATE list object in memory.
Only the row's pointers are contiguous, not the data.
In NumPy with contiguous arrays, this effect is 10-100x!
That's why memory layout matters so much in CFD/climate codes.
Never modify a collection while iterating - a common bug in particle removal:
# THE BUG - Skips particles!
particles = [p1, p2, p3, p4, p5]
for p in particles:
if p.escaped_boundary():
particles.remove(p) # WRONG! Skips next particle
# THE FIX 1: Iterate over copy
for p in particles.copy():
if p.escaped_boundary():
particles.remove(p)
# THE FIX 2: List comprehension (best for physics)
particles = [p for p in particles if not p.escaped_boundary()]
# THE FIX 3: Build removal list
to_remove = []
for p in particles:
if p.escaped_boundary():
to_remove.append(p)
for p in to_remove:
particles.remove(p)This bug has corrupted many molecular dynamics simulations!
4.8 Choosing the Right Structure for Physics
After exploring all options, how do you choose? Here’s your decision framework for computational physics:
For each physics scenario, what data structure would you choose?
- Tracking unique particle IDs in collision detection \(\to\) ?
- Time series of energy measurements \(\to\) ?
- Caching expensive equation of state calculations \(\to\) ?
- Physical constants that must not change \(\to\) ?
Think before reading…
Answers: 1. Set (unique particles, O(1) collision checks) 2. List (ordered by time, allows duplicates) 3. Dictionary (cache (T,\(\rho\)) \(\to\) pressure with O(1) lookup) 4. Tuple or namedtuple (immutable constants, prevents accidents)
Performance Quick Reference for Physics
| Operation | List | Tuple | Dict | Set |
|---|---|---|---|---|
| Access by index | O(1) | O(1) | N/A | N/A |
| Search for particle | O(n) | O(n) | O(1)†‡ | O(1)† |
| Add particle | O(1)‡ | N/A | O(1)‡ | O(1)‡ |
| Remove particle | O(n) | N/A | O(1)† | O(1)† |
| Memory (relative) | 1\(\times\) | 0.9\(\times\) | 3\(\times\) | 3\(\times\) |
| Spatial ordering | Yes | Yes | Yes* | No |
* Dicts maintain insertion order (Python 3.7+) † Average-case; worst-case O(n) with hash collisions ‡ Amortized - occasionally O(n) during resize
Real Example: Combining Data Structures
Let’s see how different data structures work together in a simple particle tracking system:
Total active mass: 3.0e+30 g
Number of measurements: 3
Constants are protected: type(CONSTANTS) = tuple
Main Takeaways
You’ve just mastered concepts that will transform your computational physics from toy problems to research-grade simulations! The journey from understanding simple lists to architecting with dictionaries and sets represents a fundamental shift in how you approach scientific computing. You now see data structures not just as containers, but as carefully chosen tools where the right choice can mean the difference between simulations that finish in hours versus weeks.
The most profound insight from this chapter is that data structure choice often matters more than algorithm optimization. We saw how switching from a list to a set for particle lookups gave us a 100,000\(\times\) speedup - no amount of code optimization could achieve that! This is the secret that separates research codes from student projects: professionals spend more time architecting their data organization than writing physics equations. The best physics insight in the world is useless if your code can’t handle realistic problem sizes.
The mutable versus immutable distinction that seemed abstract at first is actually critical for scientific computing. Every time you use a tuple for physical constants or configuration parameters, you’re preventing bugs that have literally crashed spacecraft and corrupted published results. When you properly use defensive copying for your simulation state, you’re protecting yourself from the aliasing bugs that have plagued computational physics for decades. Remember: in the Cassini example, a simple data structure choice nearly lost a billion-dollar mission.
The performance principles you’ve learned extend far beyond Python. The cache efficiency concepts explain why codes like LAMMPS and GADGET obsess over memory layout - it’s not premature optimization, it’s the difference between simulating thousands versus millions of atoms. The Big-O notation you’ve mastered is the universal language for discussing whether an algorithm scales to galaxy-sized problems. The hash table concept underlying dictionaries appears in every parallel communication library, every adaptive mesh refinement code, and every spatial indexing scheme you’ll encounter.
Most importantly, you now understand the connection between data structures and numerical methods. That O(\(\mathrm{n}^{2}\)) all-pairs particle interaction that’s impossible for large systems? It becomes O(n log n) with tree-based structures. The O(n) searching that makes timesteps crawl? It becomes O(1) with spatial hashing. These aren’t just computer science concepts - they’re the difference between simulating a few hundred particles and modeling entire galaxies with billions of stars.
Remember: every data structure makes trade-offs. Lists give you ordering but slow searches. Dictionaries trade memory for lightning-fast lookups. Sets provide uniqueness and mathematical operations but lose ordering. There’s no universally “best” structure - only the best structure for your specific physics problem. As you tackle research simulations, you’ll combine multiple structures: NumPy arrays for number crunching, dictionaries for particle properties, sets for collision detection, and spatial data structures for neighbor finding.
With this foundation, you’re ready to architect simulations that scale to astronomical proportions. The next chapter on functions and modules will show you how to organize this knowledge into reusable, testable components. After that, NumPy will supercharge your numerical operations using the memory layout principles you now understand. You’re no longer just writing code - you’re engineering solutions to the computational challenges of modern physics!
Once you can reason about control flow (Chapter 3) and data layout (this chapter), you’re ready to start abstracting your work: writing functions and modules that hide complexity while preserving correctness and performance. That’s the goal of Chapter 5.
Definitions
Aliasing: When two or more variables refer to the same object in memory, causing modifications through one to affect the others.
Amortized O(1): An operation that is usually O(1) but occasionally O(n), where the average over many operations remains O(1).
Big-O Notation: Mathematical notation describing how an algorithm’s runtime scales with input size, focusing on the dominant term.
Cache: Small, fast memory close to the CPU that stores recently accessed data for quick retrieval.
Data Structure: A way of organizing data in computer memory to enable efficient access and modification.
Deep Copy: Creating a completely independent copy of an object and all objects it contains, recursively.
Dictionary: A mutable mapping type storing key-value pairs with average-case O(1) lookup time using hash tables. Worst-case O(n) with hash collisions.
Hash Function: A function mapping data of arbitrary size to fixed-size values, enabling fast lookups.
Hash Table: The underlying implementation for dictionaries and sets, enabling average-case O(1) lookups. Worst-case O(n) when many keys collide.
Immutable: Objects whose state cannot be modified after creation (tuples, strings, numbers).
List: Python’s built-in mutable sequence type that stores references to objects in order.
Memoization: Caching technique storing function results to avoid recalculating for the same inputs.
Mutable: Objects whose state can be modified after creation (lists, dictionaries, sets).
Named Tuple: A tuple subclass allowing element access by name as well as index.
O(1) - Constant Time: An operation whose runtime doesn’t depend on input size.
O(n) - Linear Time: An operation whose runtime grows proportionally with input size.
O(\(\mathrm{n}^{2}\)) - Quadratic Time: An operation whose runtime grows with the square of input size.
O(log n) - Logarithmic Time: An operation whose runtime grows logarithmically with input size.
O(n log n): Common complexity for efficient sorting algorithms and tree-based operations.
Reference: A variable that points to an object in memory rather than containing the value directly.
Set: A mutable collection of unique, unordered elements with O(1) membership testing.
Shallow Copy: Creating a new container with references to the same contained objects.
Spatial Hashing: Organizing particles by spatial region for O(1) neighbor finding.
Tuple: An immutable sequence type that cannot be changed after creation.
Key Takeaways
✓ Data structure choice can change performance by factors of 1,000,000\(\times\) or more for large-scale simulations
✓ Lists are versatile but have O(n) search - use for ordered particle arrays that you’ll vectorize with NumPy
✓ Dictionaries and sets provide O(1) lookup through hash tables - essential for particle lookups and caching
✓ Tuples prevent modification bugs - use for physical constants and configuration parameters
✓ The shallow vs deep copy distinction is critical for grid-based simulations
✓ Python stores references to objects, explaining why NumPy’s contiguous arrays are faster
✓ Memory layout affects cache performance by 2-10\(\times\) even in Python
✓ Every data structure trades something (memory, speed, flexibility) for something else
✓ Mutable default arguments are dangerous - always use the None sentinel pattern (Chapter 1 callback)
✓ Spatial data structures transform O(\(\mathrm{n}^{2}\)) physics problems into O(n) or O(n log n)
✓ Sets provide elegant mathematical operations for domain decomposition and particle tracking
✓ Dictionaries enable memoization that can speed up equation of state calculations 100\(\times\)
Python Module & Method Reference
This reference section catalogs all Python modules, functions, and methods introduced in this chapter. Keep this as a quick lookup guide for your physics simulations.
Standard Library Modules
time module - High-resolution timing for performance measurement
import timetime.perf_counter()- Returns float seconds with highest available resolution- Use for timing code segments:
start = time.perf_counter() - Always use for benchmarking (not
time.time()which can go backwards!)
- Use for timing code segments:
sys module - System-specific parameters (expanded from Chapter 1)
import syssys.getsizeof(object)- Returns shallow memory size of object in bytes- Measures container overhead only, not contained objects
- For deep size: sum recursively or use
pympler.asizeof() - Use for comparing container overhead, not total memory
copy module - Create object copies with control over depth
import copycopy.copy(x)- Creates shallow copy (new container, same contents)copy.deepcopy(x)- Creates deep copy (recursively copies all contents)- Critical for grid simulations to avoid aliasing bugs
- Use for simulation state backups
random module - Generate random numbers for Monte Carlo
import randomrandom.random()- Random float in [0.0, 1.0)random.uniform(a, b)- Random float in [a, b]random.choice(seq)- Random element from sequencerandom.sample(population, k)- k unique random elements
json module - Save/load structured data
import jsonjson.dump(obj, file, indent=2)- Write object to file with formattingjson.load(file)- Read object from filejson.dumps(obj)- Convert object to JSON stringjson.loads(string)- Parse JSON string to object
math module - Mathematical functions (review from Chapter 2)
import mathmath.sqrt(x)- Square root (use for distances)math.log(x)- Natural logarithmmath.log2(x)- Base-2 logarithm (for Big-O analysis)
Collections Module
collections module - Specialized container datatypes
from collections import OrderedDict, Counter, defaultdict, deque, namedtupleOrderedDict - Dictionary that remembers insertion order - .move_to_end(key, last=True) - Move key to end (or beginning if last=False) - .popitem(last=True) - Remove and return (key, value) pair from end/beginning - Use for LRU caches in physics calculations
Counter - Dictionary subclass for counting hashable objects - Counter(iterable) - Create from iterable - .most_common(n) - Returns n most common (element, count) pairs - .update(iterable) - Add counts from iterable - Perfect for histogram analysis of particle properties
defaultdict - Dictionary with default value factory - defaultdict(list) - Missing keys default to empty list - defaultdict(int) - Missing keys default to 0 - Eliminates KeyError in particle grouping
deque - Double-ended queue with O(1) operations at both ends - deque(maxlen=n) - Fixed-size queue (old elements dropped) - .append(x) / .appendleft(x) - Add to right/left end - .pop() / .popleft() - Remove from right/left end - Essential for boundary conditions in simulations
namedtuple - Tuple subclass with named fields
Particle = namedtuple('Particle', ['id', 'mass', 'x', 'y', 'z'])
p = Particle(1, 1.0e30, 0, 0, 0)
print(p.mass) # Access by nameFunctools Module
functools module - Higher-order functions and operations
from functools import lru_cache@lru_cache(maxsize=128) - Decorator for automatic memoization
@lru_cache(maxsize=1000)
def expensive_calculation(T, rho):
return complex_physics_calculation(T, rho)- Caches function results automatically
maxsize=Nonefor unlimited cache.cache_info()- Returns cache statistics.cache_clear()- Clear the cache
Built-in Functions for Data Structures
Type Checking - type(obj) - Returns object’s type - isinstance(obj, type) - Check if obj is instance of type - id(obj) - Returns unique identifier (memory address) - obj1 is obj2 - Check if same object (identity)
Container Operations - len(container) - Number of elements - x in container - Membership test (O(1) for sets/dicts, O(n) for lists) - sorted(iterable) - Returns new sorted list - reversed(sequence) - Returns reverse iterator - enumerate(iterable, start=0) - Returns (index, value) pairs - zip(iter1, iter2, ...) - Parallel iteration
Copying - list(iterable) - Create new list from iterable - tuple(iterable) - Create new tuple from iterable - set(iterable) - Create new set from iterable - dict(pairs) - Create dictionary from (key, value) pairs
Data Structure Methods Summary
List Methods (Expanded) - .append(x) - Add to end (amortized O(1)) - .extend(iterable) - Add all elements from iterable - .insert(i, x) - Insert at position i (O(n)) - .remove(x) - Remove first x (O(n)) - .pop(i=-1) - Remove and return element at i - .clear() - Remove all elements - .index(x) - Find position of x (O(n)) - .count(x) - Count occurrences of x - .sort() - Sort in-place - .reverse() - Reverse in-place - .copy() - Create shallow copy
Dictionary Methods (Expanded) - .get(key, default=None) - Safe access with default - .setdefault(key, default) - Get or set with default - .pop(key, default) - Remove and return value - .popitem() - Remove and return arbitrary (key, value) - .update(other) - Update with other dict/pairs - .keys() - View of keys - .values() - View of values - .items() - View of (key, value) pairs - .clear() - Remove all items - .copy() - Shallow copy
Set Methods (Expanded) - .add(x) - Add element - .remove(x) - Remove element (raises KeyError if missing) - .discard(x) - Remove element (no error if missing) - .pop() - Remove and return arbitrary element - .clear() - Remove all elements - .union(other) or | - Elements in either set - .intersection(other) or & - Elements in both sets - .difference(other) or - - Elements in first but not second - .symmetric_difference(other) or ^ - Elements in either but not both - .update(other) or |= - Add elements from other - .intersection_update(other) or &= - Keep only elements in both - .difference_update(other) or -= - Remove elements in other - .symmetric_difference_update(other) or ^= - Keep elements in either but not both - .issubset(other) - Check if all elements in other - .issuperset(other) - Check if contains all elements of other - .isdisjoint(other) - Check if no elements in common - .copy() - Create shallow copy
Quick Reference Tables
Data Structure Operations
| Operation | List | Tuple | Dict | Set |
|---|---|---|---|---|
| Create empty | [] |
() |
{} |
set() |
| Create with items | [1,2,3] |
(1,2,3) |
{'a':1} |
{1,2,3} |
| Add item | .append(x) |
N/A | d[k]=v |
.add(x) |
| Remove item | .remove(x) |
N/A | del d[k] |
.remove(x) |
| Check membership | x in list |
x in tuple |
k in dict |
x in set |
| Get by index | list[i] |
tuple[i] |
N/A | N/A |
Performance Quick Reference
| Operation | List | Tuple | Dict | Set | Deque |
|---|---|---|---|---|---|
| Access by index | O(1) | O(1) | N/A | N/A | O(n) |
| Search for value | O(n) | O(n) | O(1)* | O(1) | O(n) |
| Add to end | O(1)† | N/A | O(1)† | O(1)† | O(1) |
| Add to beginning | O(n) | N/A | O(1)† | O(1)† | O(1) |
| Remove from end | O(1) | N/A | N/A | N/A | O(1) |
| Remove from beginning | O(n) | N/A | N/A | N/A | O(1) |
| Remove specific | O(n) | N/A | O(1) | O(1) | O(n) |
* Dict searches by key, not value † Amortized - occasionally O(n) during resize
Common Methods for Physics
| Structure | Method | Physics Use Case |
|---|---|---|
| list | .append(x) |
Add new particle |
| list | .extend(iter) |
Merge particle lists |
| dict | .get(k, default) |
Safe parameter lookup |
| dict | @lru_cache |
Automatic memoization |
| set | .union(other) |
Combine particle domains |
| set | .intersection(other) |
Find boundary particles |
| deque | .appendleft() |
Boundary conditions |
| OrderedDict | .move_to_end() |
LRU cache management |
When to Use Each Structure
| Physics Task | Best Choice | Why |
|---|---|---|
| Particle positions | List \(\to\) NumPy array | Vectorizable operations |
| Particle lookup by ID | Dictionary | O(1) access |
| Active particles | Set | Fast membership, uniqueness |
| Physical constants | Tuple/NamedTuple | Immutable, safe |
| EOS cache | Dictionary/LRU cache | Fast lookup by (T,\(\rho\)) |
| Collision candidates | Set | No duplicates, set operations |
| Time series | List | Ordered, allows duplicates |
| Spatial grid | Dict of sets | O(1) cell access |
| Boundary particles | Deque | Fast operations at both ends |
| Particle counts | Counter | Automatic histogram creation |
Next Chapter Preview
With data structures mastered, Chapter 5 will explore functions and modules - how to organize code for reusability, testing, and collaboration. You’ll learn how Python’s function model, with first-class functions and closure support, enables powerful patterns like decorators and functional programming techniques. These concepts prepare you for building modular physics engines, creating reusable analysis pipelines, and understanding the functional programming paradigm used in modern frameworks like JAX where functions transform into automatically differentiable computational graphs. Get ready to transform your scripts into professional-quality scientific libraries!
Footnotes
Lions, J.L., et al. (1996). “Ariane 5 Flight 501 Failure Report.” ESA/CNES.↩︎
Jones, M.B. (1997). “What Really Happened on Mars.” Microsoft Research.↩︎
Anderson, Y.Z., et al. (2006). “Solving Cassini’s Data Glitch Problem.” NASA/JPL.↩︎
Paxton et al. (2011). “Modules for Experiments in Stellar Astrophysics (MESA).” ApJS, 192, 3.↩︎
“Knight Capital Says Trading Mishap Cost It $440 Million.” The New York Times, August 2, 2012. SEC Release No. 70694 (October 16, 2013).↩︎