Author

Anna Rosen


title: “Chapter 4: Data Structures - Organizing Scientific Data” subtitle: “Python Fundamentals | COMP 536” author: “Anna Rosen” draft: false format: html: toc: true execute: echo: true warning: true error: false freeze: auto —

Learning Objectives

By the end of this chapter, you will be able to:

Note✅ Before Starting This Chapter

If any boxes are unchecked, review the indicated chapters first.

Chapter Overview

Imagine you’re simulating the interactions between a million particles - whether they’re dark matter particles in a galaxy, molecules in a gas, or nodes in a network. Each timestep, you need to find which particles are close enough to interact strongly. With the wrong data structure, this neighbor search could take hours per timestep. With the right one - a spatial hash table or tree structure - it takes seconds. That’s the difference between a simulation finishing in a day versus running for months. This chapter teaches you to make these critical choices that determine whether your code scales to research problems or remains stuck with toy models.

This chapter transforms you from someone who stores data to someone who orchestrates it strategically for scientific computing. You’ll discover not just that dictionary lookups are fast, but why they’re fast - through hash functions that turn particle positions into array indices. You’ll understand when they might fail - like when hash collisions cluster your data. And you’ll learn how to verify performance yourself - because in computational science, measurement beats assumption every time.

These concepts directly prepare you for the numerical computing ahead. The memory layout discussions explain why NumPy arrays can be 10-100\(\times\) more efficient than Python lists for vectorized operations. The immutability concepts prepare you for functional programming paradigms used in modern frameworks like JAX, where immutable operations enable automatic differentiation. The performance profiling skills will help you identify bottlenecks whether you’re solving differential equations or analyzing experimental data. By chapter’s end, you’ll think about data organization like a computational scientist, architecting for performance from the start.

In Chapter 3, you learned how algorithms flow through decisions and loops. In Chapter 4, you’ll learn how algorithms scale—because data structures determine whether your logic runs in seconds or in days.

ImportantTL;DR — Why This Chapter Is a Turning Point

Most courses teach data structures as a list of methods to memorize. This chapter teaches something more useful: how to choose a structure that makes your algorithm feasible.

If you internalize this chapter, you’ll start thinking like a computational scientist:

  • You’ll design around access patterns (what you need to look up, update, or traverse).
  • You’ll predict when code will scale to large problems — and when it will explode.
  • You’ll avoid whole classes of bugs by using immutability and careful copying.
  • You’ll measure performance instead of guessing (because in science, measurement beats assumption).

You do not need to memorize everything here. You do want to revisit this chapter all semester—especially when your code starts getting slow, memory-hungry, or mysteriously wrong.

NoteHow to Use This Chapter

Treat this chapter as a reference and toolbox, not something to read once and memorize.

A good workflow is:

  1. Identify your problem’s access pattern (search, lookup, append, membership, caching).
  2. Choose the simplest structure that matches that pattern.
  3. Measure performance with small benchmarks when in doubt.
  4. Revisit this chapter when your code hits performance or correctness walls.

If this material feels “advanced,” that’s normal. It clicks fastest when you apply it to real code.

4.1 What Is a Data Structure?

data structure A way of organizing data in memory to enable efficient access and modification

A data structure is fundamentally about organizing information to match your access patterns. Think about an N-body simulation where particles interact through forces - these particles could represent stars in a star cluster, molecules in a gas, or charges in a plasma. You could store particles in order of creation (like a list), organize them by spatial region for fast neighbor finding (like a dictionary of cells), or track unique particle IDs (like a set). Each choice profoundly affects your simulation’s performance - the difference between O(\(\mathrm{n}^{2}\)) all-pairs checks and O(n log n) tree-based algorithms.

Big-O notation (e.g, O(\(\mathrm{n}^{2}\)), O(n log n)) Big-O notation describing how runtime scales with input size n.

Quick Preview: Python’s Core Data Structures

Structure Mutable Ordered Duplicates Use Case
list Yes Yes Yes Particle arrays, time series
tuple No Yes Yes Constants, coordinates, dict keys
dict Yes Yes* Keys: No Lookup tables, properties
set Yes No No Unique particles, membership

*Dicts maintain insertion order (Python 3.7+ language guarantee)

Decision Framework: Choosing Your Data Structure

Before diving into each structure, here’s the mental model for choosing:

Need ordered sequence with duplicates?
├── Yes -> Need to modify it?
│         ├── Yes -> list
│         └── No (constants/config) -> tuple
└── No -> Need key-value mapping?
          ├── Yes -> dict (O(1) lookup by key)
          └── No -> Need uniqueness/set math?
                    ├── Yes -> set (O(1) membership)
                    └── No -> Probably list

Key invariants to remember:

Structure Mutable? Keys/Elements Must Be… When to Use
list Yes Anything Ordered data, will modify
tuple No Anything Constants, dict keys, records
dict Yes Hashable (immutable) Fast lookup by key
set Yes Hashable (immutable) Uniqueness, membership tests
NoteWhat Does “Hashable” Mean?

An object is hashable if it has a hash value that never changes during its lifetime. In practice:

  • Hashable: int, float, str, tuple (if contents hashable), frozenset
  • Not hashable: list, dict, set (because they’re mutable)

This is why you can use tuples as dict keys but not lists!

TipQuick Reflection (1 minute)

Think of the last piece of code you wrote (in this course or elsewhere). Where did it spend most of its time: searching, looping, looking up values, or building results? That answer usually tells you which data structure matters most.

Ariane 5 (1996) - $370M lost: A 64-bit float was cast to 16-bit int without bounds checking. Integer overflow crashed the rocket 39 seconds after launch.1

Mars Pathfinder (1997) - Near mission loss: Priority inversion in a task queue caused system resets on Mars. JPL debugged it from 50 million miles away.2

Cassini (2006) - Data loss risk: Wrong buffer structure caused data loss during communication transitions. Fixed with a circular buffer (deque-like structure).3

Each disaster traces to a data structure choice. Bounds checking, queue design, buffer management - these “basic” concepts have billion-dollar consequences.


Building Intuition: Measuring Speed Empirically

Ready to discover something that will transform how you think about organizing data in your simulations? We’re going to measure the dramatic performance difference between different data structures. This empirical approach - measure first, understand why, then optimize - is exactly how computational scientists approach performance optimization.

Membership testing: list (O(n)) vs set (O(1))
=======================================================
      Size   List (mus)    Set (mus)    Speedup
-------------------------------------------------------
       100         0.58         0.03         21x
     1,000         5.67         0.03        203x
    10,000        57.05         0.03      2,028x
   100,000       572.75         0.03     20,455x

As n grows, list time grows proportionally (O(n))
while set time stays nearly constant (O(1)).

For simulations with many membership checks,
this can mean the difference between minutes and hours!

This performance difference isn’t just academic - it determines whether your galaxy simulation can evolve for billions of years or gets stuck after a few million. The secret behind this magic is the hash table, which you’re about to understand completely.

Understanding Big-O Notation

Now that you’ve witnessed this dramatic performance difference empirically, let’s understand the mathematical pattern behind it. Big-O notation describes how an algorithm’s runtime scales with input size - it’s the language computational scientists use to discuss whether an algorithm is feasible for large-scale simulations.

Big-O describes how runtime grows with input size—not the exact speed—so constants, memory layout, and access patterns still matter.

Complexity Scaling for Physics Simulations:

Complexity 1,000 particles 1,000,000 particles Example
O(1) 1 op 1 op Hash table lookup
O(log n) 10 ops 20 ops Tree traversal
O(n) 1,000 ops 1,000,000 ops Direct summation
O(n log n) 10,000 ops 20,000,000 ops Tree codes (GADGET)
O(\(\mathrm{n}^{2}\)) 1,000,000 ops 1,000,000,000,000 ops All-pairs (impossible!)
#| label: fig-bigo
#| fig-cap: "Algorithm complexity scaling for particle simulations"

import matplotlib.pyplot as plt
import math

# Visualize how different algorithms scale for physics simulations
n = range(1, 101)
constant = [1 for _ in n]
logarithmic = [math.log2(x) if x > 0 else 0 for x in n]
linear = list(n)
nlogn = [x * math.log2(x) if x > 0 else 0 for x in n]
quadratic = [x**2 for x in n]

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(n, constant, label='O(1) - Hash table lookup', linewidth=2)
ax.plot(n, logarithmic, label='O(log n) - Binary search', linewidth=2)
ax.plot(n, linear, label='O(n) - Linear scan', linewidth=2)
ax.plot(n, nlogn, label='O(n log n) - Tree codes', linewidth=2, linestyle='--')
ax.plot(n, quadratic, label='O(n^2) - All-pairs', linewidth=2)

ax.set_xlim(0, 100)
ax.set_ylim(0, 500)
ax.set_xlabel('Number of Particles (thousands)', fontsize=12)
ax.set_ylabel('Operations (arbitrary units)', fontsize=12)
ax.set_title('Algorithm Scaling: Why Data Structures Matter', fontsize=14)
ax.legend(loc='upper left')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
For a million-particle simulation:
  O(1):       1 operation           - spatial hash lookup
  O(log n):   20 operations         - tree traversal
  O(n):       1,000,000 operations  - direct force sum
  O(n log n): 2e+07 operations - tree codes
  O(n^2):      1,000,000,000,000 ops - all pairs (impossible!)

In astronomy: This enables galaxy simulations with billions of stars
In chemistry: This allows protein folding with millions of atoms
In climate: This permits global models with millions of grid cells

The table above reveals why algorithmic complexity matters for computational physics. The naive O(\(\mathrm{n}^{2}\)) approach that checks all particle pairs becomes impossible beyond a few thousand particles. But with smart data structures like trees (O(n log n)) or spatial hashing (O(n)), we can simulate entire galaxies, proteins, or climate systems!

Important💡 Computational Thinking: The Time-Space Tradeoff in Physics Simulations

This universal pattern appears throughout computational physics: trading memory for speed.

The Pattern: - Use more memory \(\to\) organize data cleverly \(\to\) enable faster computation - Use less memory \(\to\) simpler organization \(\to\) slower computation

Real Physics Examples: - Particle mesh methods: Store density on a grid (memory) to avoid O(\(\mathrm{n}^{2}\)) particle interactions - Neighbor lists: Store nearby particles (memory) vs O(\(\mathrm{n}^{2}\)) distance checks for every step - Tree codes: Store spatial hierarchy (memory) for O(n log n) force calculation - Lookup tables: Store pre-computed values vs recalculating expensive functions - FFT methods: Store complex coefficients for O(n log n) vs O(\(\mathrm{n}^{2}\)) convolution

In astronomy, the GADGET cosmology code uses 200 bytes per particle (vs 24 for just position/velocity) to achieve O(n log n) scaling, enabling billion-particle dark matter simulations. This 8\(\times\) memory cost enables 1000\(\times\) speedup!

4.2 Lists: Python’s Workhorse Sequence

Lists are Python’s most versatile data structure - perfect for particle arrays, time series data, or any ordered collection. But here’s something crucial for scientific computing: Python lists don’t store your numbers directly! Understanding this explains why NumPy arrays are so much faster for numerical work.

How Lists Really Work in Memory

Let’s explore how Python actually stores your simulation data in memory. This knowledge will help you understand when to use Python lists versus when you need NumPy arrays.

List container: 88 bytes
Each coordinate: 24 bytes
Important⚠️ sys.getsizeof() Measures Shallow Size Only

sys.getsizeof() returns the memory used by the container object itself, not including the objects it contains. For nested structures, use recursive measurement:

# Shallow: just the list container
sys.getsizeof(position)  # ~80 bytes

# Deep: container + all contained objects
total = sys.getsizeof(position) + sum(sys.getsizeof(x) for x in position)

For accurate deep memory profiling, use pympler.asizeof() or write a recursive helper. This distinction is critical when optimizing memory for large simulations.

Total: 160 bytes for 3 floats!
That's 6.7x more than raw floats would need!

Why so much memory?
- List stores pointers, not values
- Each float is a full object with type info
- Python must track reference counts

This is why NumPy arrays (Chapter 7) store raw values contiguously!

Here’s what’s actually happening in memory when you store particle positions:

Python List of Positions:          Objects in Memory:
┌─────────────────┐               ┌──────────────────┐
│ size: 3         │               │ float: 1.5e10    │
│ capacity: 4     │          ┌───>│ type info        │
├─────────────────┤          │    │ ref count        │
│ pointer ────────┼──────────┘    └──────────────────┘
│ pointer ────────┼──────────────>┌──────────────────┐
│ pointer ────────┼──────┐        │ float: 2.3e10    │
│ (unused)        │    │          └──────────────────┘
└─────────────────┘    └────────>┌──────────────────┐
                                  │ float: 0.8e10    │
                                  └──────────────────┘

NumPy Array (preview):
┌────┬────┬────┐
│1.5 │2.3 │0.8 │  <- Raw values, no pointers!
└────┴────┴────┘

List Operations: Performance for Simulations

Different list operations have vastly different costs - critical knowledge when processing particle data or building spatial data structures.

Adding particle at END:   29.42 microseconds
Adding particle at START: 58.25 microseconds

Beginning is 2x slower!

Why? Inserting at the beginning shifts ALL particles in memory!
For particle systems, use collections.deque if you need
fast operations at both ends (e.g., boundary conditions).
Tip💡 Performance Tip: When Both Ends Matter

If your simulation needs fast operations at both ends (like particles entering/leaving through boundaries), use collections.deque:

from collections import deque
particles = deque(maxlen=1000000)  # Efficient at both ends!

This is particularly useful in molecular dynamics where particles cross periodic boundaries, or in astronomical simulations tracking objects entering and leaving a field of view.

List Growth Strategy: Understanding Dynamic Arrays

Watch how Python manages memory as your particle system grows. This pattern appears in many languages and understanding it helps you write efficient simulation codes.

Watch Python's list growth strategy:
(Critical for understanding simulation performance)
Length -> Capacity (overallocation)
----------------------------------------
   1 -> larger allocation
   5 -> larger allocation
   9 -> larger allocation

Watch list grow with overallocation:
Length -> Size change
------------------------------
Length    1 -> +  32 bytes
Length    5 -> +  32 bytes
Length    9 -> +  64 bytes
Length   17 -> +  64 bytes

Key insight: Python overallocates to make future
appends O(1) amortized. Exact growth factor is
implementation-dependent (~1.125x in CPython).

This growth pattern is why appending to lists is “amortized O(1)” - usually fast, occasionally slow when reallocation happens. For time-critical simulations, pre-allocate your arrays when particle count is known!

amortized Average cost over many operations, even if individual operations vary

4.3 Tuples: The Power of Immutability

immutable Objects whose state cannot be modified after creation

What if you need to guarantee that initial conditions or physical constants won’t change during your simulation? Enter tuples - Python’s immutable sequences. This isn’t a limitation - it’s protection against an entire category of bugs that plague scientific codes!

Warning⚠️ Common Bug Alert: The Accidental Modification Disaster

One of the most insidious bugs in computational physics happens when functions unexpectedly modify their inputs. Remember the defensive programming principles from Chapter 1? Here’s why they matter:

# THE BUG THAT CORRUPTS YOUR SIMULATION:
def evolve_system(state, dt):
    # Trying to be clever, modifying in-place
    state[0] += state[1] * dt  # Modifies original!
    state[1] += calculate_force(state[0]) * dt
    return state

initial_state = [1.0, 0.0]  # Position, velocity
state_t1 = evolve_system(initial_state, 0.01)
print(initial_state)  # [1.01, ...] - Initial conditions corrupted!

# THE FIX: Use tuples to prevent modification
initial_state = (1.0, 0.0)  # Tuple!
# Now modification attempts raise errors immediately!

This connects directly to Chapter 1’s defensive programming: catch errors early, fail loudly!

Understanding Immutability in Physics Simulations

Let’s see how immutability protects your simulations and enables powerful optimizations:

With mutable list:
  G was: 6.674e-08, now: 6.674e-05
  Constants CORRUPTED!

With immutable tuple:
  Caught bug immediately: 'tuple' object does not support item assignment

Lesson: Immutability turns silent corruption into immediate, loud errors!

Tuples as Dictionary Keys: Caching Expensive Calculations

Here’s where immutability becomes a superpower for computational physics - tuples can be dictionary keys, enabling powerful memoization patterns for expensive calculations.

Warning⚠️ Caching with Floats is Dangerous

Never use raw floating-point values as cache keys! Tiny representation differences (e.g., 1.0000000001 vs 1.0) cause cache misses. Always use:

  • Discrete indices: (T_index, rho_index)
  • Rounded values: (round(T, -3), round(rho, 6))
  • Grid bin IDs: (T_bin, rho_bin, composition_id)
EOS lookups during stellar evolution:
---------------------------------------------
First lookup:  P = 8.88e+11 dyne/cm^2
Second lookup: P = 8.88e+11 dyne/cm^2 (cache hit!)
New point:     P = 9.77e+11 dyne/cm^2

Total expensive calculations: 2
Cache hits avoid recalculation entirely!

This pattern is fundamental to stellar evolution codes like MESA, which use pre-computed opacity and equation of state tables with discrete grid indices for O(1) lookup.

Tip🤔 Check Your Understanding

You’re storing 1 million particle positions. Calculate the memory difference between:

  1. List of lists: [[x, y, z] for _ in range(1_000_000)]
  2. List of tuples: [(x, y, z) for _ in range(1_000_000)]
  3. Single flat list: [x1, y1, z1, x2, y2, z2, ...]

Which is most memory efficient? Which is easiest to use?

Solution: Tuples save ~10% memory over lists. Single flat list is most memory efficient but harder to work with. This is why NumPy uses flat arrays internally with views for usability!

Stellar evolution codes like MESA use pre-computed opacity and equation of state tables rather than calculating these values from first principles at every timestep4. The choice of data structure for storing and accessing these tables matters: O(1) dictionary lookups are faster than O(n) list searches, especially when accessed repeatedly throughout a simulation.


4.4 The Mutable vs Immutable Distinction

Time for one of Python’s most important concepts for scientific computing! Understanding mutability is the key to avoiding mysterious bugs where your simulation state changes unexpectedly. This connects directly to Chapter 1’s defensive programming principles - catching errors early prevents corrupted results.

Python’s Reference Model in Physics Simulations

Python doesn’t store values in variables - it stores references to objects. This has profound implications for simulation codes:

With IMMUTABLE objects (safe for constants):
Initially: G=6.674e-08, backup=6.674e-08
Same object in memory? True
After change: G=6.674e-08, backup=6.674e-07
Original constant is safe!

==================================================
With MUTABLE objects (dangerous for state!):
Initially: same object? True
Original particles: [[999.0, 2.0, 3.0], [4.0, 5.0, 6.0]]
'Backup': [[999.0, 2.0, 3.0], [4.0, 5.0, 6.0]]
You just corrupted your simulation state!

Lesson: Use copy.deepcopy() for simulation state backups!

On August 1, 2012, Knight Capital Group lost $440 million in 45 minutes due to a software deployment error that created unintended feedback loops in their trading \(\mathrm{system}^{1}\).

According to SEC filings and news reports, Knight was updating their trading software to handle a new NYSE order type. The deployment went to 7 of their 8 servers correctly, but one server retained old test code from 2003. When the market opened, this server began processing live orders using the old “Power Peg” test functionality, which was designed to continuously buy and sell stocks for testing purposes.

The catastrophic interaction occurred because:

  • Both old and new code read from the same order queue (shared data structure)
  • The old code interpreted a flag differently than the new code
  • No validation existed to detect incompatible code versions
  • The shared state created a feedback loop of unintended trades

In 45 minutes, the malfunctioning system:

  • Executed over 4 million unintended trades
  • Accumulated $7 billion in unwanted positions
  • Generated 10% of total US market volume that morning
  • Lost $10 million per minute

This disaster illustrates critical data structure principles:

  • Shared mutable state between incompatible systems is dangerous
  • Queue structures without version validation can corrupt workflows
  • Missing defensive checks on shared data structures cascade failures
  • A single configuration dictionary misinterpretation can destroy a company

Knight Capital, once the largest US equity trader, was acquired within a week for a fraction of its previous value. The lesson: when multiple systems share data structures, defensive programming and validation aren’t optional—they’re essential.5


The Classic Mutable Default Argument Bug

This bug is so dangerous that Python linters specifically check for it. It’s particularly insidious in iterative simulations:

Run 1:
  Energy history: [100, 95]

Run 2 (should be independent):
  Energy history: [100, 95, 200]
  Contains Run 1 data! Runs are coupled!
THE FIX: Use None sentinel pattern

Run 3:
  Energy history: [95]

Run 4 (properly independent):
  Energy history: [200]
  Runs are independent!

Shallow vs Deep Copies in Grid Simulations

This distinction is critical for grid-based simulations in computational fluid dynamics, stellar atmospheres, or any field-based calculation:

Simulating heat diffusion on a grid:
Initial grid: [20.0, 50.0, 20.0]

--- SHALLOW COPY (aliases inner arrays) ---
Original grid center: 35.0 degC
Next step grid center: 35.0 degC
Modified both grids! Simulation is corrupted!

--- DEEP COPY (independent arrays) ---
Original grid center: 50.0 degC
Next step grid center: 35.0 degC
Grids are independent! Simulation is correct!

Memory visualization:
Shallow: grid -> [ref1, ref2, ref3] -> same inner arrays
Deep:    grid -> [ref1, ref2, ref3] -> independent arrays
Tip💡 Professional Pattern: Ping-Pong Buffers

For iterative grid updates (heat equation, fluid dynamics), deepcopy is too slow. Use the ping-pong buffer pattern instead:

# Allocate TWO grids once at the start
size = 100
grid_a = [[0.0] * size for _ in range(size)]
grid_b = [[0.0] * size for _ in range(size)]

# Point current/next to the grids
current = grid_a
next_step = grid_b

# Simulation loop
for timestep in range(1000):
    # Compute next_step from current
    for i in range(1, size-1):
        for j in range(1, size-1):
            # Heat diffusion stencil
            next_step[i][j] = 0.25 * (
                current[i+1][j] + current[i-1][j] +
                current[i][j+1] + current[i][j-1]
            )

    # Swap references (O(1), no copying!)
    current, next_step = next_step, current

# current now points to the latest state

This swaps two pointers instead of copying millions of values - the canonical pattern in CFD, climate models, and any time-stepping simulation!

Note🐛 Debug This!

A student’s particle simulation is producing inconsistent results. Here’s their code:

def update_particles(particles, forces=[]):
    """Update particle positions based on forces."""
    for i, particle in enumerate(particles):
        if i >= len(forces):
            forces.append(calculate_force(particle))
        particle['velocity'] += forces[i] * dt
        particle['position'] += particle['velocity'] * dt
    return particles, forces

# Simulation loop
particles = [{'position': [0,0,0], 'velocity': [1,0,0]} for _ in range(10)]
for timestep in range(100):
    particles, forces = update_particles(particles)

Find and fix THREE bugs related to mutable defaults and aliasing: 1. The forces=[] default is created once and shared between calls 2. The function modifies the original particles list (no defensive copy) 3. The forces list grows unbounded, carrying over between timesteps

Solution:

def update_particles(particles, forces=None):
    """Update particle positions based on forces."""
    if forces is None:
        forces = []
    particles = copy.deepcopy(particles)  # Defensive copy
    forces_new = []  # Fresh forces each time

    for particle in particles:
        force = calculate_force(particle)
        forces_new.append(force)
        particle['velocity'] += force * dt
        particle['position'] += particle['velocity'] * dt
    return particles, forces_new

4.5 Dictionaries: Average-Case O(1) Lookup for Physics

Now we explore one of computer science’s most elegant inventions - the dictionary! In computational physics, dictionaries are perfect for particle properties, lookup tables, and caching expensive calculations. You’re about to understand exactly how they achieve near-instantaneous lookups regardless of size.

Organizing Simulation Data with Dictionaries

Let’s see how dictionaries transform particle management in an N-body simulation:

Accessing particle_0002:
  Mass: 5.97e+27 g
  Type: planet

Particles by type: {'star': ['particle_0001'], 'planet': ['particle_0002']}
In astronomy: Perfect for star/planet/dark matter separation

Understanding Hash Tables for Physics Applications

Let’s demystify how dictionaries achieve O(1) lookup - critical for lookup tables in equation of state calculations:

Warning⚠️ Average-Case vs Worst-Case Complexity

Dictionary and set operations are average-case O(1), not guaranteed O(1). In pathological cases with many hash collisions, lookups can degrade to O(n). This happens when:

  • Many keys hash to the same bucket (adversarial inputs)
  • Poor hash function design
  • Hash table needs resizing

For scientific computing, Python’s built-in hash functions work well for typical data. The O(1) claim holds in practice for nearly all physics applications.

How hash tables accelerate physics lookups:
============================================================
Key                  -> Hash        -> Bucket
------------------------------------------------------------
T=1e+06, rho=1e-03   -> 7060515190775938164 ->  64
T=2e+06, rho=2e-03   -> -3756423604195419920 ->  20
T=5e+06, rho=1e-02   -> -5145451484737341997 ->  97
H_ionization         -> 3730082394818695734 ->  34
He_ionization        -> -4861540764280204506 ->   6
opacity_table        -> -3903449810662890793 ->  93

The O(1) lookup process:
1. Hash the key -> integer
2. Map to bucket index
3. Direct array access!

This is why equation of state tables are fast!
NoteNote: Hash Values Change Between Python Sessions

Python salts string hashes for security. Running this code multiple times will show different hash values and bucket assignments - that’s expected! The key insight is that the process (hash \(\to\) bucket \(\to\) O(1) access) remains constant, even if specific values change.

Dictionary Performance for Particle Lookups

Time to see the dramatic performance difference for particle system queries:

#| eval: false

# Stage 1: Create test data (10 lines)
import time
import random

# Seed for reproducibility - critical for scientific computing!
rng = random.Random(42)

# Create a large particle system
n_particles = 100_000  # Reduced for reasonable runtime
print(f"Creating system with {n_particles:,} particles...")

# List approach (what beginners might try)
particle_list = [(f"particle_{i:07d}", rng.uniform(1e27, 1e30))
                 for i in range(n_particles)]
#| eval: false

# Stage 2: Create dictionary version (8 lines)
# Reset RNG for same masses
rng = random.Random(42)

# Dictionary approach (professional solution)
particle_dict = {f"particle_{i:07d}": rng.uniform(1e27, 1e30)
                 for i in range(n_particles)}

# Search for specific particle
target = "particle_0050000"

print(f"Finding {target} among {n_particles:,} particles...")
#| eval: false

# Stage 3: Time the lookups (15 lines)
# Time list search - O(n)
start = time.perf_counter()
for pid, mass in particle_list:
    if pid == target:
        mass_list = mass
        break
list_time = time.perf_counter() - start

# Time dict lookup - O(1)
start = time.perf_counter()
mass_dict = particle_dict[target]
dict_time = time.perf_counter() - start

print(f"\nList search: {list_time*1000:.3f} ms")
print(f"Dict lookup: {dict_time*1000:.6f} ms")
print(f"Dictionary is {list_time/dict_time:,.0f}x faster!")
print("\nFor astronomical catalogs with millions of objects:")
print("This difference determines feasibility!")

Sample output (from 100,000 particles on typical hardware):

Finding particle_0050000 among 100,000 particles...

List search: 2.847 ms
Dict lookup: 0.000312 ms
Dictionary is 9,125x faster!

For astronomical catalogs with millions of objects:
This difference determines feasibility!
Important💡 Computational Thinking: Memoization in Physics Calculations

This caching pattern dramatically speeds up physics codes:

from functools import lru_cache

@lru_cache(maxsize=10000)
def equation_of_state(temperature, density):
    # Expensive calculation cached automatically
    return complex_eos_calculation(temperature, density)

Real physics applications:

  • Nuclear reaction rates: Cache temperature-dependent rates
  • Opacity calculations: Cache (T, \(\rho\), composition) lookups
  • Molecular dynamics: Cache pairwise interaction potentials
  • Climate models: Cache radiative transfer calculations

4.6 Sets: Mathematical Operations for Particle Systems

Sets are perfect for tracking unique particles, finding neighbors, and performing mathematical operations on particle groups. They provide O(1) membership testing plus elegant set operations - ideal for computational physics!

Set Operations in Particle Dynamics

Boundary particles needing communication: {'p004', 'p007', 'p003', 'p005'}
Interior particles (A): {'p002', 'p001'}
Total unique particles: 9
Particles in multiple domains: {'p004', 'p007', 'p003', 'p005'}

Sets make domain decomposition bookkeeping trivial!
Used in: MPI parallel codes, astronomical survey overlaps
TipNeed a Set as a Dictionary Key?

Regular sets are mutable (unhashable), so they can’t be dict keys. Use frozenset for an immutable, hashable set:

# Track which particle combinations we've processed
processed = {}
pair = frozenset({'p001', 'p002'})  # Order doesn't matter!
processed[pair] = True

# These are the same pair:
frozenset({'p001', 'p002'}) == frozenset({'p002', 'p001'})  # True

Performance: Sets vs Lists for Neighbor Finding

Finding neighbors among 100,000 particles...

Checking 100 potential neighbors:
List approach: 0.08 ms
Set approach:  0.02763 ms
Set is 3x faster!
Tip✅ Check Your Understanding

Why is checking particle membership O(n) with lists but O(1) with sets?

Think about the search process…

Answer: Lists must check each element sequentially until finding a match (or reaching the end) - this is O(n) linear time. Sets use hash tables: they compute a hash of the particle ID and jump directly to where it would be stored - this is O(1) constant time.

For a million-particle simulation, this is the difference between microseconds and seconds per lookup. When you’re checking thousands of particle interactions per timestep, this performance difference determines whether your simulation finishes in hours or weeks!

4.7 Memory and Performance for Scientific Computing

Now that we understand how different data structures behave algorithmically with sets, dictionaries, and lists, let’s examine their actual memory footprint and cache performance implications. Understanding memory layout is crucial for scientific computing because it explains why specialized numerical libraries are so much faster than pure Python and will help you write cache-efficient simulation codes.

Memory Usage in Particle Systems

Let’s examine memory costs for different data structures in a particle simulation context:

Storing 10,000 particle IDs:
  List: 85,176 bytes
  Set:  524,504 bytes
100 particles as list: 920 bytes
With spatial hash: ~4,688 bytes
Extra 3,768 bytes enables O(1) neighbor finding!
Memory for 1,000 particles (ID, x, y, z):
-------------------------------------------------------
List of lists     :    8,856 bytes (container only!)
List of tuples    :    8,856 bytes (container only!)
Dictionary        :   36,952 bytes (container only!)
NumPy .nbytes     :   32,000 bytes (actual data!)
NumPy getsizeof   :   32,128 bytes (header only!)

⚠️  getsizeof() is SHALLOW - it misses contained objects!
   NumPy's .nbytes gives the actual data buffer size.

Cache Efficiency in Grid Computations

Modern CPUs are fast, but memory access is slow. Understanding cache efficiency is critical for grid-based simulations:

Simulating heat diffusion finite differences:
(Common in CFD and atmospheric modeling)
Processing 500x500 grid:
Row-major (cache-friendly):   16.6 ms
Column-major (cache-hostile): 19.0 ms
Ratio: 1.1x

⚠️  Important caveat about Python nested lists:
Each row is a SEPARATE list object in memory.
Only the row's pointers are contiguous, not the data.

In NumPy with contiguous arrays, this effect is 10-100x!
That's why memory layout matters so much in CFD/climate codes.
Warning⚠️ Common Bug Alert: The Iteration Modification Trap

Never modify a collection while iterating - a common bug in particle removal:

# THE BUG - Skips particles!
particles = [p1, p2, p3, p4, p5]
for p in particles:
    if p.escaped_boundary():
        particles.remove(p)  # WRONG! Skips next particle

# THE FIX 1: Iterate over copy
for p in particles.copy():
    if p.escaped_boundary():
        particles.remove(p)

# THE FIX 2: List comprehension (best for physics)
particles = [p for p in particles if not p.escaped_boundary()]

# THE FIX 3: Build removal list
to_remove = []
for p in particles:
    if p.escaped_boundary():
        to_remove.append(p)
for p in to_remove:
    particles.remove(p)

This bug has corrupted many molecular dynamics simulations!

4.8 Choosing the Right Structure for Physics

After exploring all options, how do you choose? Here’s your decision framework for computational physics:

Tip✅ Check Your Understanding

For each physics scenario, what data structure would you choose?

  1. Tracking unique particle IDs in collision detection \(\to\) ?
  2. Time series of energy measurements \(\to\) ?
  3. Caching expensive equation of state calculations \(\to\) ?
  4. Physical constants that must not change \(\to\) ?

Think before reading…

Answers: 1. Set (unique particles, O(1) collision checks) 2. List (ordered by time, allows duplicates) 3. Dictionary (cache (T,\(\rho\)) \(\to\) pressure with O(1) lookup) 4. Tuple or namedtuple (immutable constants, prevents accidents)

Performance Quick Reference for Physics

Operation List Tuple Dict Set
Access by index O(1) O(1) N/A N/A
Search for particle O(n) O(n) O(1)†‡ O(1)
Add particle O(1) N/A O(1) O(1)
Remove particle O(n) N/A O(1) O(1)
Memory (relative) 1\(\times\) 0.9\(\times\) 3\(\times\) 3\(\times\)
Spatial ordering Yes Yes Yes* No

* Dicts maintain insertion order (Python 3.7+) † Average-case; worst-case O(n) with hash collisions ‡ Amortized - occasionally O(n) during resize

Real Example: Combining Data Structures

Let’s see how different data structures work together in a simple particle tracking system:

Total active mass: 3.0e+30 g
Number of measurements: 3
Constants are protected: type(CONSTANTS) = tuple

Main Takeaways

You’ve just mastered concepts that will transform your computational physics from toy problems to research-grade simulations! The journey from understanding simple lists to architecting with dictionaries and sets represents a fundamental shift in how you approach scientific computing. You now see data structures not just as containers, but as carefully chosen tools where the right choice can mean the difference between simulations that finish in hours versus weeks.

The most profound insight from this chapter is that data structure choice often matters more than algorithm optimization. We saw how switching from a list to a set for particle lookups gave us a 100,000\(\times\) speedup - no amount of code optimization could achieve that! This is the secret that separates research codes from student projects: professionals spend more time architecting their data organization than writing physics equations. The best physics insight in the world is useless if your code can’t handle realistic problem sizes.

The mutable versus immutable distinction that seemed abstract at first is actually critical for scientific computing. Every time you use a tuple for physical constants or configuration parameters, you’re preventing bugs that have literally crashed spacecraft and corrupted published results. When you properly use defensive copying for your simulation state, you’re protecting yourself from the aliasing bugs that have plagued computational physics for decades. Remember: in the Cassini example, a simple data structure choice nearly lost a billion-dollar mission.

The performance principles you’ve learned extend far beyond Python. The cache efficiency concepts explain why codes like LAMMPS and GADGET obsess over memory layout - it’s not premature optimization, it’s the difference between simulating thousands versus millions of atoms. The Big-O notation you’ve mastered is the universal language for discussing whether an algorithm scales to galaxy-sized problems. The hash table concept underlying dictionaries appears in every parallel communication library, every adaptive mesh refinement code, and every spatial indexing scheme you’ll encounter.

Most importantly, you now understand the connection between data structures and numerical methods. That O(\(\mathrm{n}^{2}\)) all-pairs particle interaction that’s impossible for large systems? It becomes O(n log n) with tree-based structures. The O(n) searching that makes timesteps crawl? It becomes O(1) with spatial hashing. These aren’t just computer science concepts - they’re the difference between simulating a few hundred particles and modeling entire galaxies with billions of stars.

Remember: every data structure makes trade-offs. Lists give you ordering but slow searches. Dictionaries trade memory for lightning-fast lookups. Sets provide uniqueness and mathematical operations but lose ordering. There’s no universally “best” structure - only the best structure for your specific physics problem. As you tackle research simulations, you’ll combine multiple structures: NumPy arrays for number crunching, dictionaries for particle properties, sets for collision detection, and spatial data structures for neighbor finding.

With this foundation, you’re ready to architect simulations that scale to astronomical proportions. The next chapter on functions and modules will show you how to organize this knowledge into reusable, testable components. After that, NumPy will supercharge your numerical operations using the memory layout principles you now understand. You’re no longer just writing code - you’re engineering solutions to the computational challenges of modern physics!

Once you can reason about control flow (Chapter 3) and data layout (this chapter), you’re ready to start abstracting your work: writing functions and modules that hide complexity while preserving correctness and performance. That’s the goal of Chapter 5.

Definitions

Aliasing: When two or more variables refer to the same object in memory, causing modifications through one to affect the others.

Amortized O(1): An operation that is usually O(1) but occasionally O(n), where the average over many operations remains O(1).

Big-O Notation: Mathematical notation describing how an algorithm’s runtime scales with input size, focusing on the dominant term.

Cache: Small, fast memory close to the CPU that stores recently accessed data for quick retrieval.

Data Structure: A way of organizing data in computer memory to enable efficient access and modification.

Deep Copy: Creating a completely independent copy of an object and all objects it contains, recursively.

Dictionary: A mutable mapping type storing key-value pairs with average-case O(1) lookup time using hash tables. Worst-case O(n) with hash collisions.

Hash Function: A function mapping data of arbitrary size to fixed-size values, enabling fast lookups.

Hash Table: The underlying implementation for dictionaries and sets, enabling average-case O(1) lookups. Worst-case O(n) when many keys collide.

Immutable: Objects whose state cannot be modified after creation (tuples, strings, numbers).

List: Python’s built-in mutable sequence type that stores references to objects in order.

Memoization: Caching technique storing function results to avoid recalculating for the same inputs.

Mutable: Objects whose state can be modified after creation (lists, dictionaries, sets).

Named Tuple: A tuple subclass allowing element access by name as well as index.

O(1) - Constant Time: An operation whose runtime doesn’t depend on input size.

O(n) - Linear Time: An operation whose runtime grows proportionally with input size.

O(\(\mathrm{n}^{2}\)) - Quadratic Time: An operation whose runtime grows with the square of input size.

O(log n) - Logarithmic Time: An operation whose runtime grows logarithmically with input size.

O(n log n): Common complexity for efficient sorting algorithms and tree-based operations.

Reference: A variable that points to an object in memory rather than containing the value directly.

Set: A mutable collection of unique, unordered elements with O(1) membership testing.

Shallow Copy: Creating a new container with references to the same contained objects.

Spatial Hashing: Organizing particles by spatial region for O(1) neighbor finding.

Tuple: An immutable sequence type that cannot be changed after creation.

Key Takeaways

Data structure choice can change performance by factors of 1,000,000\(\times\) or more for large-scale simulations

Lists are versatile but have O(n) search - use for ordered particle arrays that you’ll vectorize with NumPy

Dictionaries and sets provide O(1) lookup through hash tables - essential for particle lookups and caching

Tuples prevent modification bugs - use for physical constants and configuration parameters

✓ The shallow vs deep copy distinction is critical for grid-based simulations

✓ Python stores references to objects, explaining why NumPy’s contiguous arrays are faster

✓ Memory layout affects cache performance by 2-10\(\times\) even in Python

✓ Every data structure trades something (memory, speed, flexibility) for something else

Mutable default arguments are dangerous - always use the None sentinel pattern (Chapter 1 callback)

✓ Spatial data structures transform O(\(\mathrm{n}^{2}\)) physics problems into O(n) or O(n log n)

Sets provide elegant mathematical operations for domain decomposition and particle tracking

Dictionaries enable memoization that can speed up equation of state calculations 100\(\times\)

Python Module & Method Reference

This reference section catalogs all Python modules, functions, and methods introduced in this chapter. Keep this as a quick lookup guide for your physics simulations.

Standard Library Modules

time module - High-resolution timing for performance measurement

import time
  • time.perf_counter() - Returns float seconds with highest available resolution
    • Use for timing code segments: start = time.perf_counter()
    • Always use for benchmarking (not time.time() which can go backwards!)

sys module - System-specific parameters (expanded from Chapter 1)

import sys
  • sys.getsizeof(object) - Returns shallow memory size of object in bytes
    • Measures container overhead only, not contained objects
    • For deep size: sum recursively or use pympler.asizeof()
    • Use for comparing container overhead, not total memory

copy module - Create object copies with control over depth

import copy
  • copy.copy(x) - Creates shallow copy (new container, same contents)
  • copy.deepcopy(x) - Creates deep copy (recursively copies all contents)
    • Critical for grid simulations to avoid aliasing bugs
    • Use for simulation state backups

random module - Generate random numbers for Monte Carlo

import random
  • random.random() - Random float in [0.0, 1.0)
  • random.uniform(a, b) - Random float in [a, b]
  • random.choice(seq) - Random element from sequence
  • random.sample(population, k) - k unique random elements

json module - Save/load structured data

import json
  • json.dump(obj, file, indent=2) - Write object to file with formatting
  • json.load(file) - Read object from file
  • json.dumps(obj) - Convert object to JSON string
  • json.loads(string) - Parse JSON string to object

math module - Mathematical functions (review from Chapter 2)

import math
  • math.sqrt(x) - Square root (use for distances)
  • math.log(x) - Natural logarithm
  • math.log2(x) - Base-2 logarithm (for Big-O analysis)

Collections Module

collections module - Specialized container datatypes

from collections import OrderedDict, Counter, defaultdict, deque, namedtuple

OrderedDict - Dictionary that remembers insertion order - .move_to_end(key, last=True) - Move key to end (or beginning if last=False) - .popitem(last=True) - Remove and return (key, value) pair from end/beginning - Use for LRU caches in physics calculations

Counter - Dictionary subclass for counting hashable objects - Counter(iterable) - Create from iterable - .most_common(n) - Returns n most common (element, count) pairs - .update(iterable) - Add counts from iterable - Perfect for histogram analysis of particle properties

defaultdict - Dictionary with default value factory - defaultdict(list) - Missing keys default to empty list - defaultdict(int) - Missing keys default to 0 - Eliminates KeyError in particle grouping

deque - Double-ended queue with O(1) operations at both ends - deque(maxlen=n) - Fixed-size queue (old elements dropped) - .append(x) / .appendleft(x) - Add to right/left end - .pop() / .popleft() - Remove from right/left end - Essential for boundary conditions in simulations

namedtuple - Tuple subclass with named fields

Particle = namedtuple('Particle', ['id', 'mass', 'x', 'y', 'z'])
p = Particle(1, 1.0e30, 0, 0, 0)
print(p.mass)  # Access by name

Functools Module

functools module - Higher-order functions and operations

from functools import lru_cache

@lru_cache(maxsize=128) - Decorator for automatic memoization

@lru_cache(maxsize=1000)
def expensive_calculation(T, rho):
    return complex_physics_calculation(T, rho)
  • Caches function results automatically
  • maxsize=None for unlimited cache
  • .cache_info() - Returns cache statistics
  • .cache_clear() - Clear the cache

Built-in Functions for Data Structures

Type Checking - type(obj) - Returns object’s type - isinstance(obj, type) - Check if obj is instance of type - id(obj) - Returns unique identifier (memory address) - obj1 is obj2 - Check if same object (identity)

Container Operations - len(container) - Number of elements - x in container - Membership test (O(1) for sets/dicts, O(n) for lists) - sorted(iterable) - Returns new sorted list - reversed(sequence) - Returns reverse iterator - enumerate(iterable, start=0) - Returns (index, value) pairs - zip(iter1, iter2, ...) - Parallel iteration

Copying - list(iterable) - Create new list from iterable - tuple(iterable) - Create new tuple from iterable - set(iterable) - Create new set from iterable - dict(pairs) - Create dictionary from (key, value) pairs

Data Structure Methods Summary

List Methods (Expanded) - .append(x) - Add to end (amortized O(1)) - .extend(iterable) - Add all elements from iterable - .insert(i, x) - Insert at position i (O(n)) - .remove(x) - Remove first x (O(n)) - .pop(i=-1) - Remove and return element at i - .clear() - Remove all elements - .index(x) - Find position of x (O(n)) - .count(x) - Count occurrences of x - .sort() - Sort in-place - .reverse() - Reverse in-place - .copy() - Create shallow copy

Dictionary Methods (Expanded) - .get(key, default=None) - Safe access with default - .setdefault(key, default) - Get or set with default - .pop(key, default) - Remove and return value - .popitem() - Remove and return arbitrary (key, value) - .update(other) - Update with other dict/pairs - .keys() - View of keys - .values() - View of values - .items() - View of (key, value) pairs - .clear() - Remove all items - .copy() - Shallow copy

Set Methods (Expanded) - .add(x) - Add element - .remove(x) - Remove element (raises KeyError if missing) - .discard(x) - Remove element (no error if missing) - .pop() - Remove and return arbitrary element - .clear() - Remove all elements - .union(other) or | - Elements in either set - .intersection(other) or & - Elements in both sets - .difference(other) or - - Elements in first but not second - .symmetric_difference(other) or ^ - Elements in either but not both - .update(other) or |= - Add elements from other - .intersection_update(other) or &= - Keep only elements in both - .difference_update(other) or -= - Remove elements in other - .symmetric_difference_update(other) or ^= - Keep elements in either but not both - .issubset(other) - Check if all elements in other - .issuperset(other) - Check if contains all elements of other - .isdisjoint(other) - Check if no elements in common - .copy() - Create shallow copy

Quick Reference Tables

Data Structure Operations

Operation List Tuple Dict Set
Create empty [] () {} set()
Create with items [1,2,3] (1,2,3) {'a':1} {1,2,3}
Add item .append(x) N/A d[k]=v .add(x)
Remove item .remove(x) N/A del d[k] .remove(x)
Check membership x in list x in tuple k in dict x in set
Get by index list[i] tuple[i] N/A N/A

Performance Quick Reference

Operation List Tuple Dict Set Deque
Access by index O(1) O(1) N/A N/A O(n)
Search for value O(n) O(n) O(1)* O(1) O(n)
Add to end O(1)† N/A O(1)† O(1)† O(1)
Add to beginning O(n) N/A O(1)† O(1)† O(1)
Remove from end O(1) N/A N/A N/A O(1)
Remove from beginning O(n) N/A N/A N/A O(1)
Remove specific O(n) N/A O(1) O(1) O(n)

* Dict searches by key, not value † Amortized - occasionally O(n) during resize

Common Methods for Physics

Structure Method Physics Use Case
list .append(x) Add new particle
list .extend(iter) Merge particle lists
dict .get(k, default) Safe parameter lookup
dict @lru_cache Automatic memoization
set .union(other) Combine particle domains
set .intersection(other) Find boundary particles
deque .appendleft() Boundary conditions
OrderedDict .move_to_end() LRU cache management

When to Use Each Structure

Physics Task Best Choice Why
Particle positions List \(\to\) NumPy array Vectorizable operations
Particle lookup by ID Dictionary O(1) access
Active particles Set Fast membership, uniqueness
Physical constants Tuple/NamedTuple Immutable, safe
EOS cache Dictionary/LRU cache Fast lookup by (T,\(\rho\))
Collision candidates Set No duplicates, set operations
Time series List Ordered, allows duplicates
Spatial grid Dict of sets O(1) cell access
Boundary particles Deque Fast operations at both ends
Particle counts Counter Automatic histogram creation

Next Chapter Preview

With data structures mastered, Chapter 5 will explore functions and modules - how to organize code for reusability, testing, and collaboration. You’ll learn how Python’s function model, with first-class functions and closure support, enables powerful patterns like decorators and functional programming techniques. These concepts prepare you for building modular physics engines, creating reusable analysis pipelines, and understanding the functional programming paradigm used in modern frameworks like JAX where functions transform into automatically differentiable computational graphs. Get ready to transform your scripts into professional-quality scientific libraries!

Footnotes

  1. Lions, J.L., et al. (1996). “Ariane 5 Flight 501 Failure Report.” ESA/CNES.↩︎

  2. Jones, M.B. (1997). “What Really Happened on Mars.” Microsoft Research.↩︎

  3. Anderson, Y.Z., et al. (2006). “Solving Cassini’s Data Glitch Problem.” NASA/JPL.↩︎

  4. Paxton et al. (2011). “Modules for Experiments in Stellar Astrophysics (MESA).” ApJS, 192, 3.↩︎

  5. “Knight Capital Says Trading Mishap Cost It $440 Million.” The New York Times, August 2, 2012. SEC Release No. 70694 (October 16, 2013).↩︎