Lattice Documentation

GPU-accelerated wind tunnel simulation using Lattice Boltzmann Methods. Runs on consumer GPUs, outputs drag and lift coefficients, and includes an ML surrogate for instant predictions.

Quick Start

Requires OpenGL 4.3+ and a GPU with compute shader support. Any discrete NVIDIA or AMD card from the last 8 years will work.

Build from source

terminal

# Dependencies (Ubuntu/Debian)
sudo apt install build-essential cmake pkg-config \
  libsdl2-dev libsdl2-ttf-dev mesa-common-dev \
  libgl1-mesa-dev libegl1-mesa-dev

# Build
cmake -B build -S simulation -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Run interactively
cd simulation
../build/3d_fluid_simulation_car --wind=1.0

Headless render

terminal

mkdir -p frames
../build/3d_fluid_simulation_car \
  --wind=2.0 --duration=10 \
  --model=assets/3d-files/ahmed_25deg_m.obj \
  --output=frames --grid=256x128x128

Web frontend

terminal

cd website
npm ci && npm run dev
# Open http://localhost:3000

How It Works

Each frame runs five GPU compute passes in sequence:

Collision -- relax distribution functions toward equilibrium (BGK or regularized operator, with optional Smagorinsky SGS turbulence model)
Streaming -- propagate distributions to neighboring cells
Force computation -- accumulate drag/lift via momentum exchange at solid boundaries
Particle advection -- move tracer particles through the velocity field
Rendering -- draw particles as points or streamline trails

The LBM solver uses a D3Q19 lattice (19 velocity directions in 3D). Solid geometry is voxelized from OBJ meshes using ray casting. Boundary conditions are Zou-He velocity inlet, Zou-He pressure outlet, and Bouzidi interpolated bounce-back on solid walls. The Bouzidi scheme stores the fractional distance from each fluid cell to the actual mesh surface along every lattice link, then applies quadratic or linear interpolation during streaming. This gives second-order wall accuracy without refining the grid.

Architecture

lattice/
  simulation/
    src/main.c           # Render loop, CLI, GL setup
    src/lbm.c            # LBM grid creation, force readback
    src/ml_predict.c     # ML inference forward pass (pure C)
    shaders/
      lbm_collide.comp   # BGK + Smagorinsky + BCs
      lbm_stream.comp    # Pull-based streaming
      lbm_force.comp     # Momentum exchange drag
      particle.comp      # Particle advection + collision
      trail_update.comp  # Streamline trail history
    lib/                 # Headers
    assets/              # OBJ models, fonts
  website/
    src/app/page.tsx     # Main simulator UI
    src/app/api/         # Next.js API routes
    src/components/      # React components
  ml/
    train.cpp            # Training driver
    data_gen.py          # Modal sweep for training data
    framework/           # C++ autodiff, layers, optimizer

CLI Reference

Flag	Default	Description
-w, --wind=SPEED	1.0	Wind speed 0-5 (maps to Re)
-d, --duration=SECS	0	Headless render duration (0 = interactive)
-o, --output=PATH		Output directory for PPM frames
-m, --model=PATH	car-model.obj	Path to OBJ mesh file
-a, --angle=DEG		Ahmed body slant angle (25 or 35)
-g, --grid=XxYxZ	128x64x64	LBM grid resolution
-r, --reynolds=RE	0	Target Reynolds number (0 = derive from wind)
-s, --scale=S	0.05	Model scale factor
-S, --smagorinsky=CS	0.1	Smagorinsky constant (0-0.5)
-v, --viz=MODE	1	Visualization mode (0-7)
-c, --collision=MODE	2	Collision: 0=off, 1=AABB, 2=mesh, 3=voxel

Wind speed controls Reynolds number when --reynolds is not set: Re = 200 * windSpeed, capped by the grid's minimum stable tau (0.52). Larger grids support higher Re. Voxel collision (mode 3) requires LBM and will fall back to AABB if LBM is disabled.

API Reference

The web frontend talks to a Modal GPU worker via a REST endpoint.

POST /api/render

request

{
  "wind_speed": 1.0,
  "duration": 10,
  "model": "ahmed25",
  "viz_mode": 1,
  "collision_mode": 2,
  "reynolds": 0
}

response

{
  "status": "complete",
  "video_url": "https://...",
  "cd_value": 0.295,
  "cl_value": -0.012,
  "cd_series": [0.31, 0.30, 0.295],
  "effective_re": 480,
  "grid_size": "256x128x128"
}

The worker runs on an NVIDIA T4 GPU via Modal. Cold starts take 30-60 seconds; warm starts respond in ~15 seconds for a 10-second simulation.

Visualization Modes

Mode	Key	Description
0	3	Depth -- distance-based coloring
1	4	Velocity magnitude -- jet colormap (blue to red)
2	5	Velocity direction -- RGB = normalized XYZ
3	6	Particle lifetime -- viridis colormap with fade
4	7	Turbulence -- cool-warm diverging map
5	8	Flow progress -- rainbow by x-position
6	9	Vorticity -- purple (laminar) to orange (turbulent)
7	V	Streamlines -- particle trails as line strips, 32-point history

Press V to cycle through modes in interactive mode. Streamline mode (7) stores a 32-frame position trail per particle and renders as GL_LINE_STRIP with velocity-based coloring and alpha fade.

Physics Details

Governing Equations

Fluid flow is governed by the Navier-Stokes equations for incompressible flow:

∂u/∂t + (u· ∇)u= −(1/ρ)∇p + ν∇²u

∇ · u = 0

where uis the velocity field, p is pressure, ρ is density, and ν is kinematic viscosity. The Reynolds number Re = UL/ν characterizes the flow regime. For automotive aerodynamics at highway speeds, Re reaches 10⁶ or higher, indicating fully turbulent flow.

Lattice Boltzmann Method

Rather than solving Navier-Stokes directly, the LBM evolves particle distribution functions on a D3Q19 lattice (3 dimensions, 19 velocities): 1 rest, 6 face-connected, and 12 edge-connected neighbors. The BGK collision operator relaxes distributions toward equilibrium:

f_i(x + e_iΔt, t + Δt) = f_i− (1/τ)[f_i − f_i^eq]

The relaxation time τ controls viscosity: ν = c_s²(τ − ½)Δt. Lower τ means higher Re but less numerical stability. The minimum stable value is τ = 0.52.

Equilibrium Distribution

The Maxwell-Boltzmann equilibrium distribution is:

f_i^eq = w_iρ[1 + (e_i·u)/c_s² + (e_i·u)²/(2c_s⁴) − |u|²/(2c_s²)]

with weights w₀ = 1/3, w_1-6 = 1/18, w_7-18 = 1/36 and speed of sound c_s² = 1/3.

Collision Kernel

lbm_collide.comp (simplified)

#version 430 core
layout(local_size_x = 8, local_size_y = 8, local_size_z = 8) in;

uniform float tau;

void main() {
    uvec3 pos = gl_GlobalInvocationID;
    uint idx = pos.x + pos.y*NX + pos.z*NX*NY;

    // Compute density and velocity
    float rho = 0.0;
    vec3 u = vec3(0.0);
    for (int i = 0; i < 19; i++) {
        float fi = f[i][idx];
        rho += fi;
        u += fi * e[i];
    }
    u /= rho;

    // BGK collision
    float omega = 1.0 / tau;
    for (int i = 0; i < 19; i++) {
        float feq = equilibrium(i, rho, u);
        f[i][idx] -= omega * (f[i][idx] - feq);
    }
}

Smagorinsky SGS Model

For turbulent flows, the Smagorinsky subgrid-scale model computes a local eddy viscosity from the non-equilibrium stress tensor:

τ_eff= ½(τ + √(τ² + 18C_s²|Π|/ρ))

where C_s is the Smagorinsky constant (default 0.1, tunable via --smagorinsky) and |Π| is the Frobenius norm of the non-equilibrium stress. This increases τ locally in turbulent regions, preventing instability without damping the entire domain.

Drag Computation

Surface forces use the Mei-Luo-Shyy momentum exchange method, which pairs with the Bouzidi boundary to account for the actual wall position rather than assuming it sits on the grid midpoint:

F = ∑_i e_i (f_i*(x_f) + f_ī(x_f))

where f_i* is the post-collision distribution heading toward the wall and f_ī is the Bouzidi-reflected distribution coming back. The drag and lift coefficients are then:

C_d = F_x/ (½ρU²A) C_l = F_y/ (½ρU²A)

lbm_collide.comp lbm_stream.comp lbm_force.comp

ML Surrogate Model

A small MLP predicts Cd and Cl in under a millisecond, giving instant estimates before the LBM has time to converge. The entire training framework is written from scratch in C++ with reverse-mode autodiff, so there are no Python ML dependencies at inference time. The same model runs in the C simulation binary, in the browser via TypeScript, and in the evaluation pipeline.

Architecture

The network takes three z-score normalized inputs (wind speed, Reynolds number, model ID) and outputs two values (Cd, Cl). SwiGLU activations replace standard ReLU for smoother gradient flow.

Linear(3, 256) -> SwiGLU(256, 512) -> Linear(256, 128) -> SwiGLU(128, 256) -> Linear(128, 2)

525,400 parameters total
  fc1:  3 x 256  + bias     =   1,024
  act1: SwiGLU(256, 512)    = 393,216  (gate + up + down projections)
  fc2:  256 x 128 + bias    =  32,896
  act2: SwiGLU(128, 256)    =  98,304
  fc3:  128 x 2 + bias      =     258

Training

Training data comes from the LBM simulation itself. A parameter sweep over wind speeds, Reynolds numbers, and model geometries generates (input, Cd, Cl) pairs. The data generator (data_gen.py) submits render jobs to Modal, validates convergence, and writes a binary dataset.

Optimizer:    AdamW (lr=1e-3, weight_decay=1e-4, clip_norm=1.0)
Loss:         MSE on [Cd, Cl]
Batch size:   64
Epochs:       500 (early stopping, patience=50)
Train/val:    80/20 split
Data format:  binary -- 16-byte header + N records of 5 float32s
              [wind_speed, reynolds, model_id, cd, cl]

Weight Format

Weights are stored in a custom binary format (LTWS) shared across all three inference targets. The normalizer file stores 3 means and 3 standard deviations as raw float32s.

model.bin:      LTWS header (magic + version + count) + 12 parameter tensors
model_norm.bin: 6 x float32 (mean[3], std[3])

Inference

The C simulation loads the weights at startup and prints an instant Cd/Cl estimate before the LBM loop begins. The browser loads the same files from /models/ and runs a pure TypeScript forward pass with no dependencies. Both paths take under 1ms.

terminal

# Generate training data
cd ml && python data_gen.py --config sweep_config.yaml --endpoint $MODAL_URL

# Train the model
cmake -B build && cmake --build build && ./build/train dataset/training_data.bin

# Evaluate
python evaluate.py --weights model.bin --norm model_norm.bin --data dataset/training_data.bin

train.cpp autodiff.cpp ad_swiglu.cpp ml_predict.c surrogate.ts data_gen.py

Validation

The solver uses a sphere drag test to validate the force computation pipeline (momentum exchange, Bouzidi boundary, Cd formula). At Re=100 with 16 cells across the sphere diameter, the measured Cd falls within the expected range from Clift, Grace & Weber (1978).

Sphere reference test

Re	Grid	Cd (sim)	Cd (ref)	Status
100	128×64×64	1.1 – 1.3	1.09	Pass (30% tol)

Why Cd differs from wind tunnel data

Published Ahmed body data (Ahmed 1984, Lienhart 2003) is measured at Re > 500,000 where the flow is fully turbulent. The LBM solver on typical grids (128-256 cells) operates at Re = 50-200 where the flow is laminar. At laminar Re the boundary layer is thicker, separation occurs earlier, and Cd is 3-10x higher than at turbulent Re. This is physically correct behavior, not a bug.

The Smagorinsky SGS turbulence model adds local eddy viscosity that partially compensates, but matching high-Re experimental data requires grids with hundreds of cells across the body -- beyond what consumer GPUs can handle in real time.

Grid size and Reynolds number

The achievable Reynolds number is limited by the grid resolution. The relaxation parameter τ must stay above 0.52 for numerical stability. This relationship determines the maximum Re for each grid:

Re_max = U · L_body / ν_min ν_min = (τ_min− 0.5) / 3

Grid	Body cells	Re (max)	Cd range	GPU VRAM
128×64×64	~38	~115	2 – 5	86 MB
128×96×96	~38	~115	2 – 5	240 MB
256×128×128	~77	~230	1 – 3	960 MB
512×256×256	~153	~460	0.5 – 2	7.6 GB

The Cd values above are for the Ahmed body at 25° slant. Published wind tunnel Cd ≈ 0.29 is at Re ≈ 768,000 (Lienhart 2003). Reaching that Re would need ≈2,000 cells across the body.

Try it yourself

Use the calculator below to see what Reynolds number and memory footprint your grid configuration produces:

Grid Resolution Calculator

X (streamwise)

Y (lateral)

Z (vertical)

Total cells

4.2M

Body length (lattice)

76.8 cells

Re (max, τ=0.52)

576

Re (stable, τ=0.65)

GPU memory

992 MB

Largest SSBO buffer

304 MB (needs EGL/native GPU)

Performance

Performance depends on grid size and GPU. The LBM kernel is memory-bandwidth-bound: each cell reads and writes 19 float32 distributions per step. Typical throughput on modern GPUs:

GPU	Grid	Steps/sec	MLUPS
A10G (Modal)	128×96×96	~400	~470
A100 40GB (Modal)	256×128×128	~100	~420
RTX 3070	128×64×64	~800	~420
RTX 4090	256×128×128	~200	~840

MLUPS = Million Lattice Updates Per Second. Memory usage per cell is approximately 200 bytes (19 distributions × 2 buffers + velocity + solid + Bouzidi q). A 256×128×128 grid needs about 960 MB of GPU buffer space.

Convergence time

The drag coefficient needs 3-5 flow-throughs to converge. A flow-through is the time for a fluid element to traverse the domain: N_x / U_lattice steps. For a 256-cell domain at U = 0.05, one flow-through takes 5,120 steps. The simulation reports an exponential moving average (EMA) of Cd that stabilizes after approximately 50 samples.

Contributing

Lattice is open source under the MIT license. Contributions are welcome -- see CONTRIBUTING.md for the full style guide, testing, and PR process.

Code style

The simulation is C11, formatted with clang-format (LLVM style, 80 char limit). The website is TypeScript/React with Tailwind. Tests run in CI via GitHub Actions.

Commit sign-off

Every commit must carry a sign-off line certifying the Developer Certificate of Origin. Use git commit -s to add it automatically:

bash

git commit -s -m "lbm: fix bounce-back at concave corners"

# Produces:
# lbm: fix bounce-back at concave corners
#
# Signed-off-by: Your Name <your@email.com>

Commits without a valid sign-off will be rejected by CI. Make sure your Git name and email match your GitHub account.

Running tests

terminal

# Run tests
cd simulation && xvfb-run ../build/test_lbm  # LBM unit tests
xvfb-run bash test/test_integration.sh       # Integration test
cd website && npm test                        # Frontend tests
cd ml/framework && ./build/test_ml            # ML framework tests

References

Ahmed, Ramm, Faltin. "Some salient features of the time-averaged ground vehicle wake." SAE 840300, 1984.
Chen, Doolen. "Lattice Boltzmann method for fluid flows." Annu. Rev. Fluid Mech., 1998.
Krüger et al. The Lattice Boltzmann Method. Springer, 2017.
Ladd. "Numerical simulations of particulate suspensions via a discretized Boltzmann equation." J. Fluid Mech., 1994.