Lattice Documentation

GPU-accelerated wind tunnel simulation using Lattice Boltzmann Methods. Runs on consumer GPUs, outputs drag and lift coefficients, and includes an ML surrogate for instant predictions.

Quick Start

Requires OpenGL 4.3+ and a GPU with compute shader support. Any discrete NVIDIA or AMD card from the last 8 years will work.

Build from source

terminal
# Dependencies (Ubuntu/Debian)
sudo apt install build-essential cmake pkg-config \
  libsdl2-dev libsdl2-ttf-dev mesa-common-dev \
  libgl1-mesa-dev libegl1-mesa-dev

# Build
cmake -B build -S simulation -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Run interactively
cd simulation
../build/3d_fluid_simulation_car --wind=1.0

Headless render

terminal
mkdir -p frames
../build/3d_fluid_simulation_car \
  --wind=2.0 --duration=10 \
  --model=assets/3d-files/ahmed_25deg_m.obj \
  --output=frames --grid=256x128x128

Web frontend

terminal
cd website
npm ci && npm run dev
# Open http://localhost:3000

How It Works

Each frame runs five GPU compute passes in sequence:

  1. Collision -- relax distribution functions toward equilibrium (BGK or regularized operator, with optional Smagorinsky SGS turbulence model)
  2. Streaming -- propagate distributions to neighboring cells
  3. Force computation -- accumulate drag/lift via momentum exchange at solid boundaries
  4. Particle advection -- move tracer particles through the velocity field
  5. Rendering -- draw particles as points or streamline trails

The LBM solver uses a D3Q19 lattice (19 velocity directions in 3D). Solid geometry is voxelized from OBJ meshes using ray casting. Boundary conditions are Zou-He velocity inlet, Zou-He pressure outlet, and Bouzidi interpolated bounce-back on solid walls. The Bouzidi scheme stores the fractional distance from each fluid cell to the actual mesh surface along every lattice link, then applies quadratic or linear interpolation during streaming. This gives second-order wall accuracy without refining the grid.

Architecture

lattice/
  simulation/
    src/main.c           # Render loop, CLI, GL setup
    src/lbm.c            # LBM grid creation, force readback
    src/ml_predict.c     # ML inference forward pass (pure C)
    shaders/
      lbm_collide.comp   # BGK + Smagorinsky + BCs
      lbm_stream.comp    # Pull-based streaming
      lbm_force.comp     # Momentum exchange drag
      particle.comp      # Particle advection + collision
      trail_update.comp  # Streamline trail history
    lib/                 # Headers
    assets/              # OBJ models, fonts
  website/
    src/app/page.tsx     # Main simulator UI
    src/app/api/         # Next.js API routes
    src/components/      # React components
  ml/
    train.cpp            # Training driver
    data_gen.py          # Modal sweep for training data
    framework/           # C++ autodiff, layers, optimizer

CLI Reference

FlagDefaultDescription
-w, --wind=SPEED1.0Wind speed 0-5 (maps to Re)
-d, --duration=SECS0Headless render duration (0 = interactive)
-o, --output=PATHOutput directory for PPM frames
-m, --model=PATHcar-model.objPath to OBJ mesh file
-a, --angle=DEGAhmed body slant angle (25 or 35)
-g, --grid=XxYxZ128x64x64LBM grid resolution
-r, --reynolds=RE0Target Reynolds number (0 = derive from wind)
-s, --scale=S0.05Model scale factor
-S, --smagorinsky=CS0.1Smagorinsky constant (0-0.5)
-v, --viz=MODE1Visualization mode (0-7)
-c, --collision=MODE2Collision: 0=off, 1=AABB, 2=mesh, 3=voxel
Wind speed controls Reynolds number when --reynolds is not set: Re = 200 * windSpeed, capped by the grid's minimum stable tau (0.52). Larger grids support higher Re. Voxel collision (mode 3) requires LBM and will fall back to AABB if LBM is disabled.

API Reference

The web frontend talks to a Modal GPU worker via a REST endpoint.

POST /api/render

request
{
  "wind_speed": 1.0,
  "duration": 10,
  "model": "ahmed25",
  "viz_mode": 1,
  "collision_mode": 2,
  "reynolds": 0
}
response
{
  "status": "complete",
  "video_url": "https://...",
  "cd_value": 0.295,
  "cl_value": -0.012,
  "cd_series": [0.31, 0.30, 0.295],
  "effective_re": 480,
  "grid_size": "256x128x128"
}

The worker runs on an NVIDIA T4 GPU via Modal. Cold starts take 30-60 seconds; warm starts respond in ~15 seconds for a 10-second simulation.

Visualization Modes

ModeKeyDescription
03Depth -- distance-based coloring
14Velocity magnitude -- jet colormap (blue to red)
25Velocity direction -- RGB = normalized XYZ
36Particle lifetime -- viridis colormap with fade
47Turbulence -- cool-warm diverging map
58Flow progress -- rainbow by x-position
69Vorticity -- purple (laminar) to orange (turbulent)
7VStreamlines -- particle trails as line strips, 32-point history

Press V to cycle through modes in interactive mode. Streamline mode (7) stores a 32-frame position trail per particle and renders as GL_LINE_STRIP with velocity-based coloring and alpha fade.

Physics Details

Governing Equations

Fluid flow is governed by the Navier-Stokes equations for incompressible flow:

u/∂t + (u· ∇)u= −(1/ρ)∇p + ν∇²u
∇ · u = 0

where uis the velocity field, p is pressure, ρ is density, and ν is kinematic viscosity. The Reynolds number Re = UL/ν characterizes the flow regime. For automotive aerodynamics at highway speeds, Re reaches 10⁶ or higher, indicating fully turbulent flow.

Lattice Boltzmann Method

Rather than solving Navier-Stokes directly, the LBM evolves particle distribution functions on a D3Q19 lattice (3 dimensions, 19 velocities): 1 rest, 6 face-connected, and 12 edge-connected neighbors. The BGK collision operator relaxes distributions toward equilibrium:

fi(x + eiΔt, t + Δt) = fi− (1/τ)[fi − fieq]
The relaxation time τ controls viscosity: ν = cs²(τ − ½)Δt. Lower τ means higher Re but less numerical stability. The minimum stable value is τ = 0.52.

Equilibrium Distribution

The Maxwell-Boltzmann equilibrium distribution is:

fieq = wiρ[1 + (ei·u)/cs² + (ei·u)²/(2cs⁴) − |u|²/(2cs²)]

with weights w0 = 1/3, w1-6 = 1/18, w7-18 = 1/36 and speed of sound cs² = 1/3.

Collision Kernel

lbm_collide.comp (simplified)
#version 430 core
layout(local_size_x = 8, local_size_y = 8, local_size_z = 8) in;

uniform float tau;

void main() {
    uvec3 pos = gl_GlobalInvocationID;
    uint idx = pos.x + pos.y*NX + pos.z*NX*NY;

    // Compute density and velocity
    float rho = 0.0;
    vec3 u = vec3(0.0);
    for (int i = 0; i < 19; i++) {
        float fi = f[i][idx];
        rho += fi;
        u += fi * e[i];
    }
    u /= rho;

    // BGK collision
    float omega = 1.0 / tau;
    for (int i = 0; i < 19; i++) {
        float feq = equilibrium(i, rho, u);
        f[i][idx] -= omega * (f[i][idx] - feq);
    }
}

Smagorinsky SGS Model

For turbulent flows, the Smagorinsky subgrid-scale model computes a local eddy viscosity from the non-equilibrium stress tensor:

τeff= ½(τ + √(τ² + 18Cs²|Π|/ρ))

where Cs is the Smagorinsky constant (default 0.1, tunable via --smagorinsky) and |Π| is the Frobenius norm of the non-equilibrium stress. This increases τ locally in turbulent regions, preventing instability without damping the entire domain.

Drag Computation

Surface forces use the Mei-Luo-Shyy momentum exchange method, which pairs with the Bouzidi boundary to account for the actual wall position rather than assuming it sits on the grid midpoint:

F = ∑i ei (fi*(xf) + f(xf))

where fi* is the post-collision distribution heading toward the wall and f is the Bouzidi-reflected distribution coming back. The drag and lift coefficients are then:

Cd = Fx/ (½ρU²A)      Cl = Fy/ (½ρU²A)

ML Surrogate Model

A small MLP predicts Cd and Cl in under a millisecond, giving instant estimates before the LBM has time to converge. The entire training framework is written from scratch in C++ with reverse-mode autodiff, so there are no Python ML dependencies at inference time. The same model runs in the C simulation binary, in the browser via TypeScript, and in the evaluation pipeline.

Architecture

The network takes three z-score normalized inputs (wind speed, Reynolds number, model ID) and outputs two values (Cd, Cl). SwiGLU activations replace standard ReLU for smoother gradient flow.

Linear(3, 256) -> SwiGLU(256, 512) -> Linear(256, 128) -> SwiGLU(128, 256) -> Linear(128, 2)

525,400 parameters total
  fc1:  3 x 256  + bias     =   1,024
  act1: SwiGLU(256, 512)    = 393,216  (gate + up + down projections)
  fc2:  256 x 128 + bias    =  32,896
  act2: SwiGLU(128, 256)    =  98,304
  fc3:  128 x 2 + bias      =     258

Training

Training data comes from the LBM simulation itself. A parameter sweep over wind speeds, Reynolds numbers, and model geometries generates (input, Cd, Cl) pairs. The data generator (data_gen.py) submits render jobs to Modal, validates convergence, and writes a binary dataset.

Optimizer:    AdamW (lr=1e-3, weight_decay=1e-4, clip_norm=1.0)
Loss:         MSE on [Cd, Cl]
Batch size:   64
Epochs:       500 (early stopping, patience=50)
Train/val:    80/20 split
Data format:  binary -- 16-byte header + N records of 5 float32s
              [wind_speed, reynolds, model_id, cd, cl]

Weight Format

Weights are stored in a custom binary format (LTWS) shared across all three inference targets. The normalizer file stores 3 means and 3 standard deviations as raw float32s.

model.bin:      LTWS header (magic + version + count) + 12 parameter tensors
model_norm.bin: 6 x float32 (mean[3], std[3])

Inference

The C simulation loads the weights at startup and prints an instant Cd/Cl estimate before the LBM loop begins. The browser loads the same files from /models/ and runs a pure TypeScript forward pass with no dependencies. Both paths take under 1ms.

terminal
# Generate training data
cd ml && python data_gen.py --config sweep_config.yaml --endpoint $MODAL_URL

# Train the model
cmake -B build && cmake --build build && ./build/train dataset/training_data.bin

# Evaluate
python evaluate.py --weights model.bin --norm model_norm.bin --data dataset/training_data.bin

Validation

The solver uses a sphere drag test to validate the force computation pipeline (momentum exchange, Bouzidi boundary, Cd formula). At Re=100 with 16 cells across the sphere diameter, the measured Cd falls within the expected range from Clift, Grace & Weber (1978).

Sphere reference test

ReGridCd (sim)Cd (ref)Status
100128×64×641.1 – 1.31.09Pass (30% tol)

Why Cd differs from wind tunnel data

Published Ahmed body data (Ahmed 1984, Lienhart 2003) is measured at Re > 500,000 where the flow is fully turbulent. The LBM solver on typical grids (128-256 cells) operates at Re = 50-200 where the flow is laminar. At laminar Re the boundary layer is thicker, separation occurs earlier, and Cd is 3-10x higher than at turbulent Re. This is physically correct behavior, not a bug.

The Smagorinsky SGS turbulence model adds local eddy viscosity that partially compensates, but matching high-Re experimental data requires grids with hundreds of cells across the body -- beyond what consumer GPUs can handle in real time.

Grid size and Reynolds number

The achievable Reynolds number is limited by the grid resolution. The relaxation parameter τ must stay above 0.52 for numerical stability. This relationship determines the maximum Re for each grid:

Remax = U · Lbody / νmin     νmin = (τmin− 0.5) / 3
GridBody cellsRe (max)Cd rangeGPU VRAM
128×64×64~38~1152 – 586 MB
128×96×96~38~1152 – 5240 MB
256×128×128~77~2301 – 3960 MB
512×256×256~153~4600.5 – 27.6 GB
The Cd values above are for the Ahmed body at 25° slant. Published wind tunnel Cd 0.29 is at Re 768,000 (Lienhart 2003). Reaching that Re would need 2,000 cells across the body.

Try it yourself

Use the calculator below to see what Reynolds number and memory footprint your grid configuration produces:

Grid Resolution Calculator

Total cells
4.2M
Body length (lattice)
76.8 cells
Re (max, τ=0.52)
576
Re (stable, τ=0.65)
77
GPU memory
992 MB
Largest SSBO buffer
304 MB (needs EGL/native GPU)

Performance

Performance depends on grid size and GPU. The LBM kernel is memory-bandwidth-bound: each cell reads and writes 19 float32 distributions per step. Typical throughput on modern GPUs:

GPUGridSteps/secMLUPS
A10G (Modal)128×96×96~400~470
A100 40GB (Modal)256×128×128~100~420
RTX 3070128×64×64~800~420
RTX 4090256×128×128~200~840

MLUPS = Million Lattice Updates Per Second. Memory usage per cell is approximately 200 bytes (19 distributions × 2 buffers + velocity + solid + Bouzidi q). A 256×128×128 grid needs about 960 MB of GPU buffer space.

Convergence time

The drag coefficient needs 3-5 flow-throughs to converge. A flow-through is the time for a fluid element to traverse the domain: N_x / U_lattice steps. For a 256-cell domain at U = 0.05, one flow-through takes 5,120 steps. The simulation reports an exponential moving average (EMA) of Cd that stabilizes after approximately 50 samples.

Contributing

Lattice is open source under the MIT license. Contributions are welcome -- see CONTRIBUTING.md for the full style guide, testing, and PR process.

Code style

The simulation is C11, formatted with clang-format (LLVM style, 80 char limit). The website is TypeScript/React with Tailwind. Tests run in CI via GitHub Actions.

Commit sign-off

Every commit must carry a sign-off line certifying the Developer Certificate of Origin. Use git commit -s to add it automatically:

bash
git commit -s -m "lbm: fix bounce-back at concave corners"

# Produces:
# lbm: fix bounce-back at concave corners
#
# Signed-off-by: Your Name <your@email.com>
Commits without a valid sign-off will be rejected by CI. Make sure your Git name and email match your GitHub account.

Running tests

terminal
# Run tests
cd simulation && xvfb-run ../build/test_lbm  # LBM unit tests
xvfb-run bash test/test_integration.sh       # Integration test
cd website && npm test                        # Frontend tests
cd ml/framework && ./build/test_ml            # ML framework tests

References

  1. Ahmed, Ramm, Faltin. "Some salient features of the time-averaged ground vehicle wake." SAE 840300, 1984.
  2. Chen, Doolen. "Lattice Boltzmann method for fluid flows." Annu. Rev. Fluid Mech., 1998.
  3. Krüger et al. The Lattice Boltzmann Method. Springer, 2017.
  4. Ladd. "Numerical simulations of particulate suspensions via a discretized Boltzmann equation." J. Fluid Mech., 1994.