GRAVITATIONAL WAVE ASTRONOMY BENCHMARK SUITE

Can LLM Agents Match Months of PhD Work?

An Eight-Task Benchmark for Gravitational Waves Astronomy

gwBenchmarks is a benchmark suite for large language model agents tackling long-horizon, terminal-based astrophysical research problems. Focused on gravitational waves from binary compact object mergers, it evaluates models using fully automated, physics-driven, and quantitatively verifiable metrics on state-of-the-art tasks — providing a direct measure of science readiness.

Investigators: Tousif Islam (KITP, UC Santa Barbara) · Digvijay Wadekar (UT Austin) · Zihan Zhou (Princeton)
Contact: tousifislam24@gmail.com · tousifislam@ucsb.edu
CLAUDE Opus 4.7 Opus 4.6 Sonnet 4.6 Haiku 4.5
GPT GPT-5.5 High GPT-5.4 Mini GPT-5.3 Codex High GPT-5.2
GEMINI Gemini 3.1 Pro Gemini 3 Flash
OTHERS Kimi K2.6 DeepSeek V4 Pro

Why gwBenchmarks

Typical LLM benchmarks evaluate performance on toy problems or tasks where success is measured by subjective, hard-to-verify metrics. Gravitational wave astronomy provides an unusually clean alternative: the ground truth is known in many cases from exact (but computationally expensive) numerical solutions to Einstein's equations, so every benchmark score is quantitatively verifiable and reproducible. The field also sits at the intersection of high-performance computing, differential equations, and statistical inference — precisely the kind of long-horizon scientific research where we most need to understand what LLM agents can and cannot do. gwBenchmarks turns this into a concrete test: give an agent training data from numerical relativity simulations, and measure how close its autonomously-built surrogate model comes to the physics.

Leaderboard

Best model per agent per benchmark. Values shown as median+hi−lo (90% credible interval from per-sample metric distribution). Haiku dynamics / ringdown / validity / analytic excluded (fabricated results).
  • Simulation Error — the irreducible floor set by the ground-truth data itself (NR convergence, EOB truncation, etc.).
  • Physics Req. — the accuracy threshold needed for scientific applications (detection, parameter estimation).
  • Green — median meets the physics requirement for that benchmark.
  • Red — median does not yet meet the physics requirement.
  • Bold — best (lowest median) among all agents for that benchmark.
SCROLL FOR ALL 8 BENCHMARKS
# Model Waveform
Bench ↓
Remnant
Bench ↓
Dynamics
Bench ↓
Ringdown
Bench ↓
Validity
Bench ↓
Analytic
Bench ↓
Template Bank
Bench ↑
New Physics
Bench ↓

Performance Profile

Haiku 4.5
claude-haiku-4-5-20251001
Fastest · Most cost-efficient
Sonnet 4.6
claude-sonnet-4-6
Balanced capability & speed
Opus 4.6
claude-opus-4-6
High capability
Opus 4.7
claude-opus-4-7
Most capable · Frontier
GPT-5.5 High
gpt-5.5
Codex · reasoning high
GPT-5.4 Mini
gpt-5.4-mini
Codex · compact model
GPT-5.3 Codex High
gpt-5.3-codex
Codex · reasoning high
GPT-5.2
gpt-5.2
Codex baseline
Gemini 3.1 Pro
gemini-3.1-pro-preview
Google · Pro tier
Gemini 3 Flash
gemini-3-flash-preview
Google · Fast tier
Kimi K2.6
kimi-k2.6
Moonshot AI
DeepSeek V4 Pro
deepseek-v4-pro-max
DeepSeek · Pro tier

Benchmark Results

Error Distributions

Violin plots show the full distribution of per-sample errors (log scale) for each agent's best model. Wider regions indicate more samples at that error level. The white dot marks the median; thick bar spans p25–p75. The purple violin (waveform, remnant, analytic) shows the NR simulation error floor from resolution convergence.

About the Benchmarks

Waveform Bench

input: binary parameters (masses, spins) → output: gravitational wave strain h(t)

When two black holes spiral together, they emit gravitational waves — ripples in spacetime detected by observatories like LIGO. The waveform shape depends on seven parameters (mass ratio and six spin components). Solving Einstein's equations numerically for a single configuration takes thousands of CPU-hours, so the field relies on fast surrogate models trained on a catalogue of numerical relativity (NR) simulations. This benchmark asks the agent to build such a surrogate from 250 training waveforms spanning the spinning, precessing parameter space.

Remnant Bench

input: binary parameters → output: kick velocity |vk|

After merger, the newly formed black hole recoils (receives a "kick") due to asymmetric gravitational wave emission. Kick velocities can reach thousands of km/s — fast enough to eject a black hole from its host galaxy. Predicting kick magnitude from pre-merger parameters is a high-dimensional regression task with complex spin–orbit coupling effects. The agent must learn this mapping from 300 NR simulations covering mass ratios 1–20 and arbitrary spin orientations.

Dynamics Bench

input: orbital parameters → output: PN frequency parameter x(t)

Before merger, the two objects slowly inspiral over thousands of orbits. Their orbital dynamics — the time evolution of frequency, eccentricity, and spin precession — are described by post-Newtonian (PN) theory, a perturbative expansion of general relativity valid at large separations. This benchmark focuses on eccentric, spinning binaries where the dynamics are especially complex. The agent must model how the orbital frequency parameter x(t) evolves over the full inspiral.

Ringdown Bench

input: remnant spin → output: quasi-normal mode frequencies

After merger, the remnant black hole "rings down" like a struck bell, emitting gravitational waves at characteristic quasi-normal mode (QNM) frequencies. Each mode has a complex frequency: the real part sets the oscillation rate, the imaginary part the damping time. These frequencies depend only on the remnant's mass and spin (the "no-hair theorem"). The agent must learn the mapping from spin to QNM frequencies, which is known to high precision from black hole perturbation theory.

Validity Bench

input: binary parameters → output: predicted surrogate mismatch

Existing surrogate models (like NRHybSur3dq8) are trained on a finite region of parameter space and degrade when extrapolated beyond it. Knowing where a model breaks down is critical for gravitational wave data analysis — using an inaccurate template can bias parameter estimates. This benchmark asks the agent to predict the mismatch (error) of an existing surrogate as a function of the input parameters, effectively learning the model's own validity boundary.

Analytic Bench

input: mass ratio q → output: closed-form h(t)

The simplest case: non-spinning black holes parameterised by a single variable, the mass ratio q. With only 20 training waveforms spanning q = 1–20, this benchmark tests whether the agent can discover a compact, closed-form analytic expression for the gravitational wave strain — the kind of symbolic model a physicist would write by hand. It rewards interpretability and extrapolation, not just interpolation accuracy.

Metrics

Waveform & Analytic Bench

output: Re(h₂₂), Im(h₂₂)
Mean frequency-domain mismatch over total masses [40, 80, 120, 160, 200] M☉ using aLIGO ZeroDetHighPower PSD, f_low=15 Hz, f_high=990 Hz, maximised over time and phase.

Remnant Bench

output: kick velocity |vk|
Normalised RMSE: NRMSE(v_k) = RMSE / range(v_k*). Dimensionless and robust to the scale of kick velocities across the parameter space.

Dynamics Bench

output: x(t)
Pointwise RMS relative error: √(mean((x̂−x*)²/x*²)). Weights all time steps equally in fractional terms, avoiding late-time bias from the monotonically growing x(t).

Ringdown Bench

output: Re(ω), Im(ω)
Mean of per-mode relative errors: (|ΔωRR*| + |ΔωII*|) / 2. Evaluated on the (ℓ=2, m=2) fundamental mode.

Validity Bench

output: predicted mismatch M̂
RMSE in log₁₀ space: RMSE(log₁₀ M̂, log₁₀ M*). Appropriate because mismatches span orders of magnitude (~1e-7 to ~1e-1).

General rules

no scoring penalty
All metrics are raw accuracy losses — no runtime penalty. Agents must try ≥ 20 modelling approaches across 4 categories. PySR and gplearn are mandatory.

About Us

Tousif Islam

Tousif Islam

Lead Investigator
Kavli Postdoctoral Scholar, KITP · UC Santa Barbara · Incoming Weinberg Fellow at UT Austin

Gravitational wave astronomer working at the intersection of numerical relativity, reduced-order surrogate modelling, and machine learning. Develops high-fidelity GW models for binary black hole mergers and applies them to source characterization, population inference, and tests of gravity. Also interested in connecting NR simulations to multimessenger astrophysics and cosmology.

Digvijay Wadekar

Digvijay Wadekar

Investigator
Assistant Professor, UT Austin · Previously JHU & IAS

Astrophysicist with broad interests spanning gravitational-wave physics, cosmology, and astroparticle physics. Develops physics-informed, interpretable ML models for compact object mergers — from detection and population inference to progenitor modelling. Recently exploring agentic LLMs for physics applications. Earlier work included cosmological pipelines, CMB-SZ gas physics, and dark matter phenomenology.

Zihan Zhou

Zihan Zhou

Investigator
Graduate Student, Princeton University

Theoretical physicist whose work spans effective field theory, gravitational wave physics, quantum dynamics, and AI for physics. Earlier research focused on compact object dynamics using EFT, scattering amplitudes, and renormalization group methods. Current interests include agent frameworks for astrophysical inference and discovery in fundamental physics.

Gravitational Wave Terminology

Key terms used throughout the paper and this benchmark suite, for readers unfamiliar with gravitational wave modelling.

General Relativity (GR)

Einstein’s theory of gravity, which describes spacetime as a dynamical geometric entity and predicts the existence of gravitational waves emitted by accelerating masses.

Beyond-GR Theories

Extensions or alternatives to general relativity that modify the underlying theory of gravity, often leading to deviations in gravitational wave signals that can be tested observationally.

Binary Black Hole (BBH)

A system of two black holes orbiting each other and emitting gravitational radiation as they inspiral and merge.

Waveform

The gravitational wave signal h(t) emitted by a source, typically represented as a complex time series encoding amplitude and phase.

Numerical Relativity (NR)

A computational approach that solves the Einstein equations directly to simulate spacetime dynamics during compact object mergers. Highly accurate but computationally expensive.

Post-Newtonian (PN) Approximation

An analytic expansion valid during the early inspiral phase, where gravitational fields are weak and velocities are small compared to the speed of light.

Effective-One-Body (EOB) Models

Semi-analytic models that map the two-body problem to an effective single-body system, combining analytic approximations with calibration to numerical simulations.

Ringdown

The final phase of a merger in which the remnant black hole emits damped oscillations characterized by quasi-normal modes.

Quasi-Normal Modes (QNM)

Characteristic oscillation frequencies of a perturbed black hole, determined by its mass and spin.

Gravitational Wave Detection

The process of identifying gravitational wave signals in noisy detector data, typically using matched filtering with waveform templates.

Parameter Estimation

The inference of source properties (e.g., masses, spins, distance) from observed gravitational wave signals, often performed using Bayesian methods.

Mismatch

A measure of disagreement between two waveforms, typically defined using a noise-weighted inner product in the frequency domain.

Recoil (Kick) Velocity

The velocity imparted to the remnant black hole due to asymmetric emission of gravitational radiation during merger.