Can LLM Agents Match Months of PhD Work?
gwBenchmarks is a benchmark suite for large language model agents tackling long-horizon, terminal-based astrophysical research problems. Focused on gravitational waves from binary compact object mergers, it evaluates models using fully automated, physics-driven, and quantitatively verifiable metrics on state-of-the-art tasks — providing a direct measure of science readiness.
Contact: tousifislam24@gmail.com · tousifislam@ucsb.edu
Why gwBenchmarks
MOTIVATIONLeaderboard
ALL BENCHMARKS · LOWER IS BETTER- Simulation Error — the irreducible floor set by the ground-truth data itself (NR convergence, EOB truncation, etc.).
- Physics Req. — the accuracy threshold needed for scientific applications (detection, parameter estimation).
- Green — median meets the physics requirement for that benchmark.
- Red — median does not yet meet the physics requirement.
- Bold — best (lowest median) among all agents for that benchmark.
| # | Model | Waveform Bench ↓ |
Remnant Bench ↓ |
Dynamics Bench ↓ |
Ringdown Bench ↓ |
Validity Bench ↓ |
Analytic Bench ↓ |
Template Bank Bench ↑ |
New Physics Bench ↓ |
|---|
Performance Profile
NORMALISED · CLOSER TO CENTRE = BETTERFastest · Most cost-efficient
Balanced capability & speed
High capability
Most capable · Frontier
Codex · reasoning high
Codex · compact model
Codex · reasoning high
Codex baseline
Google · Pro tier
Google · Fast tier
Moonshot AI
DeepSeek · Pro tier
Benchmark Results
8 TASKSError Distributions
PER-SAMPLE · LOG SCALEAbout the Benchmarks
BACKGROUND FOR NON-SPECIALISTSWaveform Bench
When two black holes spiral together, they emit gravitational waves — ripples in spacetime detected by observatories like LIGO. The waveform shape depends on seven parameters (mass ratio and six spin components). Solving Einstein's equations numerically for a single configuration takes thousands of CPU-hours, so the field relies on fast surrogate models trained on a catalogue of numerical relativity (NR) simulations. This benchmark asks the agent to build such a surrogate from 250 training waveforms spanning the spinning, precessing parameter space.
Remnant Bench
After merger, the newly formed black hole recoils (receives a "kick") due to asymmetric gravitational wave emission. Kick velocities can reach thousands of km/s — fast enough to eject a black hole from its host galaxy. Predicting kick magnitude from pre-merger parameters is a high-dimensional regression task with complex spin–orbit coupling effects. The agent must learn this mapping from 300 NR simulations covering mass ratios 1–20 and arbitrary spin orientations.
Dynamics Bench
Before merger, the two objects slowly inspiral over thousands of orbits. Their orbital dynamics — the time evolution of frequency, eccentricity, and spin precession — are described by post-Newtonian (PN) theory, a perturbative expansion of general relativity valid at large separations. This benchmark focuses on eccentric, spinning binaries where the dynamics are especially complex. The agent must model how the orbital frequency parameter x(t) evolves over the full inspiral.
Ringdown Bench
After merger, the remnant black hole "rings down" like a struck bell, emitting gravitational waves at characteristic quasi-normal mode (QNM) frequencies. Each mode has a complex frequency: the real part sets the oscillation rate, the imaginary part the damping time. These frequencies depend only on the remnant's mass and spin (the "no-hair theorem"). The agent must learn the mapping from spin to QNM frequencies, which is known to high precision from black hole perturbation theory.
Validity Bench
Existing surrogate models (like NRHybSur3dq8) are trained on a finite region of parameter space and degrade when extrapolated beyond it. Knowing where a model breaks down is critical for gravitational wave data analysis — using an inaccurate template can bias parameter estimates. This benchmark asks the agent to predict the mismatch (error) of an existing surrogate as a function of the input parameters, effectively learning the model's own validity boundary.
Analytic Bench
The simplest case: non-spinning black holes parameterised by a single variable, the mass ratio q. With only 20 training waveforms spanning q = 1–20, this benchmark tests whether the agent can discover a compact, closed-form analytic expression for the gravitational wave strain — the kind of symbolic model a physicist would write by hand. It rewards interpretability and extrapolation, not just interpolation accuracy.
Metrics
PHYSICS-MOTIVATEDWaveform & Analytic Bench
aLIGO ZeroDetHighPower PSD, f_low=15 Hz, f_high=990 Hz, maximised over time and phase.Remnant Bench
NRMSE(v_k) = RMSE / range(v_k*). Dimensionless and robust to the scale of kick velocities across the parameter space.Dynamics Bench
√(mean((x̂−x*)²/x*²)). Weights all time steps equally in fractional terms, avoiding late-time bias from the monotonically growing x(t).Ringdown Bench
(|ΔωR/ωR*| + |ΔωI/ωI*|) / 2. Evaluated on the (ℓ=2, m=2) fundamental mode.Validity Bench
RMSE(log₁₀ M̂, log₁₀ M*). Appropriate because mismatches span orders of magnitude (~1e-7 to ~1e-1).General rules
About Us
THE TEAM
Tousif Islam
Gravitational wave astronomer working at the intersection of numerical relativity, reduced-order surrogate modelling, and machine learning. Develops high-fidelity GW models for binary black hole mergers and applies them to source characterization, population inference, and tests of gravity. Also interested in connecting NR simulations to multimessenger astrophysics and cosmology.

Digvijay Wadekar
Astrophysicist with broad interests spanning gravitational-wave physics, cosmology, and astroparticle physics. Develops physics-informed, interpretable ML models for compact object mergers — from detection and population inference to progenitor modelling. Recently exploring agentic LLMs for physics applications. Earlier work included cosmological pipelines, CMB-SZ gas physics, and dark matter phenomenology.

Zihan Zhou
Theoretical physicist whose work spans effective field theory, gravitational wave physics, quantum dynamics, and AI for physics. Earlier research focused on compact object dynamics using EFT, scattering amplitudes, and renormalization group methods. Current interests include agent frameworks for astrophysical inference and discovery in fundamental physics.
Gravitational Wave Terminology
GLOSSARYKey terms used throughout the paper and this benchmark suite, for readers unfamiliar with gravitational wave modelling.
General Relativity (GR)
Einstein’s theory of gravity, which describes spacetime as a dynamical geometric entity and predicts the existence of gravitational waves emitted by accelerating masses.
Beyond-GR Theories
Extensions or alternatives to general relativity that modify the underlying theory of gravity, often leading to deviations in gravitational wave signals that can be tested observationally.
Binary Black Hole (BBH)
A system of two black holes orbiting each other and emitting gravitational radiation as they inspiral and merge.
Waveform
The gravitational wave signal h(t) emitted by a source, typically represented as a complex time series encoding amplitude and phase.
Numerical Relativity (NR)
A computational approach that solves the Einstein equations directly to simulate spacetime dynamics during compact object mergers. Highly accurate but computationally expensive.
Post-Newtonian (PN) Approximation
An analytic expansion valid during the early inspiral phase, where gravitational fields are weak and velocities are small compared to the speed of light.
Effective-One-Body (EOB) Models
Semi-analytic models that map the two-body problem to an effective single-body system, combining analytic approximations with calibration to numerical simulations.
Ringdown
The final phase of a merger in which the remnant black hole emits damped oscillations characterized by quasi-normal modes.
Quasi-Normal Modes (QNM)
Characteristic oscillation frequencies of a perturbed black hole, determined by its mass and spin.
Gravitational Wave Detection
The process of identifying gravitational wave signals in noisy detector data, typically using matched filtering with waveform templates.
Parameter Estimation
The inference of source properties (e.g., masses, spins, distance) from observed gravitational wave signals, often performed using Bayesian methods.
Mismatch
A measure of disagreement between two waveforms, typically defined using a noise-weighted inner product in the frequency domain.
Recoil (Kick) Velocity
The velocity imparted to the remnant black hole due to asymmetric emission of gravitational radiation during merger.