GlobalTap: Globally Diverse Beat Tracking Benchmark | Cornell Computational Cognition Lab

Existing beat-tracking benchmarks draw from a narrow range of musical cultures — predominantly Western — and are difficult to scale due to reliance on expert annotation. GlobalTap addresses both limitations with a 200-excerpt popular-music benchmark spanning 59 countries, developed using a fully scalable online tapping pipeline and data from 1,628 participants recruited across 10 country panels.

Rather than asking expert annotators to manually place beat times, we collect large pools of tapping responses from online participants using REPP, a high-precision browser-based tool for sensorimotor synchronization experiments. Pooled tap times are smoothed into a density over perceived beat locations using Gaussian kernel-density estimation (σ = 80 ms), and a simple optimization step distills this density into a regularized consensus beat annotation. The resulting crowd-derived annotations are then validated by independent listeners rating click-track overlays.

Figure 1. GlobalTap pipeline overview and benchmark geography. (A) Online tapping responses are pooled across listeners, smoothed by Gaussian KDE, and converted into a near-isochronous consensus beat annotation, validated by listeners on click-track overlays. (B) The 200 benchmark excerpts are drawn from 59 source countries (red); tappers were recruited across 10 country panels (blue, N = 746 tappers in total).

Figure 2. From audio to validated beat grid (Ballroom example, 12–18 s window). (i) Audio spectrogram. (ii) Per-participant tap raster (N = 40 tappers). (iii) Pooled tap density (Gaussian KDE, σ = 80 ms) with extracted consensus peaks. (iv) Optimized near-isochronous consensus grid (solid blue) overlaid on the audio envelope (gray) and compared with the expert reference annotation (dashed green).

Validation

A separate listener-rating experiment evaluated the crowd pipeline against 200 existing expert-annotated excerpts. Across all paired excerpts, crowd-derived and expert-reference annotations received statistically indistinguishable ratings (Δ = −0.01, p = .231). On MIREX — where both conditions used the same aggregation pipeline but different tap sources (online crowd vs. lab participants) — ratings were again identical, confirming that online recruitment produces annotations of equivalent quality to those from lab-recruited participants.

Crowd annotations occasionally diverged from expert references in systematic, musically interpretable ways: phase offsets (the crowd often anchored to a perceived acoustic event rather than the theoretically correct beat), tactus-level alternatives (listeners preferring a half- or double-time pulse), and bimodal tapping revealing co-existing metrical interpretations. These structured disagreements expose genuine perceptual ambiguity that single-reference annotations cannot capture.

Case studies

Candombe — phase offset. The crowd anchors to the displaced perceived beat, approximately 85 ms later, corresponding to a salient acoustic event: an emphasized onset by the chico drum. The expert annotation places the beat in accordance with the dance movement, one sixteenth note earlier.

GTZAN Disco — simple alignment. KDE peaks track the optimizer grid throughout the entire window; crowd and reference clicks are perceptually indistinguishable.

What we do

Collect large-scale online tapping data via REPP and PsyNet, bypassing browser timer jitter for millisecond-precise tap timing.
Aggregate pooled taps into consensus beat annotations using Gaussian KDE and a near-isochronous grid optimizer.
Compute intrinsic quality measures (beat-sequence CV and inter-subject tapping coherence, ISTC) that flag perceptually difficult excerpts without requiring an external reference.
Validate crowd annotations against expert references via a listener-preference experiment with 130 raters across 240 excerpts.
Release the full benchmark with de-identified tap data, consensus annotations, quality measures, and a reproducible pipeline.

Why it matters

The main obstacle to culturally broader beat-tracking evaluation is not a lack of music, but a lack of scalable annotation methods. Expert annotations are expensive, require specialist musical knowledge, and do not easily transfer across cultural traditions. GlobalTap demonstrates that online crowd tapping can close this gap, producing perceptually valid annotations at scale and across diverse musical styles. By releasing raw tap times alongside consensus beats, the benchmark also makes perceptual ambiguity inspectable — contributing to cognitive research on beat perception as well as to computational modeling and music information retrieval.