One Test Many Tongues

A scalable, corpus-derived vocabulary test for measuring language proficiency across thousands of languages.

Language shapes cognition, social interaction, and cultural experience, but measuring linguistic background at global scale remains difficult. Traditional language proficiency tests are typically hand-crafted by experts, available for only a small number of languages, and hard to deploy efficiently in large online studies.

This project develops WikiVocab, an automated pipeline for deriving brief vocabulary tests from text corpora. The method creates matched real-word and pseudoword items, validates them across languages, and makes it possible to estimate first- and second-language proficiency quickly in large, multilingual participant pools.

Automated WikiVocab pipeline for creating multilingual vocabulary tests
Automated WikiVocab pipeline. Text is collected and cleaned, rare real words are selected, pseudowords are generated from character transition statistics, and real/fake word pairs are matched on low-level statistics. Character-based languages such as Chinese are converted into letter-based representations such as Pinyin before pseudoword generation and then converted back. The validation interface presents words as images to prevent copy-pasting.

The resulting test is designed to be short enough for online experiments while still distinguishing native speakers, second-language speakers, and nonspeakers. In validation studies, WikiVocab shows high test-retest reliability and strong alignment with existing language tests and self-reported proficiency, while remaining scalable across many more languages than expert-built instruments.

Validation of WikiVocab against LexTALE and self-report across recruiters and languages
Validation across languages and recruitment platforms. Participants from multiple countries completed WikiVocab, LexTALE, and self-report measures in their native language and a randomly selected foreign language. Results show strong test-retest reliability, correlations across recruiters, and clear separation among native language, same subfamily, same family, and different family comparisons.

The project also maps the global reach of online recruitment platforms. By comparing Prolific and Cint, the work characterizes which languages and countries can be sampled at scale, and how participant demographics vary across recruitment sources.

Global reach and demographics of online recruitment platforms for language testing
Global reach on Prolific and Cint. The study maps countries and languages with sufficient active participants, marking countries recruited in each experiment, and compares demographic differences in education and age across recruitment platforms.

Extending the method to 34 languages across 34 countries reveals structure in global language proficiency. Accuracy is highest for native languages and decreases with linguistic distance, while model comparisons show that self-reported proficiency, linguistic family, lexical distance, education, and writing-system differences all predict performance.

Extended WikiVocab testing across 34 languages and 34 countries
Extended testing across 34 languages. WikiVocab accuracy separates native speakers from speakers of related and unrelated languages, and the cross-language matrix reveals structure organized by lexical distance and language family. Replications on Prolific confirm weaker diagonals for selected languages, and model comparisons quantify the predictors of test performance.

What we do

  • Build automated vocabulary tests from large text corpora, making language proficiency measurement scalable beyond expert-built tests.
  • Generate matched real-word and pseudoword pairs while controlling low-level lexical statistics and removing near-typos.
  • Validate online language tests across recruitment platforms, languages, and participant backgrounds.
  • Map the linguistic reach of online participant pools and provide infrastructure for globally distributed cognitive science.

Why it matters

Many large-scale online experiments rely on participants’ language background, but self-report alone is often too coarse for cross-cultural research. WikiVocab provides a fast, objective, and scalable way to measure language proficiency across diverse populations. This supports more rigorous global studies of cognition, language, culture, and human behavior, while helping reduce the field’s overreliance on English-language samples.

Bible-derived vocabulary tests and comparison with broader language skill measures
Additional validation with Bible-derived vocabulary tests. Automatically generated tests can be built for 1,939 languages, and performance patterns distinguish native language, same subfamily, same family, and different family comparisons. WikiVocab also correlates with broader linguistic skills including listening, reading, vocabulary, and writing.

(van Rijn et al., 2026)

Related Publications

2026

  1. Pol Rijn, Yue Sun, Harin Lee, and 5 more authors
    Proceedings of the National Academy of Sciences, 2026