A pipeline that curates language data for every tongue

Table of Contents

The global tapestry of languages is vast and unevenly lit. The online world—where most data for training large language systems is born—hums with English, Chinese, and a handful of other high-resource tongues. Around 7,000 other languages struggle to find a voice in the digital era, their words scattered across isolated communities, local texts, and scarce web corners. In response, a team from Hugging Face and EPFL (École polytechnique fédérale de Lausanne) led by Guilherme Penedo built a bold, scalable answer to a stubborn question: can we tailor the way we collect and clean data so that language models learn well in every tongue, not just the ones that dominate the internet? The result is FineWeb2, a multimillion-language pre-training data pipeline and dataset designed to be automatically adaptable to any language. The work behind it is a collaboration among the university and the research lab world that brought you the most multilingual data-curation thinking to date.

At the heart of FineWeb2 is a simple, almost human-sized idea: don’t apply one size of sieve to all languages. Different languages have different scripts, rhythms, and domain quirks. The pipeline automatically tunes its Language Identification, deduplication, and filtering steps to each language’s quirks, then uses a smart upsampling strategy to favor higher-quality content without over-emphasizing duplicates. The project’s first author, Guilherme Penedo, along with collaborators from Hugging Face, EPFL, and allied teams, shows that a data-processing workflow can scale to thousands of languages while still delivering measurable gains in model quality. This is not just a technical tweak; it’s a philosophical shift in how we approach multilingual AI data.

Adaptive pipelines unlock thousands of languages

The classic approach to multilingual data crawls treated every language with the same recipe. That fixed pipeline might filter out the right kinds of text for English but remove valuable content in Swahili, Thai, or Telugu, or worse, mislabel a document’s language and ruin downstream learning. The FineWeb2 approach flips the script. It builds a pipeline that uses language-specific statistics to guide every crucial decision—from which documents to keep to how aggressively to prune duplicates. The target is not merely more data, but better data: text that a model can actually learn from, in a form that matches how that language is written and used.

The backbone of the system is a chain of well-chosen processing steps: Language Identification (LID) to label documents by language, deduplication to collapse near-duplicates, filtering to prune low-quality text, and a clever rehydration step that upsamples certain documents based on their quality signals. All of these steps are not fixed globally; they are tuned per language, using a data-driven approach that seeks robust signals rather than hand-waved heuristics. The researchers lean on GlotLID, a cutting-edge language-identifier, to anchor language labeling across thousands of languages and scripts. They also rely on a modern tokenizer called Gemma to map a language’s words into tokens efficiently, which matters because a good tokenizer can dramatically affect how well a model learns from a sentence. And they don’t stop at labeling and trimming; they measure how aggressive each filter should be by looking at real statistics from each language’s corpora, including Wikipedia, Common Crawl, and language-specific sources.

One of the most striking ideas is the rehydration, or duplication-aware upsampling. Instead of simply removing duplicates, FineWeb2 records how many documents sit in each duplicate cluster and uses that metadata to decide which content to upsample. The result is a dataset that preserves diversity while still prioritizing higher-quality material. In practice, this means the pipeline can scale to thousands of languages without letting the common traps of web data—noise, repetition, and quality gaps—drag the models down. The upsampling approach is not a blunt trick; it’s a principled way to balance quantity and quality across a world of linguistic diversity.

What makes this approach compelling is its data-driven ethos. The pipeline isn’t merely applying a fixed English-centric standard elsewhere; it tailors the filtering and deduplication to each language’s distinctive signals. This unlocks better performance in languages that have historically lagged behind in multilingual datasets, simply because a one-size-fits-all pipeline was never designed to handle them.

FineWeb2: A dataset that spans thousands of tongues

The second major pillar of the paper is the dataset itself. FineWeb2 is built from 96 Common Crawl snapshots, spanning 2013 through 2024, and grows into a massive, 20-terabyte multilingual corpus comprising 5 billion documents across 1,868 language-script pairs. The scale is impressive, but what matters more is the curated quality and linguistic breadth. The developers are careful to document that, even at this scale, the data for many low-resource languages remains dramatically biased toward Bible and Wikipedia sources. The authors don’t pretend this isn’t a problem; they highlight it as a real constraint and use it as a call to action for the field: data diversity and licensing matter just as much as sheer volume. The dataset is released with permissive licensing (ODC-By) and accompanied by the pipeline, training, and evaluation code, inviting the broader community to iterate, improve, and expand on these foundations.

The team makes a point of giving attention to the practicalities of working with thousands of languages. They adopt ISO language codes to avoid mislabeling across scripts, acknowledge the difficulties of segmentation in languages like Chinese or Thai, and assemble a large set of tokenizers trained for many scripts. They even design a language-label confidence framework that adapts per language, rather than forcing a global threshold that would misfit many languages. It’s not just about getting more languages into a model; it’s about getting better, linguistically authentic data into those languages.

In their evaluation, the researchers focus on nine canary languages—Arabic, Chinese, French, Hindi, Russian, Swahili, Telugu, Thai, and Turkish—to perform a rigorous, language-by-language ablation study. They test how each component—LID, deduplication, filtering, and rehydration—shapes downstream learning, controlling for architecture and training scale. The results show that each processing step yields measurable uplift, with rehydration providing a notable lift even when deduplication has a mixed impact across languages.

Beyond nine canaries, FineWeb2 scales to unseen languages and covers over 1,000 languages in total, demonstrating that a well-designed, language-aware data pipeline can generalize beyond the languages it was tuned on. The authors compare FineWeb2 against a spectrum of multilingual datasets and language-specific datasets. On 11 of 14 languages tested, FineWeb2-trained models outperform prior multilingual datasets, underscoring the promise of adaptive, data-driven curation for true linguistic inclusivity. They acknowledge that country-by-country or language-by-language hand-tuning can still outperform the general approach in some cases, but the overall gains and generalization are striking.

One caveat the authors stress runs like this: not all low-resource languages will suddenly bloom from better pipelines alone. When data sources are Bible or Wikipedia heavy, the amount of truly diverse, domain-rich material can still be scarce. The distribution charts in FineWeb2 show that a large share of languages rely on a narrow slice of sources, which can skew learning differently than if more varied data were available. Still, the team argues that providing the research community with the pipeline and data—and transparently reporting these biases—gives others a clear path to improvement.

Why this matters for the future of language technology

None of this is merely an academic exercise in data wrangling. The implications ripple through the entire field of language technology. If we want a future where language models serve people in their own languages, the data must reflect that diversity, not just the whim of global web traffic. FineWeb2’s adaptive pipeline makes this possible at scale, muting the long-standing bias toward high-resource languages and creating an express lane for thousands of languages to participate in the training story. In that sense, the work reads like a map of the linguistic world with more rooms, more doors, and more routes to learn.

Another striking takeaway is the emphasis on evaluation signals that are meaningful across languages. The authors developed a novel selection process for “early-signal” tasks—benchmarks that yield informative, monotonic, and robust improvements in the very early stages of training. That matters because multilingual pre-training is expensive; knowing which tasks reliably reflect linguistic learning helps researchers prune wasted effort and accelerate progress in languages that have historically been neglected. The careful cross-language evaluation framework helps ensure that gains are not just a quirk of English or a few busy languages but are truly transferable.

The FineWeb2 project also opens a practical pathway for the broader AI community: open datasets, open pipelines, and a transparent account of what data sources exist, where they come from, and how they were filtered. The authors release both the preliminary pre-filtering data and the final filtered version, plus the code, to invite methodological experimentation. This is a rare gift in a field that often treats data curation as a black box. It also nudges the ecosystem toward responsible openness—one where researchers can audit, reproduce, and build upon each other’s work to improve language coverage, quality, and fairness.

The collaboration behind FineWeb2—rooted in EPFL and Hugging Face, with contributions from experts across tokenization, language identification, data filtering, and evaluation—embodies a modern R&D model: diverse teams, shared tooling, and a willingness to publish the playbook so others can rewrite parts of it for their own languages, regions, and communities. Guilherme Penedo, the lead author, and colleagues from these institutions have given the field a concrete blueprint for scaling multilingual data curation without surrendering linguistic nuance in the process.

Bottom line: FineWeb2 isn’t just a dataset; it’s a philosophy of data work for language AI. It reframes what it means to train models that understand and generate text in hundreds or thousands of languages. If the internet remains the primary loom for data, then the loom must be adaptable enough to weave every tongue—carefully, transparently, and at scale. This work points toward a future where language technology does not dwarf the world’s linguistic variety but actively invites it into the center of AI progress.

In short, the FineWeb2 approach is a credible, practical scaffold for a more inclusive digital future. It is a reminder that the biggest leaps in AI might come not from bigger models alone, but from smarter data—curated with a conscience for language diversity, backed by institutions that value openness, and guided by researchers who see data as a bridge between people and the information they seek.

In the end, the paper’s authors anchor their work in real-world implications: a scalable, language-aware pre-training pipeline that can handle thousands of languages, and a companion dataset that mirrors the world’s linguistic breadth. It’s a bold step toward AI that speaks with many voices, not just the loudest one in the room.

Credits: The study was conducted through a collaboration between Hugging Face and EPFL, with Guilherme Penedo as the lead author and a diverse team including Hynek Kydlícěk, Vinko Sabolˇcec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, and Thomas Wolf. The FineWeb2 pipeline, dataset, and accompanying code are all released to encourage the global NLP community to iterate, improve, and expand the reach of multilingual AI.

Breast screening gaps mapped by data, not guesswork

Hidden Black Holes Shape the X-ray Sky’s Glow

Gaia unearths hidden dwarf carbon stars across the sky

Does a Warped Disk Hide a Black Hole’s Spin?

The Quiet Guardrails Keeping Self Driving Code Portable

Do Singular Matrices Harbor a Hidden Rule?

A pipeline that curates language data for every tongue

Adaptive pipelines unlock thousands of languages

FineWeb2: A dataset that spans thousands of tongues

Why this matters for the future of language technology

Adaptive pipelines unlock thousands of languages

FineWeb2: A dataset that spans thousands of tongues

Why this matters for the future of language technology

Related News