JRC logo

JRC-ACQUIS Multilingual Parallel Corpus, Version 3.0

This is the download page of Version 3.0 of the aligned multilingual parallel corpus JRC-ACQUIS . The dataset contains resources for the following languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene, Swedish.

News: Version 3.0 of the corpus includes documents from 2005 and 2006, which almost triples the total size compared to version 2.2. The Bulgarian language has been added as a 22nd language. The corpus contains 463,792 texts and a total of over one Billion words. 

The pairwise alignment for all 231 language pairs is now available, using two alternative alignment tools: Vanilla and HunAlign.

See the history of changes in news .

Note: Some corrections have been done on the Bulgarian corpus. The online version has been modified on 13/07/2007. Please replace your version of the Bulgarian corpus if you have downloaded it before that date.

Information about the corpus and the alignment

  1. Documentation

Download:

  1. AC Corpus - version 3.0 (by language)
  2. AC aligned corpus using Vanilla aligner
  3. AC aligned corpus using HunAlign

By downloading these resources, you agree to the usage conditions.

Previous version:

1. JRC-ACQUIS Multilingual Parallel Corpus, Version 2.2

This multilingual parallel corpus has been compiled by the Language Technology team of the European Commission's Joint Research Centre (JRC).

LangTech logo


Page last updated 2008-05-30, LT Group - JRC

Valid HTML 4.01 Transitional

Site Meter