Internet Access to Corpora: The AC/DC project

AC/DC project, Linguateca
The AC/DC project stands for Acesso a corpora/Disponibilização de corpora ("acess and availability of corpora"), and is one of the activities of Linguateca, previously the Computational Processing of Portuguese project.

The physical address for this access is http://acdc.linguateca.pt/acesso/. This service was launched on 23 September 1999.

Main goals of the AC/DC project

Rationale within the Linguateca

One of the goals of Linguateca is to improve significantly the conditions for NLP of Portuguese, namely With the AC/DC (sub)project we expect to contribute to the first two aims.

Structure of this page

Internet access to corpora

Corpora: collections of texts (oral transcribed, spoken, written) encoded in a format easily searchable for purposes of linguistic investigation and NLP

Internet access: Web interface to a corpus workbench, in the present case the IMS Corpus Workbench.

Why should one provide Internet acess to corpora?

  1. Because corpus creation and dissemination are both time-consuming processes:
  2. Because the WWW provides a user-friendly and well-known interface that can help reduce inequalities based on geographical distribution (in terms of computer access and/or resources). Furthermore, it allows for distributed and collaborative projects involving different sites (networks).

The process

What is involved in providing Internet access to corpora?
  1. Get physically the corpus
  2. Get authorization to make it available
  3. Encode the corpus in the appropriate corpus workbench (in our case, the IMS Corpus Workbench)
  4. Add information to the corpus (ranging from sentence and paragraph separation to parsing it with PALAVRAS, Eckhard Bick's constraint grammar parser for Portuguese based at the VISL project)
  5. Create a Web interface
  6. Document all the steps so that the interface can be reliably used for serious investigation (version tracking, linguistic options taken, problems known)
  7. Announce it widely
  8. Monitor its use and improve the service accordingly
In our case, the raw corpus (the textual data) could come in formats ranging from simple text to SGML, while information added ranged from paragraph marking to parsing the text. See our paper Santos and Bick at the LREC2000 conference for a more detailed description of the process of annotating the corpora.

Kinds of queries

We give here a short overview of the kinds of things that can be done with (linguistically annotated) corpora: Actual examples on how to do (some of) these things can be found on Exemplos, Examples and Anotacao pages.

Corpora currently offered and their rough characterization

For each corpus, we present: We present separately the information about the simple and the annotated version of the corpus, given that there are some differences between the two versions,

(Values computed on the 13th January 2004)

Corpus Size
(units)
Size
(words)
Size
(sentences)
Short description
natura
natpanot
7.257.175
7.321.642
6.257.950
6.268.817
225.673
225.734
Newspaper text of PÚBLICO, Portugal, 1991-1994, 2 paragraphs a day
enpcpub
enpcanot
89.864
90.574
72.244
72.392
4.369
4.369
Translated fiction from five novels in English, from the ENPC
minho
minhanot
2.083.761
2.107.826
1.738.475
1.747.274
53.040
53.185
Newspaper text of local periodic, Diário do Minho, full articles before proofreading
eci-ebr
ebranot
891.687
898.542
722.012
723.007
45.530
44.689
Brazilian text: fiction, non-fiction, from the Borba-Ramsey corpus
eci-ee
eeanot
30.157
31.127
26.515
27.140
780
780
Call of ESPRIT program in European Portuguese, from the ECI
saocarlos
scanot
41.372.943
41.948.319
32.091.996
32.385.765
1.955.166
1.952.829
Brazilian text, mainly from newspapers, but also didactic material and business letters
frasespp
fppanot
19.340
19.542
16.225
16.208
594
594
Sentence corpus in European Portuguese
frasespb
fpbanot
22.486
22.730
19.155
19.165
651
651
Sentence corpus in Brazilian Portuguese
cetempublicoprmi
cpprmianot
1.198.015
1.202.938
997.695
995.851
38.151
38.251
Newspaper text from PÚBLICO, Portugal, 1991-1998, extracts of two paragraphs in a random order
ancib
ancibanot
811.739
828.475
650.045
660.045
25.798
25.596
Brazilian e-mail corpus - traffic in the ANCIB list (libraries and information science in Brazil)
diaclav
diaclavanot
7.441.109
7.529.495
6.488.273
6.549.823
228.856
210.741
Newspaper text ol local periodicals, Diário de Coimbra, Diário de Leiria, Diário de Aveiro, Viseu Diário, Portugal, 1999-2000
avante
avantanot
7.607.651
7.685.242
6.488.201
6.512.510
204.686
204.833
Newspaper text, political party weekly newspaper, Avante, Portugal, 1997-2002
amostra
amostranot
124.655
124.836
98.444
98.505
4.925
4.965
Selection of texts from the NILC corpus, in Brazilian Portuguese, including texts from the didactic, journalistic and literary styles
classlppe 1.872.381 1.307.334 74.174 Literary text (prose, drama and poetry) of Portuguese "classical" 16th to 19th century writers
Total raw
Total tagged
70.822.776
69.811.288
56.974.564
56.076.502
2.862.393
2.767.217
All raw corpora except for CETEMPúblico

We provide more extensive documentation and information, in Portuguese, about the raw corpora, the annotated corpora and the actual processing and encoding of the several kinds of information present in the corpora (tokenization, sentence separation and annotation).

Related projects

We are currently engaged in the following related projects:

[ Access to the corpora | Portuguese main page of Linguateca | English page of Linguateca ]


Last updated: 13 January 2004.
We would like to receive your feedback. Please Send questions, comments and suggestions.