The Floresta Sintá(c)tica project

logo temporário da FS

Página principal


Floresta Sintá(c)tica (syntactic forest) is a publicly available treebank for Portuguese, created as a collaboration project between the VISL project, http://visl.sdu.dk, and Linguateca (formerly the Computational Processing of Portuguese project), http://www.linguateca.pt.

The Floresta is based on human revision of the output of the PALAVRAS parser, developed by Eckhard Bick for his PhD (1994-2000) at the University of Århus (Denmark). The parser is available on the Web at the VISL project site (http://visl.sdu.dk). More information about the parser can be found in Bick, Eckhard. The Parsing System Palavras, Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework, Aarhus University Press, 2000.

The textual material of Floresta comes from the CETEMPúblico and CETENfolha corpora (actually their first one-million words).

Documentation

Please see the paper at LREC'2002 for a general description of the project. All information in English so far is listed below:

Team

Project leaders: Diana Santos and Eckhard Bick.

Linguistic revision
Susana Afonso (November 2000 to the present)
Raquel Marchi (November 2000 to September 2001; Jan 2003 to the present)
Anabela Barreiro Colasuonno (May-December 2002)

Tool development
Renato Haber (November 2000 to September 2001)
Luís Sarmento (November-December 2002)
Rui Vilela (August 2004 to the present)

Results

The Floresta Sintá(c)tica project has so far produced:

Each tree of our treebank corresponds to three different objects:

  1. CG representation in text format
  2. Phrase tree in text format
  3. Phrase tree in graphical format
2. and 3. contain exactly the same information and just differ in presentation mode, while 1. does not contain constituents nor attachment information (only dependency). We have some example sentences to illustrate the three objects.

Access

Bosque

One can download the phrase trees that constitute the Bosque:

They can also be individually inspected in graphical format at the VISL site, Portuguese zone, choosing, under "Non-automatic parse", "Floresta sintá(c)tica treebank", http://visl.sdu.dk/visl/pt/floresta.html?S=cetemcorpus#top.

Or at the VISL site, Portuguese zone, choosing, under "Non-automatic parse", "Non-automatic parse", "Pre-analysed Portuguese sentences", "Newspaper corpus treebank (Floresta)" http://visl.sdu.dk/visl/pt/treebank.html, and clicking on the figura de árvore gráfica no projecto VISL figure preceding each sentence.

The Bosque is also available in the Penn Treebank and TIGER formats, in XML, through the work of the Braga node of Linguateca, see Floresta page at Braga.

The Bosque 7.3 was used for the ConLL-X shared task on multilingual dependency parsing. We are grateful to Sabine Buchholz for processing Bosque and making it available for the ConLL-X exercise. These data provided here have been prepared by her and her team, we just make it available as is from here.

Finally, they can be queried through

Floresta Virgem

Floresta Virgem can also be queried through Águia, a tool for searching the Floresta treebank, as well as obtained as two single files:
Last update: 26 September 2006.
Comments and suggestions about the Floresta treebank