REBECA - A
Bilingual Lexical-Conceptual Database of Vehicle Domain
- Category
PhD Degree
Supervisor: Bento Carlos Dias-da-Silva
- Goal
Because of several factors, including, for
instance, perceptual salience and semiotic relevance, languages have
different inventories of lexicalized concepts (i.e. concepts
expressed by lexical units). The lexical-conceptual divergences are
a hindrance to computational treatment of natural languages in tasks
such as machine translation and cross-language information
retrieval.
Therefore, the construction of bilingual and
multilingual lexical databases, in which the lexical units of
different languages are aligned by their underlying concepts, has
become a very important research topic in Natural Language
Processing (NLP).
For Brazilian Portuguese (BP), in particular,
the construction of such resources is urgent. In this scenario, the
purpose of this project is to investigate lexicalization patterns of
BP and to develop a lexical-conceptual resource for the automatic
processing of written BP language.
Assuming a compromise between NLP and
Linguistics, this work follows a three-domain approach methodology,
which claims that the research activities should be divided into the
linguistic, linguistic-computational, and computational domains.
- Description and Results
In the linguistic domain, a set of lexicalized
concepts of North-American English (AmE) extracted from Princeton
WordNet (WN.Pr) was selected through manual analysis of the
structured (lexical databases and standard dictionaries) and
unstructured resources (textual corpora). Such set covers a large
number of lexicalized concepts of the "vehicle domain". Given those concepts,
their lexical and phrasal expressions in BP were manually compiled
from bilingual dictionaries, with the help of standard monolingual
dictionaries, thesauri, and textual corpora.
In the linguistic-computational domain, the
lexicalized concepts of AmE and BP previously identified were
aligned by means of a semantic structured interlingua (or ontology).
The interlingua is composed of the same set of concepts extracted
from WN.Pr and its structure relies on the MultiNet, a specific
knowledge representation formalism. MultiNet provides the semantic
representatives for the description of the semantics of natural
language expressions.
The alignment was done in Protégé-OWL
editor, one of the most popular tools to create and edit ontologies,
and the alignment result is a bilingual lexical-conceptual database,
named REBECA. In this database, part of the BP lexicon is strictly
aligned with part of WN.Pr.
- Download
- Current Status
- Support
CNPq
- Contact
PhD Degree
Supervisor: Bento Carlos Dias-da-Silva
Because of several factors, including, for
instance, perceptual salience and semiotic relevance, languages have
different inventories of lexicalized concepts (i.e. concepts
expressed by lexical units). The lexical-conceptual divergences are
a hindrance to computational treatment of natural languages in tasks
such as machine translation and cross-language information
retrieval.
Therefore, the construction of bilingual and
multilingual lexical databases, in which the lexical units of
different languages are aligned by their underlying concepts, has
become a very important research topic in Natural Language
Processing (NLP).
For Brazilian Portuguese (BP), in particular,
the construction of such resources is urgent. In this scenario, the
purpose of this project is to investigate lexicalization patterns of
BP and to develop a lexical-conceptual resource for the automatic
processing of written BP language.
Assuming a compromise between NLP and
Linguistics, this work follows a three-domain approach methodology,
which claims that the research activities should be divided into the
linguistic, linguistic-computational, and computational domains.
In the linguistic domain, a set of lexicalized
concepts of North-American English (AmE) extracted from Princeton
WordNet (WN.Pr) was selected through manual analysis of the
structured (lexical databases and standard dictionaries) and
unstructured resources (textual corpora). Such set covers a large
number of lexicalized concepts of the "vehicle domain". Given those concepts,
their lexical and phrasal expressions in BP were manually compiled
from bilingual dictionaries, with the help of standard monolingual
dictionaries, thesauri, and textual corpora.
In the linguistic-computational domain, the
lexicalized concepts of AmE and BP previously identified were
aligned by means of a semantic structured interlingua (or ontology).
The interlingua is composed of the same set of concepts extracted
from WN.Pr and its structure relies on the MultiNet, a specific
knowledge representation formalism. MultiNet provides the semantic
representatives for the description of the semantics of natural
language expressions.
The alignment was done in Protégé-OWL
editor, one of the most popular tools to create and edit ontologies,
and the alignment result is a bilingual lexical-conceptual database,
named REBECA. In this database, part of the BP lexicon is strictly
aligned with part of WN.Pr.
CNPq