CINTIL DependencyBank PREMIUM is a corpus of Portuguese utterances manually annotated with the representation of grammatical dependency relations and the information of part-of-speech, inflection and lemmas. It is being developed and maintained at the University of Lisbon. The current version is composed by 7,600 sentences (204,526 tokens) taken from portuguese newspaper articles.
The approach we follow is to build on top of an existing resource by adding a new annotation layer. We take the existing CINTIL corpus (Barreto et al., 2006), a 1 million token corpus already annotated with manually verified information on part-of-speech, morphology and named entities, and add syntactic function tags by automatically analysing it with a state-of-the-art dependency parser (LX-DepParser1). This tentative automatic annotation is then manually corrected.The manual correction is done by two annotators under a double-blind scheme, that is followed by adjudication by a third annotator. This process is supported by a general purpose annotation tool, WebAnno (https://code.google.com/p/webanno/).
The main motivation behind the creation of this resource was to create a corpus with a large variety of annotated phenomena that can be used for training statistical dependency parsers that are to be used in applications that deal with unrestricted text. Besides that, it enables linguistic studies that need to search the corpus for specific dependency structures.
This work was partly funded by the Portuguese Foundation for Science and Technology through the Portuguese project DP4LT (PTDC/EEI-SII/1940/2012) and by the European Commision through project QTLeap (EC/FP7/610516).