CINTIL-DeepBank (Branco et al., 2010) is a corpus of Portuguese texts annotated with deep grammatical information. This document refers to version 1.4 of the corpus, from January 2016, which adds over 15,400 annotated sentences to the previous version from September 2015.
The current version is composed by 32,497 sentences (319,040 tokens) taken from two different sources and domains: news (31,304 sentences; 311,510 tokens) and novels (399 sentences; 2,547 tokens). In addition, there are 794 sentences (4,983 tokens) that are used for regression testing of the computational grammar that supported the annotation of the corpus (see Section 4.6 of the documentation).
CINTIL-DeepBank includes several levels of information for each sentence, including its derivation tree originated during parsing, its syntactic constituency tree, different renderings of MRS based representations of its meaning (Copestake et al., 2005), and its fully-fledged grammatical representation in AVM format. This is the result of a semi-automatic annotation process by means of automatic analysis by the grammar followed by a double-blind annotation followed by adjudication (see (Branco and Costa, 2008) for a full description of the process).
The main motivation behind the creation of this resource was to build a high quality data set with rich grammatical information that could support the development of a large set of high level language resources and processing tools for Portuguese.
The development of this resource started under the project SemanticShare - Resources and Tools for Semantic Processing (at: http://nlx.di.fc.ul.pt/projects.html) whose main goal was to generate a deep linguistic annotated corpus of Portuguese, with manually verified grammatical representations, was continued in the project METANET4U - Enhancing the Linguistic Infrastructure of Europe, and in the project QTLeap - Quality Translation by Deep Language Engineering Approaches.