The CINTIL-DeepBank (Branco et. al. 2010) is a corpus of deep grammatical annotated sentences from Portuguese texts composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), novels (399 sentences; 3,082 tokens) (see 3.2.). In addition, there are 779 sentences (5,654 tokens) that are used for regression testing of the computational grammar that supported the annotation of the corpus (cf. section 4.6.).
The CINTIL-DeepBank is composed of MRS and AVM representations, derivation tree, and syntactic tree with grammatical and semantic labels of each sentence’s. This is the result of a previous semi-automatic analysis with a double-blind annotation followed by adjudication (see Branco and Costa, 2008, with a full description of the process). The resulting dataset contains one information level: semantic relations.
The main motivation behind the creation of this resource was to build a high quality data set with syntactic information that could support the development of a large set of automatic resources and tools for Portuguese for NLP studies.
The development of this resource started under the project SemanticShare – Resources and Tools for Semantic Processing (at: http://nlx.di.fc.ul.pt/projects.html) whose main goal was to generate a deep linguistic annotated corpus of Portuguese, with manually verified grammatical representations.