The present tool, that was built to deal with Portuguese-specific issues concerning syntactic categorization, assigns a single morpho-syntactic tag, from the tagset below, to every token. The tag is attached to the token, using a / (slash) symbol as separator:
um exemplo → um/IA exemplo/CN
Each individual token in multi-token expressions gets the tag of that expression prefixed by "L" and followed by the number of its position within the expression:
de maneira a que → de/LCJ1 maneira/LCJ2 a/LCJ3 que/LCJ4
This tagger was developed with TnT software over 90% of a small, 260 Ktoken, accurately hand tagged corpus. Accuracy of 96.87% was obtained with the tagger being trained over 90% of the 260 Ktokens and evaluated over the held out 10%, this being repeated over 10 different test runs and the results averaged.
LX-Tokenizer was developed and is maintained at University of Lisbon by the NLX-Natural Language and Speech Group of the Department of Informatics.