Uplug (see Tiedemann, 2003a) is a collection of tools and scripts for processing text-corpora, for automatic alignment and for term extraction from parallel corpora.
Several tools have been integrated in Uplug. Pre-processing tools include a sentence splitter, a general tokenizer and wrappers around external part-of-speech tagger and shallow parsers. The following external tools are included in the standard package: The Grok system for English (tagging and chunking), and the morphological analyzer ChaSen for Japanese. Translated documents can be sentence aligned using the length-based approach by Gale&Church, hunalign or GMA by Melamed and others. Words and phrases can be aligned using the clue alignment approach (see Tiedemann, 2003b) and GIZA++ (a toolbox for training statistical alignment models for SMT). Other tools can easily be integrated, for example, the TreeTagger for English, French, Italian, and German, the TnT tagger for English, German and Swedish.
Uplug has been developed within the PLUG project (see Tiedemann, 2002). It also includes web-based interfaces for interactive sentence and word alignment (see Tiedemann, 2006).