
SYSTRAN's MT ARCHITECTURE
OverView
The general framework which SYSTRAN utilizes in all its MT systems is
proven to be powerful and effective. In its long history, many improvements
have been made to the original design, resulting in great modularity.
Use of existing modules, as well as consistent use of similar methods
across different languages, when applicable, will allow quick and efficient
development of a functional prototype system for any new language pair.
SYSTRAN's architecture is also very flexible and allows introduction
of innovative methods. In fact, with every new language added to the SYSTRAN
inventory some new techniques have been tried in response to new challenges
of that language. Often such innovations are later found to be also applicable
to other language pair systems.
Methodology
SYSTRAN's methodology is a sentence by sentence approach, concentrating
first on individual words and their dictionary data, then on the parse
of the sentence unit, followed by the translation of the parsed sentence.
Modularity
Three major groups describe the SYSTRAN architecture: Dictionary, Systems
Software and Linguistic Software. Each of these consists of a great number
of modules which all work together to create fully automatic MT (Machine
Translation) system.
Dictionary
SYSTRAN traditionally employs three distinct, but interconnected types
of dictionaries for the MT systems of all languages.
- Stem Dictionary. The basic dictionary is a single-word "Stem Dictionary".
Words are entered in a basic form with codes to indicate inflectional patters,
part-of-speech, syntactic behavior, semantic properties, and target language
meanings together with codes needed for the target word generation. Homographic
forms with part-of-speech ambiguity are entered separately for each part-of-speech,
cross-referenced to the basic entries and indexed by type of part-of-speech
ambiguity. The source language related portion of the Stem dictionary is
complemented by transfer and target information for each word into several
target languages.
- Expression Dictionary. This is the dictionary of multiple-word expressions.
These expressions include co-occurrence-based and rule-based expressions,
and may range from simple noun phrases, to expressions containing translation
rules based on the syntactic or semantic link between individual words,
or entire classes of words. Words in the Expression dictionary are given
in their "basic" form, and indexing to the Stem Dictionary allows
execution of the rule for all inflected forms or alternate spellings as
recognized in the Stem dictionary.
- Customer Specific Dictionary (CSD). A PC/Windows based CSD allows the
user to enter terms (words and a set of pre-defined types of expressions)
which were not found in the main dictionaries. The user may also globally
or conditionally change meanings found in the main dictionaries. The CSD
is designed for the individual or industrial user with limited needs.
System Software
A body of systems software, consistent across the various SYSTRAN language
pairs, handles formatting, character conversion, user interface, sentence
and word boundary determination, dictionary and morphology lookup, and
not-found word treatment. It controls the flow of linguistic modules and
creates final formatted output. Also supported are a variety of tools for
dictionary preparation, quality assurance, corpus manipulation, and parsing
diagnostics.
Linguistic Software
- Parser. The most challenging aspect of any MT system is the parser,
the module that analyzes each sentence and attempts to build up representations
of the source sentences. SYSTRAN parses with a battery of procedural modules
which resolve, step by step, various relationships and assign structure
within the sentence. The SYSTRAN parser is deterministic in nature, so
each module makes firm decisions and passes the results on to the next
module. The advantage is that every sentence, even an incomplete or malformed
one, will be parsed and therefore translated. The disadvantage of such
determinism, is that incorrect decisions may be passed on and compounded
from module to module. SYSTRAN is able to soften this by several mechanisms
that flag uncertain decisions. SYSTRAN's final step in this checking process
is a Filter program which identifies the major parse errors.
- Target Language Translation Modules. After a parse of the input sentence
has been constructed, algorithms for the construction of a translation
are invoked. Translation information, on both the word and expression levels,
is derived during dictionary lookup and the parsing phases of the translation,
for use by two distinct Transfer and Synthesis modules. The Transfer component
performs situation-specific restructuring, depending on the degree of difference
between source and target languages. It is the only module, besides the
dictionary, which relates to both source and target language, and it is
rather small when the two languages are closely related.
- Synthesis Module. Following this, the Synthesis module generates the
strings which correspond to the information provided by all previous modules.
Synthesis is a source-language independent module. The Synthesis modules
contain sophisticated algorithms for creating specialized target language
constructs, such as negation, questions, verbs with complete morphology,
placement of adverbs, and articles.
Development
of Additional Language
Development of new language-pair translation capability between languages
for which SYSTRAN already has source and target modules, is the easiest
to accomplish. Only a new transfer module and the transfer/target dictionaries
need to be created.
Development of additional target language capability for each source
system is possible and quite economical because SYSTRAN systems are set
up as "Multi-target" systems. Adding another target language
would necessitate only the development of a new Transfer module and a new
Synthesis module, as well as building up the Transfer / Target dictionaries.
Development of additional source language capability for each target
system is more difficult, if a completely new parser has to be created.
However, if the new source language is closely related to one of the existing
SYSTRAN source languages, development of a new parser can take advantage
of common rules within a language family via the use of existing "Trunk
Parsers", (such as Romance Trunk, Slavic Trunk,...).