dissertation/Abstract.tex at master · alopez/dissertation · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
%Abstract Page

\hbox{\ }

\figpreamble
% \renewcommand{\baselinestretch}{1} % USE IN FINAL VERSION
\small \normalsize

{\LARGE \begin{center}Abstract\end{center}} % this is a bit hackish
%\addcontentsline{toc}{chapter}{Abstract}
%\vspace{20pt}

\vspace{3em}

\hspace{-.15in}
\begin{tabular}{ll}
Title of dissertation:    & {\sc Machine Translation by Pattern Matching}\\
& \\
&                          {Adam Lopez, Doctor of Philosophy, 2008}\\
& \\
Dissertation directed by: & {Professor Philip Resnik} \\
&  				{Department of Linguistics and} \\
&  				{Institute for Advanced Computer Studies} \\
\end{tabular}
\figpostamble

\vspace{3em}

%\renewcommand{\baselinestretch}{2}
\large \normalsize
The best systems for machine translation of natural language
are based on statistical models learned from data.  Conventional
representation of a statistical translation model requires substantial
offline computation and representation in main memory.
Therefore, the principal bottlenecks to the amount of data we can exploit
and the complexity of models we can use are available memory and CPU time,
and current state of the art already pushes these limits.  With data size
and model complexity continually increasing, a scalable solution to this
problem is central to future improvement.

\citet{Callison-Burch:2005:acl} and \citet{Zhang:2005:eamt} proposed a solution
that we call {\em translation by pattern matching}, which we bring to fruition
in this dissertation.  The training data itself serves as a proxy to the
model; rules and parameters are computed on demand.  It achieves our
desiderata of minimal offline computation and compact representation, but is
dependent on fast pattern matching algorithms on text.  They demonstrated
its application to a common model based on the translation of contiguous
substrings, but leave some open problems.  Among these is a question: can
this approach match the performance of conventional methods despite
unavoidable differences that it induces in the model?  We show how to answer
this question affirmatively.

The main open problem we address is much harder.  Many translation models
are based on the translation of discontiguous substrings.  The best pattern
matching algorithm for these models is much too slow, taking several minutes
per sentence.  We develop new algorithms that reduce empirical computation
time by two orders of magnitude for these models, making translation by
pattern matching widely applicable.  We use these algorithms to build a
model that is two orders of magnitude larger than the current state of the
art and substantially outperforms a strong competitor in Chinese-English
translation.  We show that a conventional representation of this model would
be impractical.  Our experiments shed light on some interesting properties
of the underlying model.  The dissertation also includes the most
comprehensive contemporary survey of statistical machine translation.