This repository was archived by the owner on Apr 11, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathsummary.tex
More file actions
120 lines (108 loc) · 5.72 KB
/
summary.tex
File metadata and controls
120 lines (108 loc) · 5.72 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
%Thesis Main Page used with thesis.sty (Dorothea Brosius, August 2006) based on the
%University of Maryland, College Park "Thesis and Dissertation Style Guide,"
%2005-2006, Fall 2005 Edition
\documentclass[11pt]{thesis-2} % USE IN FINAL VERSION
%\documentclass[11pt]{nice-thesis}
\usepackage{times}
\usepackage[svgnames]{xcolor}
\usepackage[sort,round]{natbib}
\usepackage{lscape}
\usepackage{indentfirst}
\usepackage{latexsym}
\usepackage{multirow}
\usepackage{amsmath}
\usepackage[pointedenum]{paralist}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage[noend]{algpseudocode}
% Chinese stuff (borrowed from HLT 2005 paper)
\usepackage{CJK}
\usepackage{ucs}
\usepackage[utf8x]{inputenc}
\usepackage{mathptm}
\usepackage{afterpage}
\usepackage{tikz}
\usepackage{sparklines}
\usepackage{hyperref}
\usepackage{lopez}
\usepackage{extraction-fig-macros}
\hypersetup{
pdftitle={Machine Translation by Pattern Matching}, % title
pdfauthor={Adam Lopez}, % author
pdfsubject={computer science}, % subject of the document
pdfdisplaydoctitle=true,
colorlinks=true, % false: boxed links; true: colored links
linkcolor=MidnightBlue, % color of internal links
citecolor=MidnightBlue, % color of links to bibliography
urlcolor=DarkGreen % color of external links
}
\renewcommand{\baselinestretch}{1} % 2 IN OFFICIAL VERSION
\setlength{\textwidth}{5.9in}
\setlength{\textheight}{9in}
\setlength{\topmargin}{-.50in}
\setlength{\oddsidemargin}{.55in}
\setlength{\parindent}{.4in}
\pagestyle{empty}
\newcommand{\figpreamble}{\renewcommand{\baselinestretch}{1}}
\newcommand{\figpostamble}{\renewcommand{\baselinestretch}{1}} % 2 IN OFFICIAL VERSION
%\newcommand{\figfontsize}[1]{{\small #1}} % USE IN OFFICIAL VERSION
\newcommand{\figfontsize}[1]{#1}
\begin{document}
{\Large \begin{center}Adam Lopez: Machine Translation by Pattern Matching\end{center}} % this is a bit hackish
%\addcontentsline{toc}{chapter}{Abstract}
%\vspace{20pt}
%\renewcommand{\baselinestretch}{2}
\large \normalsize
The most promising systems for machine translation of natural language
are based on statistical models learned from data. Conventional
use of a statistical translation model requires a substantial
offline computation and representation in main memory.
Therefore, the principal bottlenecks to the amount of data we can exploit
and the complexity of models we can use are available memory and CPU time,
and current state of the art already pushes these limits. With data size
and model complexity continually increasing, a scalable solution to this
problem is central to future improvement.
{\em Translation by pattern matching} realizes this solution by removing
the bottlenecks required to work with very large models. The key idea is
that the complete model is never actually computed or stored---rules and parameters
are computed only as needed. In this design, the aligned training data resides in
memory, and when translations of a source sentence are needed, the relevant rules are found in
the data, extracted, and scored. For this to be practical, the on-demand computation must be
very fast. Several researchers independently prototyped a solution for common
phrase-based models based on the translation of substrings. It depends on the
ability to quickly locate substrings in the training data using very fast
pattern matching algorithms. However, they leave several open problems.
Among these is a question: can this approach match the performance of conventional
methods despite unavoidable differences that it induces in the model? We answer
this question affirmatively.
The main open problem we address is much harder. New state-of-the-art translation models
are based on the translation of {\em discontiguous substrings} -- that is, substrings
containing variable-length gaps. State of the art pattern
matching algorithms for these models are much too slow, taking several minutes
per sentence. We introduce new algorithms for this problem that reduce empirical
computation time by two orders of magnitude, making translation by
pattern matching widely applicable and useable with the best models.
We use our new algorithms to build a
model that is two orders of magnitude larger than the current state of the
art and substantially outperforms a strong competitor in Chinese-English
translation. We show that a conventional representation of this model would
be impractical, requiring over two months of CPU time to compute and nearly
a terabyte of storage. In contrast, the data structures required by our
algorithms require less than an hour to compute and require little additional
overhead during decoding. Although our method scales with the number of
processors, it is feasible even with modest computational resources, and thus
places large-scale research on statistical machine translation within the
reach of many.
We use this system to shed light on interesting properties
of the underlying hierarchical phrase-based model. In particular we conduct
experiments on a large number of variants that pinpoint the effect of various
parameters in the heuristic model induction on the overall performance. These
experiments point the way to further model improvements.
Additionally, the thesis includes a self-contained tutorial survey
on the rapidly expanding field of statistical machine translation. It currently
represents the most comprehensive contemporary survey of this important field,
and was recently published as an article in {\em ACM Computing Surveys}.
Other portions of the thesis appeared in the conferences on
{\em Empirical Methods in Natural Language Processing (EMNLP)} and the {\em International
Conference on Computational Linguistics (Coling)}.
\end{document}