Browsing Department of International Language Studies and Computational Linguistics (ISV) by Title
Previous Page
Now showing items 16-31 of 31
-
Buch-Kromann, Matthias; Haulrich, Martin (Frederiksberg, 2010)[More information][Less information]
Abstract: We propose a novel machine learning technique that can be used to estimate probability distributions for categorical random variables that are equipped with a natural set of classification hierarchies, such as words equipped with word class hierarchies, wordnet hierarchies, and suffix and affix hierarchies. We evaluate the estimator on bigram language modelling with a hierarchy based on word suffixes, using English, Danish, and Finnish data from the Europarl corpus with training sets of up to 1–1.5 million words. The results show that the proposed estimator outperforms modified Kneser-Ney smoothing in terms of perplexity on unseen data. This suggests that important information is hidden in the classification hierarchies that we routinely use in computational linguistics, but that we are unable to utilize this information fully because our current statistical techniques are either based on simple counting models or designed for sample spaces with a distance metric, rather than sample spaces with a non-metric topology given by a classification hierarchy. Keywords: machine learning; categorical variables; classification hierarchies; language modelling; statistical estimation URI: http://hdl.handle.net/10398/8221 Files in this item: 1
2010-wp-buch-kromann-haulrich.pdf (216.6Kb) -
The Case of the Pre-nominal Genitive in EnglishAnker Jensen, Per (Frederiksberg, 2010)[More information][Less information]
URI: http://hdl.handle.net/10398/8237 Files in this item: 1
gengram_Dokumentation final+code.pdf (341.5Kb) -
Hardt, Daniel; Elming, Jakob (Frederiksberg, 2010)[More information][Less information]
Abstract: A method is presented for incremental retraining of an SMT system, in which a local phrase table is created and incrementally updated as a file is translated and post-edited. It is shown that translation data from within the same file has higher value than other domain-specific data. In two technical domains, within-file data increases BLEU score by several full points. Furthermore, a strong recency effect is documented; nearby data within the file has greater value than more distant data. It is also shown that the value of translation data is strongly correlated with a metric defined over new occurrences of ngrams. Finally, it is argued that the incremental re-training prototype could serve as the basis for a practical system which could be interactively updated in real time in a post-editing setting. Based on the results here, such an interactive system has the potential to dramatically improve translation quality. URI: http://hdl.handle.net/10398/8272 Files in this item: 1
Hardt_Elming.pdf (201.1Kb) -
Understanding Romance and Germanic Compounding in a Lexico-typological PerspectiveMüller, Henrik Høeg (Frederiksberg, 2010)[More information][Less information]
Abstract: The title of my talk is “Informational balance. Understanding Romance and Germanic Compounding in a lexico-typological perspective”. What I basically mean by informational balance is that semantic content is distributed systematically differently between nouns and verbs in the Romance and Germanic languages, and that this distribution is complementary. I shall explain that in detail in a minute, but first I shall introduce you to the problem, which I believe can be explained on the basis of this idea about “informational balance”. URI: http://hdl.handle.net/10398/8281 Files in this item: 1
Full Paper Berlin (sep 2010).pdf (110.5Kb) -
Juel Henrichsen, Peter (, 2011)[More information][Less information]
Abstract: Modern hearing aids use a variety of advanced digital signal processing methods in order to improve speech intelligibility. These methods are based on knowledge about the acoustics outside the ear as well as psychoacoustics. We present a novel observation based on the fact that acoustic prominence is not equal to information prominence for time intervals at the syllabic and sub-syllabic levels. The idea is that speech elements with a high degree of information can be robustly identified based on basic acoustic properties. We evaluated the correlation of (information rich) content words in the DanPASS corpus with fundamental frequency (F0) and spectral tilt across four frequency bands. Our results show a correlation of certain band-level differences and the presence of content words. Similarly, but to a lesser extent, a correlation between F0 and the presence of content words was found. The principle described here has the potential to improve the “information-to-noise” ratio in hearing aids. In addition, this concept may also be applicable in automatic speech recognition systems. URI: http://hdl.handle.net/10398/8411 Files in this item: 1
Peter_Juel_Henrichsen_ISAAR2011.pdf (296.9Kb) -
Carl, Michael; Kay, Martin; Jensen, Kristian T. H. (Preprint, 2010)[More information][Less information]
Abstract: This paper investigates properties of translation processes, as observed in the translation behaviour of student and professional translators. The translation process can be divided into a gisting, drafting and post-editing phase. We find that student translators have longer gisting phases whereas professional translators have longer post-editing phases. Long-distance revisions, which would typically be expected during post-editing, occur to the same extent during drafting as during post-editing. Further, both groups of translators seem to face the same translation problems. We suggest how those findings might be taken into account in the design of computer assisted translation tools. URI: http://hdl.handle.net/10398/8046 Files in this item: 1
LonDistRevision.pdf (651.7Kb) -
Low Resources Machine TranslationCarl, Michael; Maite, Melero; Badia, Toni; Vandeghinste, Vincent; Dirix, Peter; Schuurman, Ineke; Markantonatou, Stella; Sofianopoulos, Sokratis; Vassiliou, Marina; Yannoutsou, Olga (, 2008)[More information][Less information]
Abstract: METIS-II was a EU-FET MT project running from October 2004 to September 2007, which aimed at translating free text input without resorting to parallel corpora. The idea was to use ‘basic’ linguistic tools and representations and to link them with patterns and statistics from the monolingual target-language corpus. The METIS-II project has four partners, translating from their ‘home’ languages Greek, Dutch, German, and Spanish into English. The paper outlines the basic ideas of the project, their implementation, the resources used, and the results obtained. It also gives examples of how METIS-II has continued beyond its lifetime and the original scope of the project. On the basis of the results and experiences obtained, we believe that the approach is promising and offers the potential for development in various directions. URI: http://hdl.handle.net/10398/8037 Files in this item: 1
METIS-II.pdf (503.5Kb) -
Christiansen, Thomas U.; Juel Henrichsen, Peter (Aalborg, 2011)[More information][Less information]
Abstract: Nonsense syllable speech materials are often used when investigating speech perception in quiet and under adverse conditions. The main advantage of using nonsense syllables over words and sentences is that the acoustic as well as the linguistic context is minimal. This paper presents three anechoic recordings of 13 male and 13 female native talkers of Danish each speaking 65 nonsense syllables repeated three times with the neutral intonation contour for Danish (in total 15210 syllables). The authors compared and ranked groups of three recordings. These three recording had the same talker and had identical phonetic content. The syllables were ranked according to the general “appropriateness” and consistency, i.e., prototypical production of the consonant-vowel (CV) with respect to applicability in speech perceptual studies. The results were compared to results of an automatic method based on acoustic measures. The two novel ideas are 1) to devise an automated method for evaluating “appropriateness” of CVs and 2) to develop a Danish CV-material annotated with an objective measure of “appropriateness” for each recorded CV. The latter would potentially render more CV’s appropriate for perceptual studies. Moreover, objective evaluation would make it possible to examine any perceptual effects of variability in CV production (for example how susceptible different renderings by the same talker of CV’s are to background noise). To the knowledge of the authors, no such material has yet been published for any language. URI: http://hdl.handle.net/10398/8412 Files in this item: 1
Peter_Juel_Henrichsen_2.pdf (427.2Kb) -
Carl, Michael; Lykke Jakobsen,Arnt; Jensen, Kristian T. H. (, 2009)[More information][Less information]
Abstract: One of the aims of the Eye-to-IT project (FP6 IST 517590) is to integrate keyboard logging and eye-tracking data to study and anticipate the behaviour of human translators. This so-called User-Activity Data (UAD) would make it possible to empirically ground cognitive models and to validate hypotheses of human processing concepts in the data. In order to thoroughly ground a cognitive model of the user in empirical observation, two conditions must be met as a minimum. All UAD data must be fully synchronised so that data relate to a common construct. Secondly, data must be represented in a queryable form so that large volumes of data can be analysed electronically. Two programs have evolved in the Eye-to-IT project: TRANSLOG is designed to register and replay keyboard logging data, while GWM is a tool to record and replay eye-movement data. This paper reports on an attempt to synchronise and integrate the representations of both software components so that sequences of keyboard and eye-movement data can be retrieved and their interaction studied. The outcome of this effort would be the possibility to correlate eye- and keyboard activities of translators (the user model) with properties of the source and target texts and thus to uncover dependencies in the UAD. URI: http://hdl.handle.net/10398/8041 Files in this item: 1
NLPCS09.pdf (481.2Kb) -
Gylling, Morten; Korzen, Iørn (Agay, 2011)[More information][Less information]
Abstract: This paper examines some typological differences in the discourse structure of Italian and Danish. The results of the study indicate that there are significant differences in information packing in the two languages, especially in their use of deverbalisation. Italian sentences tend to include a larger number of Elementary Discourse Units (EDUs), especially propositions, than Danish. A higher percentage of these is rhetorically backgrounded by means of non-finite and nominalised predicates. Danish text structure, on the other hand, is more informationally linear and characteristic of a higher number of finite verbs and topic shifts. The study also suggests that a more fine-grained classification of non-finite and nominalised EDUs is needed for a complete in-depth analysis of discourse constraints in different language families. URI: http://hdl.handle.net/10398/8415 Files in this item: 1
Gylling_Korzen.pdf (124.8Kb) -
Med udgangspunkt i støtteverbers leksikaliseringsmønstre i dansk og franskHein, Birgitte (Frederiksberg, 2003)[More information][Less information]
Abstract: Enhver oversætter mellem et germansk sprog som dansk og et romansk sprog som fransk ved, at det ofte er bestemte sproglige konstruktioner, der volder problemer. En af disse konstruktioner består af et støtteverbum og et objekt, der tilsammen danner en semantisk enhed. Da denne konstruktion er hyppigt forekommende, specielt i juridiske og administrative tekster, kan det være af både praktisk og teoretisk værdi at skaffe et klarere billede af, hvordan konstruktionerne idiomatisk opbygges og bruges på de to sprog. Undersøgelsen søger at indskrive sig i en sammenhæng, der vedrører både oversættelse og lingvistisk beskrivelse, ud fra et ønske om at en komparativ beskrivelse skal kunne give en oversætter viden, som han kan bruge i sit praktiske arbejde. De fleste, som har benyttet computer-støttede oversættelser, må være enige i, at det stadig er nødvendigt med kvalificeret menneskelig oversættelse, hvis man skal have en idiomatisk korrekt og brugbart resultat. Der er ganske vist i dag mulighed for computer-støttede ”rå-oversættelser”. Somme tider kan disse oversættelser tjene til for eksempel at give en internetbruger et hurtigt indtryk af indholdet af en web-side på et sprog, som han ikke behersker.... URI: http://hdl.handle.net/10398/8623 Files in this item: 1
Birgitte_Hein.pdf (776.8Kb) -
Carl, Michael (, 2008)[More information][Less information]
Abstract: The paper introduces a new research strategy for the investigation of human translation behavior. While conventional cognitive research methods make use of think aloud protocols (TAP), we introduce and investigate User- Activity Data (UAD). UAD consists of the translator’s recorded keystroke and eye-movement behavior, which makes it possible to replay a translation session and to register the subjects’ comments on their own behavior during a retrospective interview. UAD has the advantage of being objective and reproducable, and, in contrast to TAP, does not interfere with the translation process. The paper gives the background of this technique and an example on a English-to-Danish translation. Our goal is to elaborate and investigate cognitively grounded basic translation concepts which are materialized and traceable in the UAD and which, in a later stage, will provide the basis for appropriate and targeted help for the translator at a given moment. URI: http://hdl.handle.net/10398/8044 Files in this item: 1
UAD-3.pdf (408.4Kb) -
Elming, Jakob (Frederiksberg, 2008)[More information][Less information]
Abstract: Reordering has been an important topic in statistical machine translation (SMT) as long as SMT has been around. State-of-the-art SMT systems such as Pharaoh (Koehn, 2004a) still employ a simplistic model of the reordering process to do non-local reordering. This model penalizes any reordering no matter the words. The reordering is only selected if it leads to a translation that looks like a much better sentence than the alternative. Recent developments have, however, seen improvements in translation quality following from syntax-based reordering. One such development is the pre-translation approach that adjusts the source sentence to resemble target language word order prior to translation. This is done based on rules that are either manually created or automatically learned from word aligned parallel corpora. We introduce a novel approach to syntactic reordering. This approach provides better exploitation of the information in the reordering rules and eliminates problematic biases of previous approaches. Although the approach is examined within a pre-translation reordering framework, it easily extends to other frameworks. Our approach significantly outperforms a state-of-the-art phrase-based SMT system and previous approaches to pretranslation reordering, including (Li et al., 2007; Zhang et al., 2007b; Crego & Mari˜ no, 2007). This is consistent both for a very close language pair, English-Danish, and a very distant language pair, English-Arabic. We also propose automatic reordering rule learning based on a rich set of linguistic information. As opposed to most previous approaches that extract a large set of rules, our approach produces a small set of predominantly general rules. These provide a good reflection of the main reordering issues of a given language pair. We examine the influence of several parameters that may have influence on the quality of the rules learned. Finally, we provide a new approach for improving automatic word alignment. This word alignment is used in the above task of automatically learning reordering rules. Our approach learns from hand aligned data how to combine several automatic word alignments to one superior word alignment. The automatic word alignments are created from the same data that has been preprocessed with different tokenization schemes. Thus utilizing the different strengths that different tokenization schemes exhibit in word alignment. We achieve a 38% error reduction for the automatic word alignment URI: http://hdl.handle.net/10398/7922 Files in this item: 1
jakob_elming.pdf (1.033Mb) -
A Program for Recording User Activity Data for Empirical Reading and Writing ResearchCarl, Michael (Frederiksberg, 2012)[More information][Less information]
Abstract: This paper presents a novel implementation of Translog-II. Translog-II is a Windows-oriented program to record and study reading and writing processes on a computer. In our research, it is an instrument to acquire objective, digital data of human translation processes. As their predecessors, Translog 2000 and Translog 2006, also Translog-II consists of two main components: Translog-II Supervisor and Translog-II User, which are used to create a project file, to run a text production experiments (a user reads, writes or translates a text) and to replay the session. Translog produces a log files which contains all user activity data of the reading, writing, or translation session, and which can be evaluated by external tools. While there is a large body of translation process research based on Translog, this paper gives an overview of the Translog-II functions and its data visualization options. URI: http://hdl.handle.net/10398/8435 Files in this item: 1
Michael_Carl_2012.pdf (824.8Kb) -
Quantifying alignment units with keystroke dataCarl, Michael (, 2009)[More information][Less information]
Abstract: The paper discusses a method to triangulate process and product data. We suggest converting Translog data into a relational format which contains both process and product data. We outline how this representation allows us to retrieve and correlate the various dimensions of the data more easily. The concept of Alignment Unit (AU) is introduced and contrasted with that of Translation Unit (TU). While AUs refer to translation equivalences in the source and target texts of the product data, TUs refer to cognitive entities that can be observed in the process data. With an (almost) exhaustive fragmentation of the source and target texts into AUs, we are able to distribute and allocate the entire set of keystroke data to appropriate AUs. Using the properties of the keystroke data, AUs are quantified in a novel way which enables us to visualise and investigate the structure of translation production on a fine-grained scale. URI: http://hdl.handle.net/10398/8040 Files in this item: 1
keystrokes.pdf (940.0Kb) -
Korzen, Iørn; Gylling, Morten (Hamburg, 2011)[More information][Less information]
Abstract: This paper argues that translators can greatly benefit from contrastive studies of discourse structure. Cross-linguistic studies of Italian and Danish point to significant typological differences in information packaging in the two languages, especially in their use of deverbalisation. Italian sentences tend to include a larger number of Elementary Discourse Units (EDUs), especially propositions, than Danish. A higher percentage of these is rhetorically backgrounded by means of non-finite and nominalised predicates. Danish text structure, on the other hand, is more informationally linear and characterised by a higher number of finite verbs and topic shifts. These typological differences are transferred into three simple translation rules concerning 1) the number of EDUs, 2) the rhetorical structure, and 3) the textualisation of rhetorical satellites. URI: http://hdl.handle.net/10398/8416 Files in this item: 1
Korzen_Gylling.pdf (513.0Kb)
Previous Page
Now showing items 16-31 of 31