Browsing Department of International Language Studies and Computational Linguistics (ISV) by Title
-
A white paperBuch-Kromann, Matthias (København, 2007)[More information][Less information]
Abstract: In this white paper, we review the theoretical evidence about the computational efficiency of dependency parsing and machine translation without the widely used, but linguistically questionable assumptions about projectivity and edge-factoring. On the basis of the heuristic local optimality parser proposed by (Buch-Kromann, 2006), we propose a common architecture for monolingual parsing, parallel parsing, and translation that does not make these assumptions. Finally, we describe the elementary repair operations in the model, and argue that the model is potentially interesting as a model of human translation. URI: http://hdl.handle.net/10398/6846 Files in this item: 1
2007-1.pdf (355.9Kb) -
Buch-Kromann, Matthias (Frederiksberg, 2010)[More information][Less information]
Abstract: DTAG is a versatile annotation tool that supports manual and semi-automatic annotation of a wide range of linguistic phenomena, including the annotation of syntax, discourse, coreference, morphology, and word alignments. It includes commands for editing general labeled graphs and graph alignments, comparing annotations, managing annotation tasks, and interfacing with a revision control system. Its visualization component can display graphs and alignments for entire texts in a compact format, with a highly flexible and configurable formatting scheme. It also provides a powerful search-replace mechanism with queries based on full first-order logic, which can be used to search for linguistic constructions and automatically apply graph transformations to collections of annotated graphs. URI: http://hdl.handle.net/10398/8222 Files in this item: 1
2010-wp-dtag (2).pdf (137.0Kb) -
Uneson, Marcus; Juel Henrichsen, Peter (Jachranka, 2011)[More information][Less information]
-
Carl, Michael; Doherty, Stephen; O’Brien, Sharon (Preprint, 2010)[More information][Less information]
Abstract: Eye tracking has been used successfully as a technique for measuring cognitive load in reading, psycholinguistics, writing, language acquisition etc for some time now. Its application as a technique for automatically measuring the reading ease of MT output has not yet, to our knowledge, been tested. We report here on a preliminary study testing the use and validity of an eye tracking methodology as a means of semi- and/or automatically evaluating machine translation output. 50 French machine translated sentences, 25 rated as excellent and 25 rated as poor in an earlier human evaluation, were selected. 10 native speakers of French were instructed to read the MT sentences for comprehensibility. Their eye gaze data were recorded non-invasively using a Tobii 1750 eye tracker. The average gaze time and fixation count were found to be higher for the “bad” sentences, while average fixation duration and pupil dilations were not found to be substantially different between output rated as good or bad. Comparisons between BLEU scores and eye gaze data were also made and found to correlate well with gaze time and fixation count, and to a lesser extent with pupil dilation and fixation duration. We conclude that the eye tracking data, in particular gaze time and fixation count, correlate reasonably well with human evaluation of MT output but fixation duration and pupil dilation may be less reliable indicators of reading difficulty for MT output. We also conclude that eye tracking has promise as an automatic MT Evaluation technique. URI: http://hdl.handle.net/10398/8045 Files in this item: 1
SubmissionforMT_dohertyobriencarl.pdf (226.2Kb) -
Carl, Michael (, 2008)[More information][Less information]
Abstract: One of the aims of the Eye-to-IT project is to investigate the possibility of using eye-tracking devices for detecting situations of targeted help for human translators. A prerequisite for automated assistance in human translation is the understanding and modelling of reading behaviour, the ability to follow human eye movements and to map gaze sample points — the output of eyetracking devices — onto words and symbols fixated. Within the Eye-to-IT project we currently use a so-called “Gaze-to- Word Mapping” (GWM) device (ˇSpakov 2008) that first computes possible fixations from sequences of gaze sample coordinates and then maps the fixations on the words which are likely to be fixated. This paper suggests an alternative framework of a probabilistic gaze mapping model for reading, in which fixations on textual objects are directly computed from the gaze sample points. The framework integrates various knowledge sources with the aim to compute the most likely fixations on words and symbols on the basis of the available data. URI: http://hdl.handle.net/10398/8043 Files in this item: 1
CLS.pdf (186.2Kb) -
En valensgrammatisk undersøgelseSkot-Hansen, Annemette (Frederiksberg, 2009)[More information][Less information]
Abstract: The purpose of this dissertation is to map the valency structure of a subset of the French adverbs. More specifically, the dissertation seeks to answer the following questions: What valency structure follows from the lexical content of the adverbs investigated? What is the nature of the semantic relation established? What is the status of the valents relative to the adverb and relative to other valents? The empirical object of investigation is focused on adverbs derived from adjectives which take prepositional phrases headed by the preposition à as their complement. In addition, the delimitation chosen for this dissertation is a class of adverbs which share the feature that they carry the suffix -ment, which developed from the Latin noun mens, meaning “spirit/thought/mood/tenor”. It is argued that the fusion of an adverb and mens establishes the general meaning [in an adjective spirit/thought/mood/tenor], i.e. the adverb retains the general quality denoted by the adjective, but the meaning targets the verb situation (at clause level) or the quality (at phrasal level) which saturates the argument of the adverb. Following tradition, the analysis adopted here, takes the verb situation to be realised by the predicate, and the quality to be realised by an adjective phrase, which may be realised by a past participle or, in rare cases, by another adverb. Since the valent is required by the lexical content of the adverb, it is assumed, following Herslund and Sørensen, that the valent is a fundamental valent. Another important feature of the adverbs which are analysed in this dissertation is that they establish a relation between two entities. This means that in addition to its fundamental valent, the adverb takes a further valent which it links with the fundamental valent. This second valent is referred to as the second valent of the adverb. The two valents are analysed as two relata in a relation. Unlike the fundamental valent, the second valent is always at phrasal level. When the adverb functions at clausal level, the second valent is realised as the prepositional object of the preposition phrase headed by à. This realisation is, however, not possible when the adverb functions at phrasal level. It is argued that this is a consequence of the fact that it is impossible to insert other constituents between the adverb and the adjective, adverb or participle which is modified by the adverb. The result is that where the second valent is realised, the adverb moves from preposition to postposition relative to its fundamental valent. In the data investigated the second valent denotes very different entities such as situations denoted by verbs and qualities, but also objects and abstract entities. The individual adverbs which are investigated here each determine their valency. In general there are different sources that allow us to uncover the core meaning of a word. The sources chosen in this dissertation are: the semantic roles assigned by the adverbs, their symmetry, elements of shared semantics or partial synonymy, their morphology and etymological roots. In order to bring together these different sources, the dissertation postulates a denotation design for each adverb. The etymology of the adverbs has been a particularly helpful in determining the relation and valency they establish. In addition to adverb and adjective suffixes, the majority of the adverbs investigated have a preposition in their synchronic morphological make-up which denotes a relation between two entities: some adverbs contain both a preposition and a morpheme from another word class, e.g. comparativement and subséquemment, while others contain only a preposition, e.g. antérieurement and postérieurement. A very small subset does not contain a preposition, but only a single adverb morpheme which denotes the relation in question, so, for instance, the adjectives par and similis, which have formed pareillement and semblablement, denote a relation between two relata. From an etymological perspective, a few adverbs, such as latéralement, do not denote a relation – so it is only through the formal realisation of the preposition phrase that the relation is established. The dissertation maps the etymological and morphological structure of the adverb and the range of functions that the adverb and its valents can have at clausal and phrasal level. The function of the adverb is relevant to the extent that the function affects its semantics and its valency structure. The effect of function is seen in some adverbs when they operate on clausal or on phrasal level and in other adverbs when they modify entire clauses or just the verb. URI: http://hdl.handle.net/10398/7944 Files in this item: 1
Annemette_Skot-Hansen.pdf (2.974Mb) -
Nistrup Madsen, Bodil; Odgaard, Anna Elisabeth (Frederiksberg, 2010)[More information][Less information]
Abstract: In order to develop a harmonised and efficient IT system, such as a database, it is important to be familiar with the underlying concept model (concept systems) for the relevant domain which the IT system should be designed to accommodate, as this forms the necessary firm foundation for designing the conceptual data model. Although there is no one-to-one correlation between concept and characteristic features in the concept model and classes and attributes in the conceptual data model, there are many similarities between concept modelling and conceptual data modelling, and by closely examining the relationship between the two models, we have strived to construct an algorithm for creating conceptual data models in Unified Modelling Language (UML) on the basis of concept models that adhere to the traditional principles and methods of terminology work. URI: http://hdl.handle.net/10398/8284 Files in this item: 1
bnm-aeo-TKE-2010-NEW.pdf (166.2Kb) -
A Study of CWA Raters' Decision-Making BehavioursLindhardsen, Vivian (Frederiksberg, 2009)[More information][Less information]
Abstract: The present maps study maps the decision-making behaviors of experienced raters in a well-established Communal Writing Assessment (CWA) context, tracing their behaviors all the way from the independent rating sessions, where the initial images and judgments are formed, to the communal rating sessions, where the final scores are assigned on the basis of collaboration between two rates. Results from think-aloud protocols, recorded discussions, retrespective reports and reported scores from 20 raters rating 15 ESL essays show that when moving from the independent ratings to the communal ratings, there is little, if any, increase in rater agreement levels and the raters' attention to the textual features corresponding to the official criteria become more evenly distributed. However, rather than consulting the scale descriptors directly in resolving insecurities about score assignment, the raters seemed to rely heavily on each others' expertise, thereby reducing the importance of the scale and emphasizing the value of the community of raters. In validating their scores in the communal rating discussions the raters appeared to be critically and equally engaged in the discussions, and through deliberating and refining their assessments the raters believed that CWA practices produce more accurate scores than in independent ratings and lead to professional development. These interpretations support a hermeneutic rather than a psychometric approach to establishing the validity of the present CWA practices. URI: http://hdl.handle.net/10398/7743 Files in this item: 1
Vivian_Lindhardsen.pdf (8.523Mb) -
Buch-Kromann, Matthias; Haulrich, Martin (Frederiksberg, 2010)[More information][Less information]
Abstract: We propose a novel machine learning technique that can be used to estimate probability distributions for categorical random variables that are equipped with a natural set of classification hierarchies, such as words equipped with word class hierarchies, wordnet hierarchies, and suffix and affix hierarchies. We evaluate the estimator on bigram language modelling with a hierarchy based on word suffixes, using English, Danish, and Finnish data from the Europarl corpus with training sets of up to 1–1.5 million words. The results show that the proposed estimator outperforms modified Kneser-Ney smoothing in terms of perplexity on unseen data. This suggests that important information is hidden in the classification hierarchies that we routinely use in computational linguistics, but that we are unable to utilize this information fully because our current statistical techniques are either based on simple counting models or designed for sample spaces with a distance metric, rather than sample spaces with a non-metric topology given by a classification hierarchy. Keywords: machine learning; categorical variables; classification hierarchies; language modelling; statistical estimation URI: http://hdl.handle.net/10398/8221 Files in this item: 1
2010-wp-buch-kromann-haulrich.pdf (216.6Kb) -
The Case of the Pre-nominal Genitive in EnglishAnker Jensen, Per (Frederiksberg, 2010)[More information][Less information]
URI: http://hdl.handle.net/10398/8237 Files in this item: 1
gengram_Dokumentation final+code.pdf (341.5Kb) -
Hardt, Daniel; Elming, Jakob (Frederiksberg, 2010)[More information][Less information]
Abstract: A method is presented for incremental retraining of an SMT system, in which a local phrase table is created and incrementally updated as a file is translated and post-edited. It is shown that translation data from within the same file has higher value than other domain-specific data. In two technical domains, within-file data increases BLEU score by several full points. Furthermore, a strong recency effect is documented; nearby data within the file has greater value than more distant data. It is also shown that the value of translation data is strongly correlated with a metric defined over new occurrences of ngrams. Finally, it is argued that the incremental re-training prototype could serve as the basis for a practical system which could be interactively updated in real time in a post-editing setting. Based on the results here, such an interactive system has the potential to dramatically improve translation quality. URI: http://hdl.handle.net/10398/8272 Files in this item: 1
Hardt_Elming.pdf (201.1Kb) -
Understanding Romance and Germanic Compounding in a Lexico-typological PerspectiveMüller, Henrik Høeg (Frederiksberg, 2010)[More information][Less information]
Abstract: The title of my talk is “Informational balance. Understanding Romance and Germanic Compounding in a lexico-typological perspective”. What I basically mean by informational balance is that semantic content is distributed systematically differently between nouns and verbs in the Romance and Germanic languages, and that this distribution is complementary. I shall explain that in detail in a minute, but first I shall introduce you to the problem, which I believe can be explained on the basis of this idea about “informational balance”. URI: http://hdl.handle.net/10398/8281 Files in this item: 1
Full Paper Berlin (sep 2010).pdf (110.5Kb) -
Juel Henrichsen, Peter (, 2011)[More information][Less information]
Abstract: Modern hearing aids use a variety of advanced digital signal processing methods in order to improve speech intelligibility. These methods are based on knowledge about the acoustics outside the ear as well as psychoacoustics. We present a novel observation based on the fact that acoustic prominence is not equal to information prominence for time intervals at the syllabic and sub-syllabic levels. The idea is that speech elements with a high degree of information can be robustly identified based on basic acoustic properties. We evaluated the correlation of (information rich) content words in the DanPASS corpus with fundamental frequency (F0) and spectral tilt across four frequency bands. Our results show a correlation of certain band-level differences and the presence of content words. Similarly, but to a lesser extent, a correlation between F0 and the presence of content words was found. The principle described here has the potential to improve the “information-to-noise” ratio in hearing aids. In addition, this concept may also be applicable in automatic speech recognition systems. URI: http://hdl.handle.net/10398/8411 Files in this item: 1
Peter_Juel_Henrichsen_ISAAR2011.pdf (296.9Kb) -
Carl, Michael; Kay, Martin; Jensen, Kristian T. H. (Preprint, 2010)[More information][Less information]
Abstract: This paper investigates properties of translation processes, as observed in the translation behaviour of student and professional translators. The translation process can be divided into a gisting, drafting and post-editing phase. We find that student translators have longer gisting phases whereas professional translators have longer post-editing phases. Long-distance revisions, which would typically be expected during post-editing, occur to the same extent during drafting as during post-editing. Further, both groups of translators seem to face the same translation problems. We suggest how those findings might be taken into account in the design of computer assisted translation tools. URI: http://hdl.handle.net/10398/8046 Files in this item: 1
LonDistRevision.pdf (651.7Kb) -
Low Resources Machine TranslationCarl, Michael; Maite, Melero; Badia, Toni; Vandeghinste, Vincent; Dirix, Peter; Schuurman, Ineke; Markantonatou, Stella; Sofianopoulos, Sokratis; Vassiliou, Marina; Yannoutsou, Olga (, 2008)[More information][Less information]
Abstract: METIS-II was a EU-FET MT project running from October 2004 to September 2007, which aimed at translating free text input without resorting to parallel corpora. The idea was to use ‘basic’ linguistic tools and representations and to link them with patterns and statistics from the monolingual target-language corpus. The METIS-II project has four partners, translating from their ‘home’ languages Greek, Dutch, German, and Spanish into English. The paper outlines the basic ideas of the project, their implementation, the resources used, and the results obtained. It also gives examples of how METIS-II has continued beyond its lifetime and the original scope of the project. On the basis of the results and experiences obtained, we believe that the approach is promising and offers the potential for development in various directions. URI: http://hdl.handle.net/10398/8037 Files in this item: 1
METIS-II.pdf (503.5Kb) -
Christiansen, Thomas U.; Juel Henrichsen, Peter (Aalborg, 2011)[More information][Less information]
Abstract: Nonsense syllable speech materials are often used when investigating speech perception in quiet and under adverse conditions. The main advantage of using nonsense syllables over words and sentences is that the acoustic as well as the linguistic context is minimal. This paper presents three anechoic recordings of 13 male and 13 female native talkers of Danish each speaking 65 nonsense syllables repeated three times with the neutral intonation contour for Danish (in total 15210 syllables). The authors compared and ranked groups of three recordings. These three recording had the same talker and had identical phonetic content. The syllables were ranked according to the general “appropriateness” and consistency, i.e., prototypical production of the consonant-vowel (CV) with respect to applicability in speech perceptual studies. The results were compared to results of an automatic method based on acoustic measures. The two novel ideas are 1) to devise an automated method for evaluating “appropriateness” of CVs and 2) to develop a Danish CV-material annotated with an objective measure of “appropriateness” for each recorded CV. The latter would potentially render more CV’s appropriate for perceptual studies. Moreover, objective evaluation would make it possible to examine any perceptual effects of variability in CV production (for example how susceptible different renderings by the same talker of CV’s are to background noise). To the knowledge of the authors, no such material has yet been published for any language. URI: http://hdl.handle.net/10398/8412 Files in this item: 1
Peter_Juel_Henrichsen_2.pdf (427.2Kb) -
Carl, Michael; Lykke Jakobsen,Arnt; Jensen, Kristian T. H. (, 2009)[More information][Less information]
Abstract: One of the aims of the Eye-to-IT project (FP6 IST 517590) is to integrate keyboard logging and eye-tracking data to study and anticipate the behaviour of human translators. This so-called User-Activity Data (UAD) would make it possible to empirically ground cognitive models and to validate hypotheses of human processing concepts in the data. In order to thoroughly ground a cognitive model of the user in empirical observation, two conditions must be met as a minimum. All UAD data must be fully synchronised so that data relate to a common construct. Secondly, data must be represented in a queryable form so that large volumes of data can be analysed electronically. Two programs have evolved in the Eye-to-IT project: TRANSLOG is designed to register and replay keyboard logging data, while GWM is a tool to record and replay eye-movement data. This paper reports on an attempt to synchronise and integrate the representations of both software components so that sequences of keyboard and eye-movement data can be retrieved and their interaction studied. The outcome of this effort would be the possibility to correlate eye- and keyboard activities of translators (the user model) with properties of the source and target texts and thus to uncover dependencies in the UAD. URI: http://hdl.handle.net/10398/8041 Files in this item: 1
NLPCS09.pdf (481.2Kb) -
Gylling, Morten; Korzen, Iørn (Agay, 2011)[More information][Less information]
Abstract: This paper examines some typological differences in the discourse structure of Italian and Danish. The results of the study indicate that there are significant differences in information packing in the two languages, especially in their use of deverbalisation. Italian sentences tend to include a larger number of Elementary Discourse Units (EDUs), especially propositions, than Danish. A higher percentage of these is rhetorically backgrounded by means of non-finite and nominalised predicates. Danish text structure, on the other hand, is more informationally linear and characteristic of a higher number of finite verbs and topic shifts. The study also suggests that a more fine-grained classification of non-finite and nominalised EDUs is needed for a complete in-depth analysis of discourse constraints in different language families. URI: http://hdl.handle.net/10398/8415 Files in this item: 1
Gylling_Korzen.pdf (124.8Kb) -
Med udgangspunkt i støtteverbers leksikaliseringsmønstre i dansk og franskHein, Birgitte (Frederiksberg, 2003)[More information][Less information]
Abstract: Enhver oversætter mellem et germansk sprog som dansk og et romansk sprog som fransk ved, at det ofte er bestemte sproglige konstruktioner, der volder problemer. En af disse konstruktioner består af et støtteverbum og et objekt, der tilsammen danner en semantisk enhed. Da denne konstruktion er hyppigt forekommende, specielt i juridiske og administrative tekster, kan det være af både praktisk og teoretisk værdi at skaffe et klarere billede af, hvordan konstruktionerne idiomatisk opbygges og bruges på de to sprog. Undersøgelsen søger at indskrive sig i en sammenhæng, der vedrører både oversættelse og lingvistisk beskrivelse, ud fra et ønske om at en komparativ beskrivelse skal kunne give en oversætter viden, som han kan bruge i sit praktiske arbejde. De fleste, som har benyttet computer-støttede oversættelser, må være enige i, at det stadig er nødvendigt med kvalificeret menneskelig oversættelse, hvis man skal have en idiomatisk korrekt og brugbart resultat. Der er ganske vist i dag mulighed for computer-støttede ”rå-oversættelser”. Somme tider kan disse oversættelser tjene til for eksempel at give en internetbruger et hurtigt indtryk af indholdet af en web-side på et sprog, som han ikke behersker.... URI: http://hdl.handle.net/10398/8623 Files in this item: 1
Birgitte_Hein.pdf (776.8Kb) -
Carl, Michael (, 2008)[More information][Less information]
Abstract: The paper introduces a new research strategy for the investigation of human translation behavior. While conventional cognitive research methods make use of think aloud protocols (TAP), we introduce and investigate User- Activity Data (UAD). UAD consists of the translator’s recorded keystroke and eye-movement behavior, which makes it possible to replay a translation session and to register the subjects’ comments on their own behavior during a retrospective interview. UAD has the advantage of being objective and reproducable, and, in contrast to TAP, does not interfere with the translation process. The paper gives the background of this technique and an example on a English-to-Danish translation. Our goal is to elaborate and investigate cognitively grounded basic translation concepts which are materialized and traceable in the UAD and which, in a later stage, will provide the basis for appropriate and targeted help for the translator at a given moment. URI: http://hdl.handle.net/10398/8044 Files in this item: 1
UAD-3.pdf (408.4Kb)