Browsing Department of International Language Studies and Computational Linguistics (ISV) by Title
-
Self-formed Groups versus Automatically-formed GroupsRazmerita, Liana; Brun, Armelle (Frederiksberg, 2011)[More information][Less information]
Abstract: Group work has been adopted as an important tool to support collaborative work in order to enhance learning processes. There is a wealth of literature related to group performance and the impact of group composition on group and individual performance. However, very few studies address the issue on how to automatically form groups. This article proposes a methodology that could be used by professors to form groups automatically taking into account different criteria as well as the students’ profile. This methodology is based on a pilot study that analyzes group composition of self-formed student groups. URI: http://hdl.handle.net/10398/8553 Files in this item: 1
Razmerita_2011.pdf (323.9Kb) -
Nistrup Madsen, Bodil; Erdman Thomsen, Hanne; Halskov, Jakob; Lassen, Tine (Frederiksberg, 2010)[More information][Less information]
Abstract: In our paper we present a project, the aim of which is to develop innovative and advanced methods for dynamic and automatic extraction of knowledge about concepts from texts and for automatic construction of ontologies. The project builds on and further develops the results of the CAOS project - Computer-Aided Ontology Structuring - which was carried out at Copenhagen Business School in the period 1998-2007. Terminological ontologies differ from other types of ontologies by comprising feature specifications and subdivision criteria. We have formalised subdivision criteria that have been used for many years in terminology work, by introducing dimensions and dimension specifications. In the CAOS prototype, facilities for semiautomatic checking of inconsistencies were developed. URI: http://hdl.handle.net/10398/8283 Files in this item: 1
TKE-2010-HET_BNM_JH_TL.pdf (370.7Kb) -
Juel Henrichsen, Peter (, 2009)[More information][Less information]
Abstract: This working paper presents the CBS text-to-speech tool colloquially known as the TtT (Tekst-til-Tale). The tool is intended for training of university-level students, especially linguists training for a degree in speech technology, and visiting foreign students wanting to improve their spoken Danish. The TtT is operated through a simple wwwbased user-interface. Using the TtT requires basic skills in formal grammar-writing, but no knowledge on other aspects of artificial voice development such as phonetic-acoustic quantification, prosodic modelling, and signal generation. The paper includes a user manual. URI: http://hdl.handle.net/10398/7763 Files in this item: 1
2009-1.pdf (363.0Kb) -
Towards a Group Formation MethodologyRazmerita, Liana; Brun, Armelle (Nordwijkerhout, 2011)[More information][Less information]
Abstract: Group work has been adopted as an important tool to support collaborative work in order to enhance learning processes. There is a wealth of literature related to group performance and the impact of group composition on group and individual performance. However, very few studies address the issue on how to automatically form groups. This article proposes a methodology that could be used by professors to form groups automatically taking into account different criteria as well as the students’ profile. This methodology is based on a pilot study that analyzes group composition of self-formed student groups. The pilot study findings suggest that students tend to form homogeneous group in terms of level of the knowledge. Furthermore, students report that working on common topics of interests was a decisive factor in forming the groups. URI: http://hdl.handle.net/10398/8335 Files in this item: 1
RazmeritaCSEDU2011.pdf (126.5Kb) -
Christiansen, Thomas U. (Frederiksberg, 2010)[More information][Less information]
Abstract: Nonsense syllable speech materials are often used when investigating speech perception in quiet and under adverse conditions. The main advantage of using nonsense syllables over words and sentences is that the acoustic as well as linguistic context is minimal. This paper describes the considerations involved in producing three anechoic recordings of 14 male and 14 female native talkers of Danish each speaking 65 nonsense syllables repeated three times with falling F0 (total of 16380 syllables). URI: http://hdl.handle.net/10398/8218 Files in this item: 1
-
Haulrich, Martin (Frederiksberg, 2012)[More information][Less information]
Abstract: Parallel treebanks have received increasing attention in the past few years, primarily due to their potential use in statistical machine translation. Creating parallel treebanks manually is a time-consuming and expensive task and for this reason there is considerable interest in creating treebanks automatically. This task can be solved using standard tools such as parsers and aligners. However, because parallel treebanks are based on parallel corpora, we are in a special situation where the same meaning is represented in two different ways. This thesis is about how we can exploit this information to create better parallel treebanks than we can by using standard tools.... URI: http://hdl.handle.net/10398/8385 Files in this item: 1
Martin_Haulrich.pdf (1.932Mb) -
A white paperBuch-Kromann, Matthias (København, 2007)[More information][Less information]
Abstract: In this white paper, we review the theoretical evidence about the computational efficiency of dependency parsing and machine translation without the widely used, but linguistically questionable assumptions about projectivity and edge-factoring. On the basis of the heuristic local optimality parser proposed by (Buch-Kromann, 2006), we propose a common architecture for monolingual parsing, parallel parsing, and translation that does not make these assumptions. Finally, we describe the elementary repair operations in the model, and argue that the model is potentially interesting as a model of human translation. URI: http://hdl.handle.net/10398/6846 Files in this item: 1
2007-1.pdf (355.9Kb) -
Buch-Kromann, Matthias (Frederiksberg, 2010)[More information][Less information]
Abstract: DTAG is a versatile annotation tool that supports manual and semi-automatic annotation of a wide range of linguistic phenomena, including the annotation of syntax, discourse, coreference, morphology, and word alignments. It includes commands for editing general labeled graphs and graph alignments, comparing annotations, managing annotation tasks, and interfacing with a revision control system. Its visualization component can display graphs and alignments for entire texts in a compact format, with a highly flexible and configurable formatting scheme. It also provides a powerful search-replace mechanism with queries based on full first-order logic, which can be used to search for linguistic constructions and automatically apply graph transformations to collections of annotated graphs. URI: http://hdl.handle.net/10398/8222 Files in this item: 1
2010-wp-dtag (2).pdf (137.0Kb) -
Uneson, Marcus; Juel Henrichsen, Peter (Jachranka, 2011)[More information][Less information]
-
Carl, Michael; Doherty, Stephen; O’Brien, Sharon (Preprint, 2010)[More information][Less information]
Abstract: Eye tracking has been used successfully as a technique for measuring cognitive load in reading, psycholinguistics, writing, language acquisition etc for some time now. Its application as a technique for automatically measuring the reading ease of MT output has not yet, to our knowledge, been tested. We report here on a preliminary study testing the use and validity of an eye tracking methodology as a means of semi- and/or automatically evaluating machine translation output. 50 French machine translated sentences, 25 rated as excellent and 25 rated as poor in an earlier human evaluation, were selected. 10 native speakers of French were instructed to read the MT sentences for comprehensibility. Their eye gaze data were recorded non-invasively using a Tobii 1750 eye tracker. The average gaze time and fixation count were found to be higher for the “bad” sentences, while average fixation duration and pupil dilations were not found to be substantially different between output rated as good or bad. Comparisons between BLEU scores and eye gaze data were also made and found to correlate well with gaze time and fixation count, and to a lesser extent with pupil dilation and fixation duration. We conclude that the eye tracking data, in particular gaze time and fixation count, correlate reasonably well with human evaluation of MT output but fixation duration and pupil dilation may be less reliable indicators of reading difficulty for MT output. We also conclude that eye tracking has promise as an automatic MT Evaluation technique. URI: http://hdl.handle.net/10398/8045 Files in this item: 1
SubmissionforMT_dohertyobriencarl.pdf (226.2Kb) -
Carl, Michael (, 2008)[More information][Less information]
Abstract: One of the aims of the Eye-to-IT project is to investigate the possibility of using eye-tracking devices for detecting situations of targeted help for human translators. A prerequisite for automated assistance in human translation is the understanding and modelling of reading behaviour, the ability to follow human eye movements and to map gaze sample points — the output of eyetracking devices — onto words and symbols fixated. Within the Eye-to-IT project we currently use a so-called “Gaze-to- Word Mapping” (GWM) device (ˇSpakov 2008) that first computes possible fixations from sequences of gaze sample coordinates and then maps the fixations on the words which are likely to be fixated. This paper suggests an alternative framework of a probabilistic gaze mapping model for reading, in which fixations on textual objects are directly computed from the gaze sample points. The framework integrates various knowledge sources with the aim to compute the most likely fixations on words and symbols on the basis of the available data. URI: http://hdl.handle.net/10398/8043 Files in this item: 1
CLS.pdf (186.2Kb) -
En valensgrammatisk undersøgelseSkot-Hansen, Annemette (Frederiksberg, 2009)[More information][Less information]
Abstract: The purpose of this dissertation is to map the valency structure of a subset of the French adverbs. More specifically, the dissertation seeks to answer the following questions: What valency structure follows from the lexical content of the adverbs investigated? What is the nature of the semantic relation established? What is the status of the valents relative to the adverb and relative to other valents? The empirical object of investigation is focused on adverbs derived from adjectives which take prepositional phrases headed by the preposition à as their complement. In addition, the delimitation chosen for this dissertation is a class of adverbs which share the feature that they carry the suffix -ment, which developed from the Latin noun mens, meaning “spirit/thought/mood/tenor”. It is argued that the fusion of an adverb and mens establishes the general meaning [in an adjective spirit/thought/mood/tenor], i.e. the adverb retains the general quality denoted by the adjective, but the meaning targets the verb situation (at clause level) or the quality (at phrasal level) which saturates the argument of the adverb. Following tradition, the analysis adopted here, takes the verb situation to be realised by the predicate, and the quality to be realised by an adjective phrase, which may be realised by a past participle or, in rare cases, by another adverb. Since the valent is required by the lexical content of the adverb, it is assumed, following Herslund and Sørensen, that the valent is a fundamental valent. Another important feature of the adverbs which are analysed in this dissertation is that they establish a relation between two entities. This means that in addition to its fundamental valent, the adverb takes a further valent which it links with the fundamental valent. This second valent is referred to as the second valent of the adverb. The two valents are analysed as two relata in a relation. Unlike the fundamental valent, the second valent is always at phrasal level. When the adverb functions at clausal level, the second valent is realised as the prepositional object of the preposition phrase headed by à. This realisation is, however, not possible when the adverb functions at phrasal level. It is argued that this is a consequence of the fact that it is impossible to insert other constituents between the adverb and the adjective, adverb or participle which is modified by the adverb. The result is that where the second valent is realised, the adverb moves from preposition to postposition relative to its fundamental valent. In the data investigated the second valent denotes very different entities such as situations denoted by verbs and qualities, but also objects and abstract entities. The individual adverbs which are investigated here each determine their valency. In general there are different sources that allow us to uncover the core meaning of a word. The sources chosen in this dissertation are: the semantic roles assigned by the adverbs, their symmetry, elements of shared semantics or partial synonymy, their morphology and etymological roots. In order to bring together these different sources, the dissertation postulates a denotation design for each adverb. The etymology of the adverbs has been a particularly helpful in determining the relation and valency they establish. In addition to adverb and adjective suffixes, the majority of the adverbs investigated have a preposition in their synchronic morphological make-up which denotes a relation between two entities: some adverbs contain both a preposition and a morpheme from another word class, e.g. comparativement and subséquemment, while others contain only a preposition, e.g. antérieurement and postérieurement. A very small subset does not contain a preposition, but only a single adverb morpheme which denotes the relation in question, so, for instance, the adjectives par and similis, which have formed pareillement and semblablement, denote a relation between two relata. From an etymological perspective, a few adverbs, such as latéralement, do not denote a relation – so it is only through the formal realisation of the preposition phrase that the relation is established. The dissertation maps the etymological and morphological structure of the adverb and the range of functions that the adverb and its valents can have at clausal and phrasal level. The function of the adverb is relevant to the extent that the function affects its semantics and its valency structure. The effect of function is seen in some adverbs when they operate on clausal or on phrasal level and in other adverbs when they modify entire clauses or just the verb. URI: http://hdl.handle.net/10398/7944 Files in this item: 1
Annemette_Skot-Hansen.pdf (2.974Mb) -
Nistrup Madsen, Bodil; Odgaard, Anna Elisabeth (Frederiksberg, 2010)[More information][Less information]
Abstract: In order to develop a harmonised and efficient IT system, such as a database, it is important to be familiar with the underlying concept model (concept systems) for the relevant domain which the IT system should be designed to accommodate, as this forms the necessary firm foundation for designing the conceptual data model. Although there is no one-to-one correlation between concept and characteristic features in the concept model and classes and attributes in the conceptual data model, there are many similarities between concept modelling and conceptual data modelling, and by closely examining the relationship between the two models, we have strived to construct an algorithm for creating conceptual data models in Unified Modelling Language (UML) on the basis of concept models that adhere to the traditional principles and methods of terminology work. URI: http://hdl.handle.net/10398/8284 Files in this item: 1
bnm-aeo-TKE-2010-NEW.pdf (166.2Kb) -
A Study of CWA Raters' Decision-Making BehavioursLindhardsen, Vivian (Frederiksberg, 2009)[More information][Less information]
Abstract: The present maps study maps the decision-making behaviors of experienced raters in a well-established Communal Writing Assessment (CWA) context, tracing their behaviors all the way from the independent rating sessions, where the initial images and judgments are formed, to the communal rating sessions, where the final scores are assigned on the basis of collaboration between two rates. Results from think-aloud protocols, recorded discussions, retrespective reports and reported scores from 20 raters rating 15 ESL essays show that when moving from the independent ratings to the communal ratings, there is little, if any, increase in rater agreement levels and the raters' attention to the textual features corresponding to the official criteria become more evenly distributed. However, rather than consulting the scale descriptors directly in resolving insecurities about score assignment, the raters seemed to rely heavily on each others' expertise, thereby reducing the importance of the scale and emphasizing the value of the community of raters. In validating their scores in the communal rating discussions the raters appeared to be critically and equally engaged in the discussions, and through deliberating and refining their assessments the raters believed that CWA practices produce more accurate scores than in independent ratings and lead to professional development. These interpretations support a hermeneutic rather than a psychometric approach to establishing the validity of the present CWA practices. URI: http://hdl.handle.net/10398/7743 Files in this item: 1
Vivian_Lindhardsen.pdf (8.523Mb) -
Buch-Kromann, Matthias; Haulrich, Martin (Frederiksberg, 2010)[More information][Less information]
Abstract: We propose a novel machine learning technique that can be used to estimate probability distributions for categorical random variables that are equipped with a natural set of classification hierarchies, such as words equipped with word class hierarchies, wordnet hierarchies, and suffix and affix hierarchies. We evaluate the estimator on bigram language modelling with a hierarchy based on word suffixes, using English, Danish, and Finnish data from the Europarl corpus with training sets of up to 1–1.5 million words. The results show that the proposed estimator outperforms modified Kneser-Ney smoothing in terms of perplexity on unseen data. This suggests that important information is hidden in the classification hierarchies that we routinely use in computational linguistics, but that we are unable to utilize this information fully because our current statistical techniques are either based on simple counting models or designed for sample spaces with a distance metric, rather than sample spaces with a non-metric topology given by a classification hierarchy. Keywords: machine learning; categorical variables; classification hierarchies; language modelling; statistical estimation URI: http://hdl.handle.net/10398/8221 Files in this item: 1
2010-wp-buch-kromann-haulrich.pdf (216.6Kb) -
The Case of the Pre-nominal Genitive in EnglishAnker Jensen, Per (Frederiksberg, 2010)[More information][Less information]
URI: http://hdl.handle.net/10398/8237 Files in this item: 1
gengram_Dokumentation final+code.pdf (341.5Kb) -
Hardt, Daniel; Elming, Jakob (Frederiksberg, 2010)[More information][Less information]
Abstract: A method is presented for incremental retraining of an SMT system, in which a local phrase table is created and incrementally updated as a file is translated and post-edited. It is shown that translation data from within the same file has higher value than other domain-specific data. In two technical domains, within-file data increases BLEU score by several full points. Furthermore, a strong recency effect is documented; nearby data within the file has greater value than more distant data. It is also shown that the value of translation data is strongly correlated with a metric defined over new occurrences of ngrams. Finally, it is argued that the incremental re-training prototype could serve as the basis for a practical system which could be interactively updated in real time in a post-editing setting. Based on the results here, such an interactive system has the potential to dramatically improve translation quality. URI: http://hdl.handle.net/10398/8272 Files in this item: 1
Hardt_Elming.pdf (201.1Kb) -
Understanding Romance and Germanic Compounding in a Lexico-typological PerspectiveMüller, Henrik Høeg (Frederiksberg, 2010)[More information][Less information]
Abstract: The title of my talk is “Informational balance. Understanding Romance and Germanic Compounding in a lexico-typological perspective”. What I basically mean by informational balance is that semantic content is distributed systematically differently between nouns and verbs in the Romance and Germanic languages, and that this distribution is complementary. I shall explain that in detail in a minute, but first I shall introduce you to the problem, which I believe can be explained on the basis of this idea about “informational balance”. URI: http://hdl.handle.net/10398/8281 Files in this item: 1
Full Paper Berlin (sep 2010).pdf (110.5Kb) -
Juel Henrichsen, Peter (, 2011)[More information][Less information]
Abstract: Modern hearing aids use a variety of advanced digital signal processing methods in order to improve speech intelligibility. These methods are based on knowledge about the acoustics outside the ear as well as psychoacoustics. We present a novel observation based on the fact that acoustic prominence is not equal to information prominence for time intervals at the syllabic and sub-syllabic levels. The idea is that speech elements with a high degree of information can be robustly identified based on basic acoustic properties. We evaluated the correlation of (information rich) content words in the DanPASS corpus with fundamental frequency (F0) and spectral tilt across four frequency bands. Our results show a correlation of certain band-level differences and the presence of content words. Similarly, but to a lesser extent, a correlation between F0 and the presence of content words was found. The principle described here has the potential to improve the “information-to-noise” ratio in hearing aids. In addition, this concept may also be applicable in automatic speech recognition systems. URI: http://hdl.handle.net/10398/8411 Files in this item: 1
Peter_Juel_Henrichsen_ISAAR2011.pdf (296.9Kb) -
Carl, Michael; Kay, Martin; Jensen, Kristian T. H. (Preprint, 2010)[More information][Less information]
Abstract: This paper investigates properties of translation processes, as observed in the translation behaviour of student and professional translators. The translation process can be divided into a gisting, drafting and post-editing phase. We find that student translators have longer gisting phases whereas professional translators have longer post-editing phases. Long-distance revisions, which would typically be expected during post-editing, occur to the same extent during drafting as during post-editing. Further, both groups of translators seem to face the same translation problems. We suggest how those findings might be taken into account in the design of computer assisted translation tools. URI: http://hdl.handle.net/10398/8046 Files in this item: 1
LonDistRevision.pdf (651.7Kb)