Open Access. Powered by Scholars. Published by Universities.®

Computational Linguistics Commons

Open Access. Powered by Scholars. Published by Universities.®

255 Full-Text Articles 454 Authors 37,428 Downloads 40 Institutions

All Articles in Computational Linguistics

Faceted Search

255 full-text articles. Page 1 of 12.

Phonologically Informed Edit Distance Algorithms For Word Alignment With Low-Resource Languages, Richard T. McCoy, Robert Frank 2019 Johns Hopkins University

Phonologically Informed Edit Distance Algorithms For Word Alignment With Low-Resource Languages, Richard T. Mccoy, Robert Frank

Robert Frank

We present three methods for weighting edit distance algorithms based on linguistic information. These methods base their penalties on (i) phonological features, (ii) distributional character embeddings, or (iii) differences between cognate words. We also introduce a novel method for evaluating edit distance through the task of low-resource word alignment by using edit-distance neighbors in a high-resource pivot language to inform alignments from the low-resource language. At this task, the cognate-based scheme outperforms our other methods and the Levenshtein edit distance baseline, showing that NLP applications can benefit from information about cross-linguistic phonological patterns.


Jabberwocky Parsing: Dependency Parsing With Lexical Noise, Jungo Kasai, Robert Frank 2019 University of Washington

Jabberwocky Parsing: Dependency Parsing With Lexical Noise, Jungo Kasai, Robert Frank

Robert Frank

Parsing models have long benefited from the use of lexical information, and indeed current state-of-the art neural network models for dependency parsing achieve substantial improvements by benefiting from distributed representations of lexical information. At the same time, humans can easily parse sentences with unknown or even novel words, as in Lewis Carroll’s poem Jabberwocky. In this paper, we carry out jabberwocky parsing experiments, exploring how robust a state-of-the-art neural network parser is to the absence of lexical information. We find that current parsing models, at least under usual training regimens, are in fact overly dependent on lexical information, and ...


Size Matters: The Impact Of Training Size In Taxonomically-Enriched Word Embeddings, Alfredo Maldonado, Filip Klubicka, John D. Kelleher 2019 Trinity College Dublin, Ireland

Size Matters: The Impact Of Training Size In Taxonomically-Enriched Word Embeddings, Alfredo Maldonado, Filip Klubicka, John D. Kelleher

Articles

Word embeddings trained on natural corpora (e.g., newspaper collections, Wikipedia or the Web) excel in capturing thematic similarity (“topical relatedness”) on word pairs such as ‘coffee’ and ‘cup’ or ’bus’ and ‘road’. However, they are less successful on pairs showing taxonomic similarity, like ‘cup’ and ‘mug’ (near synonyms) or ‘bus’ and ‘train’ (types of public transport). Moreover, purely taxonomy-based embeddings (e.g. those trained on a random-walk of WordNet’s structure) outperform natural-corpus embeddings in taxonomic similarity but underperform them in thematic similarity. Previous work suggests that performance gains in both types of similarity can be achieved by enriching ...


Demographic Factors As Domains For Adaptation In Linguistic Preprocessing, Sara Morini 2019 The Graduate Center, City University of New York

Demographic Factors As Domains For Adaptation In Linguistic Preprocessing, Sara Morini

All Dissertations, Theses, and Capstone Projects

Classic natural language processing resources such as the Penn Treebank (Marcus et al. 1993) have long been used both as evaluation data for many linguistic tasks and as training data for a variety of off-the-shelf language processing tools. Recent work has highlighted a gender imbalance in the authors of this text data (Garimella et al. 2019) and hypothesized that tools created with such resources will privilege users from particular demographic groups (Hovy and Søgaard 2015). Domain adaptation is typically employed as a strategy in machine learning to adjust models trained and evaluated with data from different genres. However, the present ...


Do It Like A Syntactician: Using Binary Gramaticality Judgements To Train Sentence Encoders And Assess Their Sensitivity To Syntactic Structure, Pablo Gonzalez Martinez 2019 The Graduate Center, City University of New York

Do It Like A Syntactician: Using Binary Gramaticality Judgements To Train Sentence Encoders And Assess Their Sensitivity To Syntactic Structure, Pablo Gonzalez Martinez

All Dissertations, Theses, and Capstone Projects

The binary nature of grammaticality judgments and their use to access the structure of syntax are a staple of modern linguistics. However, computational models of natural language rarely make use of grammaticality in their training or application. Furthermore, developments in modern neural NLP have produced a myriad of methods that push the baselines in many complex tasks, but those methods are typically not evaluated from a linguistic perspective. In this dissertation I use grammaticality judgements with artificially generated ungrammatical sentences to assess the performance of several neural encoders and propose them as a suitable training target to make models learn ...


Beneath The Surface Of Talking About Physicians: A Statistical Model Of Language For Patient Experience Comments, Taylor Turpen, Lea Matthews MD, Senem Guney PhD, CPXP 2019 NarrativeDx

Beneath The Surface Of Talking About Physicians: A Statistical Model Of Language For Patient Experience Comments, Taylor Turpen, Lea Matthews Md, Senem Guney Phd, Cpxp

Patient Experience Journal

This study applies natural language processing (NLP) techniques to patient experience comments. Our goal was to examine the language describing care experiences with two groups of physicians: those with scores in the top 100 and those with scores in the bottom 100 among all physicians (n=498) who received scores from patient satisfaction surveys. Our analysis showed a statistically significant difference in the language used to describe care experiences with these two distinct groups of physicians. This analysis illustrates how to apply NLP techniques in categorizing and building a statistical model for language use in order to identify meaningful language ...


The Design And Implementation Of Aida: Ancient Inscription Database And Analytics System, M Parvez Rashid 2019 University of Nebraska - Lincoln

The Design And Implementation Of Aida: Ancient Inscription Database And Analytics System, M Parvez Rashid

Computer Science and Engineering: Theses, Dissertations, and Student Research

AIDA, the Ancient Inscription Database and Analytic system can be used to translate and analyze ancient Minoan language. The AIDA system currently stores three types of ancient Minoan inscriptions: Linear A, Cretan Hieroglyph and Phaistos Disk inscriptions. In addition, AIDA provides candidate syllabic values and translations of Minoan words and inscriptions into English. The AIDA system allows the users to change these candidate phonetic assignments to the Linear A, Cretan Hieroglyph and Phaistos symbols. Hence the AIDA system provides for various scholars not only a convenient online resource to browse Minoan inscriptions but also provides an analysis tool to explore ...


The Perception Of Mandarin Tones In "Bubble" Noise By Native And L2 Listeners, Mengxuan Zhao 2019 The Graduate Center, City University of New York

The Perception Of Mandarin Tones In "Bubble" Noise By Native And L2 Listeners, Mengxuan Zhao

All Dissertations, Theses, and Capstone Projects

Previous studies have revealed the complexity of Mandarin Tones. For example, similarities in the pitch contours of tones 2 and 3 and tones 3 and 4 cause confusion for listeners. The realization of a tone's contour is highly dependent on its context, especially the preceding pitch. This is known as the coarticulation effect. Researchers have demonstrated the robustness of tone perception by both native and non-native listeners, even with incomplete acoustic information or in noisy environment. However, non-native listeners were observed to behave differently from native listeners in their use of contextual information. For example, the disagreement between the ...


Analyzing Prosody With Legendre Polynomial Coefficients, Rachel Rakov 2019 The Graduate Center, City University of New York

Analyzing Prosody With Legendre Polynomial Coefficients, Rachel Rakov

All Dissertations, Theses, and Capstone Projects

This investigation demonstrates the effectiveness of Legendre polynomial coefficients representing prosodic contours within the context of two different tasks: nativeness classification and sarcasm detection. By making use of accurate representations of prosodic contours to answer fundamental linguistic questions, we contribute significantly to the body of research focused on analyzing prosody in linguistics as well as modeling prosody for machine learning tasks. Using Legendre polynomial coefficient representations of prosodic contours, we answer prosodic questions about differences in prosody between native English speakers and non-native English speakers whose first language is Mandarin. We also learn more about prosodic qualities of sarcastic speech ...


Corpus Of Usage Examples: What Is It Good For?, Timofey Arkhangelskiy 2019 Universität Hamburg, Alexander von Humboldt Foundation

Corpus Of Usage Examples: What Is It Good For?, Timofey Arkhangelskiy

Proceedings of the Workshop on Computational Methods for Endangered Languages

Lexicography and corpus studies of grammar have a long history of fruitful interaction. For the most part, however, this has been a one-way relationship. Lexicographers have extensively used corpora to identify previously undetected word senses or find natural usage examples; using lexicographic materials when conducting data-driven investigations of grammar, on the other hand, is hardly commonplace. In this paper, I present a Beserman Udmurt corpus made out of "artificial" dictionary examples. I argue that, although such a corpus can not be used for certain kinds of corpus-based research, it is nevertheless a very useful tool for writing a reference grammar ...


Developing Without Developers: Choosing Labor-Saving Tools For Language Documentation Apps, Luke D. Gessler 2019 Georgetown University

Developing Without Developers: Choosing Labor-Saving Tools For Language Documentation Apps, Luke D. Gessler

Proceedings of the Workshop on Computational Methods for Endangered Languages

Application software has the potential to greatly reduce the amount of human labor needed in common language documentation tasks. But despite great advances in the maturity of tools available for apps, language documentation apps have not attained their full potential, and language documentation projects are forgoing apps in favor of less specialized tools like paper and spreadsheets. We argue that this is due to the scarcity of software development labor in language documentation, and that a careful choice of software development tools could make up for this labor shortage by increasing developer productivity. We demonstrate the benefits of strategic tool ...


Applying Support Vector Machines To Pos Tagging Of The Ainu Language, Karol Nowakowski, Michal Ptaszynski, Fumito Masui, Yoshio Momouchi 2019 Kitami Institute of Technology

Applying Support Vector Machines To Pos Tagging Of The Ainu Language, Karol Nowakowski, Michal Ptaszynski, Fumito Masui, Yoshio Momouchi

Proceedings of the Workshop on Computational Methods for Endangered Languages

No abstract provided.


Ocr Evaluation Tools For The 21st Century, Eddie A. Santos 2019 National Research Council Canada, University of Alberta

Ocr Evaluation Tools For The 21st Century, Eddie A. Santos

Proceedings of the Workshop on Computational Methods for Endangered Languages

We introduce ocreval, a port of the ISRI OCR Evaluation Tools, now with Unicode support. We describe how we upgraded the ISRI OCR Evaluation Tools to support modern text processing tasks. ocreval supports producing character-level and word-level accuracy reports, supporting all characters representable in the UTF-8 character encoding scheme. In addition, we have implemented the Unicode default word boundary specification in order to support word-level accuracy reports for a broad range of writing systems. We argue that character-level and word-level accuracy reports produce confusion matrices that are useful for tasks beyond OCR evaluation—including tasks supporting the study and computational ...


Building A Common Voice Corpus For Laiholh (Hakha Chin), Kelly Berkson, Samson Lotven, Peng Hlei Thang, Thomas Thawngza, Zai Sung, James C. Wamsley, Francis Tyers, Kenneth Van Bik, Sandra Kübler, Donald Williamson, Matthew Anderson 2019 Indiana University

Building A Common Voice Corpus For Laiholh (Hakha Chin), Kelly Berkson, Samson Lotven, Peng Hlei Thang, Thomas Thawngza, Zai Sung, James C. Wamsley, Francis Tyers, Kenneth Van Bik, Sandra Kübler, Donald Williamson, Matthew Anderson

Proceedings of the Workshop on Computational Methods for Endangered Languages

No abstract provided.


Bootstrapping A Neural Morphological Analyzer For St. Lawrence Island Yupik From A Finite-State Transducer, Lane Schwartz, Emily Chen, Benjamin Hunt, Sylvia LR Schreiner 2019 University of Illinois at Urbana-Champaign

Bootstrapping A Neural Morphological Analyzer For St. Lawrence Island Yupik From A Finite-State Transducer, Lane Schwartz, Emily Chen, Benjamin Hunt, Sylvia Lr Schreiner

Proceedings of the Workshop on Computational Methods for Endangered Languages

No abstract provided.


Future Directions In Technological Support For Language Documentation, Daan van Esch, Ben Foley, Nay San 2019 Google

Future Directions In Technological Support For Language Documentation, Daan Van Esch, Ben Foley, Nay San

Proceedings of the Workshop on Computational Methods for Endangered Languages

To reduce the annotation burden placed on linguistic fieldworkers, freeing up time for deeper linguistic analysis and descriptive work, the language documentation community has been working with machine learning researchers to investigate what assistive role technology can play, with promising early results. This paper describes a number of potential follow-up technical projects that we believe would be worthwhile and straightforward to do. We provide examples of the annotation tasks for computer scientists; descriptions of the technological challenges involved and the estimated level of complexity; and pointers to relevant literature. We hope providing a clear overview of what the needs are ...


Handling Cross-Cutting Properties In Automatic Inference Of Lexical Classes: A Case Study Of Chintang, Olga Zamaraeva, Kristen Howell, Emily M. Bender 2019 University of Washington

Handling Cross-Cutting Properties In Automatic Inference Of Lexical Classes: A Case Study Of Chintang, Olga Zamaraeva, Kristen Howell, Emily M. Bender

Proceedings of the Workshop on Computational Methods for Endangered Languages

In the context of the ongoing AGGREGATION project concerned with inferring grammars from interlinear glossed text, we explore the integration of morphological patterns extracted from IGT data with inferred syntactic properties in the context of creating implemented linguistic grammars. We present a case study of Chintang, in which we put emphasis on evaluating the accuracy of these predictions by using them to generate a grammar and parse running text. Our coverage over the corpus is low because the lexicon produced by our system only includes intransitive and transitive verbs and nouns, but it outperforms an expert-built, oracle grammar of similar ...


Towards A General-Purpose Linguistic Annotation Backend, Graham Neubig, Patrick Littell, Chian-Yu Chen, Jean Lee, Zirui Li, Yu-Hsiang Lin, Yuyan Zhang 2019 Carnegie Mellon University

Towards A General-Purpose Linguistic Annotation Backend, Graham Neubig, Patrick Littell, Chian-Yu Chen, Jean Lee, Zirui Li, Yu-Hsiang Lin, Yuyan Zhang

Proceedings of the Workshop on Computational Methods for Endangered Languages

No abstract provided.


Bootstrapping A Neural Morphological Generator From Morphological Analyzer Output For Inuktitut, Jeffrey Micher 2019 US Army Research Laboratory

Bootstrapping A Neural Morphological Generator From Morphological Analyzer Output For Inuktitut, Jeffrey Micher

Proceedings of the Workshop on Computational Methods for Endangered Languages

No abstract provided.


Finding Sami Cognates With A Character-Based Nmt Approach, Mika Hämäläinen, Jack Rueter 2019 University of Helsinki

Finding Sami Cognates With A Character-Based Nmt Approach, Mika HäMäLäInen, Jack Rueter

Proceedings of the Workshop on Computational Methods for Endangered Languages

We approach the problem of expanding the set of cognate relations with a sequence-to-sequence NMT model. The language pair of interest, Skolt Sami and North Sami, has too limited a set of parallel data for an NMT model as such. We solve this problem on the one hand, by training the model with North Sami cognates with other Uralic languages and, on the other, by generating more synthetic training data with an SMT model. The cognates found using our method are made publicly available in the Online Dictionary of Uralic Languages.


Digital Commons powered by bepress