Computational Linguistics of Slavic Languages:

A Hands-on Introduction

 

3 credit hours, general studies CS designation

 

Instructor

 

Danko Sipka, Ph.D.

Professor of Slavic Languages

e-mail: danko.sipka@asu.edu

Web: http://www.public.asu.edu/~dsipka

Office hours Per e-mail (Danko.Sipka@asu.edu) at any time. Per telephone 480 637 8427 and in the office, LL 419B see my schedule here

 

Course Web Page

 

http://www.asusilc.net/asucomp

 

Schedule & Location

 

Spring semester 2010, TuTh, 3:00-4:15 pm, CC 223

 

Prerequisites

 

The only prerequisite is familiarity with passive utilization of Microsoft Word and Microsoft Explorer. The course is not limited to Slavists. Slavic languages are going to be used only as an example. Participants from other language and literature fields are welcome.

 

Objectives

 

This general studies CS designation course has the two following principal objectives: a) to develop basic skills required of computational linguists thus enabling participants to be competitive in language industries, and b) to create the foundations for possible further development in more advanced fields in language industries.

 

Description

 

This course focuses on the following fields of computational linguistics: a) computer-assisted language learning, CALL, (Web pages, Unicode and other standards, interactive on-line exercises), b) digitizing (scanning, optical character recognition, SGML formatting), c) textual corpora (concordances, frequency analysis, reversed lexical lists, etc.), d) statistical analysis (descriptive and elementary inferential statistics), e) data mining in the Web (search engines, Web resources, Web crawlers), f) computational lexicography (lexical databases, morphological taggers and parsers), g) programming in Perl (branching, looping, arrays, file manipulation, regular expressions).

 

Students are assigned several practical tasks within each field. While emphasizing a hands-on component, this course will also familiarize its participants with the basic concepts in the fields of language industries.

Following are details on the course topics and the sequence in which they will be presented.

 

Overview

Course mechanics; Structural characteristics of Slavic languages and their relevance to computational linguistics; Language industries: possibilities and limitations; Operating systems (Windows, Mac OS, Linux) and non-English writing systems.

HW # 1 (due end of Week 1): Part 1: Get your asurite id if you do not have one, create a Web space at www.public.asu.edu and e-mail your personal space address to me. Part 2: Set-up your computer in a manner such that you can create a text in a non-Western-European script. E-mail me such text.

 

Web pages and code pages

HTML coding; Representing various code pages with special emphasis on Unicode; Formatting; Tables; Links; Multimedia content, and other bells and whistles, using ready-made software to design Web pages.

HW # 2 (due end of Week 2): Create a Web page in your personal space which contains non-English characters, tables, images (e.g., your mug shot), sounds (e.g., you cursing in an exotic language), and links. E-mail the link to your page to me

 

Interactive on-line exercises

CALL; HTML forms; Java Script; Exercise formats (multiple choice, fill in the blank, etc.); Blackboard & Co.; More bells and whistles with JavaScript.

HW # 3 (due end of Week 4): Part 1: Create two Web pages each with ten exercise tasks and each with a different exercise format, Part 2: Design a Web page with dynamic multimedia content using tables and OnMouseOver/OnMousOut function. E-mail the links to all these pages to me.

 

Digitizing printed resources

Scanning; OCR packages with non-Western character support (FineReader, Recognita, Cuneiform, etc.); SGML/XML formatting; TEI standard.

HW # 4 (due end of Week 6): Scan one page from a non-Western script dictionary; use an OCR software package to convert it into a text. SGML code the first ten entries using the TEI standard. E-mail both the pure text and the SGML coded text to me.

 

Working with textual corpora

Representativeness; Practical applications in language corpora; Concordances; Frequency lists; Reverse lists; Content analysis.

HW # 5 (due end of Week 8): Create two corpora from two different newspaper sections (e.g., Sports and World) from a newspaper of your choice. Download at least 500K of text from each section. Create a concordance and frequency list for each corpus. Analyze the data and compose a report stating your findings. Attach the top 50 ranks from the frequency list to your report and e-mail it to me.

 

Statistical analysis

Data gathering and coding; Working with statistical packages; Descriptive statistics (frequency, mean, standard deviation, etc.); Inferential statistics (correlation coefficient); Graphic presentation of the results.

HW # 6 (due end of Week 9): Write a research proposal utilizing correlation coefficient as a tool to test a hypothesis related to your field of specialization. E-mail it to me.

 

Data mining in the Web

Web search engines; Web resources for Slavic language and area studies; General linguistic resources; Web crawlers.

HW # 7 (due end of Week 10): Create a Web page devoted to a specific problem with a short description and at least thirty relevant external links. Send the link to your Web page to me.

 

Computational lexicography and grammar

Natural language processing (NLP) of Slavic languages; Lexical databases; Sorting; Exporting and importing; Morphological parsers; Morphological generators.

HW # 8 (due end of Week 12): Create a 50- entry bilingual dictionary (non-Western language X - English) with equivalents, POS tags, usage labels, examples and their translation in the form of a Microsoft Access database. E-mail both the database and the English – non-Western language X index as a text file exported from the database.

 

Programming in PERL

Branching; Looping, Arrays, File manipulation; Regular expressions; Perl and CGI Scripts.

HW # 9 (due end of Week 13) Write a Perl script which converts any input string by replacing all voiceless consonants into voiced ones and vice versa. E-mail the script to me.

HW # 10 (due end of Week 14) Write a Perl script which extracts suffixes entered by the user from any given textual corpus. E-mail the script to me.

Final Project (due end of Week 16): Export your dictionary from HW # 8 into a text file. Write a HTML form and a Perl code to be used as a CGI script which will query your database according to the headword, the equivalent, POS tag and the usage label. E-mail the dictionary text file, the html form, and the Perl cgi script to me.

 

Course Materials

 

All course materials, including tutorials for further exploration of each of the covered topics, will be available on-line from the course Web page.

 

Grading Policy

 

Student performance is evaluated through weekly projects and the final project (see Description above). All projects must be submitted in order to pass the course. Students who submit all projects in a timely manner and in the correct form will receive A. The grade will be lowered one level if there are two late projects or two incomplete projects. Thus, one late/incomplete project is still A, two or three late/incomplete projects is B, four or five late/incomplete projects is C, six or seven late/incomplete projects is D, eight or more late or incomplete projects is E. The final project must be submitted in a timely manner.  Failure to complete the final project counts as two incomplete projects.

 

Schedule

 

Weeks 1 and 2: Overview, Web pages and code pages

Weeks 3 and 4: Interactive on-line exercises

Weeks 5 and 6: Digitizing printed resources

Weeks 7 and 8: Working with textual corpora

Weeks 9 and 10: Statistical analysis, Data mining in the Web

Weeks 11 and 12: Computational lexicography

Weeks 13-16: Programming in PERL