Computational
Linguistics of Slavic Languages:
A Hands-on
Introduction
3 credit hours, general studies CS designation
Instructor
Danko Sipka, Ph.D.
Professor of Slavic Languages
e-mail: danko.sipka@asu.edu
Web: http://www.public.asu.edu/~dsipka
Course Web Page
http://www.asusilc.net/asucomp
Schedule & Location
Spring semester 2010, TuTh, 3:00-4:15 pm, CC 223
Prerequisites
The only prerequisite is familiarity with passive utilization of Microsoft Word and Microsoft Explorer. The course is not limited to Slavists. Slavic languages are going to be used only as an example. Participants from other language and literature fields are welcome.
Objectives
This general studies CS designation course has the two following principal objectives: a) to develop basic skills required of computational linguists thus enabling participants to be competitive in language industries, and b) to create the foundations for possible further development in more advanced fields in language industries.
Description
This course focuses on the following fields of computational linguistics: a) computer-assisted language learning, CALL, (Web pages, Unicode and other standards, interactive on-line exercises), b) digitizing (scanning, optical character recognition, SGML formatting), c) textual corpora (concordances, frequency analysis, reversed lexical lists, etc.), d) statistical analysis (descriptive and elementary inferential statistics), e) data mining in the Web (search engines, Web resources, Web crawlers), f) computational lexicography (lexical databases, morphological taggers and parsers), g) programming in Perl (branching, looping, arrays, file manipulation, regular expressions).
Students are assigned several practical tasks within each field. While emphasizing a hands-on component, this course will also familiarize its participants with the basic concepts in the fields of language industries.
Following are details on the course topics and the sequence in which they will be presented.
Overview
Course mechanics; Structural characteristics of Slavic languages and their relevance to computational linguistics; Language industries: possibilities and limitations; Operating systems (Windows, Mac OS, Linux) and non-English writing systems.
HW
# 1 (due end of Week 1): Part 1: Get your
asurite id if you do not have one, create a Web space at www.public.asu.edu and e-mail your
personal space address to me. Part 2: Set-up your computer in a manner such
that you can create a text in a non-Western-European script. E-mail me such
text.
Web
pages and code pages
HTML coding; Representing various code pages with special emphasis on Unicode; Formatting; Tables; Links; Multimedia content, and other bells and whistles, using ready-made software to design Web pages.
HW
# 2 (due end of Week 2): Create a Web page
in your personal space which contains non-English characters, tables, images
(e.g., your mug shot), sounds (e.g., you cursing in an exotic language), and
links. E-mail the link to your page to me
Interactive
on-line exercises
CALL; HTML forms; Java Script; Exercise formats (multiple choice, fill in the blank, etc.); Blackboard & Co.; More bells and whistles with JavaScript.
HW
# 3 (due end of Week 4): Part 1: Create
two Web pages each with ten exercise tasks and each with a different exercise
format, Part 2: Design a Web page with dynamic multimedia content using tables
and OnMouseOver/OnMousOut function. E-mail
the links to all these pages to me.
Digitizing
printed resources
Scanning; OCR packages with non-Western character support (FineReader, Recognita, Cuneiform, etc.); SGML/XML formatting; TEI standard.
HW # 4 (due end of Week 6): Scan one page from a non-Western script dictionary; use an OCR software package to convert it into a text. SGML code the first ten entries using the TEI standard. E-mail both the pure text and the SGML coded text to me.
Working
with textual corpora
Representativeness; Practical applications in language corpora; Concordances; Frequency lists; Reverse lists; Content analysis.
HW # 5 (due end of Week 8): Create two corpora from two different newspaper sections (e.g., Sports and World) from a newspaper of your choice. Download at least 500K of text from each section. Create a concordance and frequency list for each corpus. Analyze the data and compose a report stating your findings. Attach the top 50 ranks from the frequency list to your report and e-mail it to me.
Statistical
analysis
Data gathering and coding; Working with statistical packages; Descriptive statistics (frequency, mean, standard deviation, etc.); Inferential statistics (correlation coefficient); Graphic presentation of the results.
HW
# 6 (due end of Week 9): Write a research
proposal utilizing correlation coefficient as a tool to test a hypothesis
related to your field of specialization. E-mail it to me.
Data
mining in the Web
Web search engines; Web resources for Slavic language and area studies; General linguistic resources; Web crawlers.
HW # 7 (due end of Week 10): Create a Web page devoted to a specific problem with a short description and at least thirty relevant external links. Send the link to your Web page to me.
Computational
lexicography and grammar
Natural language processing (NLP) of Slavic languages; Lexical databases; Sorting; Exporting and importing; Morphological parsers; Morphological generators.
HW # 8 (due end of Week 12): Create a 50- entry bilingual dictionary (non-Western language X - English) with equivalents, POS tags, usage labels, examples and their translation in the form of a Microsoft Access database. E-mail both the database and the English – non-Western language X index as a text file exported from the database.
Programming
in PERL
Branching; Looping, Arrays, File manipulation; Regular expressions; Perl and CGI Scripts.
HW # 9 (due end of Week 13) Write a Perl script which converts any input string by replacing all voiceless consonants into voiced ones and vice versa. E-mail the script to me.
HW # 10 (due end of Week 14) Write a Perl script which extracts suffixes entered by the user from any given textual corpus. E-mail the script to me.
Final Project (due end of Week 16): Export your dictionary from HW # 8 into a text file. Write a HTML form and a Perl code to be used as a CGI script which will query your database according to the headword, the equivalent, POS tag and the usage label. E-mail the dictionary text file, the html form, and the Perl cgi script to me.
Course Materials
All course materials, including tutorials for further exploration of each of the covered topics, will be available on-line from the course Web page.
Grading Policy
Student performance is evaluated through weekly projects and the final project (see Description above). All projects must be submitted in order to pass the course. Students who submit all projects in a timely manner and in the correct form will receive A. The grade will be lowered one level if there are two late projects or two incomplete projects. Thus, one late/incomplete project is still A, two or three late/incomplete projects is B, four or five late/incomplete projects is C, six or seven late/incomplete projects is D, eight or more late or incomplete projects is E. The final project must be submitted in a timely manner. Failure to complete the final project counts as two incomplete projects.
Schedule
Weeks 1 and 2: Overview, Web pages and code pages
Weeks 3 and 4: Interactive on-line exercises
Weeks 5 and 6: Digitizing printed resources
Weeks 7 and 8: Working with textual corpora
Weeks 9 and 10: Statistical analysis, Data mining in the Web
Weeks 11 and 12: Computational lexicography
Weeks 13-16: Programming in PERL