OCR, Text Encoding and SGML/XML


Summary

Scanning; OCR packages with non-Western character support (FineReader, Recognita, Cuneiform, etc.); SGML/XML formatting; TEI standard.


Homework

HW # 4 (due end of Week 6): Scan one page from a non-Western script dictionary; use an OCR software package to convert it into a text. SGML code the first ten entries using the TEI standard. E-mail both the pure text and the SGML coded text to me.


Ample OCR (Optical Character Recognition) software is available in the Internet and leading companies offer modules for Slavic languages. Take a look at the following pieces of software:


You can download tryout versions from the last two links and use it to practice scanning and OCR.
Take a look at an overview of available introductions and tutorials to
XML is a subset of SGML which is particularly convenient for exchange over the Internet.

Take a look at Anic's dictionary below as an example for SGML coding

XML file - an example:

Click here to see an XML document

Here is what is behind it. You need to have the following two files

Save this as serbepic.xml

<?xml version="1.0" standalone="no"?>
<!DOCTYPE text SYSTEM "serbepic.dtd">
<POEM>
    <TITLE>Marko Kraljevic i Vila Ravijojla</TITLE>
    <AUTHOR><FIRSTNAME>Unknown</FIRSTNAME>
    <LASTNAME>Unknown</LASTNAME></AUTHOR>
        <LINE N="1">
          <FOUR>Vino pije</FOUR><SIX>Kraljevicu Marko</SIX></LINE> 
        <LINE N="2">
          <FOUR>A u Skadru,</FOUR><SIX>gradu bijelome</SIX></LINE> 
</POEM>
Save this as serbepic.dtd (Document Text Description)
<!ELEMENT POEM     	(TITLE, AUTHOR, LINE*)>
<!ELEMENT TITLE   	(#PCDATA)>
<!ELEMENT AUTHOR	(FIRSTNAME, LASTNAME)>
<!ELEMENT FIRSTNAME  (#PCDATA)>
<!ELEMENT LASTNAME   (#PCDATA)>
<!ELEMENT LINE   	(FOUR, SIX)>
<!ELEMENT FOUR    	(#PCDATA)>
<!ELEMENT SIX    	(#PCDATA)>
<!ATTLIST  LINE  N  CDATA  #REQUIRED>

Also, click here to see a formated XML document. For that one, you need:

serbepic2.xml

<?xml version="1.0" encoding="windows-1250" standalone="no"?>
<?xml-stylesheet type="text/css" href="serbepic.css"?>
<!DOCTYPE text SYSTEM "serbepic.dtd">
<POEM>
    <TITLE>Marko Kraljeviĉ i Vila Ravijojla</TITLE>
    <AUTHOR><FIRSTNAME>Unknown</FIRSTNAME> 	     		         <LASTNAME>Unknown</LASTNAME></AUTHOR>
        <LINE N="1">
          <FOUR>Vino pije</FOUR> <SIX>Kraljeviĉu Marko</SIX></LINE> 
        <LINE N="2">
          <FOUR>A u Skadru,</FOUR> <SIX>gradu bijelome</SIX></LINE> 
</POEM>
serbepic.dtd
<!ELEMENT POEM     	(TITLE, AUTHOR, LINE*)>
<!ELEMENT TITLE   	(#PCDATA)>
<!ELEMENT AUTHOR	(FIRSTNAME, LASTNAME)>
<!ELEMENT FIRSTNAME  (#PCDATA)>
<!ELEMENT LASTNAME   (#PCDATA)>
<!ELEMENT LINE   	(FOUR, SIX)>
<!ELEMENT FOUR    	(#PCDATA)>
<!ELEMENT SIX    	(#PCDATA)>
<!ATTLIST  LINE  N  CDATA  #REQUIRED>
and serbepic.css
POEM
{
background-color: gainsboro;
width: 100%;
}
LINE
{
display: block;
margin-bottom: 3pt;
margin-left: 0;
}
FOUR
{
margin-left: 3;
color:blue
}
FOUR
{
margin-left: 3;
}AUTHOR
{
display: block;
color:white;
margin-bottom: 10pt;
margin-left: 0;
}
TITLE
{
color: red;
font-size: 20pt;
}


Extensive information about the TEI (Text Encoding and Interchange) standard is available at the: TEI Home Page
Take a look at the first page of Aniĉ's (1998) Croatian dictionary:

Printed textTEI SGML coded text

Regular expressions can be used even in Microsoft Word. Press Ctrl-H. Make sure to choose "More" and then mark "use wildcards". For example, if you have the following sequence

first second

and you use the following:

Find what: (<*>) (<*>) Replace with: <second>\2</second> <first>\1</first>

you will get:

<second>second</second> <first>first</first>

More about regular expressions in various languages here


Example

Last year's project