"To us all towns are one, all men our kin. |
Home | Trans State Nation | Tamil Eelam | Beyond Tamil Nation | Comments |
Home > Tamil Digital Renaissance > Tamilnet'99 > Dr.K.Kalyanasundaram
K.Kalyanasundaram and Muthu Nedumaran
Singapore Internet Working Group for Tools and Standards for Tamil Computing
Introduction | TSCII Initiative | Existing Tamil Standards | Biggest challenge today | Design Goals of TSCII initiative | Details of TSCII encoding | TSCII encoding and desktop publishing | Two factors guide the selection of glyphs | TSCII encoding and information exchange | TSCII encoding and data base applications | Concluding Remarks | A dedicated Web site for TSCII.
Dravidian Languages such as Tamil use non-Roman characters. Historically, transliterated form of writing Tamil text using Roman characters was the common practice, particularly amongst western scholars. Dedicated softwares (text editors and word-processors) capable of rendering tamil scripts using built-in Tamil fonts made their debut in eighties and soon became popular particularly in Malaysia and Singapore region. These commercial softwares were rather expensive and so their usage was restricted largely to publishing houses.
Free, self-standing font faces became available in the Internet in the early nineties and this resulted in a major explosion in the number of people who can handle materials directly in their personal computers. Though exact numbers are not yet available, it is likely that today at least a quarter of a million can handle tamil materials in tamil script on computers. This number is likely to double as another year passes by.
Last decade also saw phenomenal growth and establishment of Internet as one of the major modes of information interchange. Enormous amount of materials are being produced using computers using different operating systems and with different softwares. Information is also being exchanged through different exchange protocols (SMTP, NNTP, POP, HTTP, ...). Facile flow of information through different computers and protocols require in a mandatory way some standardized font encoding scheme.
A font encoding scheme is an explicitly written convention for handling of language script(s) by the computer systems. An encoding scheme can be glyph-based where one uses a series of graphic characters (glyphs) stored at specific locations of a font to generate the script. All font (font faces) then use this same standardized format.
An encoding scheme can also be character-based where one simply defines the basic elements required to define the entire alphabet and leave the details of rendering of the script to softwares. Unfortunately there has not been any single encoding scheme universally accepted for Tamil and used by everyone. This has resulted in mushrooming of hundreds of Tamil font faces (7- and 8-bit monolingaul and bilingual) in the Internet and near impossible situation of exchange of information between individuals. This paper proposes a glyph-based encoding TSCII as a possible candidate for fonts used in Tamil computing.
Two major developments were responsible for the present initiative. Almost three years ago, world-wide discussions started through two Email discussion lists "[email protected]" and "[email protected]". The participants with different background (software developers, typographers, linguists , academic scholars, users) are all interested to see an encoding standard defined soon. Earlier conference in this series TamilNet'97 held in Singapore (organized by the National University of Singapore) recognized the urgent need to define a standard and welcomed initiatives in this regard. The two mailing lists were soon merged into one. The present proposal TSCII is the outcome of discussions of an Internet Working Group for Development of Tools and Standards for Computing.
During this decade nearly all major representatives of hardware and software industry joined together and launched a global encoding initiative called UNICODE. Unicode is an ambitious character-based encoding scheme to handle all world languages with specific segments allocated for each language including. While Unicode is nearly implemented for English and European Languages, it is not yet ready for Indic languages.
Before Unicode, in the eighties, Government of India proposed an Indian Standard called ISCII to handle all Indian Languages under a single character-based encoding scheme. CDAC, Pune developed softwares based on this ISCII standard and these are used in state and federal govt. establishments within India. For Indic languages, Unicode has adopted the character-based encoding schemes of ISCII. In spite of this parentage of Unicode with ISCII (as far as Indic languages are concerned), implementation of Unicode is quite different from that of ISCII. Neither the softwares nor text files/databases of one scheme applicable directly to the other.
Due to the difficulties in implementing ISCII in widely used information exchange protocols of Internet, CDAC recently proposed a "secondary layer standard" ISFOC. ISFOC is a glyph-based encoding standard with no direct inter-convertibility to the parent ISCII. The complexity of file transfers between these glyph- and character-based encoding standards takes away all benefits of cross-language advantages of this standard. So the situation is far from satisfactory. The complexity of these two schemes does not allow usage of any of the thousands of shrink-wrap softwares that are available for English language. ISCII scheme can be implemented only through dedicated hardware and/or softwares. Fortunately language is comparatively a simple language adequately handled by a glyph-based encoding scheme.
The goal today is not to introduce one more font encoding to the arena. The biggest challenge facing us today is unification of hundreds of font encodings that are in use for today. Any standard font encoding scheme proposed today must allow facile migration of legacy documents, texts and fonts. Through its elaborate design goals, TSCII initiative attempts to address all these pressing requirements. The initiative recognizes that soon (possibly in a decade) Unicode will firmly establish itself as the world-standard for multi-lingual computing needs. Hence the initiative proposes a glyph-based encoding scheme as an "interim" option till Unicode is firmly established in all computer OS and softwares for the market.
Design Goals of TSCII initiative
The following are some of the key Design Goals of the proposed TSCII initiative:
- the encoding SCHEME MUST BE IMPLEMENTABLE ON ALL COMMONLY USED COMPUTER PLATFORMS (Unix, Windows, Mac and others). New generationsof more powerful computers and associated softwares are being released every couple of years.
- The encoding standard should be such that it can be used in all computers released during the last decade (backward compatibility)
- the encoding SCHEME MUST BE OPEN. There will no need to get permission from anyone to implement the encoding standard in hardwares and softwares. No copyright restrictions of any kind. Nearly all of the International Standards for Information Interchange are OPEN standards.
- the encoding SCHEME WILL BE AN 8-BIT BILINGUAL (ROMAN/TAMIL). The widely used lower ASCII set (roman characters and punctuation marks) take its standard location (slots 0-127). Tamil glyphs occupy the upper ASCII berth (slots 128-255). This is based on the recognition that information exchange via widely used methods such as Email and Web is best assured if key information (tags) on the nature/content of the file are indicated using the usual lower ASCII set.
- the encoding SCHEME PROPOSED WILL BE GLYPH-BASED ONE WITH A UNIQUE COLLECTION OF GLYPHS to generate the entire alphabet. The encoding scheme should be such that there are no ambiguities in the interpretation of the resulting text (by search and sort engines for example), no redundancy nor repetition of old style of writing some alphabets (lai, Lai, Naa, naa and Raa). The encoding will allow text input as per the current practice of Tamils worldwide, without enforcing any language reforms. In view of the near absence of usage of numerals, their inclusion can be viewed as an exception. Two main reasons for this exception are:
- the need for the encoding scheme to handle very ancient manuscripts (Etext archives) where these numerals were in use;
- to allow join the main stream where many of the classical languages have their own numerals. Unicode does recognize this key fact in providing slots for these in many language segments. The scheme will not include any idiosyncratic, personal, novel, rarely exchanged or private-use characters.
- the encoding SCEME WILL BE UNIVERSAL IN SCOPE. While keeping grantha characters and numerals as part of the glyph choices, the encoding scheme is designed such that the basic glyphs are collected in widely used Latin-1 (8859-x) standards. This will ensure that the "pure Tamil" message gets through even in the poorest/bad local implementation scenarios. Supplementary characters such as granthas, numerals and rarely used alphabets (such as nju, ngu, njU, ngU) are to be placed in rows 8 and 9. By providing a couple of encoding slot positions as vacant, the scheme will allow software developers to use them as "escape routes" to bring in additional (special characters such as old-style Lai/lai/Naa/naa/Raa) characters if necessary.
- The encoding SCHEME HAS TO PROVIDE COMPATIBILITY to Unicode (and hence ISCII) scheme. Proposed inclusion of grantha characters and numerals is relevant in this context. Compatibility to Unicode will ensure easy migration of documents to and fro during and after the transition period.
- Efforts will be made to provide appropriate softwares that will ALLOW SMOOTH MIGRATION FROM HUNDREDS OF CURRENT FONT ENCODING SCHEMES. For conversion of legacy documents "Converters" will be provided that allow inter-conversion of files between widely used encoding schemes and proposed TSCII. The converters will also allow any user to quickly generate converter plug-ins for any custom encoding. For text/data input, several keyboard editors will be provided allowing input as per widely used keyboard layouts and existing font encodings. This way anyone can continue to do the input in nearly the same manner as before but all with the same encode-conformant fonts. Transition to TSCII thus will be smooth and rapid.
This presents in the form of a compact table glyph choices and their slot allocations (code positions) of the proposed 8-bit bilingual encoding. < tscii.gif >
It can be noted that the encoding scheme includes in addition to characters of tamil alphabet the following: single and double curly quotes and copyright sign at their code-positions of widely used Latin-1 scheme. Having these curly quotes allow ready usage of shrink-wrap software (for word-processing, graphics etc.) that are available for use in English and European languages. Copyright sign (at its ANSI slot #169) is increasingly used in many of the Internet-based documents particularly in Web pages. Presence of this will avoid un-necessary switching to other Roman font faces. Two slots (254, 255) have been left vacant as "private use area" for software developers.
TSCII encoding and desktop publishing
Undoubtedly, the major use of the tamil fonts (Tamil computing) is in the word-processing application. Desktop publishing is becoming the widely used mode for printing even in professional publishing houses. During our 2-year long discussions, it was repeatedly pointed out that the glyph choices should be such that high quality printing required by professional publishing houses must be met adequately. Many of the tamil alphabets are complex forms graphically.
It was pointed out that excessive use of kerning (as is the case with 7-bit fonts) renders delivery of high quality glyphs rather difficult. Since the number of slots available in upper-ASCII segment (#128-255) is much less than the required ones to allocate one slot for each of 240+ tamil alphabets, choices have to be made on which of the Tamil alphabets are to be included in the native form, which are to generated using modifiers (several keystrokes in sequence).
Choice of glyphs will determine the quality of the output and the slot allocations will determine the trouble-free performance across different computer platforms. It may also be pointed out the quality of any font face (outline definition of the glyphs) will largely determine the quality of the output. Most of the freely distributed Tamil fonts in the Internet perform poorly in this context. Whatever the encoding scheme the Tamilnadu Government adopts as the standard, the world of Tamil Computing will benefit enormously if the Government distributes free at least a couple of "high quality" font faces through the Internet.
Two factors guide the selection of glyphs
One is the FREQUENCY OF OCCURRENCE OF THE ALPHABETS STRUCTURAL COMPLEXITY OF THE TAMIL ALPHABET so that they can be generated nicely in on-screen display and in print with or without the add of kerning and other basic font handling techniques already available for over a decade in all computer platforms. It is not a good idea to go for an encoding scheme where 80% of the chosen glyphs occur for less than 30% of the actual text. Assuming that the quality of the glyphs in the font face are of exceptionally good quality, more the number of alphabets in native form, better will be the quality of the Tamil text. A good balance has to be made between frequency of occurrence and the structural complexity of some of the alphabets.Fortunately several of the Tamil alphabets are written as a composite of two or three basic components (referred to as "modifiers"). e.g. aakara, ekara, Ekara, okara, Okara, aikara and aukara varisai alphabets. Along with the basic consonants (mei), it suffices to have a select collection of modifiers (aakara, ekara, Ekara, aikara, aukara) in the encoding scheme to generate all these compound (uyirmei) characters. There is no need to have these as unique glyphs in the encoding scheme. Similar logic can be applied to grantha series as well. It suffices to include the special ukara and uukara modifiers and can use the Tamil modifier glyphs for the rest. After an in-depth analysis of various options, it was decided to invoke modifiers for the "ikara" and "iikara" varisai alphabets and rest of the series are generated directly.
There have been several analyses of the frequency of occurrence of Tamil alphabets and they have been used earlier in determination of the keyboard layout. With the choice of glyphs discussed above , we have NEARLY 87.07% OF THE TAMIL CHARACTERRS ACCOMMODATED IN NATIVE FORM in the encoding scheme: meis with puLLis 28.85%; basic meis (akaram eRRiya meis) 23.50%; ukara varisai 11.88%; entire uyirs: 7.00 %; aakara varisai (with stand alone "aa" modifier) 6.39 %; aikara varisai (with stand alone "ai" modifier) 4.41%; eekara varisai (with stand alone "ee" modifier) 1.88%; ekara varisai (with stand alone "e" modifier) 1.44%; ti and tii 1.06%; uukara varisai 0.62% and aukara varisai (with e, au modifiers) 0.04%. It means that, nearly 87% of the Tamil characters are rendered as native ones without any kerning. Their quality will be dependent purely on the quality of the font face design.
Even in the ca. 13% generated via kerning (used mainly in the ikara and iikara varisai), majority of them can be generated in quite satisfactory way using kerning procedures. Kerning is a routine font handling technique now available in all of the common computer platforms/OS. As a right-end modifier, the ikara and iikara varisai uyirmeis can be rendered fairly precise on all platforms.
So it is likely that, using the proposed glyph encoding scheme, over 98% of the Tamil characters can be rendered easily on screen and in print without any loss of quality. Techniques such as pair-wise kerning can handle even the residuals adequately. Professional publishing houses with more stringent requirements on the glyph display invariably use more sophisticated printing equipment and high-end computer systems. Advanced font handling techniques such as glyph substitution (GSUB) through (or without) Open True Type fonts are already implemented at the OS level. Hence it should not be problem for these cases to use dedicated software where single form of these alphabets are stored elsewhere and brought in wherever they are needed.
TSCII encoding and information exchange
A second major area of application of the font encoding is information exchange through Email and WWW. We will discuss each of these one by one. With the emergence of 16-bit Unicode as the encoding for multi-lingualism, nearly all of the widely used computer operating systems now can handle correctly information at this 16-bit level. Computers released in this decade (target coverage of proosed TSCII) all can handle 8-bit encoded messages.
EMAIL: Nearly all of Email softwares (including those that are used in shell account access such as PINE) are 8-bit compliant. A routinely used communication protocol is MIME (also known as Quoted-printable or base 64 encoding). MIME was designed to allow email exchanges in all of the world languages without worrrying much into the details of the font encoding used. MIME simply sends the code positions of the characters (A0, B2, EA etc) and the user/client-software recodes the information as the local choice of font face and associated font encoding.
Using MIME it is possible to exchange information across different computer platforms very reliably. During the testing of TSCII encoding, we have successfully used most of the commonly used Email softwares on all three computer platforms. We have already a handful of email discussion lists where TSCII-based tamil exchanges are taking place routinely and the participants use many different softwares running on Mac or Unix or Windows-based computers.
WEB: Current versions of both of the dominant Web browsers Internet Explorer and Netscape are Unicode-intelligent. Using the "user-defined" case for the font-encoding, we have shown successfully that it is possible to present formatted tamil texts (TSCII based) in the form of Web pages. Users can read the Tamil materials locally using their preferred web-browser and font face of his personal chocie. We have also shown successful demonstrations of information exchange of formatted tamil text materials via "portable document format (PDF)" on all three Mac/windows/Unix platforms. PDF format is increasingly becoming the preferred mode of distribution of formatted materials (e.g. catalogs and annual reports of business establishments).
TSCII encoding and data base applications
Another major area of concern is on Database Applications. Any encoding scheme should allow facile search and sorting of stored/saved tamil information. The information searched could be a tamil text viewed on a word-processing application or large database in a business or governmental organization. Database handling can be considered in three stages; storing, sorting and searching. The data base could be directly as plain 8-bit text as per the TSCII encoding. Sorting of the 8-bit tscii data can be done through an intermediate layer where glyphs are substituted by de-coupled characters and use any of the standard sorting algorithms. A demo software "varisai characters
With the very likelihood of Unicode taking the place as the international encoding standard, alternate possibility would be in Unicode. An on-fly convertor associated with the application can convert tscii data -> unicode for saving in the database and also render data back to the application via another unicode -> tscii conversion. The intermediate layering can be transparent only to the application developers. Except for an encoding scheme that lists the entire 240+ alphabets in the required sorting sequence, usage of an intermediate layer in any glyph-based scheme in inevitable.
Unicode has already released some standardised sorting options as RFCs and there are already software developers working on developing softwares based on this option. Double-byte sorting has also been proposed as an option. Clearly there will be more than one way one can do the search and sorting.
The proposed glyph-based encoding is the outcome of nearly three years of discussions in a public forum accompanied by extensive field testing by a group of intemet-linked volunteers group. It has been shown to be a very viable "interim" option for Tamil Computing. It has the support of a broad spectrum of Internet Tamil community. In a short span of two months since the present encoding scheme was adopted as the final form by the Internet Working Group, many Tamil commercial software developers have produced several TSCII-encoding based tools and softwares and agreed to distribute them FREE: Tamil font faces and keyboard editors for use in Windows, Mac platforms, text converters to go between TSCII and popular Tamil font faces and vice versa, Email software that allow exchanges directly in Tamil, On-fly converters of web-pages, etc.
A dedicated Web site for TSCII.
http://www.tamil.net/tscii has also been formed to provide all the necessary technical assistance for quick implementation of the standard and to serve as "the site" where anyone can download above type of TSCII-based tools. Hence we strongly believe that the proposed standard is a very viable one, guaranteed to deliver the goods it promises. We sincerely hope that the Tamilnadu Government will give a fair hearing to this proposal and possibly adopt it for Tamil Computing as soon as possible.