Automatic Encoding Detection And Unicode Conversion Engine Computer Science Essay

In computers, characters are represented using numbers. Initially the encoding schemes were designed to support the English alphabet, which has a limited number of symbols. Later the requirement for a worldwide character encoding scheme to support multi lingual computing was identified. The solution was to come up with a 16 encoding scheme to represent a character so that it can support up to large character set. The current Unicode version contains 107,000 characters covering 90 scripts. In the current context operating systems such as Windows 7, UNIX based operating systems applications such as word processors and data exchange technologies do support this standard enabling internationalization in the IT industry. Even though this standard has been the de facto standard, still there can be seen certain applications using proprietary encoding schemes to represent the data. As an example, famous Sinhala news sites still do not adapt Unicode standard based fonts to represent the content. This causes issues such as the requirement of downloading proprietary fonts, browser dependencies making the efforts of Unicode standard in vain. In addition to the web site content itself there are collections of information included in documents such as PDFs in non Unicode fonts making it difficult to search through search engines unless the search term is entered in that particular font encoding.

This has given the requirement of automatically detecting the encoding and transforming into the Unicode encoding in the corresponding language, so that it avoids the problems mentioned. In case of web sites, a browser plug-in implementation to support the automatic non-Unicode to Unicode conversion would eliminate the requirement of downloading legacy fonts, which uses proprietary character encodings. Although some web sites provide the source font information, there are certain web applications, which do not give this information, making the auto detection process more difficult. Hence it is required to detect the encoding first, before it has been fed to the transformation process. This has given the rise to a research area of auto detecting the language encoding for a given text based on language characteristics.

This problem will be addressed based on a statistical language encoding detection mechanism. The technique would be demonstrated with the support for all the Sinhala Non Unicode encodings. The implementation for the demonstration will make sure that it is an extendible solution for other languages making it support for any given language based on a future requirement.

Since the beginning of the computer age, many encoding schemes have been created to represent various writing scripts/characters for computerized data. With the advent of globalization and the development of the Internet, information exchanges crossing both language and regional boundaries are becoming ever more important. However, the existence of multiple coding schemes presents a significant barrier. The Unicode has provided a universal coding scheme, but it has not so far replaced existing regional coding schemes for a variety of reasons. Thus, today’s global software applications are required to handle multiple encodings in addition to supporting Unicode.

In computers, characters are encoded as numbers. A typeface is the scheme of letterforms and the font is the computer file or program which physically embodies the typeface. Legacy fonts use different encoding systems for assigning the numbers for characters. This leads to the fact that two legacy font encodings defining different numbers for the same character. This may lead to conflicts with how the characters are encoded in different systems and will require maintaining multiple encoding fonts. The requirement of having a standard to unique character identification was satisfied with the introduction of Unicode. Unicode enables a single software product or a single website to be targeted across multiple platforms, languages and countries without re-engineering.


Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world’s writing systems. The latest Unicode has more than 107,000 characters covering 90 scripts, which consists of a set of code charts. The Unicode Consortium co-ordinates Unicode’s development and the goal is to eventually replace existing character encoding schemes with Unicode and its standard Unicode Transformation Format (UTF) schemes. This standard is being supported in many recent technologies including Programming Languages and modern operating systems. All W3C recommendations have used Unicode as their document character set since HTML 4.0. Web browsers have supported Unicode, especially UTF-8, for many years [4], [5].

Sinhala Legacy Font Conversion Requirement for Web Content

Sinhala language usage in computer technology has been present since 1980s but the lack of standards in character representation system resulted in proprietary fonts. Sinhala was added to Unicode in 1998 with the intention of overcoming the limitations in proprietary character encodings. Dinamina, DinaminaUniWeb, Iskoola Pota, KandyUnicode, KaputaUnicode, Malithi Web, Potha are some Sinhala Unicode fonts which were developed so that the numbers assigned with the characters are the same. Still some major news sites which display Sinhala character contents have not adapted the Unicode standards. The Legacy Fonts encoding schemes are used instead causing the conflicts in content representation. In order to minimize the problems, font families were created where the shape of characters only differs but the encoding remains the same. FM Font Family, DL Font Family are some examples where a font family concept is used as a grouping of Sinhala fonts with similar encodings [1], [2].

Adaptation of non Unicode encodings causes a lot of compatibility issues when viewed in different browsers and operating systems. Operating systems such as Windows Vista, Windows7 come with Sinhala Unicode support and do not require external fonts to be installed to read Sinhalese script. Variations of GNU/Linux distributions such as Dabian or Ubuntu also provide Sinhala Unicode support. Enabling non Unicode applications especially web contents with the support for Unicode fonts will allow the users to view contents without installing the legacy fonts.

Non Unicode PDF Documents

In addition to the contents in the web, there exists a whole lot of government documents which are in PDF format but their contents are encoded with legacy fonts. Those documents would not be searchable through search engines by entering the search terms in Unicode. In order to overcome the problem it is important to convert such documents in to a Unicode font so that they are searchable and its data can be used by other applications consistently, irrespective of the font. As another part of the project this problem would be addressed through a converter tool, which creates the Unicode version of existing PDF document which are currently in legacy font.

The Problem

Sections 1.3, 1.4 describe two domains in which the Non Unicode to Unicode conversion is required. The conversion involves identification of non-Unicode contents and replacing it with the corresponding Unicode contents. The content replacement requires a Mapping engine, which would do the proper segmentation of the input text and map it with the corresponding Unicode code. The mapping engine can perform the mapping task only if it knows what is the source text encoding. In general, the encoding is specified along with the content so that the mapping engine could feed it directly. However, in certain cases the encoding is not specified along with the content. Hence detecting the encoding through an encoding the detection engine provides a research area, especially with the non-Unicode content. In addition to that, incorporating the detection engine along with a conversion engine would be another part of the problem, to solve the application areas in 1.3, 1.4.

Project Scope

The system will be initially targeted for Sinhala fonts used by local sites. Later the same mechanism will be extended to support other languages and scripts (Tamil, Devanagaree).

Deliverables and outcomes

Web Service/Plug-in to Local Language web site Font Conversion which automatically converts website contents from legacy fonts to Unicode.

PDF document conversion tool to convert legacy fonts to Unicode

In both implementations, the language encoding detection would use the proposed encoding detection mechanism. It can be considered as the core for the implementations in addition to the translation engine which performs the Non Unicode to Unicode mapping.

Literature Review

Character Encodings

Character Encoding Schemes

Encoding refers to the process of representing information in some form. Human language is an encoding system by which information is represented in terms of sequences of lexical units, and those in terms of sound or gesture sequences. Written language is a derivative system of encoding by which those sequences of lexical units, sounds or gestures are represented in terms of the graphical symbols that make up some writing system.

A character encoding is an algorithm for presenting characters in digital form as sequences of octets. There are hundreds of encodings, and many of them have different names. There is a standardized procedure for registering an encoding. A primary name is assigned to an encoding, and possibly some alias names. For example, ASCII, US-ASCII, ANSI_X3.4-1986, and ISO646-US are different names for an encoding. There are also many unregistered encodings and names that are used widely. The character encoding names are not case sensitive and hence “ASCII” and “Ascii” are equivalent [25].

Figure 2.1 Character encoding Example

Single Octet Encodings

When character repertoire that contains at most 256 characters, assigning a number in the range 0255 to each character and use an octet with that value to represent that character is the most simplest and obvious way. Such encodings, called single-octet or 8-bit encodings, are widely used and will remain important [22].

Multi-Octet Encodings

In multi octet encodings more than one octet is used to represent a single character. A simple two-octet encoding is sufficient for a character repertoire that contains at most 65,536 characters. Two octet schemes are uneconomical if the text mostly consists of characters that could be presented in a single-octet encoding. On the other hand, the objective of supporting Universal character set is not achievable with just 65,536 unique codes. Thus, encodings that use a variable number of octets per character are more common. The most widely used among such encodings is UTF-8 (UTF stands for Unicode Transformation Format), which uses one to four octets per character.

Principles of Unicode Standard

Unicode has used as the universal encoding standard to encode characters in all living languages. To the end, is follows a set of fundamental principles. The Unicode standard is simple and consistent. It does not depend on states or modes for encoding special characters.

The Unicode standard incorporates the character sets of many existing standards: For example, it includes Latin-I, character set as its first 256 characters. It includes repertoire of characters from numerous other corporate, national and international standards as well.

In modern businesses needs handle characters from a wide variety of languages at the same time. With Unicode, a single internationalization process can produce code that handles the requirements of all the world markets at the same time. The data corruption problems do not occur since Unicode has a single definition for each character. Since it handles the characters for all the world markets in a uniform way, it avoids the complexities of different character code architectures. All of the modern operating systems, from PCs to mainframes, support Unicode now, or are actively developing support for it. The same is true of databases, as well.There are 10 design principles associated with Unicode.


The Unicode is designed to be Universal. The repertoire must be large enough to encompass all characters that are likely to be used in general text interchange. Unicode needs to encompass a variety of essentially different collections of characters and writing systems. For example, it cannot postulate that all text is written left to right, or that all letters have uppercase and lowercase forms, or that text can be divided into words separated by spaces or other whitespace.


Software does not have to maintain state or look for special escape sequences, and character synchronization from any point in a character stream is quick and unambiguous. A fixed character code allows for efficient sorting, searching, display, and editing of text. But with Unicode efficiency there exist certain tradeoffs made specially with the storage requirements needing four octets for each character. Certain representation forms such as UTF-8 format requiring linear processing of the data stream in order to identify characters. Unicode contains a large amount of characters and features that have been included only for compatibility with other standards. This may require preprocessing that deals with compatibility characters and with different Unicode representations of the same character (e.g., letter é as a single character or as two characters).

Characters, not glyphs

Unicode assigns code points to characters as abstractions, not to visual appearances. A character in Unicode represents an abstract concept rather than the manifestation as a particular form or glyph. As shown in Figure 2.2, the glyphs of many fonts that render the Latin character A all correspond to the same abstract character “a.”

Figure 2.2: Abstract Latin Letter “a” and Style Variants

Another example is the Arabic presentation form. An Arabic character may be written in up to four different shapes. Figure 2.3 shows an Arabic character written in its isolated form, and at the beginning, in the middle, and at the end of a word. According to the design principle of encoding abstract characters, these presentation variants are all represented by one Unicode character.

Figure 2.3: Arabic character with four representations

The relationship between characters and glyphs is rather simple for languages like English: mostly each character is presented by one glyph, taken from a font that has been chosen. For other languages, the relationship can be much more complex routinely combining several characters into one glyph.


Characters have well-defined meanings. When the Unicode standard refers to semantics, it often means the properties of characters, such spacing, combinability, and directionality, rather than what the character really means.

Plain text

Unicode deals with plain texti.e., strings of characters without formatting or structuring information (except for things like line breaks).

Logical order

The default representation of Unicode data uses logical order of data, as opposed to approaches that handle writing direction by changing the order of characters.


The principle of uniqueness was also applied to decide that certain characters should not be encoded separately. Unicode encodes duplicates of a character as a single code point, if they belong to the same script but different languages. For example, the letter ü denoting a particular vowel in German is treated as the same as the letter ü in Spanish.

The Unicode standard uses Han unification to consolidate Chinese, Korean, and Japanese ideographs. Han unification is the process of assigning the same code point to characters historically perceived as being the same character but represented as unique in more than one East Asian ideographic character standard. These results in a group of ideographs shared by several cultures and significantly reduces the number of code points needed to encode them. The Unicode Consortium chose to represent shared ideographs only once because the goal of the Unicode standard was to encode characters independent of the languages that use them. Unicode makes no distinctions based on pronunciation or meaning; higher-level operating systems and applications must take that responsibility. Through Han unification, Unicode assigned about 21,000 code points to ideographic characters instead of the 120,000 that would be required if the Asian languages were treated separately. It is true that the same character might look slightly different in Chinese than in Japanese, but that difference in appearance is a font issue, not a “uniqueness” issue.

Figure 2.4: Han Unification example

The Unicode standard allows for character composition in creating marked characters. It encodes each character and diacritic or vowel mark separately, and allows the characters to be combined to create a marked character. It provides single codes for marked characters when necessary to comply with preexisting character standard.

Dynamic composition

Characters with diacritic marks can be composed dynamically, using characters designated as combining marks.

Equivalent sequences

Unicode has a large number of characters that are precomposed forms, such as é. They have decompositions that are declared as equivalent to the precomposed form. An application may still treat the precomposed form and the decomposition differently, since as strings of encoded characters, they are distinct.


Character data can be accurately converted between Unicode and other character standards and specifications.

South Asian Scripts

The scripts of South Asia share so many common features that a side-by-side comparison of a few will often reveal structural similarities even in the modern letterforms. With minor historical exceptions, they are written from left to right. They are all abugidas in which most symbols stand for a consonant plus an inherent vowel (usually the sound /a/). Word-initial vowels in many of these scripts have distinct symbols, and word-internal vowels are usually written by juxtaposing a vowel sign in the vicinity of the affected consonant. Absence of the inherent vowel, when that occurs, is frequently marked with a special sign [17].

Another designation is preferred in some languages. As an example in Hindi, the word hal refers to the character itself, and halant refers to the consonant that has its inherent vowel suppressed. The virama sign nominally serves to suppress the inherent vowel of the consonant to which it is applied; it is a combining character, with its shape varying from script to script.

Most of the scripts of South Asia, from north of the Himalayas to Sri Lanka in the south, from Pakistan in the west to the easternmost islands of Indonesia, are derived from the ancient Brahmi script. The oldest lengthy inscriptions of India, the edicts of Ashoka from the third century BCE, were written in two scripts, Kharoshthi and Brahmi. These are both ultimately of Semitic origin, probably deriving from Aramaic, which was an important administrative language of the Middle East at that time. Kharoshthi, written from right to left, was supplanted by Brahmi and its derivatives. The descendants of Brahmi spread with myriad changes throughout the subcontinent and outlying islands. There are said to be some 200 different scripts deriving from it. By the eleventh century, the modern script known as Devanagari was in ascendancy in India proper as the major script of Sanskrit literature.

The North Indian branch of scripts was, like Brahmi itself, chiefly used to write Indo-European languages such as Pali and Sanskrit, and eventually the Hindi, Bengali, and Gujarati languages, though it was also the source for scripts for non-Indo-European languages such as Tibetan, Mongolian, and Lepcha.

The South Indian scripts are also derived from Brahmi and, therefore, share many structural characteristics. These scripts were first used to write Pali and Sanskrit but were later adapted for use in writing non-Indo-European languages including Dravidian family of southern India and Sri Lanka.

Sinhala Language

Characteristics of Sinhala

The Sinhala script, also known as Sinhalese, is used to write the Sinhala language, by the majority language of Sri Lanka. It is also used to write the Pali and Sanskrit languages. The script is a descendant of Brahmi and resembles the scripts of South India in form and structure. Sinhala differs from other languages of the region in that it has a series of prenasalized stops that are distinguished from the combination of a nasal followed by a stop. In other words, both forms occur and are written differently [23].

Figure 2.5: Example for prenasalized stop in Sinhala

In addition, Sinhala has separate distinct signs for both a short and a long low front vowel sounding similar to the initial vowel of the English word “apple,” usually represented in IPA as U+00E6 æ latin small letter ae (ash). The independent forms of these vowels are encoded at U+0D87 and U+0D88.

Because of these extra letters, the encoding for Sinhala does not precisely follow the pattern established for the other Indic scripts (for example, Devanagari). It does use the same general structure, making use of phonetic order, matra reordering, and use of the virama (U+0DCA sinhala sign al-lakuna) to indicate conjunct consonant clusters. Sinhala does not use half-forms in the Devanagari manner, but does use many ligatures.

Sinhala Writing System

The Sinhala writing system can be called an abugida, as each consonant has an inherent vowel (/a/), which can be changed with the different vowel signs. Thus, for example, the basic form of the letter k is ක “ka”. For “ki”, a small arch is placed over the ක: කි. This replaces the inherent /a/ by /i/. It is also possible to have no vowel following a consonant. In order to produce such a pure consonant, a special marker, the hal kirÄ«ma has to be added: ක් . This marker suppresses the inherent vowel.

Figure 2.6: Character associative Symbols in Sinhala

Historical Symbols. Neither U+0DF4 sinhala punctuation kunddaliya nor the Sinhala numerals are in general use today, having been replaced by Western-style punctuation and Western digits. The kunddaliya was formerly used as a full stop or period. It is included for scholarly use. The Sinhala numerals are not presently encoded.

Sinhala and Unicode

In 1997, Sri Lanka submitted a proposal for the Sinhala character code at the Unicode working group meeting in Crete, Greece. This proposal competed with proposals from UK, Ireland and the USA. The Sri Lankan draft was finally accepted with slight modifications. This was ratified at the 1998 meeting of the working group held at Seattle, USA and the Sinhala Code Chart was included in Unicode Version 3.0 [2].

It has been suggested by the Unicode consortium that ZWJ and ZWNJ should be introduced in Orthographic languages like Sinhala to achieve the following:

1. ZWJ joins two or more consonants to form a single unit (conjunct consonants).

2. ZWJ can also alter shape of preceding consonants (cursiveness of the consonant).

3. ZWNJ can be used to disjoin a single ligature into two or more units.

Encoding auto Detection

Browser and auto-detection

In designing auto detection algorithms to auto detect encodings in web pages it needs to depend on the following assumptions on input data [24].

Input text is composed of words/sentences readable to readers of a particular language.

Input text is from typical web pages on the Internet which is not an ancient dead language.

The input text may contain extraneous noises which have no relation to its encoding, e.g. HTML tags, non-native words (e.g. English words in Chinese documents), space and other format/control characters.

Methods of auto detection

The paper[24] discusses about 3 different methods for detecting the encoding of text data.

Coding Scheme Method

In any of the multi-byte encoding coding schemes, not all possible code points are used. If an illegal byte or byte sequence (i.e. unused code point) is encountered when verifying a certain encoding, it is possible to immediately conclude that this is not the right guess. Efficient algorithm to detecting character set using coding scheme through a parallel state machine is discussed in the paper [24].

For each coding scheme, a state machine is implemented to verify a byte sequence for this particular encoding. For each byte the detector receives, it will feed that byte to every active state machine available, one byte at a time. The state machine changes its state based on its previous state and the byte it receives. In a typical example, one state machine will eventually provide a positive answer and all others will provide a negative answer.

Character Distribution Method

In any given language, some characters are used more often than other characters. This fact can be used to devise a data model for each language script. This is particularly useful for languages with a large number of characters such as Chinese, Japanese and Korean. The tests were carried out with the data for simplified Chinese encoded in GB2312, traditional Chinese encoded in Big, Japanese and Korean. It was observed that a rather small set of coding points covers a significant percentage of characters used.

Parameter called Distribution Ration was defined and used for the purpose separating the two encodings.

Distribution Ratio = the Number of occurrences of the 512 most frequently used characters divided by the Number of occurrences of the rest of the characters.

. Two-Char Sequence Distribution Method

In languages that only use a small number of characters, we need to go further than counting the occurrences of each single character. Combination of characters reveals more language-characteristic information. 2-Char Sequence as 2 characters appearing immediately one after another in input text, and the order is significant in this case. Just as not all characters are used equally frequently in a language, 2-Char Sequence distribution also turns out to be extremely language/encoding dependent.

Current Approaches to Solve Encoding Problems

Siyabas Script

The SiyabasScript is as an attempt to develop a browser plugin, which solves the problem using legacy font in Sinhala news sites [6]. It is an extension to Mozilla Firefox and Google Chrome web browsers. This solution was specifically designed for a limited number of target web sites, which were having the specific fonts. The solution had the limitation of having to reengineer the plug-in, if a new version of the browser is released. The solution was not global since that id did not have the ability to support a new site which is using a Sinhala legacy font. In order to overcome that, the proposed solution will identify the font and encodings based on the content but not on site. There is a chance that the solution might not work if the site decided to adapt another legacy font, as it cannot detect the encoding scheme changes. There is a significant delay in the conversion process. The user would notice the display of the content with characters which are in legacy font before they get converted to the Unicode. This performance delay can be also identified as an area to improve in the solution. The conversion process does not provide the exact conversion specially when the characters need to be combined in Unicode. පශ= වෛද්‍ය ෙජ්.ඒ.පී. විෙජ්කුමාර මහතා, ගෘෘප් කපිතාන් ඇන්ඩෘෘ විෙජ්සුරිය, කි්‍රකà¶à·Š , ඕස්à¶à·šà·Šâ€à¶»à¶½à·’යානු , අනුශ+රයන් වන ශී්‍ර ,කී්‍රඩාංගණයේදී ,ද`ග පන්දු යවන්නන් can be mentioned as the examples of words of such conversion issues.

The plug-in supports the Sinhala Unicode conversion for the sites, and But the other websites mentioned in the paper does not get properly converted to Sinhala with Firefox version 3.5.17.

Aksharamukha Asian Script Converter

Aksharamukha is a South: South-East-Asian script convertor tool. It supports transliteration between Brahmi derived Asian scripts. It also has the functionality to transliterate web pages from Indic Scripts to other scripts. The Convertor scrapes the HTML page, then transliterates the Indic Scripts and displays the HTML. There are certain issues in the tool when it comes to alignment with the original web page. Misalignments and missing images, unconverted hyperlinks are some of them.

Figure 2.7: Aksharamukha Asian Script Converter

Corpus-based Sinhala Lexicon

The Lexicon of a language is its vocabulary including higher order constructs such as words and expressions. In order to detect the encoding of a given text this can be used as a supporting tool. Corpus based Sinhala lexicon has nearly 35000 entries based on a corpus consisting of 10 million words from diverse genres such as technical writing, creative writing and news reportage [7], [9]. The text distribution across genres is given in table 1.

Table 2.1: Distribution of Words across Genres [7]


Number of words

Percentage of words

Creative Writing



Technical Writing



News Reportage



N-gram-based language, script, and encoding scheme-detection

N-Gram refers to N character sequences and is used as a well-established technique used in classifying language of text documents. The method detects language, script, and encoding schemes using a target text document encoded by computer by checking how many byte sequences of the target match the byte sequences that can appear in the texts belonging to a language, script, and encoding scheme. N-grams are extracted from a string, or a document, by a sliding window that shifts one character at a time.

Sinhala Enabled Mobile Browser for J2ME Phones

Mobile phone usage is rapidly increasing throughout the world as well as in Sri Lanka. It has become the most ubiquitous communication device. Accessing internet through the mobile phone has become a common activity of people especially for messaging and news items. In J2ME enabled phones Sinhala Unicode support yet to be developed. They do not allow installation of fonts outside. Hence those devices will not be able to display Unicode contents, especially on the web, until Unicode is supported by the platform. Integrating the Unicode viewing support will provide a good opportunity to carry the technology to remote areas if it can be presented in the native language. If this is facilitated, in addition to the urban crowd, people from rural areas will be able to subscribe to a daily newspaper with their mobile. One major advantage of such an application is that it will provide a phone model independent solution which supports any Java enabled phone.

Cillion is a Mini browser software which shows Unicode contents in J2ME phones. This software is an application developed with the fonts integrated wh

Place your order
(550 words)

Approximate price: $22

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
The price is based on these factors:
Academic level
Number of pages
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our Guarantees

Money-back Guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism Guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read more

Free-revision Policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read more

Privacy Policy

Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read more

Fair-cooperation Guarantee

By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more