An Analysis of Asian Language Web Pages

This paper gives an overview and an evaluation of Web pages of Asian languages on the Web, in particular of those languages that have not been focused on so far. The authors have collected over 100 million Asian Web pages downloaded from 42 Asian country domains, identified the languages based on Ngram statistics and analyzed their language properties. Primarily the number of pages written in each language measures the presence of a language. The survey reveals that the digital language divide exists at a serious level in the region. The state of multilingualism and the dominating presence of cross-border languages, English in particular, are analyzed. The paper sheds light on script and encoding issues of Asian language texts on the Web. In order to promote language resource collection and sharing, authors have a vision of creating an observation-collection instrument for Asian language resources on the Web. The results of the survey show the feasibility of this vision, and provide us with a better idea of the steps needed to realize that vision.


INTRODUCTION
Since the early days of Web development, various attempts have been made to grasp the language distribution of the Web.An estimate of language distribution in terms of Internet users' languages has been regularly reported by a marketing research group [1].Estimates of the distribution of the Web documents are compiled by various groups, each with a different scope and focus.The work of Alis Technologies and the Internet Society [2] is among the earliest.Network and Development Foundation (FUNREDES) compiles a regular report focused on the Romance language group [3], and Online Computer Library Center's (OCLC) Web Characterization Project [4] covers large number of European languages.Most of these surveys have An Analysis of Asian Language Web Pages S. T. Nandasara 1* , Shigeaki Kodama 2 , Chew Yew Choong 3 , Rizza Caminero 4 , Ahmed Tarcan 5 , Hammam Riza 6 , Robin Lee Nagano 7 , Yoshiki Mikami 8   1 University of Colombo School of Computing, Colombo, Sri Lanka 2, 3, 4, 8 Nagaoka University of Technology, Nagaoka, Niigata, Japan 5 Dicle University, Diyarbakir, 21280, Turkey 6 IPTEKnet, BPPT, Indonesia 7 Miskolc University, Miskolc, Hungary stn@ucsc.cmb.ac.lk, kodamas@kjs.nagaokaut.ac.jp, yewchoong@hotmail.com,rhyze1018@yahoo.com,tarcan@dicle.edu.tr,hammam@iptek.net.id,nagano.robin@chello.hu,mikami@kjs.nagaokaut.ac.jpRevised: 24 September 2008; Accepted: 24 July 2008 evolved along with the multilingual search engines like Inktomi, Yahoo, Google, Alltheweb, etc.The languagespecific search capability of the search engines has provided a means of surveying for researchers.Although these surveys have given us fairly good pictures about European language presence on the Web, far less attention has been paid to Asian languages, among them "less computerized languages" in particular.
This ignorance may arise partly from the technical difficulties of language identification of Asian languages and partly from "commercial value" of Asian languages that has been low.With the exceptions of Chinese, Japanese, Korean, Thai, Malay, Turkish, Arabic, and Hebrew, nothing is known about the extent of the presence of Asian languages on the Web.We felt a strong need to implement an independent survey instrument to observe the activity level of those languages.The UNESCO report, presented to the Tunis phase of the World Summit on the Information Society, "Measuring Language Diversity on the Internet" [5], shares exactly the same concerns as we do.
In response to this, the Language Observatory (LO) project was launched in 2003 under the sponsorship of the Japan Science and Technology Agency (JST) and has been implemented in collaboration with several international partners who have common interests with us [6].After a few years of development work, the LO team has trained our own language identification engine to cover more than three hundred languages of the world, and has acquired the capability to collect terabyte size Web documents from the Internet.The paper is based on the preliminary survey results of this project.
In addition, we have begun a sister project, the Asian Language Resource Network project by the sponsorship of Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT) from 2005.We find a synergy between those two projects: the observation instrument for Asian languages can work as a language resource collection instrument as well.We have a vision of integrating the two projects as an observation-collection instrument for Asian language resources on the Web.

OBJECTIVES
The objectives of this paper are firstly to give an overview for Asian languages on the Web, in particular for those languages that have been ignored up to now.Through this study, we have tried to spotlight the presence of Asian languages.Here the presence of a language is measured primarily by the number of pages written in each language and is supplemented by additional indicators like pages-per-capita to give an indication of the relative intensity of Web authorship.In terms of language coverage, we discovered 55 Asian languages.Chinese, Japanese and Korean are excluded from the analysis because the presence of these languages can be relatively easily measured by using existing commercial search engines.
Secondly, the paper tries to describe the state of multilingualism in Asian country domains.The state of multilingualism can be defined at various levels, from a personal or document level to a societal level.In this study, we show a multiple language presence in each country domain.To give an overview of cross-border languages is a part of these efforts.
Thirdly, the paper tries to shed light on script and encoding issues of Asian languages.The paper tries to answer questions like; to what extent is UCS/ Unicode employed for Asian languages?What scripts are actually used to represent a specific language?To what extent are locally developed encodings used?Most European languages are written in only one script, Latin, Cyrillic or Greek.Some Asian languages, however, are written in a variety of scripts.This is most notable in the central Asian region, where the same language can be presented in Cyrillic, Arabic and Latin.In addition, each script is presented in various encoding schemes.While UCS/Unicode is expected to play a pivotal role in the promotion of multilingual document processing on the Web, its actual implementation on the Web seems still very much limited.Instead, various local legacy encoding schemes are employed.In order to promote language resource collection and development, due attention should be paid to script and encoding variety issues, particularly in this case, where it leads to a chaotic situation of encoding as observed in Asian language documents on the Web.
And lastly, the section three discusses the data the section three discusses the data collection using UbiCrawler, language identification process including creating training data sets and analytical methodologies.The Asian language presence on the Web is discussed in "Asian Language Presence on the Web" section.The state of multilingualism and the presence of cross-border languages are discussed in "Multilingualism in the Asian Web" section and script and encoding issues are discussed in " Script and Encoding Issues" section .Finally discussions on future research areas and conclusion are given in "Descussion" and "Conclusion" sections respectively.

Web Pages Collected
We use a Web crawler that works by downloading Web pages from the Internet.While downloading, it traces links within pages and recursively crawls to gather those newly discovered pages.The collection of downloaded Web pages is then passed to the language identification engine and the language properties of the pages are identified.[7].
The Further, Web sites and their contents change Web sites and their contents change over time.Most search engines have accumulated their databases to have longer (in time) coverage.This means that in the database, there might be many obsolete Web sites and pages.Because the pages downloaded during one short period of time in our study accurately reflect the "current" status of Web sites.
Lastly, while search engines generally cache all types of files, we only crawl for html and text files, both static and dynamic.Although there are many documents available in PDF format, we excluded PDF files because of technical difficulties in handling PDF for Language Identification Module (LIM).

Language Identification Process
The Language Identification Module (LIM) developed for the Language Observatory Project (LOP) [10] can simultaneously detect the triplet of Language, Script and Encoding (LSE) scheme (LSE is used below for this triplet) for each document.The identification is based on the n-gram statistics of documents.A natural language model, which assumes that the probability of the next word depends on the previous few words is generally known as an N-gram model, and a series of N characters (or N bytes) can be referred to as an "N -gram" as well.The advantages of the n-gram approach are that it does not require a special dictionary or word frequency list for each language, and it can detect encoding scheme.
LIM consists of two components.First, the training component accumulates sets of shift-codons from the training data.The term "shift-codon" is derived from the genetic term "codon", a sequence of three nucleotides.Shift-codons are, as the naming implies, three byte strings extracted from the first position, the second position, (n-2)-th position of a training data (n is the length of the training data).The set of shiftcodons thus created are stored with the LSE tags into the reference database.The source of training data is translations of the Universal Declaration of Human Rights (UDHR) provided by the United Nation's Office of Higher Commissioner for Human Rights.
The second component, the identification component, produces shift-codons of the target data and then compares them with all sets of shift-codons stored in the reference database.After comparison, the component calculates the matching ratio of the shift-codons of the target text to those of the training text (the number of matched codons of the target document divided by the total number of codons).Then the component returns the LSE that shows the highest matching ratio as a result.The component returns "Below Threshold" when the highest matching ratio is below a given threshold, and returns "No Match" when no single codon of the target document matches with those of stored reference data.The component returns "Short" or "Empty" when the byte length of the target document is not enough to be identified or no content is found on the target document after removing HTML tags.
There are two data sets used in the language identifier (LI).First, it is the data set that we use to train the LI; we called it the Training corpus (TC).The second data set is the Validation corpus (VC), which contains 500 multilingual Web pages that manually checked by users to confirm their actual language, script and encoding (LSE).The purpose of this corpus is to ensure the accuracy of the LI.Since we already know the correct LSE of the VC, every time we made changes to the LI, we can perform an experiment against the VC and find out how the changes affect the accuracy rate.
The language identification engine LI has been trained in more than 200 languages of the world (345 in terms of LSEs) at the time of this survey.Among them, 62 languages are spoken in Asia and total of 98 different encodings for Asian language scripts have been trained.Missing Asian languages from the UDHR listing are Zhuang, Yi, Hmong (including its various dialects), Shan, Karen, Oriya, Divehi, D�ongkha (Bhutanese), etc.
Languages selected here are official or nationally recognized languages in respective Asian countries.Training data sets are based on the Universal Declaration of Human Rights document, which has been converted into each language and into commonly used encoding schemes including UTF-8.Table 1 is the complete list of the Asian languages targeted in this survey, classified by language family.Additional information for the languages is also listed, vi�: the script(s) for the language and the encodings we trained LIM over.

Introduction to Asian Languages
We can list several language families on the Asian continent: Austroasiatic, Austronesian, Dravidian, Indo-Iranian, Mongolian, Semitic, Sino-Tibetan, Thai-Kadai, Turkic, and Tungus.Some of these language families are not firmly established and could be regrouped into the larger language groups or could be divided into smaller sub-groups.For example, the Turkic, Mongolian, and Tungus language families can be regrouped into larger language family Altaic, and the Indo-Iranian language family can be divided into the Indo-Aryan, Iranian, and Kafiri.There are some isolated languages around the Asian continent, e.g.Korean, Japanese, Ainu, and Burushaski.Some European languages -English, Russian, French, and Portuguese -are also used in the region as official languages, and from the mixture of an indigenous language and an introduced language, pidgins or creoles have emerged.
Among those language families, Sino-Tibetan has the largest number of speakers, estimated at 1.2 billion.Next comes Indo-Iranian, with at least 700 million speakers in India, and more than 200 million people in Pakistan, Bangladesh, Iran and other South and MiddleEast Asian countries.Malay in the Austronesian language family has around 250 million speakers in Indonesia, Malaysia, Brunei, Singapore, the southern Philippines, and Thailand.Tamil, a Dravidian family, has about 200 million speakers in India.Semitic includes a language of many speakers, that is, Arabic, the number of which is estimated to be about 200 million.Other language families have a relatively small number of speakers.Among the isolated languages, Japanese has the largest number of speakers with about 125 million and Korean follows with about 75 million.When we describe the Asian languages, we cannot avoid mentioning the diversity of scripts they use.Contrasted with Western Europe, the diversity is outstanding.In Southeast and South Asian countries, many scripts that come from the Brahmi script are used, and in the East and Near East Asian countries, Han�i script and some other indigenous scripts are used.Latin, Arabic and Cyrillic script are also used with some additional letters and diacritical marks.

Web Presence by Country
In Figure 1, the colouring of map is based on the number of Web pages per 1000 population, as this is the reflection of the degree of presence of a country on the Table 1: List of Language/Script/Encoding [1] trained, grouped by language family

[Austronesian] [Indo-Iranian] [Dravidian]
Web.This shows that Israel is the highest (3757 pages per 1000 population) in the rank and Singapore and Cyprus follow, respectively.The population data was obtained from the CIA World Factbook (estimates as of July 2006).
Figure 1 shows that Ka�akhstan and A�erbaijan respectively have the highest Web page size per 1000 population among Central Asian countries.Figure 1 also shows that Cambodia, Afghanistan, Pakistan, India, Syria, Yemen, Bangladesh, and the last, Myanmar, have the least number of pages presence on the Web (between 5 (4.54%) to 0 (0.35%) pages per 1000 population).It is worth noting that Myanmar, the neighboring country to Thailand, has the least (0.35%) among all the Asian countries.
The presence on the Web of each Asian country is given at ccTLD level in Table 2.In Table 2, ranking is based on the percentage of Web presence against the total Web pages in the region.This shows that Israel (28.88%),Thailand (11.72%) and Turkey (10.61%) with a higher number of language presence on the internet at ccTLD level.Table 2 was tabulated using the Number of Web Pages collected by the crawler engine.

Web Presence by Language
Fourth column of Table 3 shows the total number of   3 shows that Hebrew, Thai, Turkish, Vietnamese, Arabic, Tatar, Farsi, Javanese, Indonesian, Malay, Sudanese, Hindi, Dari, Uzbek and Mongolian have a relatively high presence on the Web.The highest number is for Hebrew, and the second highest for Thai.The fifth column gives the number of pages per 1000 speakers of each language.
An almost identical ranking is observed in both the number of pages and the pages per population.
A high degree of "divide" in terms of usage level of languages can be observed among the Asian languages.The number of Hebrew pages per 1000 speakers is 28 times higher than that of the Malay language (ranked tenth in Table 3), 300 times higher than Pushtu (ranked 20th), and 3,000 times higher than Gujarati (ranked 50th).The speakers' population of languages is said to follow Zipf's Law -the n-th ranked language speaker is one n-th of the population of the top ranked language.But if we measure the size of a language by the number of pages written in the respective language, the relative size of the 1st, 10th, 20th and 50th ranked language in Table 3 becomes a series of 1, 0.036, 0.0035, 0.0001.Our observation suggests that the number of Web pages written in each language follows a progressive power law curve.The situation evidenced here can be well described as a "Digital Language Divide".

Multilingualism by Country Domain
The most recent version of Ethnologue [11] lists close to seven thousand languages around the world.More than 2600 of them are spoken in the Asian region.This indicates that a large scale linguistic diversity is observable in Asia.Among the 2600s', only around 51 languages are recognized by Asian governments as official or national language(s) of the country and other languages have been recognized as languages for home use.
Through the survey, a rich diversity of written pages was found in the country with the richest diversity of languages in the region, i.e.Indonesia.It is interesting to note that there are a significantly larger number of pages in Javanese compared to either Indonesian or Malay.The major language found in Indonesia, Malaysia, Brunei, Singapore, Southern Thailand and Phillipines can be categorized into a single root Malay language spoken in different dialects.This surprising result shows two things: Javanese has a dominating Web presence in Indonesia.The lesser Sundanese, Madurese, Achehnese and Buginese languages are found to be of great importance to Indonesia's local language diversity on the Internet (see Table 3).  [1 5,701,765 (5.3%) Empty pages 273,187 (0.3%)No matching pages 9,386 (0.0%) Duplicated pages [2]  35 1,756 0.00 [1] The threshold is set at 20% in this survey; [2] Almost one-third of the pages were found to be an exact copy of another page.We excluded duplicate pages from the language identification process.

Cross-Border Languages and their Dominance
Another aspect of the multilingualism in the region is the overwhelming presence of cross-border languages on the Web.Here we define two categories of languages.The first category is "local languages", which are officially recogni�ed language(s) and home speakers' languages of the state.The second category is "crossborder languages", such as English, French, Russian and Arabic, which are used as a language of communication among the peoples of different nations.
Arabic can be categorized in two ways.In the Middle East region, Arabic is recogni�ed as an official language in many countries, but it also functions as an important cross-border language.So we treat Arabic in two ways depending on the context of analysis; if it is an official language, it is counted as a local language in Figure 2, and if not, then as an 'other cross-border language' [12].Figure 2 shows the relative share of these categories of language in each country domain.
Countries are grouped by sub-region.We found that each sub-region shows clear characteristics in terms of the weight and the choice of cross-border languages.
In West Asia, two cross-border languages, English and Arabic, dominate the Web.Almost 99% of Web pages are written in these two cross-border languages, except in Cyprus, Iran and Israel.Local languages show a majority in several countries, such as in Israel (62.0% in Hebrew), Turkey (50.7% in Turkish) and Iran (50.6% in Farsi, Dari, Pashtu and Balochi).If we treat Arabic as a local language, the share of local languages becomes more than half in most countries in the region.A quite unique case is Cyprus, where Greek (36.6%) plays a key role.
In South Asia, the dominance of English is outstanding.Relatively high share of local languages is found only in Nepal (22.4% in Hindi or Nepali), India (21.7% in various Indian languages), the Maldives (8.2% in Divehi) and Sri Lanka (5.1% in Sinhala).
In the Central Asia, there are two cross-border languages, English and Russian.As will be discussed in section " The Waro Alphabets in Central Asia", Russian dominates in Ka�akhstan (88.8% in Russian), Kyrgy�stan (86.3% in Russian), and U�bekistan (70.4% in Russian), while English dominates in Turkmenistan (94.4% in English).Tajikistan (45.9% in English and 44.6% in Russian) has an equal balance of the two languages.Local languages show a substantially higher share (35.2%Mongolian) only in Mongolia.
In South East Asia, the situation is rather different.The share of local languages is far higher than in other sub-regions.Among them, local languages have a major share in Vietnam (69.8% in Vietnamese), Thailand (64.0% in Thai) and Indonesia (58.7% in various local languages including Javanese, Indonesia, Sundanese, Balinese, etc.).English dominance is also observed frequently in this sub-region.

Script Diversity in Asia
Asia is especially rich in scripts.The five basic scripts: Ideographic, Brahmi, Latin, Arabic and Cyrillic grew up in the region, each largely separated by mountains, ocean or deserts.In East Asia, the influence of Chinese ideographic script (han�i) is remarkable.In South Asia, in and around the Indian Subcontinent and in the continental part of Southeast Asia, scripts originating from Brahmiscript are influential.The islands of Southeast Asia and Australasia have mostly adopted Latin scripts (some islands in the region still use Brahmi-originating scripts such as the Balinese script, or aksara Bali).In Central Asia, historically languages were written in the Arabic script under the influence of the Ottoman Empire but later transformed into Cyrillic.Lastly in the western part of Asia, Arabic script is widely used not only by Arabic speakers but also by non-Arabic speakers.

The War of Alphabets in Central Asia
As the Turkic language border extends from Europe to China, covering 12 million square kilometres, the languages are written in several scripts.In Turkey the Latin alphabet has been used since 1928.In Central Asian republics, the Cyrillic script has been in use from about the same time.In some areas of Afghanistan Arabic script is used.It is said that Turkic languages such as Uyghur and Ka�akh are written even in Chinese script.Now the region is in a transition period.As an interesting example of script diversity, let us discuss this sub-region.
Since 1990 Turkey has invited thousands of students from new republics in the central Asia to Turkish universities by offering scholarships.The Turkish Education Minister Köksal Toptan, while attending a conference of the education ministers of Turkic republics and communities in Bishkek in September 1993, said, "the most important factor which will secure our unity and develop our language is a common alphabet" [13].Azerbaijan, Turkmenistan, Uzbekistan, Tatarstan, and Gagauz have an agreement to complete the transition into Latin script before 2010.
But in reality, in symposiums and meetings between Turkic republics in Central Asia Russian is nearly the sole tool of communication between the Turkic peoples in the central Asia."The local languages are used exclusively in indigenous film-making, scholarly publication, and in local trade and commerce" [13] Ka�akhstan and Kyrgy�stan have a significant Russian population.This fact increases again the influence of Russian language and the Cyrillic script as well.China and Iran are the other important actors in this sub-region: Kyrgy�stan and Ka�akhstan share borders with China, and Iran has an important influence on A�erbaijan and Turkmenistan.
Although our survey results can provide only a limited picture of this situation, they do make it clear that the choice of script used to write local languages seems influenced substantially by the script of the dominating language in the country (Table 4(a), 4(b)).

The Existence of Multiple Encodings
Indian language Web sites heavily rely on unique encodings or proprietary extensions of existing standard encodings [14].One survey had found 24 such local encodings for Hindi alone, and 15 for Tamil, 14 for Marathi, 10 for Malayalam, and so on.The total number of these local encodings reaches well over 50 [15].The existence of multiple local encodings is not specific only to Indian languages, but is widely seen in other languages which use non-Latin scripts or Latin script with significant extensions and/or additional diacritics.Vietnamese is a typical example of the latter.
To resolve this problem for scripts of the languages around the world, the International Organization for Standardi�ation (ISO) and the International Electrotechnical Commission (IEC) made efforts to develop a single comprehensive universal character set.The first version of "The Universal Multiple Octet Coded Character Set (UCS)" was published in 1991.Later the of UTF-8 (96.4%).It is followed by Mongolian (95.5%),Hindi and other Devanagari-based languages (78.4%), and Sinhala (44.5%).Hebrew (12.3%),Thai (2.7%), Burmese (0.7%), and Turkish (0.5%) are relatively or extremely low.These estimates of UTF-8 penetration should be considered as overestimated, because many local encodings are still missing from our training data.

DISCUSSION
In this section, we will discuss issues from the viewpoint of how to reali�e the vision of an integrated observationcollection instrument for Asian language resources from the Web.First we will discuss the availability and quality of language resources, and then we will focus on our agenda in the technical domains, how to deal with plural scripts and encodings and how to create efficient and workable solutions for collecting language resources.

Overall Assessment
When measured by the number of pages or by pagesper-capita, most of the Asian languages are far less represented on the Web than European languages.This is not a surprising result, but their presence is even more limited than expected.Hindi and Bengali, for example, with almost four hundred million speakers between them -larger than the total population of the European Union -have only one hundred thousand or so pages on the Web.The degree of difference between them and European language representation is in the order of tens of thousands or hundreds of thousands."The digital language divide" does definitely exist at a worrisome level.
When measured by volume of text, a one million document set contains roughly 5 gigabytes of text, assuming 5000 bytes as the average page size.But only ten Asian languages have above this amount of language resources with the remainder being far smaller.
When we evaluate the quality of documents as language resources, such factors as the variety of content category, language quality, and variety of style of documents should be evaluated.At this moment, we cannot tell much about these points.But at least one point can be mentioned here.It is likely that content category, quality and style of languages are biased, at least in "smaller" languages.The bias might stem from the specialization of usage in a multilingual environment.Multilingualism is the norm in most parts of Asia.In a multilingual environment, there is often specialization in discourse situations.For example, English for the occupational domain is an official language for public or educational domains and other local languages for personal domains.When such specialization is apparent, the language contents on the Web also may show specialization depending upon the domains of the language's specialization.The outstanding dominance of cross-border languages in many country domains suggests that the specialization domains left for the local languages might be relatively limited.
work of ISO / IEC and that of the Unicode Consortium became integrated and synchronized.The most recent version of the Unicode Standard (The Unicode Consortium, 2005) assigns a unique identifier to each of 97,720 characters (including 70,207 ideographic characters defined by national and industry standards of China, Japan, Korea, Taiwan, Vietnam and Singapore).But it was expected by the Unicode Consortium, that encoding based on UCS/Unicode, whose most commonly used form is UTF-8 (UTF-Unicode Transformation Format), would be used in parallel with the abovementioned local encodings.
Taking this plurality into account, we have tried to collect training data encoded in these local proprietary codes and in UTF-8.As shown in Table 5, we have trained our language identification engine LIM by using 9 encodings for Hindi, 4 encodings for Tamil and 2 encodings for Telugu.Also we have included 3 encodings for Vietnamese and 2 encodings for Sinhala.This is still not sufficient to match the plurality in the real world, but we believe that this is the first ever attempt to identify actual usage of local encodings in the Web space.
As a result, we found that the use of UTF-8 in the Asian region is extremely low.Table 5 shows to what extent UTF-8 is used in selected languages (for other languages, we do not have sufficient training data prepared in different encodings).The table shows that Vietnamese is found to be the highest in the penetration

How to cope with the Growing Web
The current survey does not cover Web pages placed under generic domains like com, org or net.Many local language news sites, blog pages and chat-rooms are hosted in generic domains, whose size is almost ten times larger than the entire country code domains.Therefore, considering the growing speed of the Web, the question of how to implement an efficient crawler becomes a key issue in our vision.The current study consumes almost 652 gigabytes of disk storage and consumes 50 to 80 Mbps bandwidth for almost one week.A simple calculation tells us that 65 terabytes of disc storage and 100 weeks would be needed to collect the entire Web (10 billion pages is assumed here).It seems impossible for most non-commercial entities.In this context, several studies and attempts have been made in the field of language-focused crawling [14][16].One of the assumptions behind the design of this approach is that pages written in a specific language may have a high likelihood of being linked to pages in the same language.We need to verify this assumption.A graph analysis to reveal the structure of sub-graphs of Web pages written in the same language should be tackled.
In the same context, a distributed crawling approach coupled with proximity-based allocation of tasks has been explored [17].An advantage of this approach is the possibility of combining free-resources from any possible participant, and proximity-based allocation of tasks can improve the speed of crawling by reducing response time from an assigned server to a target server.The Language Observatory is offering a server to an experiment to test this approach, designed by Thai Computational Linguistics Laboratory (TCL).

CONCLUSION
A detection technique for natural languages and their encoding schemes can also be used as an online language, script, and encoding scheme identifier and to develop tools such as multilingual search engines.It will be difficult to install the shift-codon trained data into a Web browser due to the large amounts of shiftcodon required.However, online detection service and crawling for specific language groups could be implemented with limited knowledge, since the server manages the knowledge.
The survey presented, in spite of its limitations, is probably the first comprehensive survey of Asian languages on the Web.The results revealed the existence of a worrisome level of digital language divide and the dominance of cross-border languages in the Asian domains.Through the survey, an estimate of the size of language resources on the Web is given.Also the extent of plurality in scripts and encodings of Asian language documents is indicated.It may be premature to confirm the feasibility of a "Web as Corpus" scenario for Asian languages in a conclusive manner.
Finally, the survey has identified points to be aware of and has given directions that can benefit anybody who tries to create a language resource collection.

Figure 1 :
Figure 1: Presence of Web Pages by Country in the Asian Region Table 2: Percentage of Web Pages on the Internet at ccTLD Level

Figure 2 :
Figure 2: Cross-border languages presence in Asian countries grouped by region GCC stands for the Gulf Coopeation Coucil, which consists of Bahrain, Kuwait, �man, �atar, Saudi Arabia and UAE Kuwait, �man, �atar, Saudi Arabia and UAE Kuwait, �man, �atar, Saudi Arabia and UAE

Table 2 :
Percentage of Web Pages on the Internet at ccTLD Level

ccTLD % of Web Pages ccTLD % of Web Pages ccTLD % of Web Pages il
Web pages identified by the survey.The data shown in the third column of the table is the speaker population of that language with statistics taken from the UDHR Web site.In principle, all Asian languages listed in first column in Table3are considered as local languages.The ranking is based on the number of pages.Table

Table 3 :
Number of Web Pages Collected from Asian ccTLDs, by Language Kodama, Chew Yew Choong, Razza Caminero 18 Ahmed Tarcan, Hammam Riza, Robin Lee Nagano, Yoshiki Mikami 2008The International Journal on Advances in ICT for Emerging Regions 01 (01)

Table 4 :
Number of Pages in Domains of Central Asian Republics (a) English, Russian, and Arabic pages by country "n/a" means pages are not yet found, but it does not mean nonexistence of pages.

Table 5 :
The Penetration of UTF-8 Encoding in [1]Local proprietary encodings are shown in this table by Local proprietary encodings are shown in this table by ocal proprietary encodings are shown in this table by names of font (families).21 An Analysis of Asian Language Web Pages The International Journal on Advances in ICT for Emerging Regions 01 (01) October 2008