Today I made a brief presentation introducing what is Unicode and how Tamil is encoded in UNICODE (தமிழில் ஒருங்குறி).
You can download the presentation from here.
Today there was a conference in Hotel Lalit, New Delhi on “WWW: Technology, Standards and Internationalization Conference” and inauguration of W3C India office in TDIL, Government of India.
Ms.Swaran Latha of TDIL & Director of W3C India Office
- Character sets and codes for all 22 official Indian Languages and 11 Scripts are now in UNICODE
- Efforts happening on PLS 1.0 (Pronunciation of Language Specifications) starting with Hindi. TDIL will soon start work for other Indian languages
- Issues specific to Indic languages on CSS3 style sheets like line breaks, drop case and others need to be handled & discussed
- In terms of CSS3 Japanese have done some excellent work on these lines, refer:http://www.w3.org/TR/jlreq/; Arabic and other Bi-Directional languages as well, refer: http://www.w3.org/TR/html-bidi/. Indic languages in this area is not having any reference documents, this was told to be one main reason for Indic Languages having any rendering issues on CSS3 or HTML. Work needs to done on preparing reference documents for Drop Case, Underline, Indentation, Bullets and so on. Indic Languages have issues on Vertical Layouts, has they need to displayed Syllable by Syllable rather than Characters. Only with these documents in place, we can talk to browser vendors to enable 100% support for Indian Languages
- In terms of XML Normalization, IDs need to be in Non-Latin Characters. Lot of work needs to be done here too
- Work needs to be done in Character Model (http://www.w3.org/TR/charmod/), Speech Synthesis Markup Language (SSML) and TimeZone (http://www.w3.org/TR/timezone/) to update them with Indic Languages and India specific information
- TDIL has worked and submitted CLDR (Common Locale Data Repository) for six indian languages to all international standards body
- We need to thing on issues on IDN relating to Indic languages, especially since many languages share a common scripts. For Example once Hindi goes live in IDN, Marathi & Bodo may not have much options in IDN names. Refer RFC 5646 for language tagging, UNICODE to Punycode becomes necessary
- TDIL has worked on a Mobile Initiation Plan for Hindi, Bangla, Marathi & Tamil. A 7-bit UNICODE based Encoding scheme for many indian languages have been approved in 3GPP for Mobile SMS usage in Sep ‘09. If the standards are agreed (UTF-8/UTF-16) between TELCOS, Mobile Browser developers & OEMs for Indic Languages then huge results can be seen in One year for having Indic Languages in Mobile web (A related presentation can be seen here)
- Issues that TDIL is working on includes Orthography, Pronunciation, One Script – Many Languages, Very few linguistic experts know IT, Working on Collation/Sort Order with State Government, Lack of Parallel Corpora between English and Indic Languages
- Only 5% of people in India can read & write English
- A presentation made by Ms.Swaran Latha on the same topic in a NASSCOM event can be seen here
Dr.Jeffrey Jaffe, CEO, W3C Foundation
- Says how W3C standards are all fully open, it doesn’t prevent proprietary innovations but once the innovation gets into fabric of web, W3C tries to standardize it.Then the vendor has to turn it royalty free including patents for it
- Sir Tim Berners Lee is working with UK government to publish all their data in the most open, rich semantic standards. Wish similar efforts are under way in India soon
- About 2 Billion People are using Internet today
Dr. S. Ramadorai ,Vice Chairman, TCS India
- Inclusiveness of all 22 Indian languages will have a vested interest to industry as they will give a larger consumer base to service providers.
- Case is not just of interoperability & translation between Indian languages it is now about world languages to Indian languages & back
Dr. Raghuram Krishnapuram, Senior Manager, IBM India Research Lab
Mr. Rajendra S. Pawar, Vice Chairman, NASSCOM
- Every physicist writes his last book on philosophy
- My children’s generation is more comfortable sharing personal information in Social Networks. Society has come a full cycle to become a better society by sharing
- Industrial revolution did well in distributing wealth. Unintended consequence was, of Humans were seen as an intrusion in man-machine equation. The information revolution has brought back the man to the centre, a full cycle.
- How industrial world’s scarcity mentality has given way to abundance mentality in the Information world
- In India we bow to people who give away their wealth, but look up to people who have lot of wealth
- The main question now is who will give the soul to the web and maintain it. That’s why all of us should be members in W3C & contribute
Shri R. Chandrashekhar, Secretary ,DIT, MCIT
- No one knows better than India the perils of a monolingual Internet. Real world is very diff with multiple languages in usage
- For India & developing countries Bits & Bites (meaning food) are both important
Shri Sachin Pilot, Hon’ble Minister of State for Communications & IT
- Vision of IT for 500 million Indians by 2022, IT Dept has to train about 10 million
Internationalization aspects in W3C
- Richard Ishida W3C Internationalization head talked about I18N aspects in W3C. Reference: www.w3c.org/International is the best single place to go for information relating to I18N from W3C, Twitter: @webi18n. How to author HTML & CSS for I18N considerations.
- Dr. A. Kumaran , Microsoft Research India talked on I18N & Name search. He says “what is songs are for a bollywood movie, are names in social networking”. There are 60 million web pages which have misspelled “Barack Obama”. There are 1500 known misspellings of Britney Spears. These name search issues compounds across languages. A multilingual name search can be done by generating hash code in each language for a name and then compared. This approach is much better than transliteration. In one trial, for example the accuracy for Tamil jumped from 0.29 to 0.69 with this method. To begin with you need 10,000 Parallel Names in two languages
- Manish Bhargava , Google Inc., USA heads Indic efforts in Google which works on 40 Lang which maps to about 99% of world population. He talked about “web for all”. Hindi has 290 Million users, Bengali 215M users,Telugu 75M users,Tamil 77M Internet users. India has 50M Internet users. PC penetration in India is 2%. Introduction of Google transliteration Keyboard feature in Google search resulted in about 7% increase in volume of searches in Indic languages in just 2 weeks of introduction. Indic Languages user base in web is heterogeneous with NRIs – 20 Million, Developed Users in India – 300 Million and Emerging users in India – 600 Million. Only 50% of world will be on Internet by 2030. Lack of Online content in Indic Languages is a major constraint and Google picked up English content and translated them to Hindi and put them back in Wikipedia.
- Pravin Satpute, Redhat India is talking on how there are four rendering engines in Linux including PANGO. Harfbuzz is an project to unify these
Web Access through Mobile and Handheld devices
- Dr. Phil Archer, W3C Mobile Lead talked on Geo location API, Device API to take advantage of mobile device capabilities like camera from your Mobile Web page. Recommends seeing the Mobile web application best practise document
- Prof.Devendra Jalihal of IIT Madras talked on a scheme for efficient Hindi mobile keypad. When inputting Indic Language, Multi Tap (more taps required) issue crop up & Dictionary based typing is not smooth as well. The finding of the research was that “Spreading vowels in Indian Languages to separate buttons in a mobile phone keyboard is good for efficiency in typing”
- Vivekananda Pai of Reverie Technology says Indic languages need 100% perfect rendering support. Devices give English users good experience, but need to do the same for Indic languages. S.K.Mohanty, CDAC is one of the pioneers on Typography for Indian Languages. TV Remote will become the second most popular device in India, once 3G becomes popular
Some other data points:
- To support WAP a device needs 50 Mhz CPU and 0.5 MB of RAM. To support Basic Web a device needs 110 Mhz CPU and 1 MB of RAM. To support Web Kit a device needs 400 Mhz CPU and 5 MB of RAM
- Mobile Users in India are 471 Million. Out of which devices capable of Internet Mobile: 127 M, Used once: 12 M, Active users: 4 M as per TRAI Sep ‘09 report
- In India a user sends about 30 SMS / Month – very low compared to other Asian countries especially South East Asia
- Each of the Top 4 Regional newspapers in India exceed by 3 Times the top English newspaper in India in terms of readership
I am a great admirer of Former President Dr.A.P.J. Abdul Kalam and inspired a lot by his book “Wings of Fire”. He is one of the few leaders in modern India who is a true role model to young Indian’s. Previously I had attended one of his talks on the topic of “Bridging the two Indias” in 2006 when he was President of India, so obviously high security and you couldn’t go near him then, still it was a privilege for me to listen to him in person. So few weeks when I planned my New Delhi for this week, I was lucky to get an appointment with Dr.Kalamji in his Rajaji Marg residence for today evening, I was thrilled beyond words.
Today evening I went inside his house at the fixed time (@7PM), got to spend few minutes with him, talk to him and take some photographs. I presented him with few titles from LIFCO including LIFCO’s Tamil Peragarathi and he gave me an autographed copy of his book titled “The Family and the Nation”. I am currently in Cloud Nine writing these words
My earlier posts on Dr.Kalamji are “Birthday Wishes to Hon’ble President”, “The President of India website” and “Government of India to offer Free Broad-band by 2009”