Find true character length of an Unicode String

Many times while working in Indic Language web pages, I want to find the true character length of a string. .NET String.length() or its variants from other major programming languages, return character length based on storage space. They don’t follow the language rules, so they are incorrect according to language grammar rules.

For example if the string is Tamil ‘வி’ or ‘கொ’, or Hindi ‘मा’, the returned length is ‘2’. Obviously this is incorrect, as per grammar it should be counted as ‘1’ character.

To solve this problem, I have come up with this sample Windows Forms (.NET Framework 1.0) application. It uses .NET Frameworks, System.Char.GetUnicodeCategory() method which identifies every character as Upper Case or Lower Case or Other Letters (which means this is a Non English character) or Control, etc. Check it out. An online version of this (where you can try it without download any bits) can be seen along with my article here at Bhashaindia.com.

Known bug: Deepak Gulati of Microsoft India found a bug with this solution, for Tamil (Grantha) Character ‘ஸ்ரீ’, should be counted as 1, but this code counts it as 2. The reason being ‘ஸ்ரீ’, technically is composed of two unicode seperate characters ‘ஸ்’ and ‘ரி’ which are shown as an one glyph by the font used. I am trying to figure to a way to solve. If you can think of one, post it below in the comments. My good friend Sri.Muthu Nedumaran has suggested a platform neutral solution that works for Tamil. I am working on that, will post it shortly here.

Download Application (5.89 KB)
Download Source (18.44 KB)

Related posts