Technology

Read a Tamil magazine in English – automatic translation

Optical character recognition (OCR) is the technique of typed or printed or handwritten text to computer encoded text (plain text). Machine Translation is the translation (conversion) of text from one language to another.

Today an article about had been published in a Tamil magazine. A friend asked me to translate the same from Tamil in English. Instead of typing it all myself, I wanted to use machine translation – for that first, I needed to get the Tamil text out of the magazine page and then translate them using Google or Bing Translate.

There are two methods to get this done.

Method 1 – Using Google Docs

Following are the steps on how I did this:

  1. Scan the article.
  2. The page layout was in two columns, I cropped them into a single column individual files (Google Docs gets confused when you have two columns, it combines the text as one long line).
  3. I joined all the individual files into one long (single column) one – I converted it to a PDF file, but it can be a long JPG file too.
  4. Upload the PDF file to Google Drive.
  5. Right the file and say “Open with Google Docs”.
  6. You will see a new file opened, with Tamil Text. Copy the entire text.
  7. Go to translate.google.com, paste the text and then convert to English.
  8. Copy the English text and paste it into a new file.
Step 1, 2, 3 – Convert the two-column image to a long single column
Step 5, 6 – Upload the image file to Google Drive, then open with Google Docs
Step 7, 8 – Copy the Tamil text, translate using Google Translate and then get the English text

Method 2 – Using Google Translate app

A simpler method if I just wanted to read the Tamil text in English will be to point the camera in Google Translate app at the Tamil text and press Translate!

Update 27th March 2021:

This is not related directly to the above. But recently there have been breakthroughs in the technology that’s powering language translation, with the aim of making them available for less-resource (which tend to be lesser-known) languages around the world. This article from BBC talks about US Intelligence Agency IARPA funding research to develop a system that can find, translate and summarise information from any low-resource language, whether it is in text or speech. An interesting approach that doesn’t require the enormous parallel corpus that’s required in the current machine translation models.

Currently, Google Translates only 100 of the 4000 written languages from around the world.  If these new techniques work one day we may see something like the #StarTrek Universal #Translator or Hitchhiker’s guide #Babelfish coming true!