சில நாட்களுக்கு முன் நண்பர் மணி மணிவண்ணன் அவரது பேஸ்புக் பக்கத்தில், தமிழ்நாட்டுப் பாடநூல் நிறுவனம், சென்னை, 1985இல் வெளியிட்ட”தமிழ்‌- தமிழ்‌ அகரமுதலி” என்ற நூல் இலவசமாக மின் புத்தக வடிவில் கிடைக்கிறது எனவும், அதன் இணைப்பையும் கொடுத்திருந்தார். பயனுள்ள நூல் இது. உடனே பதிவிறக்கம் செய்தேன்.

தமிழ் இணையக் கல்விக்கழகம் இந்த நூலை நல்ல முறையில் வருடி, நகல் எடுத்திருக்கிறார்கள். ஒரே ஒரு குறை, மின் நூலில், தமிழில் தேட முடியவில்லை – அது ஏனென்றால், எளிதாகக் கிடைக்கும் ஒளி எழுத்துணரி செயலிகளில், தமிழ் இப்போது தான் வந்திருக்கிறது. அதனால் இறக்கம் செய்த மின் நூலை தேசாரக்ட் என்னும் இலவச செயலியைக் கொண்டு ஒளி எழுத்துணரிச் செய்து புதிய பதிவாகக் கொடுத்துள்ளேன். அதை எப்படிச் செய்தேன் என்பதைக் கீழே சொல்லியுள்ளேன்.

Recently my friend Mr Mani Manivannan had shared a link to download for free, a Tamil-Tamil-Dictionary published in 1985 by Tamil Nadu Text Book and Educational Services Corporation as a PDF ebook. Unfortunately, the file was not searchable in Tamil, as the OCR facility to understand Tamil text had come out for wider use, only in the recent months.

I tried to fix it. I started by downloading the ebook (PDF file) from Archive.org, then performed OCR using the Open Source app called Tesseract – I had talked about this wonderful tool at the Tamil Internet Conference 2019. Tesseract reads each page as a picture, performs character recognition, that means it figures out the text that’s in the image into a string of Unicode text. The accuracy of OCR is not 100% but certainly usable. Tesseract then attaches (invisibly as meta-data) the Unicode text into the image with a marker to indicate the location from where it had read it, in a technique called Embedded Text in a PDF. Then, it writes for all the pages, the image and the associated Unicode text into a new output PDF file.

From the PDF file (on the left) I am able to copy 'n' paste the text in Tamil to Notepad (on the right) in Unicode format seamlessly
From the PDF file (on the left) I am able to copy ‘n’ paste the text in Tamil to Notepad (on the right) in Unicode format seamlessly – The accuracy is not 100% but certainly usable
Searching in the ebook in Tamil works well
Searching in the ebook in Tamil works well

Below are the three batch files that I wrote to run in Windows 10, which does the necessary steps.

Step 1: Tesseract accepts as input, only image files like JPG, GIF or TIFF. So, we need to extract every page from the PDF file as individual images. I am using Ghostscript utility to do this and you can download it from here. The batch file creates the image files with the fixed name of __inputpages*.png, you may make the batch file generic by accepting the output filename as a command-line parameter.

echo off
echo usage: PDF2OCT_STEP1.bat inputfile.pdf
rem About: Generates an image for each page in a PDF
rem 1.Ghostscript should be installed and the folder be in the path 
rem 2.Ghostscript: https://www.ghostscript.com/download/gsdnld.html
rem 3.Creates temp files called __inputpages*.png

gswin64.exe -dNOPAUSE -dBATCH -r600 -sDEVICE=png16m -sOutputFile="__inputpages-%%05d.png" %1

The dictionary ebook referenced above was about 1000 pages and this step took about 40 minutes on my Core i7, 16 GB, SSD laptop running Windows 10.

Step 2: Is to generate a text file containing the list of all __inputpages.png in the folder. Then run the tesseract engine against each image file. The command-line parameters to tesseract.exe specify the language (I have given as Tamil and English in that order), and the output file to be in PDF format. This second batch file requires the input image files to be named as __inputpages.png (the same as the output from Step 1).

echo off
echo usage: PDFOCR_STEP2.bat outputfile 
rem About: Does OCR for Tamil for each of the images and then combines them into one PDF
rem 1.Tesseract.exe has to be installed and be in the path: https://github.com/tesseract-ocr/tesseract
rem 2.NO input file to be specified and NO .pdf in the output file name
rem 3.creates temp files called __inputpages*.png and __inputfiles.txt

dir /b __inputpages*.png >__inputfiles.txt
tesseract -l tam+eng __inputfiles.txt %1 pdf

For the 1000+ pages in the dictionary, ebook referenced above this step took about 4-7 hours (my machine had gone to sleep mode a few times in between) on my Core i7, 16 GB, SSD laptop running Windows 10.

Step 3: The last step is to use Ghostscript to compress the PDF file to make it suitable for reading as an ebook, into a new PDF file.

echo off
echo usage: PDF2OCT_STEP3.bat Inputfile.pdf OutputFile.pdf
rem About: Compresses a PDF to be an ebook
rem 1.Ghostscript should be installed and be in the path 

gswin64.exe -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=%2 %1

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.