சில நாட்களுக்கு முன் நண்பர் மணி மணிவண்ணன் அவரது பேஸ்புக் பக்கத்தில், தமிழ்நாட்டுப் பாடநூல் நிறுவனம், சென்னை, 1985இல் வெளியிட்ட”தமிழ்‌- தமிழ்‌ அகரமுதலி” என்ற நூல் இலவசமாக மின் புத்தக வடிவில் கிடைக்கிறது எனவும், அதன் இணைப்பையும் கொடுத்திருந்தார். பயனுள்ள நூல் இது. உடனே பதிவிறக்கம் செய்தேன்.

தமிழ் இணையக் கல்விக்கழகம் இந்த நூலை நல்ல முறையில் வருடி, நகல் எடுத்திருக்கிறார்கள். ஒரே ஒரு குறை, மின் நூலில், தமிழில் தேட முடியவில்லை – அது ஏனென்றால், எளிதாகக் கிடைக்கும் ஒளி எழுத்துணரி செயலிகளில், தமிழ் இப்போது தான் வந்திருக்கிறது. அதனால் இறக்கம் செய்த மின் நூலை தேசாரக்ட் என்னும் இலவச செயலியைக் கொண்டு ஒளி எழுத்துணரிச் செய்து புதிய பதிவாகக் கொடுத்துள்ளேன். அதை எப்படிச் செய்தேன் என்பதைக் கீழே சொல்லியுள்ளேன்.

Recently my friend Mr Mani Manivannan had shared a link to download for free, a Tamil-Tamil-Dictionary published in 1985 by Tamil Nadu Text Book and Educational Services Corporation as a PDF ebook. Unfortunately, the file was not searchable in Tamil, as the OCR facility to understand Tamil text had come out for wider use, only in the recent months.

I tried to fix it. I started by downloading the ebook (PDF file) from Archive.org, then performed OCR using the Open Source app called Tesseract – I had talked about this wonderful tool at the Tamil Internet Conference 2019. Tesseract reads each page as a picture, performs character recognition, that means it figures out the text that’s in the image into a string of Unicode text. The accuracy of OCR is not 100% but certainly usable. Tesseract then attaches (invisibly as meta-data) the Unicode text into the image with a marker to indicate the location from where it had read it, in a technique called Embedded Text in a PDF. Then, it writes for all the pages, the image and the associated Unicode text into a new output PDF file.

From the PDF file (on the left) I am able to copy 'n' paste the text in Tamil to Notepad (on the right) in Unicode format seamlessly

From the PDF file (on the left) I am able to copy ‘n’ paste the text in Tamil to Notepad (on the right) in Unicode format seamlessly – The accuracy is not 100% but certainly usable

Searching in the ebook in Tamil works well

Searching in the ebook in Tamil works well

Below are the three batch files that I wrote to run in Windows 10, which does the necessary steps.

Step 1: Tesseract accepts as input, only image files like JPG, GIF or TIFF. So, we need to extract every page from the PDF file as individual images. I am using Ghostscript utility to do this and you can download it from here. The batch file creates the image files with the fixed name of __inputpages*.png, you may make the batch file generic by accepting the output filename as a command-line parameter.

echo off
echo usage: PDF2OCT_STEP1.bat inputfile.pdf
rem About: Generates an image for each page in a PDF
rem 1.Ghostscript should be installed and the folder be in the path 
rem 2.Ghostscript: https://www.ghostscript.com/download/gsdnld.html
rem 3.Creates temp files called __inputpages*.png

gswin64.exe -dNOPAUSE -dBATCH -r600 -sDEVICE=png16m -sOutputFile="__inputpages-%%05d.png" %1

The dictionary ebook referenced above was about 1000 pages and this step took about 40 minutes on my Core i7, 16 GB, SSD laptop running Windows 10.

Step 2: Is to generate a text file containing the list of all __inputpages.png in the folder. Then run the tesseract engine against each image file. The command-line parameters to tesseract.exe specify the language (I have given as Tamil and English in that order), and the output file to be in PDF format. This second batch file requires the input image files to be named as __inputpages.png (the same as the output from Step 1).

echo off
echo usage: PDFOCR_STEP2.bat outputfile 
rem About: Does OCR for Tamil for each of the images and then combines them into one PDF
rem 1.Tesseract.exe has to be installed and be in the path: https://github.com/tesseract-ocr/tesseract
rem 2.NO input file to be specified and NO .pdf in the output file name
rem 3.creates temp files called __inputpages*.png and __inputfiles.txt

dir /b __inputpages*.png >__inputfiles.txt
tesseract -l tam+eng __inputfiles.txt %1 pdf

For the 1000+ pages in the dictionary, ebook referenced above this step took about 4-7 hours (my machine had gone to sleep mode a few times in between) on my Core i7, 16 GB, SSD laptop running Windows 10.

Step 3: The last step is to use Ghostscript to compress the PDF file to make it suitable for reading as an ebook, into a new PDF file.

echo off
echo usage: PDF2OCT_STEP3.bat Inputfile.pdf OutputFile.pdf
rem About: Compresses a PDF to be an ebook
rem 1.Ghostscript should be installed and be in the path 

gswin64.exe -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=%2 %1

Scanning

If you are scanning (say an old book), for quicker completion you will be scanning two facing pages at once. While it may save scanning time, it will be inconvenient to read it on-screen as a PDF and to apply OCR (as instructed above through Tesseract). It is better to split them into individual pages as in the original book, remove the noises and the brown haze in all the pages. To do this, there is an awesome tool called Scan Tailor – it is open-source and free. The tool requires a bit of learning – you can start with the quick start video or the step-by-step instructions in their wiki – remember that the tool accepts image files as input and produces image files as output.

The other day I was trying to scan (to preserve) an 80-year-old book of over 150 pages. I got all the pages scanned, but the output was two facing pages and looked unimpressive. I used Scan Tailor to split pages (from two facing pages to single pages), straighten up the pages, despeckle (remove the non-text/image areas like the brown paper background) and then output clean sharp looking pages.Above is a sample page(s), on top is the scanned page I started with and the bottom are the two output pages after using Scan Tailor.

Above is a sample page(s): on top is the scanned page I started with, and at the bottom are the two (output) pages after using Scan Tailor.

Categorized in: