சில நாட்களுக்கு முன் நண்பர் மணி மணிவண்ணன் அவரது பேஸ்புக் பக்கத்தில், தமிழ்நாட்டுப் பாடநூல் நிறுவனம், சென்னை, 1985இல் வெளியிட்ட”தமிழ்- தமிழ் அகரமுதலி” என்ற நூல் இலவசமாக மின் புத்தக வடிவில் கிடைக்கிறது எனவும், அதன் இணைப்பையும் கொடுத்திருந்தார். பயனுள்ள நூல் இது. உடனே பதிவிறக்கம் செய்தேன்.
தமிழ் இணையக் கல்விக்கழகம் இந்த நூலை நல்ல முறையில் வருடி, நகல் எடுத்திருக்கிறார்கள். ஒரே ஒரு குறை, மின் நூலில், தமிழில் தேட முடியவில்லை – அது ஏனென்றால், எளிதாகக் கிடைக்கும் ஒளி எழுத்துணரி செயலிகளில், தமிழ் இப்போது தான் வந்திருக்கிறது. அதனால் இறக்கம் செய்த மின் நூலை தேசாரக்ட் என்னும் இலவச செயலியைக் கொண்டு ஒளி எழுத்துணரிச் செய்து புதிய பதிவாகக் கொடுத்துள்ளேன். அதை எப்படிச் செய்தேன் என்பதைக் கீழே சொல்லியுள்ளேன்.
Recently my friend Mr Mani Manivannan had shared a link to download for free, a Tamil-Tamil-Dictionary published in 1985 by Tamil Nadu Text Book and Educational Services Corporation as a PDF ebook. Unfortunately, the file was not searchable in Tamil, as the OCR facility to understand Tamil text had come out for wider use, only in the recent months.
I tried to fix it. I started by downloading the ebook (PDF file) from Archive.org, then performed OCR using the Open Source app called Tesseract – I had talked about this wonderful tool at the Tamil Internet Conference 2019. Tesseract reads each page as a picture, performs character recognition, that means it figures out the text that’s in the image into a string of Unicode text. The accuracy of OCR is not 100% but certainly usable. Tesseract then attaches (invisibly as meta-data) the Unicode text into the image with a marker to indicate the location from where it had read it, in a technique called Embedded Text in a PDF. Then, it writes for all the pages, the image and the associated Unicode text into a new output PDF file.
Below are the three batch files that I wrote to run in Windows 10, which does the necessary steps.
Step 1: Tesseract accepts as input, only image files like JPG, GIF or TIFF. So, we need to extract every page from the PDF file as individual images. I am using Ghostscript utility to do this and you can download it from here. The batch file creates the image files with the fixed name of __inputpages*.png, you may make the batch file generic by accepting the output filename as a command-line parameter.
echo off echo usage: PDF2OCT_STEP1.bat inputfile.pdf rem About: Generates an image for each page in a PDF rem 1.Ghostscript should be installed and the folder be in the path rem 2.Ghostscript: https://www.ghostscript.com/download/gsdnld.html rem 3.Creates temp files called __inputpages*.png gswin64.exe -dNOPAUSE -dBATCH -r600 -sDEVICE=png16m -sOutputFile="__inputpages-%%05d.png" %1
The dictionary ebook referenced above was about 1000 pages and this step took about 40 minutes on my Core i7, 16 GB, SSD laptop running Windows 10.
Step 2: Is to generate a text file containing the list of all __inputpages.png in the folder. Then run the tesseract engine against each image file. The command-line parameters to tesseract.exe specify the language (I have given as Tamil and English in that order), and the output file to be in PDF format. This second batch file requires the input image files to be named as __inputpages.png (the same as the output from Step 1).
echo off echo usage: PDFOCR_STEP2.bat outputfile rem About: Does OCR for Tamil for each of the images and then combines them into one PDF rem 1.Tesseract.exe has to be installed and be in the path: https://github.com/tesseract-ocr/tesseract rem 2.NO input file to be specified and NO .pdf in the output file name rem 3.creates temp files called __inputpages*.png and __inputfiles.txt dir /b __inputpages*.png >__inputfiles.txt tesseract -l tam+eng __inputfiles.txt %1 pdf
For the 1000+ pages in the dictionary, ebook referenced above this step took about 4-7 hours (my machine had gone to sleep mode a few times in between) on my Core i7, 16 GB, SSD laptop running Windows 10.
Step 3: The last step is to use Ghostscript to compress the PDF file to make it suitable for reading as an ebook, into a new PDF file.
echo off echo usage: PDF2OCT_STEP3.bat Inputfile.pdf OutputFile.pdf rem About: Compresses a PDF to be an ebook rem 1.Ghostscript should be installed and be in the path gswin64.exe -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=%2 %1
If you are scanning (say an old book), for quicker completion you will be scanning two facing pages at once. While it may save scanning time, it will be inconvenient to read it on-screen as a PDF and to apply OCR (as instructed above through Tesseract). It is better to split them into individual pages as in the original book, remove the noises and the brown haze in all the pages. To do this, there is an awesome tool called Scan Tailor – it is open-source and free. The tool requires a bit of learning – you can start with the quick start video or the step-by-step instructions in their wiki – remember that the tool accepts image files as input and produces image files as output.
The other day I was trying to scan (to preserve) an 80-year-old book of over 150 pages. I got all the pages scanned, but the output was two facing pages and looked unimpressive. I used Scan Tailor to split pages (from two facing pages to single pages), straighten up the pages, despeckle (remove the non-text/image areas like the brown paper background) and then output clean sharp looking pages.
Above is a sample page(s): on top is the scanned page I started with, and at the bottom are the two (output) pages after using Scan Tailor.