Those who have been following my posts will have spotted that occasionally I discuss something less technical. If that sort of thing bores you – look away now.
Everyone I’ve ever met who’s been in IT for any length of time – whether it’s as a technician, a sysadmin or a helpdesk operator – knows that this is a fast-moving industry and sometimes businesses get left behind.
Whether that’s the server that for some reason is still running Exchange 5.5, the PC with an IBM logo on the front that’s still running Windows 2000 or the sudden, urgent need to restore a backup from some obscure tape format that we thought had died out circa 2001.
And we get to pick up the pieces.
There’s a simple reason for this: as a profession, we’re fantastically good at spending money. We can easily spend half an hour on Dell’s website and our employer walks away £thousands lighter.
However, we’re fantastically bad at explaining why we’re spending the money or what benefit it’ll bring. Few of us buy a new car when the old one still meets our needs and it’s still economical to maintain, yet we provide equipment that’s more-or-less maintenance free and expect our employers to replace it while it still meets their needs just fine.
Upshot? We get to explain that yes, you can still buy Exchange. But no, you can’t easily upgrade the fifteen year old server in the corner to the latest version.
Solution? Explain what you want in terms the business will understand: it should either make money, save money or reduce risk. If you can’t think of at least one good reason based on one of these three, you probably shouldn’t be recommending the solution in the first place.
Most servers ship with at least two network ports, often more.
And yet so often we plug one of them into a switch and ignore the other one. We’ve paid for an expensive server with very capable networking, and now we’re going to ignore half of its capabilities. Meanwhile, we’re asking it to do more and more. Sooner or later, that gigabit network port is the bottleneck.
Why not use both network ports simultaneously?
There’s various ways you can set this up. Some require special configuration of managed switches; some don’t. For this blog post, I’m going to concentrate on methods that don’t require special switch configuration because they’re a little bit easier and they’re somewhat less fragile – you don’t risk your network collapsing in a big heap just because someone plugged a network port into the wrong socket. (We also get the added bonus that if one of the two network ports in our server fails, it’ll still work, albeit more slowly. But I can’t remember the last time I saw a network port fail…)
These instructions are written purely for Debian Squeeze. You may need minor tweaks to use them in Ubuntu; you’ll almost certainly need significant changes to use them in other distributions.
First, install the ifenslave package:
apt-get install ifenslave
Configuration is just a few lines in /etc/network/interfaces:
All that’s left to do now is write a script that will:
Detect when a new file’s been uploaded.
Turn it into a searchable PDF with OCR.
Put the finished PDF in a suitable directory so we can easily browse for it later.
This is actually pretty easy. inotifywait(1) will tell us whenever a file’s been closed, we can use that as our trigger to OCR the document.
Our script is therefore in two parts:
Part 1: will watch the /home/incoming directory for any files that are closed. Part 2: will be called by the script in part 1 every time a file is created.
Part 1
This script lives in /home/scripts and is called watch-dir.
#!/bin/bash
INCOMING="/home/incoming"
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
inotifywait -m --format '%:e %f' -e CLOSE_WRITE "${INCOMING}" 2>/dev/null | while read LINE
do
FILE="${INCOMING}"/`echo ${LINE} | cut -d" " -f2-`
"${DIR}"/process-image "${FILE}" &
done
Part 2
This script lives in /home/scripts and is called process-image.
#!/bin/bash
# Dead easy - at least in theory!
# Take a single argument - filename of the file to process.
# Do all the necessary processing to make it a
# searchable PDF.
OUTFILE="`basename "${1}"`"
TEMPFILE="`mktemp`"
if [ -s "${1}" ]
then
# We use the first part of the filename as a classification.
CLASSIFICATION=`echo ${OUTFILE} | cut -f1 -d"-"`
OUTDIR="/home/http/documents/${CLASSIFICATION}/`date +%Y`/`date +%Y-%m`/`date +%Y-%m-%d`"
if [ ! -d "${OUTDIR}" ]
then
mkdir -p "${OUTDIR}" || exit 1
fi
# We have to move our file to a temporary location right away because
# otherwise pdfsandwich uses the file's own location for
# temporary storage. Well and good - but the file's location is
# subject to an inotify that will call this script!
mv "${1}" "${TEMPFILE}" || exit 1
# Have we a colour or a mono image? Probably quicker to find out
# and process accordingly rather than treat everything as RGB.
# We assume the first page is representative of everything
COLOURDEPTH=`convert "${TEMPFILE}[0]" -verbose -identify /dev/null 2>/dev/null | grep "Depth:" | awk -F'[/-]' '{print $2}'`
if [ "${COLOURDEPTH}" -gt 1 ]
then
SANDWICHOPTS="-rgb"
fi
pdfsandwich ${SANDWICHOPTS} -o "${OUTDIR}/${OUTFILE}" "${TEMPFILE}" > /dev/null 2>&1
rm "${TEMPFILE}"
fi
There’s just one thing missing: pdfsandwich. This is actually something I found elsewhere on the web. It hasn’t made it into any of the major distro repositories as far as I can tell, but it’s easy enough to compile and install yourself. Find it here.
Run /home/scripts/watch-dir every time we boot – the easiest way to do this is to include a line in /etc/rc.local that calls it:
/home/scripts/watch-dir &
Get it started now (unless you were planning on rebooting):
nohup /home/scripts/watch-dir &
Now you should be able to scan in documents, they’ll be automatically OCR’d and made available on the internal website you set up in part 3.
Further enhancements are left to the reader; suggestions include:
Automatically notifying sphider-plus to reindex when a document is added. (You’ll need a newer version of sphider-plus to do this. Unfortunately there is a cost associated with this, but it’s pretty cheap. Get it from here).
There is a bug in pdfsandwich (actually, I think the bug is probably in tesseract or hocr2pdf, both of which are called by pdfsandwich): under certain circumstances which I haven’t been able to nail down, sometimes you’ll find that in the finished PDF one page of a multi-page document will only show the OCR’d layer, not the original document. Track down this bug, fix it and notify the maintainer of the appropriate package so that the upstream package can also be fixed.
This isn’t terribly good for bulk scanning – if you want to scan in 50 one-page documents, you have to scan them individually otherwise they’ll be treated as a single 50 page document. Edit the script so we can somehow communicate with it that certain documents should be split into their constituent pages and store the resulting PDFs in this way.
Like all OCR-based solutions, this won’t give you a perfect representation of the source text in the finished PDF. But I’m quite sure the accuracy can be improved, very likely without having to make significant changes to how this operates. Carry out some experiments to figure out optimum settings for accuracy and edit the scripts accordingly.
There’s two components to this part: configuring somewhere for the files to be uploaded to and setting up your MFD to upload to them. Most modern MFDs will upload to a CIFS share, which is what we’re going to use here. First thing’s first, we need to install Samba:
apt-get install samba
Now we need to set up Samba. We’ll have user-level security (it’ll be much easier to lock things down if we want to increase security at a later date, and besides share-level security went out with the Ark) and a single share called incoming. We also need a user for the MFD to log into Samba with; we’ll call this user “scanner”. We’ll also have a group called “scanner” so we can be a little more flexible over who can access this share should we wish.
Edit /etc/samba/smb.conf as follows:
......
# "security = user" is always a good idea. This will require a Unix account
# in this server for every user accessing the server. See
# /usr/share/doc/samba-doc/htmldocs/Samba3-HOWTO/ServerType.html
# in the samba-doc package for details.
security = user
......
[incoming]
path = /home/incoming
guest ok = no
browseable = no
read only = no
valid users = @scanner
Now, we need a new user for the MFD. Samba requires that users also have corresponding Unix accounts, so first we create a Unix account, then we set their Samba password. We also need to ensure the permissions on /home/incoming are correct – the folllowing commands deal with this:
Make sure you choose a password that is not only secure, but possible to type in on your MFD! Check this works by connecting to the following folder in Windows:
\\(hostname)\incoming
You’ll need to use the username/password for the scanner user you set up.
For the final part of this, you need to set up your MFD to scan to this directory.
I’ve chosen an Oki MB451 multifunction unit for a number of reasons:
It’s cheap.
It has a double-sided document feeder for scanning. More and more documents are being sent double-sided; it seems like a step back to have a document feeder that can’t deal with this.
It supports scanning directly to email and CIFS share without requiring extra software on the PC. (This is important; certainly a few years ago a lot of manufacturers claimed their products could do this but it wasn’t apparent until after you’d taken it out of the box that their product didn’t do any of it without additional software on your PC. Certain large photocopier-type units still have this restriction, though sometimes you can buy an optional bolt-on to overcome it. I prefer avoiding the need for extra bolt-ons because they’re usually extortionately priced and often difficult to source).
It has a nice big display. These units can be a pig to set up at the best of times; a large display often goes some way to alleviate this problem.
You can set up lots of profiles – preconfigured shortcuts that say “everything scanned under this profile should be stored under this name in this share accessed with this username and password; files should have this format”. Unfortunately you can’t nail a profile to say “everything scanned under this profile is double-sided” but you can’t have everything!
The printer supports Postscript, which means it’ll be pretty much guaranteed to work under any OS I can throw at it for a long time to come.
I won’t go into detail regarding MFD configuration – there’s simply too many on the market and they all vary. It’s enough to explain that I’ve set up a profile called “Correspondence” and I’ve pointed it at \\(hostname)\incoming.
With the profile I’ve set up, scanned documents will be stored under \\(hostname)\incoming\Correspondence-#####.pdf.
Test this all works by scanning a document and making sure it appears in the /home/incoming directory on your Linux box.
There’s only one thing left to do – tie all this together so incoming documents are automatically OCR’d, made available via Apache and OCR’d so they’re indexable in Sphider….
This is Part 4: Indexing the storage. Indexing in this context is the process of making the storage searchable – so we can just have a simple text box we type search terms in and get results. We’re not talking about the Apache index we set up in Part 3. There’s all sorts of free projects that do this, but alas most of them just provide you with a library that you can integrate into your own programming project. We don’t want a library, we want a pre-cooked system including database, indexing and an interface we can punch words into and get search results back. The product I’ve found is called Sphider Plus. This is a paid-for fork of a project that’s more-or-less died on the vine. I’m using an older version that I found online– as it’s GPL’d there’s no legal risk to this, but I may well pay for a newer version at some point because there’s some interesting features listed in the newer version and it’s cheap enough. Download and unzip sphider-plus:
cd /home/http/search
unzip /path/to/sphider-plus.zip
Now to configure things. First off we need to log into MySQL and create a suitable database. Enter your password when prompted:
mysql -u (username) -p
CREATE USER sphider-plus IDENTIFIED BY (put a secret password in here);
CREATE DATABASE sphider-plus;
GRANT ALL ON sphider-plus.* TO 'sphider-plus'@'localhost';
exit
In the settings/database.php, change the following lines as appropriate:
$database="sphider-plus";
$mysql_user = "sphider-plus";
$mysql_password = "(PASSWORD YOU USED ABOVE)";
$mysql_host = "localhost";
$mysql_table_prefix = "";
We want to be able to index PDFs, so edit the line in settings/conf.php:
//Path to PDF converter
$pdftotext_path = '/home/http/search/converter/pdftotext';
Edit /home/http/search/converter/pdftotext as follows:
Point your web browser at: http://(host name).local/search and follow the setup prompts to configure Sphider’s database. Once you’ve done that, log into the admin interface and tell it to start indexing: http://(host name).local Assuming everything works, you should now be able to search for PDFs containing text at: http://(host name).local/search/search.php Look under Statistics/Spidering Logs in Sphider’s administrative interface if you need to troubleshoot any issues. Configure Sphider Plus to your liking. You now have a search interface! But…. it’d be nice if we could search from the file browsing interface. Fortunately, we can. Make the following changes to /home/http/assets/header.html:
If you point your browser at http://(hostname).local, you should now see your text box you can use for searching: Try it out! You should be able to search for text within PDFs with this. However, right now all we have is an index to a single PDF we put up for testing. We need some means of scanning pages and uploading them to this server – the subject of our next post….