Document Storage Project

Part 1: What we’re doing

This is Part 2: Setting up our base system.

I’m assuming that readers are already reasonably familiar with Linux and can generally find their way around OK. If I didn’t assume that, this set of instructions would probably wind up becoming a book!

I’m keeping it simple here by installing this on a spare PC I have hanging around. Things would be a little more complicated if this was on a shared host or a virtual server in a datacentre, but that’s beyond the scope of this project.

Install a base Debian Wheezy system. At the time of writing this is the “Testing” branch, which I wouldn’t ever deploy to a client. But this project is for me personally so I’m rather less bothered. You don’t need any extra software, so untick as much as you can.

Give the bulk of the disk space over to /home; keep 15-20 GB left over for /var.

In /home, create the following directories:

http
http/assets
http/search
http/documents
scripts
incoming

Run the following command to install the software we’ll need:

apt-get install tesseract-ocr bzip2 make ocaml gawk apache2 unzip php5 zip php5-gd mysql-server php5-mysql subversion inotify-tools imagemagick ghostscript exactimage openssh-server avahi-daemon

We now have:

A Linux box running Apache – and we shouldn’t even need DNS if we’re on the same subnet. Check it works by typing http://(hostname).local into your web browser.
Directories for our scripts, our static HTML, the document repository, incoming files for OCR’ing and scripts to carry out the OCR work.
Most of the software we’re going to need. There’s one or two things missing, but they’re so trivial that it’s hardly worth losing any sleep over them.

We still need:

To configure Apache to act as our file browser.
To integrate search functionality.
To sort out the scripts that are going to OCR incoming files.

James Cort

James Cort is Managing Director of Bediwin Information Services, providing IT management and integration services in the South West of England.

Document Storage: Part 2

Document Storage Project

Related Posts

James Cort