Document Storage: Part 3

Document Storage Project

This is Part 3: Configuring Apache.

We’re only looking for a fairly simple interface to browse through documents. Apache already gives us that – you just need to enable a feature called “Indexes”. But the default indexing is pretty ugly; it’d be nice to make it look a little prettier and maybe add scope for expanding on functionality.

Initially I was going to design my own style, but it turns out someone’s already done that and he’s done a better job than I could ever hope to. So I took the style setup from Recursive Design and tweaked it slightly to fit in with what we’re doing here.

  cd /home/http/assets
  svn co http://recursive-design.com/svn/misc/apache/index-style
  mv index-style/* . 
  rmdir index-style

The stylesheet really ought to be referenced with a specific path so it can always be found. Edit /home/http/assets/header.html and change the stylesheet reference thus:

    

Next up, we need to configure Apache. Ignore the instructions on the Recursive Design blog; things are slightly different here. Edit /etc/apache2/sites-available/default as follows:


        ServerAdmin (YOUR EMAIL ADDRESS HERE)

        DocumentRoot /home/http/documents
        
                Options FollowSymLinks
                AllowOverride None
        
        
                AllowOverride None
                Options Indexes
                DirectoryIndex index.html index.php
                IndexOptions FancyIndexing
                IndexOptions VersionSort
                IndexOptions HTMLTable
                IndexOptions FoldersFirst
                IndexOptions IconsAreLinks
                IndexOptions IgnoreCase
                IndexOptions SuppressDescription
                IndexOptions SuppressHTMLPreamble
                IndexOptions XHTML
                IndexOptions IconWidth=16
                IndexOptions IconHeight=16
                IndexOptions NameWidth=*
                IndexOrderDefault Descending Name
                HeaderName /assets/header.html
                ReadmeName /assets/footer.html
                Order allow,deny
                Allow from all
        

        Alias /assets /home/http/assets
        Alias /search /home/http/search

        
                AllowOverride None
                Order allow,deny
                Allow from all
        

        
                AllowOverride None
                Order allow,deny
                Allow from all
        

        ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/

 ..... LEAVE THE REST OF THE FILE ALONE ......

Restart Apache.

Put something – ideally a PDF that was NOT generated from a scan but instead contains searchable text – into /home/http/documents and browse to http://(hostname).local from a separate PC on the network. If all goes according to plan, you should see something a bit like this:

There’s a lot more to do: we still need something that can index this little lot (so we can just punch in search terms) and we need some easy way to get documents onto the server. But they’re a topic for a future post…

Read More

Document Storage: Part 2

Document Storage Project

This is Part 2: Setting up our base system.

I’m assuming that readers are already reasonably familiar with Linux and can generally find their way around OK. If I didn’t assume that, this set of instructions would probably wind up becoming a book!

I’m keeping it simple here by installing this on a spare PC I have hanging around. Things would be a little more complicated if this was on a shared host or a virtual server in a datacentre, but that’s beyond the scope of this project.

Install a base Debian Wheezy system. At the time of writing this is the “Testing” branch, which I wouldn’t ever deploy to a client. But this project is for me personally so I’m rather less bothered. You don’t need any extra software, so untick as much as you can.

Give the bulk of the disk space over to /home; keep 15-20 GB left over for /var.

In /home, create the following directories:

http
http/assets
http/search
http/documents
scripts
incoming

Run the following command to install the software we’ll need:

apt-get install tesseract-ocr bzip2 make ocaml gawk apache2 unzip php5 zip php5-gd mysql-server php5-mysql subversion inotify-tools imagemagick ghostscript exactimage openssh-server avahi-daemon

We now have:

  1. A Linux box running Apache – and we shouldn’t even need DNS if we’re on the same subnet. Check it works by typing http://(hostname).local into your web browser.
  2. Directories for our scripts, our static HTML, the document repository, incoming files for OCR’ing and scripts to carry out the OCR work.
  3. Most of the software we’re going to need. There’s one or two things missing, but they’re so trivial that it’s hardly worth losing any sleep over them.

We still need:

  1. To configure Apache to act as our file browser.
  2. To integrate search functionality.
  3. To sort out the scripts that are going to OCR incoming files.

Read More

Project: Document Storage Made Simple

I’ve recently identified a need to setup a means of indexing and browsing documents. This is a great little project to demonstrate how we can tie lots of tools together with Linux, so I’m going to write up what I’m doing as I go along.

The needs are pretty simple, but I haven’t found anything out of the box that suits. In particular, I’ve got the following needs:

  • This is only for a couple of people, it doesn’t need the complexity associated with a full-blown document management system.
  • It does need to index PDFs – including PDFs that have been generated from scanned in pages rather than from an office suite.
  • Scanned PDFs may not have had any sort of OCR process applied to them – but that doesn’t mean I don’t want to be able to search for them!
  • It needs to be able to do this with minimal interaction – as a rule of thumb, if it’s even conceivably possible to automate part of the process, that part of the process must be automated. I can think of better things to do with my time than click “Next…”
  • Must be able to interact with the system via a web browser.
  • Anything running on the public Internet is out – most of the information I’m scanning in has no business being anywhere near the public Internet.
  • Must be dead easy to backup. Anything that involves databases, Tomcat etc. is probably far too complicated.
  • Budget: About £250+VAT for a MFD that has a duplexing scanner unit. Other than that: £0. Most multifunction devices come with software that will OCR scanned files and index them, but further investigation suggests it usually fails the web-based and the “minimal interaction” requirement. I have a spare computer I can use sitting around, but I can’t justify a fortune on software. That may change in the future, but it’s what we’ve got now.

So, here’s the question: Can I do it? Read on….
(more…)

Read More

The power of “Why?”

I’m going go steer away from the very technical “how-to” type things I’ve written in the past and instead give a little bit of job advice to anyone who finds themselves in a technical role for the first time.

Sooner or later, we all have to deal with technical support-type questions.

It’s very tempting in these cases to take everything you’re told at face value and ask simple yes/no questions for more detail. On the face of it, this makes some sense – they can be easy to understand, quick to answer and get you to the root cause very quickly.

I would argue that they’re terrible questions. Yes, sometimes you get useful answers, but as often as not you get:

  • Answers that are downright wrong. Maybe the customer misunderstood the question, maybe they didn’t understand it at all but were afraid to admit ignorance. 
  • Answers that aren’t wrong, but aren’t terribly helpful.  Example: “No, I haven’t seen any error messages” (but considering my computer hasn’t actually got as far as logging me in that shouldn’t be terribly surprising).
  • Drawn into an argument. Example: “I’ve already told you what the problem is, now are you going to fix it?!”
Instead, try “Why?”. “Why do you think you’ve got a virus?” “Why are you having trouble with the website?”. It forces your customer to elaborate and drastically reduces the risk of confrontation.

 

Read More

Centralising logs for fun and profit

It’s one of those things that usually gets pushed to the back burner because it seems like too much work for too little gain: setting up a central syslog server which all your other systems can report back to.

This is a shame, because there’s lots of benefits to having such a server:

  • You can analyse what’s going on in your network from a single, central location – saving you from having to log into a variety of devices for troubleshooting.
  • Improved security – if you have a security breach, the offender has to break into the logging server as well if they’re to cover their tracks properly. (I wouldn’t recommend re-purposing an existing server for precisely this reason – you want your syslog server to be as secure as possible, which means it needs to be running as few services as possible).
  • You only need to remember one set of tools to manage logs from a range of devices. Most routers will happily send logs back to a remote syslog server; there are also third-party products you can install on Windows.

It’s trivially easy to set this up in any reasonably modern Linux distribution. Once again, I’m going to use Debian for this example.

Out of the box, Debian uses rsyslog and stores the configuration file in /etc/rsyslog.conf. Fortunately, the default configuration only needs minor changes to two lines as shown in this excerpt:

# provides UDP syslog reception
#$ModLoad imudp
#$UDPServerRun 514

Uncomment the lines beginning $ModLoad and $UDPServerRun by removing the # symbols:

# provides UDP syslog reception
$ModLoad imudp
$UDPServerRun 514

Restart rsyslogd (service rsyslog restart) and…. that’s it. Done.

Well, that’s not quite it. A remote syslog server isn’t much good unless you have equipment sending logs to it.  On any other Debian servers you may have, this is just a matter of adding a line to /etc/rsyslog.conf:

*.* @192.168.42.39:514

(substitute your own logging server’s IP address or hostname for 192.168.42.39).

Restart rsyslog on the server that will be sending logs to your remote syslog server. Now when you check your remote syslog server, you should find logs appearing from both itself and anything else that’s configured to send logs to it.

Advanced Tweaks

Once you’ve got this done, there’s all sorts of things you can add. You can separate logfiles according to the host that generated them, you can have new logfiles created every day with an appropriate filename… or you can just stick with the basic configuration which will put everything in the same set of log files and just use grep to separate the interesting information.

Whatever you do, keep a sharp eye on disk space on your logfile server. Logs can grow very large very quickly, and a syslog server with a full disk won’t log anything at all.

Read More