Document Storage: Part 3

Document Storage Project

This is Part 3: Configuring Apache.

We’re only looking for a fairly simple interface to browse through documents. Apache already gives us that – you just need to enable a feature called “Indexes”. But the default indexing is pretty ugly; it’d be nice to make it look a little prettier and maybe add scope for expanding on functionality.

Initially I was going to design my own style, but it turns out someone’s already done that and he’s done a better job than I could ever hope to. So I took the style setup from Recursive Design and tweaked it slightly to fit in with what we’re doing here.

  cd /home/http/assets
  svn co http://recursive-design.com/svn/misc/apache/index-style
  mv index-style/* . 
  rmdir index-style

The stylesheet really ought to be referenced with a specific path so it can always be found. Edit /home/http/assets/header.html and change the stylesheet reference thus:

    

Next up, we need to configure Apache. Ignore the instructions on the Recursive Design blog; things are slightly different here. Edit /etc/apache2/sites-available/default as follows:


        ServerAdmin (YOUR EMAIL ADDRESS HERE)

        DocumentRoot /home/http/documents
        
                Options FollowSymLinks
                AllowOverride None
        
        
                AllowOverride None
                Options Indexes
                DirectoryIndex index.html index.php
                IndexOptions FancyIndexing
                IndexOptions VersionSort
                IndexOptions HTMLTable
                IndexOptions FoldersFirst
                IndexOptions IconsAreLinks
                IndexOptions IgnoreCase
                IndexOptions SuppressDescription
                IndexOptions SuppressHTMLPreamble
                IndexOptions XHTML
                IndexOptions IconWidth=16
                IndexOptions IconHeight=16
                IndexOptions NameWidth=*
                IndexOrderDefault Descending Name
                HeaderName /assets/header.html
                ReadmeName /assets/footer.html
                Order allow,deny
                Allow from all
        

        Alias /assets /home/http/assets
        Alias /search /home/http/search

        
                AllowOverride None
                Order allow,deny
                Allow from all
        

        
                AllowOverride None
                Order allow,deny
                Allow from all
        

        ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/

 ..... LEAVE THE REST OF THE FILE ALONE ......

Restart Apache.

Put something – ideally a PDF that was NOT generated from a scan but instead contains searchable text – into /home/http/documents and browse to http://(hostname).local from a separate PC on the network. If all goes according to plan, you should see something a bit like this:

There’s a lot more to do: we still need something that can index this little lot (so we can just punch in search terms) and we need some easy way to get documents onto the server. But they’re a topic for a future post…

Read More

Document Storage: Part 2

Document Storage Project

This is Part 2: Setting up our base system.

I’m assuming that readers are already reasonably familiar with Linux and can generally find their way around OK. If I didn’t assume that, this set of instructions would probably wind up becoming a book!

I’m keeping it simple here by installing this on a spare PC I have hanging around. Things would be a little more complicated if this was on a shared host or a virtual server in a datacentre, but that’s beyond the scope of this project.

Install a base Debian Wheezy system. At the time of writing this is the “Testing” branch, which I wouldn’t ever deploy to a client. But this project is for me personally so I’m rather less bothered. You don’t need any extra software, so untick as much as you can.

Give the bulk of the disk space over to /home; keep 15-20 GB left over for /var.

In /home, create the following directories:

http
http/assets
http/search
http/documents
scripts
incoming

Run the following command to install the software we’ll need:

apt-get install tesseract-ocr bzip2 make ocaml gawk apache2 unzip php5 zip php5-gd mysql-server php5-mysql subversion inotify-tools imagemagick ghostscript exactimage openssh-server avahi-daemon

We now have:

  1. A Linux box running Apache – and we shouldn’t even need DNS if we’re on the same subnet. Check it works by typing http://(hostname).local into your web browser.
  2. Directories for our scripts, our static HTML, the document repository, incoming files for OCR’ing and scripts to carry out the OCR work.
  3. Most of the software we’re going to need. There’s one or two things missing, but they’re so trivial that it’s hardly worth losing any sleep over them.

We still need:

  1. To configure Apache to act as our file browser.
  2. To integrate search functionality.
  3. To sort out the scripts that are going to OCR incoming files.

Read More

Project: Document Storage Made Simple

I’ve recently identified a need to setup a means of indexing and browsing documents. This is a great little project to demonstrate how we can tie lots of tools together with Linux, so I’m going to write up what I’m doing as I go along.

The needs are pretty simple, but I haven’t found anything out of the box that suits. In particular, I’ve got the following needs:

  • This is only for a couple of people, it doesn’t need the complexity associated with a full-blown document management system.
  • It does need to index PDFs – including PDFs that have been generated from scanned in pages rather than from an office suite.
  • Scanned PDFs may not have had any sort of OCR process applied to them – but that doesn’t mean I don’t want to be able to search for them!
  • It needs to be able to do this with minimal interaction – as a rule of thumb, if it’s even conceivably possible to automate part of the process, that part of the process must be automated. I can think of better things to do with my time than click “Next…”
  • Must be able to interact with the system via a web browser.
  • Anything running on the public Internet is out – most of the information I’m scanning in has no business being anywhere near the public Internet.
  • Must be dead easy to backup. Anything that involves databases, Tomcat etc. is probably far too complicated.
  • Budget: About £250+VAT for a MFD that has a duplexing scanner unit. Other than that: £0. Most multifunction devices come with software that will OCR scanned files and index them, but further investigation suggests it usually fails the web-based and the “minimal interaction” requirement. I have a spare computer I can use sitting around, but I can’t justify a fortune on software. That may change in the future, but it’s what we’ve got now.

So, here’s the question: Can I do it? Read on….
(more…)

Read More

The power of “Why?”

I’m going go steer away from the very technical “how-to” type things I’ve written in the past and instead give a little bit of job advice to anyone who finds themselves in a technical role for the first time.

Sooner or later, we all have to deal with technical support-type questions.

It’s very tempting in these cases to take everything you’re told at face value and ask simple yes/no questions for more detail. On the face of it, this makes some sense – they can be easy to understand, quick to answer and get you to the root cause very quickly.

I would argue that they’re terrible questions. Yes, sometimes you get useful answers, but as often as not you get:

  • Answers that are downright wrong. Maybe the customer misunderstood the question, maybe they didn’t understand it at all but were afraid to admit ignorance. 
  • Answers that aren’t wrong, but aren’t terribly helpful.  Example: “No, I haven’t seen any error messages” (but considering my computer hasn’t actually got as far as logging me in that shouldn’t be terribly surprising).
  • Drawn into an argument. Example: “I’ve already told you what the problem is, now are you going to fix it?!”
Instead, try “Why?”. “Why do you think you’ve got a virus?” “Why are you having trouble with the website?”. It forces your customer to elaborate and drastically reduces the risk of confrontation.

 

Read More

Sending Test Emails with Telnet

I’d like to talk quickly about a great and underutilized method for troubleshooting email flow problems.  Today I had to rebuild an Exchange Hub Transport server after a slight catastrophe from last week in which the VM the Hub lived on was completely unrecoverable.  That is another story but it brings up the need for using a great tool that is often skipped over, and that is sending test email via telnet.

The reason I say that this method is underutilized is because, well who uses telnet these days?  What’s great about using this is that you can test different aspects and essentially pinpoint where mail flow issues are occurring.  In my case I was have trouble relaying email from an internal account to outside mail servers.  So let’s jump into how to use this tool, its easy but I feel like not enough people know about it, so here we go.

First, since I was testing from inside, I need to connect to the local server name.

telnet hubserver.psa.local 25

Easy enough, we are using telnet to connect to the hub server, hubserver.psa.local on port 25 (SMTP).  Once we get in we run a simple,

ehlo

That gives us back a little bit of information, basically telling us that this is an email server and some of its capabilities.  Next, we will need to run through the following set of commands to send out the test email.  It is important that these commands are entered in exactly, with no backspaces, otherwise it will break the command and you will get an error message spit back out from your telnet session.

MAIL FROM: [email protected]
RCPT TO: [email protected]
DATA
SUBJECT:

message content.

.
QUIT
  • MAIL FROM: This is telling the mail server who this message is being sent from.
  • [email protected] is the internal mail sender I was using.
  • RCPT TO: Tells the mail server the email address that is being sent to.
  • [email protected] is the address we are sending to. It can be any of your internet based mail addresses (google, yahoo, etc.).
  • DATA signifies the start of the message body.
  • SUBJECT: This line is optional, probably a good idea to include a subject so the message doesn’t get blocked or sent to spam.  Hit enter twice after this to drop into the message content.
  • message content is whatever you want to include in your message.  Follow your message by hitting enter.
  • “.” (read dot) on a line by itself will tell the mail server to end the message and send it.  It is basically the equivalent of an escape character for emails.
  • QUIT leaves the telnet session from the mail server.

It is important that the previous set of commands is run the way that they look.  This whole string of commands should look something similar to the following inside of your shell when things are all said and done, assuming everything is working properly.

In my case, I was unable to enter an address for the RCPT TO: command.  To fix this, among with a few other steps in rebuilding the hub was to grant anonymous send permission on the Exchange side of things, then after that mail began flowing through the newly rebuilt Hub Transport server perfectly.

That should be it, I highly suggest going through the process of sending out a few test emails to get this method stuck in your brain for later on down the road if you ever have to do any mail flow type troubleshooting.  Good luck!

Read More