I’ve recently identified a need to setup a means of indexing and browsing documents. This is a great little project to demonstrate how we can tie lots of tools together with Linux, so I’m going to write up what I’m doing as I go along.
The needs are pretty simple, but I haven’t found anything out of the box that suits. In particular, I’ve got the following needs:
- This is only for a couple of people, it doesn’t need the complexity associated with a full-blown document management system.
- It does need to index PDFs – including PDFs that have been generated from scanned in pages rather than from an office suite.
- Scanned PDFs may not have had any sort of OCR process applied to them – but that doesn’t mean I don’t want to be able to search for them!
- It needs to be able to do this with minimal interaction – as a rule of thumb, if it’s even conceivably possible to automate part of the process, that part of the process must be automated. I can think of better things to do with my time than click “Next…”
- Must be able to interact with the system via a web browser.
- Anything running on the public Internet is out – most of the information I’m scanning in has no business being anywhere near the public Internet.
- Must be dead easy to backup. Anything that involves databases, Tomcat etc. is probably far too complicated.
- Budget: About £250+VAT for a MFD that has a duplexing scanner unit. Other than that: £0. Most multifunction devices come with software that will OCR scanned files and index them, but further investigation suggests it usually fails the web-based and the “minimal interaction” requirement. I have a spare computer I can use sitting around, but I can’t justify a fortune on software. That may change in the future, but it’s what we’ve got now.
So, here’s the question: Can I do it? Read on….
Let’s keep this simple. We’ll have a directory containing all our documents, and we’ll have some sort of web-driven browsing interface to look through that directory. We’ll want some way to search within that directory – scanned files don’t tend to contain very good metadata and I don’t want to have to babysit the scanning process to resolve this. An indexing and search facility that will search PDFs would be perfect here.
We’re also going to need some way to populate our document directory. Remember incoming files may not be OCR’d, so we need to do something about that.
In Part 2, we’ll start putting together a box that can do all this.