I’ve recently identified a need to setup a means of indexing and browsing documents. This is a great little project to demonstrate how we can tie lots of tools together with Linux, so I’m going to write up what I’m doing as I go along.
The needs are pretty simple, but I haven’t found anything out of the box that suits. In particular, I’ve got the following needs:
- This is only for a couple of people, it doesn’t need the complexity associated with a full-blown document management system.
- It does need to index PDFs – including PDFs that have been generated from scanned in pages rather than from an office suite.
- Scanned PDFs may not have had any sort of OCR process applied to them – but that doesn’t mean I don’t want to be able to search for them!
- It needs to be able to do this with minimal interaction – as a rule of thumb, if it’s even conceivably possible to automate part of the process, that part of the process must be automated. I can think of better things to do with my time than click “Next…”
- Must be able to interact with the system via a web browser.
- Anything running on the public Internet is out – most of the information I’m scanning in has no business being anywhere near the public Internet.
- Must be dead easy to backup. Anything that involves databases, Tomcat etc. is probably far too complicated.
- Budget: About £250+VAT for a MFD that has a duplexing scanner unit. Other than that: £0. Most multifunction devices come with software that will OCR scanned files and index them, but further investigation suggests it usually fails the web-based and the “minimal interaction” requirement. I have a spare computer I can use sitting around, but I can’t justify a fortune on software. That may change in the future, but it’s what we’ve got now.
So, here’s the question: Can I do it? Read on….
(more…)