Document Storage: Part 3

Document Storage Project

This is Part 3: Configuring Apache.

We’re only looking for a fairly simple interface to browse through documents. Apache already gives us that – you just need to enable a feature called “Indexes”. But the default indexing is pretty ugly; it’d be nice to make it look a little prettier and maybe add scope for expanding on functionality.

Initially I was going to design my own style, but it turns out someone’s already done that and he’s done a better job than I could ever hope to. So I took the style setup from Recursive Design and tweaked it slightly to fit in with what we’re doing here.

  cd /home/http/assets
  svn co http://recursive-design.com/svn/misc/apache/index-style
  mv index-style/* . 
  rmdir index-style

The stylesheet really ought to be referenced with a specific path so it can always be found. Edit /home/http/assets/header.html and change the stylesheet reference thus:

    

Next up, we need to configure Apache. Ignore the instructions on the Recursive Design blog; things are slightly different here. Edit /etc/apache2/sites-available/default as follows:


        ServerAdmin (YOUR EMAIL ADDRESS HERE)

        DocumentRoot /home/http/documents
        
                Options FollowSymLinks
                AllowOverride None
        
        
                AllowOverride None
                Options Indexes
                DirectoryIndex index.html index.php
                IndexOptions FancyIndexing
                IndexOptions VersionSort
                IndexOptions HTMLTable
                IndexOptions FoldersFirst
                IndexOptions IconsAreLinks
                IndexOptions IgnoreCase
                IndexOptions SuppressDescription
                IndexOptions SuppressHTMLPreamble
                IndexOptions XHTML
                IndexOptions IconWidth=16
                IndexOptions IconHeight=16
                IndexOptions NameWidth=*
                IndexOrderDefault Descending Name
                HeaderName /assets/header.html
                ReadmeName /assets/footer.html
                Order allow,deny
                Allow from all
        

        Alias /assets /home/http/assets
        Alias /search /home/http/search

        
                AllowOverride None
                Order allow,deny
                Allow from all
        

        
                AllowOverride None
                Order allow,deny
                Allow from all
        

        ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/

 ..... LEAVE THE REST OF THE FILE ALONE ......

Restart Apache.

Put something – ideally a PDF that was NOT generated from a scan but instead contains searchable text – into /home/http/documents and browse to http://(hostname).local from a separate PC on the network. If all goes according to plan, you should see something a bit like this:

There’s a lot more to do: we still need something that can index this little lot (so we can just punch in search terms) and we need some easy way to get documents onto the server. But they’re a topic for a future post…

Read More

Document Storage: Part 2

Document Storage Project

This is Part 2: Setting up our base system.

I’m assuming that readers are already reasonably familiar with Linux and can generally find their way around OK. If I didn’t assume that, this set of instructions would probably wind up becoming a book!

I’m keeping it simple here by installing this on a spare PC I have hanging around. Things would be a little more complicated if this was on a shared host or a virtual server in a datacentre, but that’s beyond the scope of this project.

Install a base Debian Wheezy system. At the time of writing this is the “Testing” branch, which I wouldn’t ever deploy to a client. But this project is for me personally so I’m rather less bothered. You don’t need any extra software, so untick as much as you can.

Give the bulk of the disk space over to /home; keep 15-20 GB left over for /var.

In /home, create the following directories:

http
http/assets
http/search
http/documents
scripts
incoming

Run the following command to install the software we’ll need:

apt-get install tesseract-ocr bzip2 make ocaml gawk apache2 unzip php5 zip php5-gd mysql-server php5-mysql subversion inotify-tools imagemagick ghostscript exactimage openssh-server avahi-daemon

We now have:

  1. A Linux box running Apache – and we shouldn’t even need DNS if we’re on the same subnet. Check it works by typing http://(hostname).local into your web browser.
  2. Directories for our scripts, our static HTML, the document repository, incoming files for OCR’ing and scripts to carry out the OCR work.
  3. Most of the software we’re going to need. There’s one or two things missing, but they’re so trivial that it’s hardly worth losing any sleep over them.

We still need:

  1. To configure Apache to act as our file browser.
  2. To integrate search functionality.
  3. To sort out the scripts that are going to OCR incoming files.

Read More

Introduction to Irssi

If you have been a follower this blog, I wrote a post awhile back that described my preferred settings in tmux and just recently wrote a post about getting set up with bitlbee.  Today we will be adding on another piece to what I will be calling my ultimate command line theme by introducing another useful command line tool for communications called Irssi.  Here are the back posts to these if you missed them.

Now that we have that out of the way let’s talk about Irssi.  Irssi is a console based IRC client that has been around for quite a while now.  There is somewhat of a debate holy war as to which console based IRC client is the best.  There are a number of hardcore Irssi users around that tout it is as the best, with the likewise being true for Weechat fans.  Before going any further I will say that there is definitely a certain amount of leg work to get Irssi up and running with the full set of customizations and features. That said, I believe the extra work is worth every minute of time and effort if you are looking for a fully featured, rich IRC experience.

I want to present both of these clients (Irssi and Weechat) to readers and let each person decide for themselves which is the best, because saying one is better than the other wouldn’t be a fair comparison, and is really like comparing apples and orages.  With that said, in a future post I will be going over the basics of using Weechat, the other touted console based IRC client.

Bitlbee in Irssi

Add an alias for Bitlbee.

/network add bitlbee
/server add -auto -network bitlbee localhost
/connect bitlbee

Register server account to tie to Irssi.

register
/oper
<desired password>

Automatically join and identify when Irssi starts.

/channel add -auto -botcmd '/say identify\; /oper' &bitlbee bitlbee

Add in your Gtalk account.

account add jabber [email protected]
/oper
<gmail password>

Set up correct port and ssl stuff for gtalk.

account jabber server talk.google.com:5223:ssl

Getting used to Irssi

Here we will assume that you have created and set up a user with irc.freenode.net.  Once that step has been completed you should be able to follow these instructions without any issues.

/SERVER ADD -auto -network freenode irc.freenode.net

You may have to shutdown and restart Irssi at this point for it to recognize the network name “freenode” in the next step.

/CHANNEL ADD -auto ##/r/sysadmin freenode

Adding advanced_windowlist.pl

First we need to download the script and put it into the appropriate place.  If you haven’t created your Irssi scripts directory and your autorun directory go ahead and make them quickly.

mkdir ~/.irssi/scripts
mkdir ~/.irssi/scripts/autorun

Change directories to your scripts directory and download the script.

cd ~/.irssi/scripts
wget http://anti.teamidiot.de/static/nei/*/Code/Irssi/adv_windowlist.pl

Let’s quickly set it to be executable.

sudo chmod +x adv_windowlist.pl

Now we need to symlink this script and then run it in Irssi. To symlink it run the following,

cd ~/.irssi/scripts/autorun
ln -s ../adv_windowlist.pl

And finally to load it into Irrsi.

/run adv_windowlist.pl

That should be it. This can come in handy when you have any more than a handful of windows and can’t keep your conversations straight. If we take a look at our Irssi session we can see that there is a name associated with each window number now.

As you can see there is now a name associated with each of the windows that we have open.  This looks pretty good but there are some cool features in this script that we are going to leverage to make it look even better.  In your Irssi session run the following commands to customize your display even further,

/statusbar window remove act
/set awl_display_key $Q%K|$N%n $H$C$S
/set awl_display_key_active $Q%K|$N%n $H%U$C%n$S
/set awl_display_nokey [$N]$H$C$S
/set awl_block 1

OK, this is looking better. We now have our current conversation underlined, our windows named and numbered with decent formatting and have set windows with activity to update and change colors. There are more options if you look at the script itself but this is a pretty good start.

Setting up hilight.pl

This script will add in the ability to check messages that contain your nick. This is a good way to easily check messages while you were away or didn’t get a chance to respond to.  First we need to make sure that the new window will split correctly.

SET autostick_split_windows ON
/hilight <nickname>

Now we add in and configure our new notification window.

/window new split
/window name hilight
/window size 4

Set up nm.pl

The description from its creator is “right aligned nicks depending on longest nick”.  This script will help with the readability and organization of your different chats.  I’m not sure if it requires nickcolor.pl but I have it in my scripts folder and symlinked to my autorun folder just in case.  Just load nm.pl in like you do for all other scripts and it will start doing its things.

/run nm.pl

Setting up themes

Definitely not a necessity but can help to make things cleaner and easier to read.  So far I have played around with the xchat and fear2 themes but will come back and update this post if I happen to find a better theme.  The good thing is that thems are really easy to set up and use.  So to load a specific theme just copy it into your /.irssi directory and turn it on in Irssi.

wget http://irssi.org/themefiles/xchat.theme
/set theme xchat

That’s all I have on Irssi for now.  If there is one complaint that I have about Irssi it is that the nicklist.pl script doesn’t play nicely in tmux (however it should be fine in screen).  It is a manual process and is a pain in the ass to get set up so I have chosen not to cover it in this post.  It is possible I know, but for me, it just wasn’t worth the trouble.  If you know of an easy way to get this working inside of tmux let me know.

Resources:
http://irssi.org/beginner/
http://quadpoint.org/articles/irssi/#channel_statusbar_using_advanced_windowlist
http://www.antonfagerberg.com/archive/my-perfect-irssi-setup
http://quadpoint.org/articles/irssisplit/

Read More

Introduction to command line IM and Bitlbee

Bitlbee is a way to bridge IM and IRC together essentially allowing you to connect to your IM network through an IRC interface.  One great feature of Bitlbee is that it supports a large number of different protocols (including Gtalk, Yahoo!, Facebook and Twitter), which happen to be nearly all the major platforms I’m concerned with, excluding Microsoft Lync.  The main reason I want to discuss Bitlbee now, ahead of time, is because I will be doing a series of posts that specifically tie Bitlbee in with a few of the more popular IRC clients.

As you will see, there are slight differences in how Bitlbee behaves inside each of the IRC clients I have been trying out, I will leave these details out for now to make things easier to follow.  Today’s post will be more guided towards general use of Bitlbee, so I will be going over things like how to get around and its basic usage.

As usual, I will be running in Ubuntu so these instructions are specific to Debian based distros.  Outside of installation, I image the usage will be very similar in other distributions because most of the commands and configuration are happening inside Bitblee.

Getting used to Bitlbee

Let’s start off by getting Bitlbee installed.

sudo aptitude install bitlbee

Now let’s go ahead and add in our gtalk (jabber) account.

account add jabber [email protected]

Set up correct port and ssl for gtalk.

account 1 server talk.google.com:5223:ssl

Optional – turn on oauth (Still having some issues with this one).

account gtalk set oauth true
oauth = 'true'

Log in to Gtalk.

account jabber on

Start a chat in a new window.

/msg NickName Hello!

Getting a listing of various IM accounts.

account list
account list online
account list all

Managing various IM contacts, pretty self explanatory.  Here 0 is the gtalk account we added earlier, [email protected] is the person we are adding to our account and nickname is how they will show up in our contact list.

add 0 [email protected] nickname
remove nickname

If you have anything else to add I would love to hear it.  I’m still playing around with the oauth stuff, so I will update this post later with a fix.

Resources:
http://510x.se/notes/posts/Install_and_setup_BitlBee/
http://static.quadpoint.org/bitlbee-user-guide.html
http://wiki.bitlbee.org/Commands

Read More

Project: Document Storage Made Simple

I’ve recently identified a need to setup a means of indexing and browsing documents. This is a great little project to demonstrate how we can tie lots of tools together with Linux, so I’m going to write up what I’m doing as I go along.

The needs are pretty simple, but I haven’t found anything out of the box that suits. In particular, I’ve got the following needs:

  • This is only for a couple of people, it doesn’t need the complexity associated with a full-blown document management system.
  • It does need to index PDFs – including PDFs that have been generated from scanned in pages rather than from an office suite.
  • Scanned PDFs may not have had any sort of OCR process applied to them – but that doesn’t mean I don’t want to be able to search for them!
  • It needs to be able to do this with minimal interaction – as a rule of thumb, if it’s even conceivably possible to automate part of the process, that part of the process must be automated. I can think of better things to do with my time than click “Next…”
  • Must be able to interact with the system via a web browser.
  • Anything running on the public Internet is out – most of the information I’m scanning in has no business being anywhere near the public Internet.
  • Must be dead easy to backup. Anything that involves databases, Tomcat etc. is probably far too complicated.
  • Budget: About £250+VAT for a MFD that has a duplexing scanner unit. Other than that: £0. Most multifunction devices come with software that will OCR scanned files and index them, but further investigation suggests it usually fails the web-based and the “minimal interaction” requirement. I have a spare computer I can use sitting around, but I can’t justify a fortune on software. That may change in the future, but it’s what we’ve got now.

So, here’s the question: Can I do it? Read on….
(more…)

Read More