Document Storage: Part 4

Document Storage Project

This is Part 4: Indexing the storage. Indexing in this context is the process of making the storage searchable – so we can just have a simple text box we type search terms in and get results. We’re not talking about the Apache index we set up in Part 3. There’s all sorts of free projects that do this, but alas most of them just provide you with a library that you can integrate into your own programming project. We don’t want a library, we want a pre-cooked system including database, indexing and an interface we can punch words into and get search results back. The product I’ve found is called Sphider Plus. This is a paid-for fork of a project that’s more-or-less died on the vine. I’m using an older version that I found online– as it’s GPL’d there’s no legal risk to this, but I may well pay for a newer version at some point because there’s some interesting features listed in the newer version and it’s cheap enough. Download and unzip sphider-plus:

cd /home/http/search
unzip /path/to/sphider-plus.zip

Now to configure things. First off we need to log into MySQL and create a suitable database. Enter your password when prompted:

mysql -u (username) -p
CREATE USER sphider-plus IDENTIFIED BY (put a secret password in here);
CREATE DATABASE sphider-plus;
GRANT ALL ON sphider-plus.* TO 'sphider-plus'@'localhost';
exit

In the settings/database.php, change the following lines as appropriate:

	$database="sphider-plus";
	$mysql_user = "sphider-plus";
	$mysql_password = "(PASSWORD YOU USED ABOVE)"; 
	$mysql_host = "localhost";
	$mysql_table_prefix = "";

We want to be able to index PDFs, so edit the line in settings/conf.php:

//Path to PDF converter
$pdftotext_path = '/home/http/search/converter/pdftotext';

Edit /home/http/search/converter/pdftotext as follows:

#!/bin/sh
/home/http/search/converter/pdftotext.script $1 -

Set up permissions appropriately:

find -name /home/http/search -type d | xargs chmod 700
find -name /home/http/search -type f | xargs chmod 600
chown -R www-data /home/http/search
chmod 700 /home/http/search/pdftotext
chmod 700 /home/http/search/pdftotext.script

Point your web browser at: http://(host name).local/search and follow the setup prompts to configure Sphider’s database. Once you’ve done that, log into the admin interface and tell it to start indexing: http://(host name).local Assuming everything works, you should now be able to search for PDFs containing text at: http://(host name).local/search/search.php Look under Statistics/Spidering Logs in Sphider’s administrative interface if you need to troubleshoot any issues. Configure Sphider Plus to your liking. You now have a search interface! But…. it’d be nice if we could search from the file browsing interface. Fortunately, we can. Make the following changes to /home/http/assets/header.html:

          <div id="commandbar">
                <a href="/" id="home">home</a>
                                <a href="../" id="parent">up</a>
                <a href="#" id="refresh">refresh</a>
                <form action="/search/search.php" method="get" id="search">
                        <input type="text" name="query" id="query" size="40" value="" columns="2" autocomplete="off" delay="500">
                        <input type="hidden" name="search" value="1">
                        <input type="Submit" value="Search">
                        Show   
                        <select name='results'>
                                <option >5</option>
                                <option >10</option>
                                <option selected>20</option>
                                <option >30</option>
                                <option >50</option>
                        </select>
                        results per page
                <form>
                </div>
                        <div id="files">
                        <h2>

You’ll also want to change the stylesheet, /home/http/assets/style.css. Find the parts that relate to #commandbar and change them as follows:

/* command = commandbar button, is active, is greyed out */ 
#commandbar a,form { 
background-position: left center; 
background-repeat: no-repeat; 
font-size: 82.5%; 
margin-left: 0.4em; 
margin-right: 0px; 
margin-top: 0px; 
margin-bottom: 0px; 
padding-left: 22px; 
padding-right: 1.0em; 
padding-top: 0.4em; 
padding-bottom: 0.6em; 
line-height: 1.5em; 
color: #555555; 
border-right: 2px dotted #D5E0E0; 
}

/* common commands */ 
#commandbar #parent { 
background-image: url('/assets/icons/parent.gif'); 
}

#commandbar input,#search,#querySuggestList { 
display: inline; 
}

#commandbar #refresh { 
background-image: url('/assets/icons/refresh.gif'); 
}

If you point your browser at http://(hostname).local, you should now see your text box you can use for searching: Try it out! You should be able to search for text within PDFs with this. However, right now all we have is an index to a single PDF we put up for testing. We need some means of scanning pages and uploading them to this server – the subject of our next post….

Read More

Introduction to Weechat

Continuing our little mini series I wanted to introduce a great alternative to Irssi, called Weechat.  In this post we will go over how to get it set up and configured as closely to the way we had our workflow set up in Irssi.  This way you can evaluate both of these IRC clients for yourself and make the determination of which will suit you best for your needs.  If you haven’t taken a look at the previous posts they sort of build off of each other but if you want to compare Weechat and Irssi just take a look at Introduction to Irssi.

I’m not going to lie, the more I use Weechat the more it grows on me.  It is clean, easy to use and has some great functionality built in to it that isn’t offered out of the box in Irssi.  Small things for the most part, such as a native nicklist, colored nicks and some slick formatting to name a few.  It definitely looks and feels nice right away, without the need for customization.  One convincing argument some have mentioned for switching over to Weechat is that nearly everything is scriptable in a variety of different languages as well as very customizable.  Another nicety is the very slick script manager that you gives you the ability to install and manage different scripts and plugins without any hassle, which I will discuss later in this post.

Another cool thing I learned is that the writer/creator of the program hangs out in the #weechat IRC channel on freenode pretty much all the time and will answer questions for you!  How cool is that?  I had some issues getting the buffers.pl script set up the way I wanted and (@FlashCode) was there to immediately guide me in the right direction.

Bitlbee in Weechat

Just a quick word here.  I noticed that Bitlbee behaves slightly differently in Weechat so if you are used to Irssi then you should read about how to get things working.  Nothing too major here, just a few peculiarities that I thought were important to include.  I have listed below some of the most common things to go over in Bitlbee paired with Weechat.

Connect to bitlbee.

/connect localhost

Automatically connect to Bitlbee when Weechat starts.

/server add &bitlbee localhost -autoconnect

Add your gtalk account.

account add jabber [email protected]
acc 0 set password <password>
acc 0 on

Change nick to more readable form (restart for this to take effect).

acc 0 set nick_source full_name

Connect to gtalk on start (should work but still need to fix this).

/set irc.server.bitlbee.command "/msg &bitlbee identify <password>"

I couldn’t get oauth working, I will come back and update this later.

Getting used to Weechat

Now that we have all of that fun stuff out of the way we are finally ready for the meat of this post.  Let’s go ahead and install and fire up Weechat.

sudo aptitude install weechat
weechat-curses

Here is how to set some your defaults in Weechat, most of these are pretty intuitive.

/server add freenode irc.freenode.net
/set aspell.check.enabled on
/set irc.server.freenode.nicks "username, username_"
/set irc.server.freenode.username "username"
/set irc.server.freenode.realname "first and last"
/set irc.server.freenode.autoconnect on

Identify nick on server after connecting.

/set irc.server.freenode.command "/msg nickserv identify <password>"

Autojoin favorite channels after connecting to your IRC server.

/set irc.server.freenode.autojoin "##/r/sysadmin,#channel2"

Connect to your newly created IRC alias.

/connect freenode

As you can see, out of the box Weechat offers some very nice features.  The only thing that I found annoying/frustrating in Weechat was that there was no way to manage your different windows.  Remember, in Irssi this was done with the addition of the adv_windwolist.pl script.  In Weechat things are a tad different but there is a script to manage these windows (they are referred to as buffers in Weechat).

Installing the Weeget script manager (This has been deprecated)

Instead of using the “weeget.py” method use “script”.  It is baked in now so installing plugins is super easy now.

For a list of all available plugins visit the Weechat script page.

The first step is to get a nice little script that will allow us to get this script and others.  It is called weeget and should be installed first.  I HIGHLY suggest looking at getting this up and working, it will save you pain and misery down the road, trust me.

cd ~/.weechat/python
wget http://www.weechat.org/files/scripts/weeget.py

We need to restart Weechat for this script get picked up and to take effect.

Install buffers.pl

Once that is done, we will install our new window manager script with the following command inside Weechat.

/script install buffers

This will give you a nice pretty list of all your open windows and conversations on the left side of Weechat.

That’s okay but we can make this better!  Here is how to stick the the bar to either the top or the bottom of you console and to fill with columns.

/set weechat.bar.buffers.position top (or bottom)
/set weechat.bar.buffers.filling_top_bottom columns_vertical

Install iset

Another slick script to install is the iset.pl script.  This makes changing settings much easier as it adds a lightweight interface and description to all the different options and settings.  Installation is easy once we have weeget installed.

/script install iset

There is one last script that users may like, called buffer_autoclose.py.  This will close any inactive sessions that happen to remain open that don’t need to be and will begin working automatically once it is installed.  This will clean up the buffer list and just helps to improve the look and feel.  Super easy to install with weeget.

Install buffer_autoclose

/script install buffer_autoclose.py

And here is what our final product looks like.  Slightly different than irssi and I haven’t really gotten into custom themes or any of that jazz but it has really begun to shape up.

If you have any tips or tricks on how to improve this environment let me know.  I am new to Weechat and am still discovering all of its nuances and will be coming back to this post periodically to update things that I have found to be worthy of adding.

Resources:
http://leigh.cudd.li/article/Bitlbee_and_Weechat_Mini-Tutorial
http://kmacphail.blogspot.com/2011/09/using-weechat-with-freenode-basics.html
http://www.weechat.org/files/doc/devel/weechat_user.en.html#screen_layout

Read More

Document Storage: Part 3

Document Storage Project

This is Part 3: Configuring Apache.

We’re only looking for a fairly simple interface to browse through documents. Apache already gives us that – you just need to enable a feature called “Indexes”. But the default indexing is pretty ugly; it’d be nice to make it look a little prettier and maybe add scope for expanding on functionality.

Initially I was going to design my own style, but it turns out someone’s already done that and he’s done a better job than I could ever hope to. So I took the style setup from Recursive Design and tweaked it slightly to fit in with what we’re doing here.

  cd /home/http/assets
  svn co http://recursive-design.com/svn/misc/apache/index-style
  mv index-style/* . 
  rmdir index-style

The stylesheet really ought to be referenced with a specific path so it can always be found. Edit /home/http/assets/header.html and change the stylesheet reference thus:

    

Next up, we need to configure Apache. Ignore the instructions on the Recursive Design blog; things are slightly different here. Edit /etc/apache2/sites-available/default as follows:


        ServerAdmin (YOUR EMAIL ADDRESS HERE)

        DocumentRoot /home/http/documents
        
                Options FollowSymLinks
                AllowOverride None
        
        
                AllowOverride None
                Options Indexes
                DirectoryIndex index.html index.php
                IndexOptions FancyIndexing
                IndexOptions VersionSort
                IndexOptions HTMLTable
                IndexOptions FoldersFirst
                IndexOptions IconsAreLinks
                IndexOptions IgnoreCase
                IndexOptions SuppressDescription
                IndexOptions SuppressHTMLPreamble
                IndexOptions XHTML
                IndexOptions IconWidth=16
                IndexOptions IconHeight=16
                IndexOptions NameWidth=*
                IndexOrderDefault Descending Name
                HeaderName /assets/header.html
                ReadmeName /assets/footer.html
                Order allow,deny
                Allow from all
        

        Alias /assets /home/http/assets
        Alias /search /home/http/search

        
                AllowOverride None
                Order allow,deny
                Allow from all
        

        
                AllowOverride None
                Order allow,deny
                Allow from all
        

        ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/

 ..... LEAVE THE REST OF THE FILE ALONE ......

Restart Apache.

Put something – ideally a PDF that was NOT generated from a scan but instead contains searchable text – into /home/http/documents and browse to http://(hostname).local from a separate PC on the network. If all goes according to plan, you should see something a bit like this:

There’s a lot more to do: we still need something that can index this little lot (so we can just punch in search terms) and we need some easy way to get documents onto the server. But they’re a topic for a future post…

Read More

Configure SNMP in Debian

This post is pretty straight forward but I want to mention there is a trick you have to use in Debian to get everything working correctly after you have all your SNMP packages installed.  I didn’t realize this when I was setting this up the other day and it tripped me up for awhile.

So to start things off, we need SNMP and SNMPD on our systems.

sudo aptitude install snmp snmpd

We also need to update our SNMP settings to reflect the read only SNMP community string that we want to use.  The default is public but it has been criticized for being susceptible to security breaches so you should probably keep that in mind when setting up SNMP in your environment.

At the very minimum your snmpd.conf file should look something like the following:

rocommunity mysnmpstring

Once you have updated this you need to unbind your localhost so that it can be read by others on the network.  This is what tripped me up initially on my Debian box, I do not believe it is an issue in Ubuntu but if it is then you should be able to use these instructions as well.  To fix this problem you need to edit the /etc/default/snmpd file and chop off the 127.0.0.1 from the SNMPDOPTS section.  When it is fixed it should look like this:

SNMPDOPTS='-Lsd -Lf /dev/null -u snmp -g snmp -I -smux -p /var/run/snmpd.pid'

Now you just need to restart the SNMP service:

service snmpd restart

You can check your handy work when you are done to make sure everything is working correctly by using this command from either the local host or another machine with SNMP installed on it.

snmpwalk -v1 -cpublic HOSTNAME/IP

Hopefully this will save time for somebody in the future, it certainly tricked me.

Read More

Document Storage: Part 2

Document Storage Project

This is Part 2: Setting up our base system.

I’m assuming that readers are already reasonably familiar with Linux and can generally find their way around OK. If I didn’t assume that, this set of instructions would probably wind up becoming a book!

I’m keeping it simple here by installing this on a spare PC I have hanging around. Things would be a little more complicated if this was on a shared host or a virtual server in a datacentre, but that’s beyond the scope of this project.

Install a base Debian Wheezy system. At the time of writing this is the “Testing” branch, which I wouldn’t ever deploy to a client. But this project is for me personally so I’m rather less bothered. You don’t need any extra software, so untick as much as you can.

Give the bulk of the disk space over to /home; keep 15-20 GB left over for /var.

In /home, create the following directories:

http
http/assets
http/search
http/documents
scripts
incoming

Run the following command to install the software we’ll need:

apt-get install tesseract-ocr bzip2 make ocaml gawk apache2 unzip php5 zip php5-gd mysql-server php5-mysql subversion inotify-tools imagemagick ghostscript exactimage openssh-server avahi-daemon

We now have:

  1. A Linux box running Apache – and we shouldn’t even need DNS if we’re on the same subnet. Check it works by typing http://(hostname).local into your web browser.
  2. Directories for our scripts, our static HTML, the document repository, incoming files for OCR’ing and scripts to carry out the OCR work.
  3. Most of the software we’re going to need. There’s one or two things missing, but they’re so trivial that it’s hardly worth losing any sleep over them.

We still need:

  1. To configure Apache to act as our file browser.
  2. To integrate search functionality.
  3. To sort out the scripts that are going to OCR incoming files.

Read More