Document Storage: Part 4

Document Storage Project

This is Part 4: Indexing the storage. Indexing in this context is the process of making the storage searchable – so we can just have a simple text box we type search terms in and get results. We’re not talking about the Apache index we set up in Part 3. There’s all sorts of free projects that do this, but alas most of them just provide you with a library that you can integrate into your own programming project. We don’t want a library, we want a pre-cooked system including database, indexing and an interface we can punch words into and get search results back. The product I’ve found is called Sphider Plus. This is a paid-for fork of a project that’s more-or-less died on the vine. I’m using an older version that I found online– as it’s GPL’d there’s no legal risk to this, but I may well pay for a newer version at some point because there’s some interesting features listed in the newer version and it’s cheap enough. Download and unzip sphider-plus:

cd /home/http/search
unzip /path/to/sphider-plus.zip

Now to configure things. First off we need to log into MySQL and create a suitable database. Enter your password when prompted:

mysql -u (username) -p
CREATE USER sphider-plus IDENTIFIED BY (put a secret password in here);
CREATE DATABASE sphider-plus;
GRANT ALL ON sphider-plus.* TO 'sphider-plus'@'localhost';
exit

In the settings/database.php, change the following lines as appropriate:

	$database="sphider-plus";
	$mysql_user = "sphider-plus";
	$mysql_password = "(PASSWORD YOU USED ABOVE)"; 
	$mysql_host = "localhost";
	$mysql_table_prefix = "";

We want to be able to index PDFs, so edit the line in settings/conf.php:

//Path to PDF converter
$pdftotext_path = '/home/http/search/converter/pdftotext';

Edit /home/http/search/converter/pdftotext as follows:

#!/bin/sh
/home/http/search/converter/pdftotext.script $1 -

Set up permissions appropriately:

find -name /home/http/search -type d | xargs chmod 700
find -name /home/http/search -type f | xargs chmod 600
chown -R www-data /home/http/search
chmod 700 /home/http/search/pdftotext
chmod 700 /home/http/search/pdftotext.script

Point your web browser at: http://(host name).local/search and follow the setup prompts to configure Sphider’s database. Once you’ve done that, log into the admin interface and tell it to start indexing: http://(host name).local Assuming everything works, you should now be able to search for PDFs containing text at: http://(host name).local/search/search.php Look under Statistics/Spidering Logs in Sphider’s administrative interface if you need to troubleshoot any issues. Configure Sphider Plus to your liking. You now have a search interface! But…. it’d be nice if we could search from the file browsing interface. Fortunately, we can. Make the following changes to /home/http/assets/header.html:

          <div id="commandbar">
                <a href="/" id="home">home</a>
                                <a href="../" id="parent">up</a>
                <a href="#" id="refresh">refresh</a>
                <form action="/search/search.php" method="get" id="search">
                        <input type="text" name="query" id="query" size="40" value="" columns="2" autocomplete="off" delay="500">
                        <input type="hidden" name="search" value="1">
                        <input type="Submit" value="Search">
                        Show   
                        <select name='results'>
                                <option >5</option>
                                <option >10</option>
                                <option selected>20</option>
                                <option >30</option>
                                <option >50</option>
                        </select>
                        results per page
                <form>
                </div>
                        <div id="files">
                        <h2>

You’ll also want to change the stylesheet, /home/http/assets/style.css. Find the parts that relate to #commandbar and change them as follows:

/* command = commandbar button, is active, is greyed out */ 
#commandbar a,form { 
background-position: left center; 
background-repeat: no-repeat; 
font-size: 82.5%; 
margin-left: 0.4em; 
margin-right: 0px; 
margin-top: 0px; 
margin-bottom: 0px; 
padding-left: 22px; 
padding-right: 1.0em; 
padding-top: 0.4em; 
padding-bottom: 0.6em; 
line-height: 1.5em; 
color: #555555; 
border-right: 2px dotted #D5E0E0; 
}

/* common commands */ 
#commandbar #parent { 
background-image: url('/assets/icons/parent.gif'); 
}

#commandbar input,#search,#querySuggestList { 
display: inline; 
}

#commandbar #refresh { 
background-image: url('/assets/icons/refresh.gif'); 
}

If you point your browser at http://(hostname).local, you should now see your text box you can use for searching: Try it out! You should be able to search for text within PDFs with this. However, right now all we have is an index to a single PDF we put up for testing. We need some means of scanning pages and uploading them to this server – the subject of our next post….

Read More

Introduction to Weechat

Continuing our little mini series I wanted to introduce a great alternative to Irssi, called Weechat.  In this post we will go over how to get it set up and configured as closely to the way we had our workflow set up in Irssi.  This way you can evaluate both of these IRC clients for yourself and make the determination of which will suit you best for your needs.  If you haven’t taken a look at the previous posts they sort of build off of each other but if you want to compare Weechat and Irssi just take a look at Introduction to Irssi.

I’m not going to lie, the more I use Weechat the more it grows on me.  It is clean, easy to use and has some great functionality built in to it that isn’t offered out of the box in Irssi.  Small things for the most part, such as a native nicklist, colored nicks and some slick formatting to name a few.  It definitely looks and feels nice right away, without the need for customization.  One convincing argument some have mentioned for switching over to Weechat is that nearly everything is scriptable in a variety of different languages as well as very customizable.  Another nicety is the very slick script manager that you gives you the ability to install and manage different scripts and plugins without any hassle, which I will discuss later in this post.

Another cool thing I learned is that the writer/creator of the program hangs out in the #weechat IRC channel on freenode pretty much all the time and will answer questions for you!  How cool is that?  I had some issues getting the buffers.pl script set up the way I wanted and (@FlashCode) was there to immediately guide me in the right direction.

Bitlbee in Weechat

Just a quick word here.  I noticed that Bitlbee behaves slightly differently in Weechat so if you are used to Irssi then you should read about how to get things working.  Nothing too major here, just a few peculiarities that I thought were important to include.  I have listed below some of the most common things to go over in Bitlbee paired with Weechat.

Connect to bitlbee.

/connect localhost

Automatically connect to Bitlbee when Weechat starts.

/server add &bitlbee localhost -autoconnect

Add your gtalk account.

account add jabber [email protected]
acc 0 set password <password>
acc 0 on

Change nick to more readable form (restart for this to take effect).

acc 0 set nick_source full_name

Connect to gtalk on start (should work but still need to fix this).

/set irc.server.bitlbee.command "/msg &bitlbee identify <password>"

I couldn’t get oauth working, I will come back and update this later.

Getting used to Weechat

Now that we have all of that fun stuff out of the way we are finally ready for the meat of this post.  Let’s go ahead and install and fire up Weechat.

sudo aptitude install weechat
weechat-curses

Here is how to set some your defaults in Weechat, most of these are pretty intuitive.

/server add freenode irc.freenode.net
/set aspell.check.enabled on
/set irc.server.freenode.nicks "username, username_"
/set irc.server.freenode.username "username"
/set irc.server.freenode.realname "first and last"
/set irc.server.freenode.autoconnect on

Identify nick on server after connecting.

/set irc.server.freenode.command "/msg nickserv identify <password>"

Autojoin favorite channels after connecting to your IRC server.

/set irc.server.freenode.autojoin "##/r/sysadmin,#channel2"

Connect to your newly created IRC alias.

/connect freenode

As you can see, out of the box Weechat offers some very nice features.  The only thing that I found annoying/frustrating in Weechat was that there was no way to manage your different windows.  Remember, in Irssi this was done with the addition of the adv_windwolist.pl script.  In Weechat things are a tad different but there is a script to manage these windows (they are referred to as buffers in Weechat).

Installing the Weeget script manager (This has been deprecated)

Instead of using the “weeget.py” method use “script”.  It is baked in now so installing plugins is super easy now.

For a list of all available plugins visit the Weechat script page.

The first step is to get a nice little script that will allow us to get this script and others.  It is called weeget and should be installed first.  I HIGHLY suggest looking at getting this up and working, it will save you pain and misery down the road, trust me.

cd ~/.weechat/python
wget http://www.weechat.org/files/scripts/weeget.py

We need to restart Weechat for this script get picked up and to take effect.

Install buffers.pl

Once that is done, we will install our new window manager script with the following command inside Weechat.

/script install buffers

This will give you a nice pretty list of all your open windows and conversations on the left side of Weechat.

That’s okay but we can make this better!  Here is how to stick the the bar to either the top or the bottom of you console and to fill with columns.

/set weechat.bar.buffers.position top (or bottom)
/set weechat.bar.buffers.filling_top_bottom columns_vertical

Install iset

Another slick script to install is the iset.pl script.  This makes changing settings much easier as it adds a lightweight interface and description to all the different options and settings.  Installation is easy once we have weeget installed.

/script install iset

There is one last script that users may like, called buffer_autoclose.py.  This will close any inactive sessions that happen to remain open that don’t need to be and will begin working automatically once it is installed.  This will clean up the buffer list and just helps to improve the look and feel.  Super easy to install with weeget.

Install buffer_autoclose

/script install buffer_autoclose.py

And here is what our final product looks like.  Slightly different than irssi and I haven’t really gotten into custom themes or any of that jazz but it has really begun to shape up.

If you have any tips or tricks on how to improve this environment let me know.  I am new to Weechat and am still discovering all of its nuances and will be coming back to this post periodically to update things that I have found to be worthy of adding.

Resources:
http://leigh.cudd.li/article/Bitlbee_and_Weechat_Mini-Tutorial
http://kmacphail.blogspot.com/2011/09/using-weechat-with-freenode-basics.html
http://www.weechat.org/files/doc/devel/weechat_user.en.html#screen_layout

Read More