Document Storage: Part 4

Document Storage Project

This is Part 4: Indexing the storage. Indexing in this context is the process of making the storage searchable – so we can just have a simple text box we type search terms in and get results. We’re not talking about the Apache index we set up in Part 3. There’s all sorts of free projects that do this, but alas most of them just provide you with a library that you can integrate into your own programming project. We don’t want a library, we want a pre-cooked system including database, indexing and an interface we can punch words into and get search results back. The product I’ve found is called Sphider Plus. This is a paid-for fork of a project that’s more-or-less died on the vine. I’m using an older version that I found online– as it’s GPL’d there’s no legal risk to this, but I may well pay for a newer version at some point because there’s some interesting features listed in the newer version and it’s cheap enough. Download and unzip sphider-plus:

cd /home/http/search
unzip /path/to/sphider-plus.zip

Now to configure things. First off we need to log into MySQL and create a suitable database. Enter your password when prompted:

mysql -u (username) -p
CREATE USER sphider-plus IDENTIFIED BY (put a secret password in here);
CREATE DATABASE sphider-plus;
GRANT ALL ON sphider-plus.* TO 'sphider-plus'@'localhost';
exit

In the settings/database.php, change the following lines as appropriate:

	$database="sphider-plus";
	$mysql_user = "sphider-plus";
	$mysql_password = "(PASSWORD YOU USED ABOVE)"; 
	$mysql_host = "localhost";
	$mysql_table_prefix = "";

We want to be able to index PDFs, so edit the line in settings/conf.php:

//Path to PDF converter
$pdftotext_path = '/home/http/search/converter/pdftotext';

Edit /home/http/search/converter/pdftotext as follows:

#!/bin/sh
/home/http/search/converter/pdftotext.script $1 -

Set up permissions appropriately:

find -name /home/http/search -type d | xargs chmod 700
find -name /home/http/search -type f | xargs chmod 600
chown -R www-data /home/http/search
chmod 700 /home/http/search/pdftotext
chmod 700 /home/http/search/pdftotext.script

Point your web browser at: http://(host name).local/search and follow the setup prompts to configure Sphider’s database. Once you’ve done that, log into the admin interface and tell it to start indexing: http://(host name).local Assuming everything works, you should now be able to search for PDFs containing text at: http://(host name).local/search/search.php Look under Statistics/Spidering Logs in Sphider’s administrative interface if you need to troubleshoot any issues. Configure Sphider Plus to your liking. You now have a search interface! But…. it’d be nice if we could search from the file browsing interface. Fortunately, we can. Make the following changes to /home/http/assets/header.html:

          <div id="commandbar">
                <a href="/" id="home">home</a>
                                <a href="../" id="parent">up</a>
                <a href="#" id="refresh">refresh</a>
                <form action="/search/search.php" method="get" id="search">
                        <input type="text" name="query" id="query" size="40" value="" columns="2" autocomplete="off" delay="500">
                        <input type="hidden" name="search" value="1">
                        <input type="Submit" value="Search">
                        Show   
                        <select name='results'>
                                <option >5</option>
                                <option >10</option>
                                <option selected>20</option>
                                <option >30</option>
                                <option >50</option>
                        </select>
                        results per page
                <form>
                </div>
                        <div id="files">
                        <h2>

You’ll also want to change the stylesheet, /home/http/assets/style.css. Find the parts that relate to #commandbar and change them as follows:

/* command = commandbar button, is active, is greyed out */ 
#commandbar a,form { 
background-position: left center; 
background-repeat: no-repeat; 
font-size: 82.5%; 
margin-left: 0.4em; 
margin-right: 0px; 
margin-top: 0px; 
margin-bottom: 0px; 
padding-left: 22px; 
padding-right: 1.0em; 
padding-top: 0.4em; 
padding-bottom: 0.6em; 
line-height: 1.5em; 
color: #555555; 
border-right: 2px dotted #D5E0E0; 
}

/* common commands */ 
#commandbar #parent { 
background-image: url('/assets/icons/parent.gif'); 
}

#commandbar input,#search,#querySuggestList { 
display: inline; 
}

#commandbar #refresh { 
background-image: url('/assets/icons/refresh.gif'); 
}

If you point your browser at http://(hostname).local, you should now see your text box you can use for searching: Try it out! You should be able to search for text within PDFs with this. However, right now all we have is an index to a single PDF we put up for testing. We need some means of scanning pages and uploading them to this server – the subject of our next post….

Read More

Document Storage: Part 3

Document Storage Project

This is Part 3: Configuring Apache.

We’re only looking for a fairly simple interface to browse through documents. Apache already gives us that – you just need to enable a feature called “Indexes”. But the default indexing is pretty ugly; it’d be nice to make it look a little prettier and maybe add scope for expanding on functionality.

Initially I was going to design my own style, but it turns out someone’s already done that and he’s done a better job than I could ever hope to. So I took the style setup from Recursive Design and tweaked it slightly to fit in with what we’re doing here.

  cd /home/http/assets
  svn co http://recursive-design.com/svn/misc/apache/index-style
  mv index-style/* . 
  rmdir index-style

The stylesheet really ought to be referenced with a specific path so it can always be found. Edit /home/http/assets/header.html and change the stylesheet reference thus:

    

Next up, we need to configure Apache. Ignore the instructions on the Recursive Design blog; things are slightly different here. Edit /etc/apache2/sites-available/default as follows:


        ServerAdmin (YOUR EMAIL ADDRESS HERE)

        DocumentRoot /home/http/documents
        
                Options FollowSymLinks
                AllowOverride None
        
        
                AllowOverride None
                Options Indexes
                DirectoryIndex index.html index.php
                IndexOptions FancyIndexing
                IndexOptions VersionSort
                IndexOptions HTMLTable
                IndexOptions FoldersFirst
                IndexOptions IconsAreLinks
                IndexOptions IgnoreCase
                IndexOptions SuppressDescription
                IndexOptions SuppressHTMLPreamble
                IndexOptions XHTML
                IndexOptions IconWidth=16
                IndexOptions IconHeight=16
                IndexOptions NameWidth=*
                IndexOrderDefault Descending Name
                HeaderName /assets/header.html
                ReadmeName /assets/footer.html
                Order allow,deny
                Allow from all
        

        Alias /assets /home/http/assets
        Alias /search /home/http/search

        
                AllowOverride None
                Order allow,deny
                Allow from all
        

        
                AllowOverride None
                Order allow,deny
                Allow from all
        

        ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/

 ..... LEAVE THE REST OF THE FILE ALONE ......

Restart Apache.

Put something – ideally a PDF that was NOT generated from a scan but instead contains searchable text – into /home/http/documents and browse to http://(hostname).local from a separate PC on the network. If all goes according to plan, you should see something a bit like this:

There’s a lot more to do: we still need something that can index this little lot (so we can just punch in search terms) and we need some easy way to get documents onto the server. But they’re a topic for a future post…

Read More

Document Storage: Part 2

Document Storage Project

This is Part 2: Setting up our base system.

I’m assuming that readers are already reasonably familiar with Linux and can generally find their way around OK. If I didn’t assume that, this set of instructions would probably wind up becoming a book!

I’m keeping it simple here by installing this on a spare PC I have hanging around. Things would be a little more complicated if this was on a shared host or a virtual server in a datacentre, but that’s beyond the scope of this project.

Install a base Debian Wheezy system. At the time of writing this is the “Testing” branch, which I wouldn’t ever deploy to a client. But this project is for me personally so I’m rather less bothered. You don’t need any extra software, so untick as much as you can.

Give the bulk of the disk space over to /home; keep 15-20 GB left over for /var.

In /home, create the following directories:

http
http/assets
http/search
http/documents
scripts
incoming

Run the following command to install the software we’ll need:

apt-get install tesseract-ocr bzip2 make ocaml gawk apache2 unzip php5 zip php5-gd mysql-server php5-mysql subversion inotify-tools imagemagick ghostscript exactimage openssh-server avahi-daemon

We now have:

  1. A Linux box running Apache – and we shouldn’t even need DNS if we’re on the same subnet. Check it works by typing http://(hostname).local into your web browser.
  2. Directories for our scripts, our static HTML, the document repository, incoming files for OCR’ing and scripts to carry out the OCR work.
  3. Most of the software we’re going to need. There’s one or two things missing, but they’re so trivial that it’s hardly worth losing any sleep over them.

We still need:

  1. To configure Apache to act as our file browser.
  2. To integrate search functionality.
  3. To sort out the scripts that are going to OCR incoming files.

Read More

Restore an Exchange Mailbox Database using Data Protector

Forgive the boring title for this post but I do think that this is a really important topic and one that I had to deal with recently at work.  Somehow one of our Exchange mailbox databases became corrupted and one of our users lost a ton of email, which, I’m almost 100% sure was related to the outage catastrophe we experienced 1.5 weeks ago.  This event made me thank the Flying Spaghetti Monster that I was getting good backups from our (sometimes shaky) backup solution, Data Protector.  Anyway, for this topic I will just assume that you are getting backups from whatever backup solution but it isn’t all that important because the majority of this post will cover specific instructions for the procedure within Exchange, so you can take bits and pieces and apply them where you need to.

Before I go any further, it is always worth mentioning;  make sure you are getting good backups! 

Ok, now that we have that out of the way I will show you the basic restore procedure within the Data Protector environment.  Select the Restore option from the drop down list -> MS Exchange 2010 Server.

Then select the source to backup up (Whichever database that needs to be restored).  With in Data Protector specify the restore options that you would like.

These are the options I used most recently.

  • Restore method: Restores files to temporary location
  • Backup version: Whichever data you decide you need to roll back to
  • Restore chain: Restore only this backup
  • Target client: Select the mailbox server that you want to restore to
  • Restore into location: This can be any location, just make sure there is enough disk space.
  • Select Restore databse file only

Once you have chosen your restore options, click the restore button to begin the restore procedure.

Once the database has been restored with Data Protector

Now for the fun stuff.  This is the part that I’m guessing most will probably be concerned with, but I didn’t want to leave out my Data Protector peeps.  Open up an Exchange Management Shell on the mailbox server that you restored your database to.  Technically it can be from any server as long as you connect to the correct mailbox server I guess.  Anyway, rename your restored database to something like “recoverydb.edb”.  Change directories into the restore folder, then check the status of the newly restored database with the following command:

eseutil /mh recoverydb.edb

You should see something similar to the following:

If it shows Clean Shutdown you can skip ahead.  Since we didn’t bring any log files down with us in this restore we will need to run the database hard repair on this database using the following command:

eseutil /p recoverydb.edb

After running the repair you should get a clean shutdown state if you check again (eseutil /mh).

Now we need to create a recovery database for Exchange to use in order to recover this data from.

New-MailboxDatabase -Recovery -Name “recoverydb” -Server Mailbox1 -EdbFilePath “M:\recovery\recoverydb.edb” -LogFolderPath “M:\recovery” -Verbose

It is important that when you create your recovery database it matches the renamed .edb file.  So since I renamed my recovery database to recoverydb.edb, I used recoverydb in the Powershell command.  If you want to check to make sure this step was done properly, use the following command to verify that the database is roughly the size you are expecting it to be:

Get-MailboxDatabase -status | select Servername,Name,DatabaseSize

After everything looks good we mount our database.

Mount-Database recoverydb

Just to verify that the database has stuff in it and we can find the person we’re looking for, we will take a quick look at the database contents, as shown below.

Get-MailboxStatistics -Database recoverydb

It looks like there are users there so all we need to do now is dump their emails into a temporary/recovery account in Exchange with the following command:

Restore-Mailbox -RecoveryMailbox “user_to_recover” -Identity “temporary_account” -RecoveryDatabase recoverydb -TargetFolder “RecoveredItems”

  • -RecoveryMailbox is the user mailbox that we are pulling data from, the source mailbox
  • -Identity is the user mailbox that we are putting data into, the destination mailbox
  • -RecoveryDatabase is our newly created recoverydb
  • -TargetFolder is the a folder that we will create on the target user to house the recovered items
  • -Verbose optional debugging information if there is a problem anywhere in the process

The wording and syntax of this command is a little bit tricky.  Just remember that the -RecoveryMailbox signifies the backup location and the -Identity signifies the restore location.  After this process completes (could take awhile depending on the mailbox size) you should be able to log in to the temporary account and take a look at the newly created “RecoveredItems” folder in which the mailbox contents of the user mailbox we are restoring have been copied in to.

Once this is done, just right click the target mailbox (temporary_account), click Manage Full Access Permission and give the restore mailbox (user_to_recover) full permissions through the Exchange Console so you can copy over messages, etc. in Outlook.

This step can be done any number of different ways but I chose this method because I was more concerned about the safest way to do this.  You could, for example, copy the contents directly into the user mailbox if you wanted.  Another option would be to export the contents of the temporary user out into a .pst file, with something like the following:

New-MailboxExportRequest –Mailbox mailboxserver –FilePath \recovery.pst

That should be it, after you are done and the emails have all been recovered , be sure to dismount the recovery database and delete the files to free space back up on your mailbox server.

Resources:

http://blogs.perficient.com/microsoft/2011/02/working-with-exchange-2010-recovery-databases-2/
http://www.mikepfeiffer.net/2011/07/restoring-mailbox-data-from-a-recovery-database
http://pmirmand.files.wordpress.com/2011/08/restore-database-on-exchange-2010.docx

Read More

Sending Test Emails with Telnet

I’d like to talk quickly about a great and underutilized method for troubleshooting email flow problems.  Today I had to rebuild an Exchange Hub Transport server after a slight catastrophe from last week in which the VM the Hub lived on was completely unrecoverable.  That is another story but it brings up the need for using a great tool that is often skipped over, and that is sending test email via telnet.

The reason I say that this method is underutilized is because, well who uses telnet these days?  What’s great about using this is that you can test different aspects and essentially pinpoint where mail flow issues are occurring.  In my case I was have trouble relaying email from an internal account to outside mail servers.  So let’s jump into how to use this tool, its easy but I feel like not enough people know about it, so here we go.

First, since I was testing from inside, I need to connect to the local server name.

telnet hubserver.psa.local 25

Easy enough, we are using telnet to connect to the hub server, hubserver.psa.local on port 25 (SMTP).  Once we get in we run a simple,

ehlo

That gives us back a little bit of information, basically telling us that this is an email server and some of its capabilities.  Next, we will need to run through the following set of commands to send out the test email.  It is important that these commands are entered in exactly, with no backspaces, otherwise it will break the command and you will get an error message spit back out from your telnet session.

MAIL FROM: [email protected]
RCPT TO: [email protected]
DATA
SUBJECT:

message content.

.
QUIT
  • MAIL FROM: This is telling the mail server who this message is being sent from.
  • [email protected] is the internal mail sender I was using.
  • RCPT TO: Tells the mail server the email address that is being sent to.
  • [email protected] is the address we are sending to. It can be any of your internet based mail addresses (google, yahoo, etc.).
  • DATA signifies the start of the message body.
  • SUBJECT: This line is optional, probably a good idea to include a subject so the message doesn’t get blocked or sent to spam.  Hit enter twice after this to drop into the message content.
  • message content is whatever you want to include in your message.  Follow your message by hitting enter.
  • “.” (read dot) on a line by itself will tell the mail server to end the message and send it.  It is basically the equivalent of an escape character for emails.
  • QUIT leaves the telnet session from the mail server.

It is important that the previous set of commands is run the way that they look.  This whole string of commands should look something similar to the following inside of your shell when things are all said and done, assuming everything is working properly.

In my case, I was unable to enter an address for the RCPT TO: command.  To fix this, among with a few other steps in rebuilding the hub was to grant anonymous send permission on the Exchange side of things, then after that mail began flowing through the newly rebuilt Hub Transport server perfectly.

That should be it, I highly suggest going through the process of sending out a few test emails to get this method stuck in your brain for later on down the road if you ever have to do any mail flow type troubleshooting.  Good luck!

Read More