Document Storage: Part 5

Document Storage Project

This is Part 5: Uploading Scanned Images.

There’s two components to this part: configuring somewhere for the files to be uploaded to and setting up your MFD to upload to them. Most modern MFDs will upload to a CIFS share, which is what we’re going to use here. First thing’s first, we need to install Samba:

apt-get install samba

Now we need to set up Samba. We’ll have user-level security (it’ll be much easier to lock things down if we want to increase security at a later date, and besides share-level security went out with the Ark) and a single share called incoming. We also need a user for the MFD to log into Samba with; we’ll call this user “scanner”. We’ll also have a group called “scanner” so we can be a little more flexible over who can access this share should we wish.

Edit /etc/samba/smb.conf as follows:

......

# "security = user" is always a good idea. This will require a Unix account
# in this server for every user accessing the server. See
# /usr/share/doc/samba-doc/htmldocs/Samba3-HOWTO/ServerType.html
# in the samba-doc package for details.
   security = user

......

[incoming]
        path = /home/incoming
        guest ok = no
        browseable = no
        read only = no
        valid users = @scanner

Now, we need a new user for the MFD. Samba requires that users also have corresponding Unix accounts, so first we create a Unix account, then we set their Samba password. We also need to ensure the permissions on /home/incoming are correct – the folllowing commands deal with this:

  useradd scanner
  smbpasswd scanner
  chgrp scanner /home/incoming
  chmod g+rwx /home/incoming

Make sure you choose a password that is not only secure, but possible to type in on your MFD! Check this works by connecting to the following folder in Windows:

\\(hostname)\incoming

You’ll need to use the username/password for the scanner user you set up.

For the final part of this, you need to set up your MFD to scan to this directory.

I’ve chosen an Oki MB451 multifunction unit for a number of reasons:

  • It’s cheap.
  • It has a double-sided document feeder for scanning. More and more documents are being sent double-sided; it seems like a step back to have a document feeder that can’t deal with this.
  • It supports scanning directly to email and CIFS share without requiring extra software on the PC. (This is important; certainly a few years ago a lot of manufacturers claimed their products could do this but it wasn’t apparent until after you’d taken it out of the box that their product didn’t do any of it without additional software on your PC. Certain large photocopier-type units still have this restriction, though sometimes you can buy an optional bolt-on to overcome it. I prefer avoiding the need for extra bolt-ons because they’re usually extortionately priced and often difficult to source).
  • It has a nice big display. These units can be a pig to set up at the best of times; a large display often goes some way to alleviate this problem.
  • You can set up lots of profiles – preconfigured shortcuts that say “everything scanned under this profile should be stored under this name in this share accessed with this username and password; files should have this format”. Unfortunately you can’t nail a profile to say “everything scanned under this profile is double-sided” but you can’t have everything!
  • The printer supports Postscript, which means it’ll be pretty much guaranteed to work under any OS I can throw at it for a long time to come.

I won’t go into detail regarding MFD configuration – there’s simply too many on the market and they all vary. It’s enough to explain that I’ve set up a profile called “Correspondence” and I’ve pointed it at \\(hostname)\incoming.

With the profile I’ve set up, scanned documents will be stored under \\(hostname)\incoming\Correspondence-#####.pdf.

Test this all works by scanning a document and making sure it appears in the /home/incoming directory on your Linux box.

There’s only one thing left to do – tie all this together so incoming documents are automatically OCR’d, made available via Apache and OCR’d so they’re indexable in Sphider….

Read More

Becoming a better sysadmin

I typically don’t focus on philosophical topics or the more abstract subjects, but recently I have been reading  up on the topic of self improvement and wanted to take some time today to lay out and develop some of the key concepts and ideas that I have found to be helpful so far.  Hopefully some of these ideas can be used to help you improve as well in the world of system administration and other future career endeavors.

So this post is going to be more of a work in progress than anything else, since I really just wanted to get some of this stuff written down in order to clear it out of my head.  There are literally books that have been written on self improvement and learning strategies so my goal with this isn’t to get every single detail, I just want to hit the high points and how their application to system administration.  Here’s what I have so far, feel free to let me know what I’m missing or throw in anything else that might be particularly useful on this subject.

Explicit vs Tacit knowledge

Explicit knowledge can be defined as that gained from books or listening to a lecture.  Basically some form of reading or auditory resource.

Tacit knowledge can be defined as that gained from experience, action and practice.

I’d like to start off by making a distinction between different types of knowledge.  I believe that the practice of system administration relies heavily on both types and just one type of experience is not enough to be great in this field.  They work hand in hand.  So for example, reading a ton of books, while useful in its own right will not be nearly as effective as reading books and then applying the knowledge gained from hands on experience.  Likewise, if somebody never bothers to pickup a book and relies entirely on hands experiences they will not be as knowledgeable as someone who incorporates both types of knowledge.  Although I do feel that much more can be learned from hands on experience in the field of system administration than by books alone.

Types of learning

There has been a good deal of research done on this subject but for the purposes of this post I would like to boil this all down to what are considered the three primary or main styles of learning.  The reason I want to focus on these is that they seem to work hand in hand with explicit and tacit knowledge and can be described a bit more easily.  Each one of these different styles represents a different sort of idiom to the learning experience.  So here they are:

  • Visual – Learning by watching or reading.
  • Auditory – Learning by listening.
  • Kinesthetic – Learning from experience, hands on.

I would argue that employing a good variety of learning and study methods would be the most appropriate way to develop your skills as a sysadmin.  But even in my own experiences with learning styles I have realized that I tend to favor a kinesthetic learning approach, and I’m sure others have their own preferences as well.  Instead of saying that one is better than another, I would suggest employing all of these types.  Take a look at yourself and figure out how you learn best and then decide which method(s) are the most and least helpful and then decide how to make these styles work to your advantage.  For example, I feel that I am a weak reader.  While I know that reading is important I tend to spend the least amount of time doing just reading if at all possible.  Having a piece of reading material as a reference or as an introduction is great.  If I don’t quite understand things from reading the next step I like to take is internalizing things by listening to or watching.  Finally, once I get a good enough idea about a topic I like to quickly put things into my own experiences.  There is some quote about how experience sticks but I am too lazy to look it up.  Suffice it say, I tend to remember things much more concretely when I am able to experience them for myself.

Again, this is just in my own experience and everybody is different.  I just wanted to give a specific example of one way to utilize different styles of learning.  There are many other possibilities and this just happens to be the way I prefer to learn things.

Learning strategies

Now that we have that out of the way, I want to highlight some of the major tactics that I use when attempting to learn a new subject.  I definitely use some of these more than others but the point is that you should attempt to utilize as much as you can for your own benefit.  Here are some different strategies I came up with that help me greatly when I encounter new and difficult to understand information.  Many of these work together or in tandem so they may described more than once.

The Feynman technique – This is as close to the end all be all that there is when it comes to learning.  Everybody is probably familiar with this one, but I am guessing they are not familiar with the name.  This technique is used to explain or go through a topic as if you were teaching it to somebody else that was just learning about it for the first time.  This basically forces you to know what you’re talking about.  If you get stuck when trying to explain a particular concept or idea, make a note of what you are struggling with and research and relearn the material until you can confidently explain it.  You should be able to explain the subject simply, if your explanations are wordy or convoluted you probably don’t understand it as well as you think.

Reading – I usually like to get an introduction to a topic by reading up on (and bookmarking) what information I feel to be the most informed, whether it be official documentation, RFC’s, books, magazines, respected blogs and authors, etc.  As I mentioned before, I would consider myself a weak reader (something that I definitely need to improve on!) so I also like to take very brief notes when something I read seems like it would useful so I can try it out for myself.

Watching/Listening to others – After getting a good idea from reading about a subject I always like to reinforce this by either watching demonstrations, videos, listening to podcasts, lectures or anything else that will show me how to get a better idea of how to do something.  When I’m on a long drive for example is a great time to put on a podcast.  It kills time as well as improves knowledge at the cost of nothing.  Very efficient!  The same with videos and demonstrations, the only thing holding you back is the motivation.

Try things for yourself – Sometimes this can be the most difficult approach but definitely can also be the most rewarding, there is nothing better than learning things the hard way.  Try things out for yourself in a lab or anywhere that you can practice the concepts that you are attempting to learn and understand.

Take notes – This is important for your own understanding of how things work in a way that you can internalize.  I will take notes on simple things like commands I won’t remember, related topics and concepts or even just jotting down keywords quickly that to Google for later on.  This goes hand in hand with the reading technique described above, just jotting down very simple, brief notes can be really useful.

Communicate with others – There are plenty of resources out there for getting help and for communicating and discussing what you learn with others.  I would suggest looking a /r/sysadmin as a starting point.  IRC channels are another great place to ask questions and get help, there are channels for pretty much any subject you can think of out there.  There are good sysadmin related channels at irc.freenode.net, if you don’t already utilize IRC I highly suggest you take a look.

Come back later – Give your brain some time to start digesting some of the information and to take a step back and put the pieces together to begin creating a bigger picture.  I can’t count how many times I have been working on learning a new concept or subject and felt overwhelmed and stuck until I took a break, did something completely different or thought about something else entirely and came back to the subject later on with a fresh perspective.   Sometimes these difficult subjects just take time to fully understand so taking breaks and clearing your head can be very useful.

Sleep on it – Have you ever heard of the term before?  This may sound crazy but sometimes if there is a particular problem that I can’t solve I will often times think about it before I go to sleep.  I find that by blocking out all outside interference and noise I can much more easily think about it, come up with fresh perspectives and ideas and often times will wake up with an answer the next morning.  I think meditation is comparable to this but I know nothing about meditation (I hope to at some point!) so I have to use this method for the time being.

Break stuff – One of the best ways to incorporate a number of these techniques is to intentionally break stuff in your own setups.  Triple check to be sure that you aren’t breaking anything important first and then go ahead and give it a try.  By forcing yourself to fix things that are broken you develop a much deeper and more intimate relationship with the way things work, why they work the way that they do and how things get broken to begin with.  The great thing about using this method is that it is almost always useful for something in the future, whether it be the troubleshooting skills, the Googling skills or the specific knowledge in the particular area that needed to be fixed.

Practice, practice, practice – The more I read about becoming better at something the more I am convinced that you have to practice like an absolute maniac.  I think for system administration this can partially come from practical job experience but it also comes from dedicated study and lab time.  The hands on component is where most of your practice will come from and becoming better doesn’t just happen, it takes cultivation and time, just like with any other skill.  Stick with it and never stop learning and improving on your skills through practice and experience.

Read More

Document Storage: Part 4

Document Storage Project

This is Part 4: Indexing the storage. Indexing in this context is the process of making the storage searchable – so we can just have a simple text box we type search terms in and get results. We’re not talking about the Apache index we set up in Part 3. There’s all sorts of free projects that do this, but alas most of them just provide you with a library that you can integrate into your own programming project. We don’t want a library, we want a pre-cooked system including database, indexing and an interface we can punch words into and get search results back. The product I’ve found is called Sphider Plus. This is a paid-for fork of a project that’s more-or-less died on the vine. I’m using an older version that I found online– as it’s GPL’d there’s no legal risk to this, but I may well pay for a newer version at some point because there’s some interesting features listed in the newer version and it’s cheap enough. Download and unzip sphider-plus:

cd /home/http/search
unzip /path/to/sphider-plus.zip

Now to configure things. First off we need to log into MySQL and create a suitable database. Enter your password when prompted:

mysql -u (username) -p
CREATE USER sphider-plus IDENTIFIED BY (put a secret password in here);
CREATE DATABASE sphider-plus;
GRANT ALL ON sphider-plus.* TO 'sphider-plus'@'localhost';
exit

In the settings/database.php, change the following lines as appropriate:

	$database="sphider-plus";
	$mysql_user = "sphider-plus";
	$mysql_password = "(PASSWORD YOU USED ABOVE)"; 
	$mysql_host = "localhost";
	$mysql_table_prefix = "";

We want to be able to index PDFs, so edit the line in settings/conf.php:

//Path to PDF converter
$pdftotext_path = '/home/http/search/converter/pdftotext';

Edit /home/http/search/converter/pdftotext as follows:

#!/bin/sh
/home/http/search/converter/pdftotext.script $1 -

Set up permissions appropriately:

find -name /home/http/search -type d | xargs chmod 700
find -name /home/http/search -type f | xargs chmod 600
chown -R www-data /home/http/search
chmod 700 /home/http/search/pdftotext
chmod 700 /home/http/search/pdftotext.script

Point your web browser at: http://(host name).local/search and follow the setup prompts to configure Sphider’s database. Once you’ve done that, log into the admin interface and tell it to start indexing: http://(host name).local Assuming everything works, you should now be able to search for PDFs containing text at: http://(host name).local/search/search.php Look under Statistics/Spidering Logs in Sphider’s administrative interface if you need to troubleshoot any issues. Configure Sphider Plus to your liking. You now have a search interface! But…. it’d be nice if we could search from the file browsing interface. Fortunately, we can. Make the following changes to /home/http/assets/header.html:

          <div id="commandbar">
                <a href="/" id="home">home</a>
                                <a href="../" id="parent">up</a>
                <a href="#" id="refresh">refresh</a>
                <form action="/search/search.php" method="get" id="search">
                        <input type="text" name="query" id="query" size="40" value="" columns="2" autocomplete="off" delay="500">
                        <input type="hidden" name="search" value="1">
                        <input type="Submit" value="Search">
                        Show   
                        <select name='results'>
                                <option >5</option>
                                <option >10</option>
                                <option selected>20</option>
                                <option >30</option>
                                <option >50</option>
                        </select>
                        results per page
                <form>
                </div>
                        <div id="files">
                        <h2>

You’ll also want to change the stylesheet, /home/http/assets/style.css. Find the parts that relate to #commandbar and change them as follows:

/* command = commandbar button, is active, is greyed out */ 
#commandbar a,form { 
background-position: left center; 
background-repeat: no-repeat; 
font-size: 82.5%; 
margin-left: 0.4em; 
margin-right: 0px; 
margin-top: 0px; 
margin-bottom: 0px; 
padding-left: 22px; 
padding-right: 1.0em; 
padding-top: 0.4em; 
padding-bottom: 0.6em; 
line-height: 1.5em; 
color: #555555; 
border-right: 2px dotted #D5E0E0; 
}

/* common commands */ 
#commandbar #parent { 
background-image: url('/assets/icons/parent.gif'); 
}

#commandbar input,#search,#querySuggestList { 
display: inline; 
}

#commandbar #refresh { 
background-image: url('/assets/icons/refresh.gif'); 
}

If you point your browser at http://(hostname).local, you should now see your text box you can use for searching: Try it out! You should be able to search for text within PDFs with this. However, right now all we have is an index to a single PDF we put up for testing. We need some means of scanning pages and uploading them to this server – the subject of our next post….

Read More

Document Storage: Part 3

Document Storage Project

This is Part 3: Configuring Apache.

We’re only looking for a fairly simple interface to browse through documents. Apache already gives us that – you just need to enable a feature called “Indexes”. But the default indexing is pretty ugly; it’d be nice to make it look a little prettier and maybe add scope for expanding on functionality.

Initially I was going to design my own style, but it turns out someone’s already done that and he’s done a better job than I could ever hope to. So I took the style setup from Recursive Design and tweaked it slightly to fit in with what we’re doing here.

  cd /home/http/assets
  svn co http://recursive-design.com/svn/misc/apache/index-style
  mv index-style/* . 
  rmdir index-style

The stylesheet really ought to be referenced with a specific path so it can always be found. Edit /home/http/assets/header.html and change the stylesheet reference thus:

    

Next up, we need to configure Apache. Ignore the instructions on the Recursive Design blog; things are slightly different here. Edit /etc/apache2/sites-available/default as follows:


        ServerAdmin (YOUR EMAIL ADDRESS HERE)

        DocumentRoot /home/http/documents
        
                Options FollowSymLinks
                AllowOverride None
        
        
                AllowOverride None
                Options Indexes
                DirectoryIndex index.html index.php
                IndexOptions FancyIndexing
                IndexOptions VersionSort
                IndexOptions HTMLTable
                IndexOptions FoldersFirst
                IndexOptions IconsAreLinks
                IndexOptions IgnoreCase
                IndexOptions SuppressDescription
                IndexOptions SuppressHTMLPreamble
                IndexOptions XHTML
                IndexOptions IconWidth=16
                IndexOptions IconHeight=16
                IndexOptions NameWidth=*
                IndexOrderDefault Descending Name
                HeaderName /assets/header.html
                ReadmeName /assets/footer.html
                Order allow,deny
                Allow from all
        

        Alias /assets /home/http/assets
        Alias /search /home/http/search

        
                AllowOverride None
                Order allow,deny
                Allow from all
        

        
                AllowOverride None
                Order allow,deny
                Allow from all
        

        ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/

 ..... LEAVE THE REST OF THE FILE ALONE ......

Restart Apache.

Put something – ideally a PDF that was NOT generated from a scan but instead contains searchable text – into /home/http/documents and browse to http://(hostname).local from a separate PC on the network. If all goes according to plan, you should see something a bit like this:

There’s a lot more to do: we still need something that can index this little lot (so we can just punch in search terms) and we need some easy way to get documents onto the server. But they’re a topic for a future post…

Read More

Document Storage: Part 2

Document Storage Project

This is Part 2: Setting up our base system.

I’m assuming that readers are already reasonably familiar with Linux and can generally find their way around OK. If I didn’t assume that, this set of instructions would probably wind up becoming a book!

I’m keeping it simple here by installing this on a spare PC I have hanging around. Things would be a little more complicated if this was on a shared host or a virtual server in a datacentre, but that’s beyond the scope of this project.

Install a base Debian Wheezy system. At the time of writing this is the “Testing” branch, which I wouldn’t ever deploy to a client. But this project is for me personally so I’m rather less bothered. You don’t need any extra software, so untick as much as you can.

Give the bulk of the disk space over to /home; keep 15-20 GB left over for /var.

In /home, create the following directories:

http
http/assets
http/search
http/documents
scripts
incoming

Run the following command to install the software we’ll need:

apt-get install tesseract-ocr bzip2 make ocaml gawk apache2 unzip php5 zip php5-gd mysql-server php5-mysql subversion inotify-tools imagemagick ghostscript exactimage openssh-server avahi-daemon

We now have:

  1. A Linux box running Apache – and we shouldn’t even need DNS if we’re on the same subnet. Check it works by typing http://(hostname).local into your web browser.
  2. Directories for our scripts, our static HTML, the document repository, incoming files for OCR’ing and scripts to carry out the OCR work.
  3. Most of the software we’re going to need. There’s one or two things missing, but they’re so trivial that it’s hardly worth losing any sleep over them.

We still need:

  1. To configure Apache to act as our file browser.
  2. To integrate search functionality.
  3. To sort out the scripts that are going to OCR incoming files.

Read More