Document Storage: Part 5

Document Storage Project

This is Part 5: Uploading Scanned Images.

There’s two components to this part: configuring somewhere for the files to be uploaded to and setting up your MFD to upload to them. Most modern MFDs will upload to a CIFS share, which is what we’re going to use here. First thing’s first, we need to install Samba:

apt-get install samba

Now we need to set up Samba. We’ll have user-level security (it’ll be much easier to lock things down if we want to increase security at a later date, and besides share-level security went out with the Ark) and a single share called incoming. We also need a user for the MFD to log into Samba with; we’ll call this user “scanner”. We’ll also have a group called “scanner” so we can be a little more flexible over who can access this share should we wish.

Edit /etc/samba/smb.conf as follows:

......

# "security = user" is always a good idea. This will require a Unix account
# in this server for every user accessing the server. See
# /usr/share/doc/samba-doc/htmldocs/Samba3-HOWTO/ServerType.html
# in the samba-doc package for details.
   security = user

......

[incoming]
        path = /home/incoming
        guest ok = no
        browseable = no
        read only = no
        valid users = @scanner

Now, we need a new user for the MFD. Samba requires that users also have corresponding Unix accounts, so first we create a Unix account, then we set their Samba password. We also need to ensure the permissions on /home/incoming are correct – the folllowing commands deal with this:

  useradd scanner
  smbpasswd scanner
  chgrp scanner /home/incoming
  chmod g+rwx /home/incoming

Make sure you choose a password that is not only secure, but possible to type in on your MFD! Check this works by connecting to the following folder in Windows:

\\(hostname)\incoming

You’ll need to use the username/password for the scanner user you set up.

For the final part of this, you need to set up your MFD to scan to this directory.

I’ve chosen an Oki MB451 multifunction unit for a number of reasons:

  • It’s cheap.
  • It has a double-sided document feeder for scanning. More and more documents are being sent double-sided; it seems like a step back to have a document feeder that can’t deal with this.
  • It supports scanning directly to email and CIFS share without requiring extra software on the PC. (This is important; certainly a few years ago a lot of manufacturers claimed their products could do this but it wasn’t apparent until after you’d taken it out of the box that their product didn’t do any of it without additional software on your PC. Certain large photocopier-type units still have this restriction, though sometimes you can buy an optional bolt-on to overcome it. I prefer avoiding the need for extra bolt-ons because they’re usually extortionately priced and often difficult to source).
  • It has a nice big display. These units can be a pig to set up at the best of times; a large display often goes some way to alleviate this problem.
  • You can set up lots of profiles – preconfigured shortcuts that say “everything scanned under this profile should be stored under this name in this share accessed with this username and password; files should have this format”. Unfortunately you can’t nail a profile to say “everything scanned under this profile is double-sided” but you can’t have everything!
  • The printer supports Postscript, which means it’ll be pretty much guaranteed to work under any OS I can throw at it for a long time to come.

I won’t go into detail regarding MFD configuration – there’s simply too many on the market and they all vary. It’s enough to explain that I’ve set up a profile called “Correspondence” and I’ve pointed it at \\(hostname)\incoming.

With the profile I’ve set up, scanned documents will be stored under \\(hostname)\incoming\Correspondence-#####.pdf.

Test this all works by scanning a document and making sure it appears in the /home/incoming directory on your Linux box.

There’s only one thing left to do – tie all this together so incoming documents are automatically OCR’d, made available via Apache and OCR’d so they’re indexable in Sphider….

About the Author: James Cort

James Cort is Managing Director of Bediwin Information Services, providing IT management and integration services in the South West of England.

Becoming a better sysadmin

I typically don’t focus on philosophical topics or the more abstract subjects, but recently I have been reading  up on the topic of self improvement and wanted to take some time today to lay out and develop some of the key concepts and ideas that I have found to be helpful so far.  Hopefully some of these ideas can be used to help you improve as well in the world of system administration and other future career endeavors.

So this post is going to be more of a work in progress than anything else, since I really just wanted to get some of this stuff written down in order to clear it out of my head.  There are literally books that have been written on self improvement and learning strategies so my goal with this isn’t to get every single detail, I just want to hit the high points and how their application to system administration.  Here’s what I have so far, feel free to let me know what I’m missing or throw in anything else that might be particularly useful on this subject.

Explicit vs Tacit knowledge

Explicit knowledge can be defined as that gained from books or listening to a lecture.  Basically some form of reading or auditory resource.

Tacit knowledge can be defined as that gained from experience, action and practice.

I’d like to start off by making a distinction between different types of knowledge.  I believe that the practice of system administration relies heavily on both types and just one type of experience is not enough to be great in this field.  They work hand in hand.  So for example, reading a ton of books, while useful in its own right will not be nearly as effective as reading books and then applying the knowledge gained from hands on experience.  Likewise, if somebody never bothers to pickup a book and relies entirely on hands experiences they will not be as knowledgeable as someone who incorporates both types of knowledge.  Although I do feel that much more can be learned from hands on experience in the field of system administration than by books alone.

Types of learning

There has been a good deal of research done on this subject but for the purposes of this post I would like to boil this all down to what are considered the three primary or main styles of learning.  The reason I want to focus on these is that they seem to work hand in hand with explicit and tacit knowledge and can be described a bit more easily.  Each one of these different styles represents a different sort of idiom to the learning experience.  So here they are:

  • Visual – Learning by watching or reading.
  • Auditory – Learning by listening.
  • Kinesthetic – Learning from experience, hands on.

I would argue that employing a good variety of learning and study methods would be the most appropriate way to develop your skills as a sysadmin.  But even in my own experiences with learning styles I have realized that I tend to favor a kinesthetic learning approach, and I’m sure others have their own preferences as well.  Instead of saying that one is better than another, I would suggest employing all of these types.  Take a look at yourself and figure out how you learn best and then decide which method(s) are the most and least helpful and then decide how to make these styles work to your advantage.  For example, I feel that I am a weak reader.  While I know that reading is important I tend to spend the least amount of time doing just reading if at all possible.  Having a piece of reading material as a reference or as an introduction is great.  If I don’t quite understand things from reading the next step I like to take is internalizing things by listening to or watching.  Finally, once I get a good enough idea about a topic I like to quickly put things into my own experiences.  There is some quote about how experience sticks but I am too lazy to look it up.  Suffice it say, I tend to remember things much more concretely when I am able to experience them for myself.

Again, this is just in my own experience and everybody is different.  I just wanted to give a specific example of one way to utilize different styles of learning.  There are many other possibilities and this just happens to be the way I prefer to learn things.

Learning strategies

Now that we have that out of the way, I want to highlight some of the major tactics that I use when attempting to learn a new subject.  I definitely use some of these more than others but the point is that you should attempt to utilize as much as you can for your own benefit.  Here are some different strategies I came up with that help me greatly when I encounter new and difficult to understand information.  Many of these work together or in tandem so they may described more than once.

The Feynman technique – This is as close to the end all be all that there is when it comes to learning.  Everybody is probably familiar with this one, but I am guessing they are not familiar with the name.  This technique is used to explain or go through a topic as if you were teaching it to somebody else that was just learning about it for the first time.  This basically forces you to know what you’re talking about.  If you get stuck when trying to explain a particular concept or idea, make a note of what you are struggling with and research and relearn the material until you can confidently explain it.  You should be able to explain the subject simply, if your explanations are wordy or convoluted you probably don’t understand it as well as you think.

Reading – I usually like to get an introduction to a topic by reading up on (and bookmarking) what information I feel to be the most informed, whether it be official documentation, RFC’s, books, magazines, respected blogs and authors, etc.  As I mentioned before, I would consider myself a weak reader (something that I definitely need to improve on!) so I also like to take very brief notes when something I read seems like it would useful so I can try it out for myself.

Watching/Listening to others – After getting a good idea from reading about a subject I always like to reinforce this by either watching demonstrations, videos, listening to podcasts, lectures or anything else that will show me how to get a better idea of how to do something.  When I’m on a long drive for example is a great time to put on a podcast.  It kills time as well as improves knowledge at the cost of nothing.  Very efficient!  The same with videos and demonstrations, the only thing holding you back is the motivation.

Try things for yourself – Sometimes this can be the most difficult approach but definitely can also be the most rewarding, there is nothing better than learning things the hard way.  Try things out for yourself in a lab or anywhere that you can practice the concepts that you are attempting to learn and understand.

Take notes – This is important for your own understanding of how things work in a way that you can internalize.  I will take notes on simple things like commands I won’t remember, related topics and concepts or even just jotting down keywords quickly that to Google for later on.  This goes hand in hand with the reading technique described above, just jotting down very simple, brief notes can be really useful.

Communicate with others – There are plenty of resources out there for getting help and for communicating and discussing what you learn with others.  I would suggest looking a /r/sysadmin as a starting point.  IRC channels are another great place to ask questions and get help, there are channels for pretty much any subject you can think of out there.  There are good sysadmin related channels at irc.freenode.net, if you don’t already utilize IRC I highly suggest you take a look.

Come back later – Give your brain some time to start digesting some of the information and to take a step back and put the pieces together to begin creating a bigger picture.  I can’t count how many times I have been working on learning a new concept or subject and felt overwhelmed and stuck until I took a break, did something completely different or thought about something else entirely and came back to the subject later on with a fresh perspective.   Sometimes these difficult subjects just take time to fully understand so taking breaks and clearing your head can be very useful.

Sleep on it – Have you ever heard of the term before?  This may sound crazy but sometimes if there is a particular problem that I can’t solve I will often times think about it before I go to sleep.  I find that by blocking out all outside interference and noise I can much more easily think about it, come up with fresh perspectives and ideas and often times will wake up with an answer the next morning.  I think meditation is comparable to this but I know nothing about meditation (I hope to at some point!) so I have to use this method for the time being.

Break stuff – One of the best ways to incorporate a number of these techniques is to intentionally break stuff in your own setups.  Triple check to be sure that you aren’t breaking anything important first and then go ahead and give it a try.  By forcing yourself to fix things that are broken you develop a much deeper and more intimate relationship with the way things work, why they work the way that they do and how things get broken to begin with.  The great thing about using this method is that it is almost always useful for something in the future, whether it be the troubleshooting skills, the Googling skills or the specific knowledge in the particular area that needed to be fixed.

Practice, practice, practice – The more I read about becoming better at something the more I am convinced that you have to practice like an absolute maniac.  I think for system administration this can partially come from practical job experience but it also comes from dedicated study and lab time.  The hands on component is where most of your practice will come from and becoming better doesn’t just happen, it takes cultivation and time, just like with any other skill.  Stick with it and never stop learning and improving on your skills through practice and experience.

About the Author: Josh Reichardt

Josh is the creator of this blog, a system administrator and a contributor to other technology communities such as /r/sysadmin and Ops School. You can also find him on Twitter and Facebook.

Document Storage: Part 4

Document Storage Project

This is Part 4: Indexing the storage. Indexing in this context is the process of making the storage searchable – so we can just have a simple text box we type search terms in and get results. We’re not talking about the Apache index we set up in Part 3. There’s all sorts of free projects that do this, but alas most of them just provide you with a library that you can integrate into your own programming project. We don’t want a library, we want a pre-cooked system including database, indexing and an interface we can punch words into and get search results back. The product I’ve found is called Sphider Plus. This is a paid-for fork of a project that’s more-or-less died on the vine. I’m using an older version that I found online- as it’s GPL’d there’s no legal risk to this, but I may well pay for a newer version at some point because there’s some interesting features listed in the newer version and it’s cheap enough. Download and unzip sphider-plus:

cd /home/http/search
unzip /path/to/sphider-plus.zip

Now to configure things. First off we need to log into MySQL and create a suitable database. Enter your password when prompted:

mysql -u (username) -p
CREATE USER sphider-plus IDENTIFIED BY (put a secret password in here);
CREATE DATABASE sphider-plus;
GRANT ALL ON sphider-plus.* TO 'sphider-plus'@'localhost';
exit

In the settings/database.php, change the following lines as appropriate:

	$database="sphider-plus";
	$mysql_user = "sphider-plus";
	$mysql_password = "(PASSWORD YOU USED ABOVE)"; 
	$mysql_host = "localhost";
	$mysql_table_prefix = "";

We want to be able to index PDFs, so edit the line in settings/conf.php:

//Path to PDF converter
$pdftotext_path = '/home/http/search/converter/pdftotext';

Edit /home/http/search/converter/pdftotext as follows:

#!/bin/sh
/home/http/search/converter/pdftotext.script $1 -

Set up permissions appropriately:

find -name /home/http/search -type d | xargs chmod 700
find -name /home/http/search -type f | xargs chmod 600
chown -R www-data /home/http/search
chmod 700 /home/http/search/pdftotext
chmod 700 /home/http/search/pdftotext.script

Point your web browser at: http://(host name).local/search and follow the setup prompts to configure Sphider’s database. Once you’ve done that, log into the admin interface and tell it to start indexing: http://(host name).local Assuming everything works, you should now be able to search for PDFs containing text at: http://(host name).local/search/search.php Look under Statistics/Spidering Logs in Sphider’s administrative interface if you need to troubleshoot any issues. Configure Sphider Plus to your liking. You now have a search interface! But…. it’d be nice if we could search from the file browsing interface. Fortunately, we can. Make the following changes to /home/http/assets/header.html:

          <div id="commandbar">
                <a href="/" id="home">home</a>
                                <a href="../" id="parent">up</a>
                <a href="#" id="refresh">refresh</a>
                <form action="/search/search.php" method="get" id="search">
                        <input type="text" name="query" id="query" size="40" value="" columns="2" autocomplete="off" delay="500">
                        <input type="hidden" name="search" value="1">
                        <input type="Submit" value="Search">
                        Show   
                        <select name='results'>
                                <option >5</option>
                                <option >10</option>
                                <option selected>20</option>
                                <option >30</option>
                                <option >50</option>
                        </select>
                        results per page
                <form>
                </div>
                        <div id="files">
                        <h2>

You’ll also want to change the stylesheet, /home/http/assets/style.css. Find the parts that relate to #commandbar and change them as follows:

/* command = commandbar button, is active, is greyed out */ 
#commandbar a,form { 
background-position: left center; 
background-repeat: no-repeat; 
font-size: 82.5%; 
margin-left: 0.4em; 
margin-right: 0px; 
margin-top: 0px; 
margin-bottom: 0px; 
padding-left: 22px; 
padding-right: 1.0em; 
padding-top: 0.4em; 
padding-bottom: 0.6em; 
line-height: 1.5em; 
color: #555555; 
border-right: 2px dotted #D5E0E0; 
}

/* common commands */ 
#commandbar #parent { 
background-image: url('/assets/icons/parent.gif'); 
}

#commandbar input,#search,#querySuggestList { 
display: inline; 
}

#commandbar #refresh { 
background-image: url('/assets/icons/refresh.gif'); 
}

If you point your browser at http://(hostname).local, you should now see your text box you can use for searching: Try it out! You should be able to search for text within PDFs with this. However, right now all we have is an index to a single PDF we put up for testing. We need some means of scanning pages and uploading them to this server – the subject of our next post….

About the Author: James Cort

James Cort is Managing Director of Bediwin Information Services, providing IT management and integration services in the South West of England.

Introduction to Weechat

Continuing our little mini series I wanted to introduce a great alternative to Irssi, called Weechat.  In this post we will go over how to get it set up and configured as closely to the way we had our workflow set up in Irssi.  This way you can evaluate both of these IRC clients for yourself and make the determination of which will suit you best for your needs.  If you haven’t taken a look at the previous posts they sort of build off of each other but if you want to compare Weechat and Irssi just take a look at Introduction to Irssi.

I’m not going to lie, the more I use Weechat the more it grows on me.  It is clean, easy to use and has some great functionality built in to it that isn’t offered out of the box in Irssi.  Small things for the most part, such as a native nicklist, colored nicks and some slick formatting to name a few.  It definitely looks and feels nice right away, without the need for customization.  One convincing argument some have mentioned for switching over to Weechat is that nearly everything is scriptable in a variety of different languages as well as very customizable.  Another nicety is the very slick script manager that you gives you the ability to install and manage different scripts and plugins without any hassle, which I will discuss later in this post.

Another cool thing I learned is that the writer/creator of the program hangs out in the #weechat IRC channel on freenode pretty much all the time and will answer questions for you!  How cool is that?  I had some issues getting the buffers.pl script set up the way I wanted and (@FlashCode) was there to immediately guide me in the right direction.

Bitlbee in Weechat

Just a quick word here.  I noticed that Bitlbee behaves slightly differently in Weechat so if you are used to Irssi then you should read about how to get things working.  Nothing too major here, just a few peculiarities that I thought were important to include.  I have listed below some of the most common things to go over in Bitlbee paired with Weechat.

Connect to bitlbee.

/connect localhost

Automatically connect to Bitlbee when Weechat starts.

/server add &bitlbee localhost -autoconnect

Add your gtalk account.

account add jabber acountname@gmail.com
acc 0 set password <password>
acc 0 on

Change nick to more readable form (restart for this to take effect).

acc 0 set nick_source full_name

Connect to gtalk on start (should work but still need to fix this).

/set irc.server.bitlbee.command "/msg &bitlbee identify <password>"

I couldn’t get oauth working, I will come back and update this later.

Getting used to Weechat

Now that we have all of that fun stuff out of the way we are finally ready for the meat of this post.  Let’s go ahead and install and fire up Weechat.

sudo aptitude install weechat
weechat-curses

Here is how to set some your defaults in Weechat, most of these are pretty intuitive.

/server add freenode irc.freenode.net
/set aspell.check.enabled on
/set irc.server.freenode.nicks "username, username_"
/set irc.server.freenode.username "username"
/set irc.server.freenode.realname "first and last"
/set irc.server.freenode.autoconnect on

Identify nick on server after connecting.

/set irc.server.freenode.command "/msg nickserv identify <password>"

Autojoin favorite channels after connecting to your IRC server.

/set irc.server.freenode.autojoin "##/r/sysadmin,#channel2"

Connect to your newly created IRC alias.

/connect freenode

As you can see, out of the box Weechat offers some very nice features.  The only thing that I found annoying/frustrating in Weechat was that there was no way to manage your different windows.  Remember, in Irssi this was done with the addition of the adv_windwolist.pl script.  In Weechat things are a tad different but there is a script to manage these windows (they are referred to as buffers in Weechat).

Installing the Weeget script manager (This has been deprecated)

Instead of using the “weeget.py” method use “script”.  It is baked in now so installing plugins is super easy now.

For a list of all available plugins visit the Weechat script page.

The first step is to get a nice little script that will allow us to get this script and others.  It is called weeget and should be installed first.  I HIGHLY suggest looking at getting this up and working, it will save you pain and misery down the road, trust me.

cd ~/.weechat/python
wget http://www.weechat.org/files/scripts/weeget.py

We need to restart Weechat for this script get picked up and to take effect.

Install buffers.pl

Once that is done, we will install our new window manager script with the following command inside Weechat.

/script install buffers

This will give you a nice pretty list of all your open windows and conversations on the left side of Weechat.

That’s okay but we can make this better!  Here is how to stick the the bar to either the top or the bottom of you console and to fill with columns.

/set weechat.bar.buffers.position top (or bottom)
/set weechat.bar.buffers.filling_top_bottom columns_vertical

Install iset

Another slick script to install is the iset.pl script.  This makes changing settings much easier as it adds a lightweight interface and description to all the different options and settings.  Installation is easy once we have weeget installed.

/script install iset

There is one last script that users may like, called buffer_autoclose.py.  This will close any inactive sessions that happen to remain open that don’t need to be and will begin working automatically once it is installed.  This will clean up the buffer list and just helps to improve the look and feel.  Super easy to install with weeget.

Install buffer_autoclose

/script install buffer_autoclose.py

And here is what our final product looks like.  Slightly different than irssi and I haven’t really gotten into custom themes or any of that jazz but it has really begun to shape up.

If you have any tips or tricks on how to improve this environment let me know.  I am new to Weechat and am still discovering all of its nuances and will be coming back to this post periodically to update things that I have found to be worthy of adding.

Resources:
http://leigh.cudd.li/article/Bitlbee_and_Weechat_Mini-Tutorial
http://kmacphail.blogspot.com/2011/09/using-weechat-with-freenode-basics.html
http://www.weechat.org/files/doc/devel/weechat_user.en.html#screen_layout

About the Author: Josh Reichardt

Josh is the creator of this blog, a system administrator and a contributor to other technology communities such as /r/sysadmin and Ops School. You can also find him on Twitter and Facebook.

Document Storage: Part 3

Document Storage Project

This is Part 3: Configuring Apache.

We’re only looking for a fairly simple interface to browse through documents. Apache already gives us that – you just need to enable a feature called “Indexes”. But the default indexing is pretty ugly; it’d be nice to make it look a little prettier and maybe add scope for expanding on functionality.

Initially I was going to design my own style, but it turns out someone’s already done that and he’s done a better job than I could ever hope to. So I took the style setup from Recursive Design and tweaked it slightly to fit in with what we’re doing here.

  cd /home/http/assets
  svn co http://recursive-design.com/svn/misc/apache/index-style
  mv index-style/* . 
  rmdir index-style

The stylesheet really ought to be referenced with a specific path so it can always be found. Edit /home/http/assets/header.html and change the stylesheet reference thus:

    

Next up, we need to configure Apache. Ignore the instructions on the Recursive Design blog; things are slightly different here. Edit /etc/apache2/sites-available/default as follows:


        ServerAdmin (YOUR EMAIL ADDRESS HERE)

        DocumentRoot /home/http/documents
        
                Options FollowSymLinks
                AllowOverride None
        
        
                AllowOverride None
                Options Indexes
                DirectoryIndex index.html index.php
                IndexOptions FancyIndexing
                IndexOptions VersionSort
                IndexOptions HTMLTable
                IndexOptions FoldersFirst
                IndexOptions IconsAreLinks
                IndexOptions IgnoreCase
                IndexOptions SuppressDescription
                IndexOptions SuppressHTMLPreamble
                IndexOptions XHTML
                IndexOptions IconWidth=16
                IndexOptions IconHeight=16
                IndexOptions NameWidth=*
                IndexOrderDefault Descending Name
                HeaderName /assets/header.html
                ReadmeName /assets/footer.html
                Order allow,deny
                Allow from all
        

        Alias /assets /home/http/assets
        Alias /search /home/http/search

        
                AllowOverride None
                Order allow,deny
                Allow from all
        

        
                AllowOverride None
                Order allow,deny
                Allow from all
        

        ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/

 ..... LEAVE THE REST OF THE FILE ALONE ......

Restart Apache.

Put something – ideally a PDF that was NOT generated from a scan but instead contains searchable text – into /home/http/documents and browse to http://(hostname).local from a separate PC on the network. If all goes according to plan, you should see something a bit like this:

There’s a lot more to do: we still need something that can index this little lot (so we can just punch in search terms) and we need some easy way to get documents onto the server. But they’re a topic for a future post…

About the Author: James Cort

James Cort is Managing Director of Bediwin Information Services, providing IT management and integration services in the South West of England.