Using Vim as a word processor

Recently I have been asked to share some of my content on a site called Ops School, a very cool site, that bills itself as “a comprehensive program that will help you learn to be an operations engineer”.  It is essentially an online guide covering topics geared towards a successful career in IT.  If you haven’t checked the site out already I highly suggest you go take a look!  Like right now.  Even better if you have something to contribute!  Either join the mailing list or get going by joining the community over on github.  Contributing to this project is a fantastic way to get your name on an Open Source project and would also be a great learning experience if that type of things is interesting to you.  At least it has been for me so far.

Anyway, the project has a set of guidelines and styles posted on their site for authors to adhere by.  Thus far I have found Vim to be the best word processor for following these styles and also the best way to submit writing to this project, plus it is a good way to force myself to make use of Vim because I don’t get much practice using it otherwise.

I have taken bits and pieces from various other vimrc’s I’ve found and fit them into my own unique scenario, which I suggest you do as well.  But the following section is a great example to use a starting point for adding in the word processor functionality to your vimrc.

func! WordProcessorMode()
  setlocal formatoptions=t1
  setlocal textwidth=80
  map j gj
  map k gk
  setlocal smartindent
  setlocal spell spelllang=en_us
  setlocal noexpandtab
com! WP call WordProcessorMode()

One gotcha that I encountered with this setup initially was that lines didn’t automatically re-balance for me if I went back to a previous paragraph and made a change that  caused a line to spill over the 80 character word wrap limit.  To do align paragraphs, select the text that has come out of line and type “gq” to balance out the text in the paragraph again.

If you have question let me know.  Otherwise, if you have any other tricks or tips that you like to use to enhance your Vim word processing experience feel free to let me know!

Protip March: Quickly viewing logs with Powershell

Wow it feels like it’s been forever since I have posted.  I have been crazy busy with work stuff and am just now getting caught up with everything and have enough room to poke my head above the water and breath again finally.  We had a massive overhaul of our data center in mid February (among other things) and I am finally getting all the loose ends tied up from that project, including our brand-spanking new test environment which I am super excited about and which I will post about in the not so distant future.

Here is proof of some of our efforts just in case you don’t believe me 🙂




Anyway, getting back on track, I just discovered a slick way in Powershell to mimic the functionality of tail and tail -f in the Linux world.  If you have ever used tail then you know it is a great tool for monitoring log files or quickly looking at the end of a piece of code for example.

With the trick I’m about to show you, the same can essentially be done in Windows.  However, there are a few caveats.  For one, the syntax is a little bit different (if you want to change this just set up an alias).  The Powershell equivalent relies on the Get-Content cmdlet with the -Tail and -Wait flags to accomplish this task.

So in the following example I have instructed Powershell to look at the last 30 lines of the uploadpic.ps1 file and using the -Wait flag it will be updated as the file gets appended to.

Get-Content -path .\uploadpic.ps1 -Tail 30 -Wait

If you don’t care about viewing the file live then you can remove the -Wait flag and Powershell will simply grab the last N number of lines where N is 30 in our example.  30 seems like a good enough number in our example and can obviously be changed depending on your needs.  Easy enough for what I need it for.

Get-Content -path .\uploadpic.ps1 -Tail 30

As I mentioned, I will be going into a little more detail about some of the things I learned from our data center rebuild that I feel were some great lessons and good things to know/be aware of.   Standby for new contents as I get back to writing more blog posts and getting back up to speed on the writing side of things.

Protip January: Get your external IP from the command line

Ever need to grab your IP quick but don’t want to get out of the command line or stop whatever you’re working on?  Or how about if you have SSH’d into a number of different servers and you simply want to know where you are at currently?  This little trick enables you to quickly determine your public IP address without leaving the command line.

I’ll admit, I didn’t originally come up with this one, but liked it so much that I decided to write a quick post about it because I thought it was so nice and useful. There is a great website called where users can post all their slick one liners, which is where I found this one.  If you haven’t been there before I highly recommend it, there is some really good stuff over there.

This one is simple yet quite useful, which is what I’m all about.  The command uses curl, so if you don’t have that bad boy installed yet you’ll need to go get that quick (Debian based distros).

sudo aptitude install curl

Once that is installed simply run the following:


And bam!  Emeril style.  Let that go out and do its thing and you will quickly have your external IP address.  I like this method a lot more than having to jump out of the shell and open up a browser then going to a website to get this information.  It might not save that much time but to me just knowing how to do this is useful and knowledge is power.  Or something.

Document storage: Part 6

Document Storage Project

This is Part 6: Tying it all together.

All that’s left to do now is write a script that will:

  • Detect when a new file’s been uploaded.
  • Turn it into a searchable PDF with OCR.
  • Put the finished PDF in a suitable directory so we can easily browse for it later.

This is actually pretty easy. inotifywait(1) will tell us whenever a file’s been closed, we can use that as our trigger to OCR the document.

Our script is therefore in two parts:

Part 1: will watch the /home/incoming directory for any files that are closed.
Part 2: will be called by the script in part 1 every time a file is created.

Part 1

This script lives in /home/scripts and is called watch-dir.

DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"

inotifywait -m --format '%:e %f' -e CLOSE_WRITE "${INCOMING}"  2>/dev/null | while read LINE
        FILE="${INCOMING}"/`echo ${LINE} | cut -d" " -f2-`
        "${DIR}"/process-image "${FILE}" &

Part 2

This script lives in /home/scripts and is called process-image.


# Dead easy - at least in theory!
# Take a single argument - filename of the file to process. 
# Do all the necessary processing to make it a 
# searchable PDF.

OUTFILE="`basename "${1}"`"

if [ -s "${1}" ]
	# We use the first part of the filename as a classification.
	CLASSIFICATION=`echo ${OUTFILE} | cut -f1 -d"-"`
	OUTDIR="/home/http/documents/${CLASSIFICATION}/`date +%Y`/`date +%Y-%m`/`date +%Y-%m-%d`"

	if [ ! -d "${OUTDIR}" ]
		mkdir -p "${OUTDIR}" || exit 1

	# We have to move our file to a temporary location right away because 
	# otherwise pdfsandwich uses the file's own location for 
	# temporary storage. Well and good - but the file's location is 
	# subject to an inotify that will call this script!

	mv "${1}" "${TEMPFILE}" || exit 1

	# Have we a colour or a mono image? Probably quicker to find out 
	# and process accordingly rather than treat everything as RGB.
	# We assume the first page is representative of everything
        COLOURDEPTH=`convert "${TEMPFILE}[0]" -verbose -identify /dev/null 2>/dev/null | grep "Depth:" | awk -F'[/-]' '{print $2}'`
	if [ "${COLOURDEPTH}" -gt 1 ]
	pdfsandwich ${SANDWICHOPTS} -o "${OUTDIR}/${OUTFILE}" "${TEMPFILE}" > /dev/null 2>&1
	rm "${TEMPFILE}"

There’s just one thing missing: pdfsandwich. This is actually something I found elsewhere on the web. It hasn’t made it into any of the major distro repositories as far as I can tell, but it’s easy enough to compile and install yourself. Find it here.

Run /home/scripts/watch-dir every time we boot – the easiest way to do this is to include a line in /etc/rc.local that calls it:

/home/scripts/watch-dir &

Get it started now (unless you were planning on rebooting):

nohup /home/scripts/watch-dir &

Now you should be able to scan in documents, they’ll be automatically OCR’d and made available on the internal website you set up in part 3.

Further enhancements are left to the reader; suggestions include:

  • Automatically notifying sphider-plus to reindex when a document is added. (You’ll need a newer version of sphider-plus to do this. Unfortunately there is a cost associated with this, but it’s pretty cheap. Get it from here).
  • There is a bug in pdfsandwich (actually, I think the bug is probably in tesseract or hocr2pdf, both of which are called by pdfsandwich): under certain circumstances which I haven’t been able to nail down, sometimes you’ll find that in the finished PDF one page of a multi-page document will only show the OCR’d layer, not the original document. Track down this bug, fix it and notify the maintainer of the appropriate package so that the upstream package can also be fixed.
  • This isn’t terribly good for bulk scanning – if you want to scan in 50 one-page documents, you have to scan them individually otherwise they’ll be treated as a single 50 page document. Edit the script so we can somehow communicate with it that certain documents should be split into their constituent pages and store the resulting PDFs in this way.
  • Like all OCR-based solutions, this won’t give you a perfect representation of the source text in the finished PDF. But I’m quite sure the accuracy can be improved, very likely without having to make significant changes to how this operates. Carry out some experiments to figure out optimum settings for accuracy and edit the scripts accordingly.

Document Storage: Part 5

Document Storage Project

This is Part 5: Uploading Scanned Images.

There’s two components to this part: configuring somewhere for the files to be uploaded to and setting up your MFD to upload to them. Most modern MFDs will upload to a CIFS share, which is what we’re going to use here. First thing’s first, we need to install Samba:

apt-get install samba

Now we need to set up Samba. We’ll have user-level security (it’ll be much easier to lock things down if we want to increase security at a later date, and besides share-level security went out with the Ark) and a single share called incoming. We also need a user for the MFD to log into Samba with; we’ll call this user “scanner”. We’ll also have a group called “scanner” so we can be a little more flexible over who can access this share should we wish.

Edit /etc/samba/smb.conf as follows:


# "security = user" is always a good idea. This will require a Unix account
# in this server for every user accessing the server. See
# /usr/share/doc/samba-doc/htmldocs/Samba3-HOWTO/ServerType.html
# in the samba-doc package for details.
   security = user


        path = /home/incoming
        guest ok = no
        browseable = no
        read only = no
        valid users = @scanner

Now, we need a new user for the MFD. Samba requires that users also have corresponding Unix accounts, so first we create a Unix account, then we set their Samba password. We also need to ensure the permissions on /home/incoming are correct – the folllowing commands deal with this:

  useradd scanner
  smbpasswd scanner
  chgrp scanner /home/incoming
  chmod g+rwx /home/incoming

Make sure you choose a password that is not only secure, but possible to type in on your MFD! Check this works by connecting to the following folder in Windows:


You’ll need to use the username/password for the scanner user you set up.

For the final part of this, you need to set up your MFD to scan to this directory.

I’ve chosen an Oki MB451 multifunction unit for a number of reasons:

  • It’s cheap.
  • It has a double-sided document feeder for scanning. More and more documents are being sent double-sided; it seems like a step back to have a document feeder that can’t deal with this.
  • It supports scanning directly to email and CIFS share without requiring extra software on the PC. (This is important; certainly a few years ago a lot of manufacturers claimed their products could do this but it wasn’t apparent until after you’d taken it out of the box that their product didn’t do any of it without additional software on your PC. Certain large photocopier-type units still have this restriction, though sometimes you can buy an optional bolt-on to overcome it. I prefer avoiding the need for extra bolt-ons because they’re usually extortionately priced and often difficult to source).
  • It has a nice big display. These units can be a pig to set up at the best of times; a large display often goes some way to alleviate this problem.
  • You can set up lots of profiles – preconfigured shortcuts that say “everything scanned under this profile should be stored under this name in this share accessed with this username and password; files should have this format”. Unfortunately you can’t nail a profile to say “everything scanned under this profile is double-sided” but you can’t have everything!
  • The printer supports Postscript, which means it’ll be pretty much guaranteed to work under any OS I can throw at it for a long time to come.

I won’t go into detail regarding MFD configuration – there’s simply too many on the market and they all vary. It’s enough to explain that I’ve set up a profile called “Correspondence” and I’ve pointed it at \\(hostname)\incoming.

With the profile I’ve set up, scanned documents will be stored under \\(hostname)\incoming\Correspondence-#####.pdf.

Test this all works by scanning a document and making sure it appears in the /home/incoming directory on your Linux box.

There’s only one thing left to do – tie all this together so incoming documents are automatically OCR’d, made available via Apache and OCR’d so they’re indexable in Sphider….

