Category Archives: General

Cloud Backup Tutorial

I have been knee deep in backups for the past few weeks, but I think I can finally see light at the end of the tunnel.  What looked like a simple enough idea to implement turned out to be a much more complicated task to accomplish.  I don’t know why, but there seems to be practically no information at all out there covering this topic.  Maybe it’s just because backups suck?  Either way they are extremely important to the vitality of a company and without a workable set of data, you are screwed if something happens to your data.  So today I am going to write about managing cloud data and cloud backups and hopefully shine some light on this seemingly foreign topic.

Part of being a cloud based company means dealing with cloud based storage.  Some of the terms involved are slightly different than the standard backup and storage terminology.  Things like buckets, object based storage, S3, GCS, boto all come to mind when dealing with cloud based storage and backups.  It turns out that there are a handful of tools out there for dealing with our storage requirements which I will be discussing today.

The Google and Amazon API’s are nice because they allow for creating third party tools to manage the storage, outside of their official and standard tools.  In my journey to find a solution I ran across several, workable tools that I would like to mention.  The end goal of this project was to sync a massive amount of files and data from S3 storage to GCS.  I found that the following tools all provided at least some of my requirements and each has its own set of uses.  They are included here in no real order:

  • duplicity/duply – This tool works with S3 for small scale storage.
  • Rclone – This one looks very promising, supports S3 to GCS sync.
  • aws-cli – The official command line tool supported by AWS.

S3cmd – This was the first tool that came close to doing what I wanted.  It’s a really nice tool for smallish amounts of files and has some really nice and handy features and is capable of syncing S3 buckets.  It is equipped with a number of nice and handy options but unfortunately the way it is designed does not allow for reading and writing a large number of files.  It is a great tool for smaller sets of data.

s3s3mirror – This is an extremely fast copy tool written in Java and hosted on Github.  This thing is awesome at copying data quickly.  This tool was able to copy about 6 million files in a little over 5 hours the other day.  One extremely nice feature of this tool is that it has an intelligent sync built in so it knows which files have been copied over.  Even better, this tool is even faster when it is running reads only.  So once your initial sync has completed, additional syncs are blazing fast.

This is a jar file so you will need to have Java installed on your system to run it.

sudo apt-get install openjdk-jre-headless

Then you will need to grab the code from Github.

git clone git@github.com:cobbzilla/s3s3mirror.git

And to run it.

./s3s3mirror.sh first-bucket/ second-bucket/

That’s pretty much it.  There are some handy flags but this is the main command. There is an -r flag for changing the retry count, a -v flag for verbosity and troubleshooting as well as a –dry-run flag to see what will happen.

The only down side of this tool is that it only seems to be supported for S3 at this point – although the source is posted to Github so could easily be adapted to work for GCS, which is something I am actually looking at doing.

Gsutil – The Python command line tool that was created and developed by Google.  This is the most powerful tool that I have found so far.  It has a ton of command line options, the ability to communicate with other cloud providers, open source and is under active development and maintenance.  Gsutil is scriptable and has code for dealing with failures – it can retry failed copies as well as resumable transfers, and has intelligence for checking which files and directories already exist for scenarios where synchronizing buckets is important.

The first step to using gsutil after installation is to run through the configuration with the gsutil config command.  Follow the instructions to link gsutil with your account.  After the initial configuration has been run you can modify or update all the gsutil goodies by editing the config file – which lives in ~/.boto by default.  One config change that is worth mentioning is the parallel_process_count and parallel_thread_count.  These control how much data can get shoved through gsutil at once – so on really beefy boxes you can crank this number up quite a bit higher than its default.  To utilize the parallel processing you simply need to set the -m flag on your gsutil command.

gsutil -m sync/cp gs://bucket-name

One very nice feature of gsutil is that it has built in functionality to interact with AWS and S3 storage.  To enable  this functionality you need to copy your AWS access_id and your secret_access_key in to your ~/.boto config file.  After that, you can test out the updated config to look at your buckets that live on S3.

gsutil ls s3://

So your final command to sync an S3 bucket to Google Cloud would look similar to the following,

gsutil -m cp -R s3://bucket-name gs://bucket-name

Notice the -R flag, which sets the copy to be a recursive copy instead everything in one bucket to the other, instead of a single layer copy.

There is one final tool that I’d like to cover, which isn’t a command line tool but turns out to be incredibly useful for copying large sets of data from S3 in to GCS, which is the GCS Online Import tool.  Follow the link and go fill out the interest form listed and after a little while you should hear from somebody from Google about setting up and using your new account.  It is free to use and the support is very good. Once you have been approved for using this tool you will need to provide a little bit of information for setting up sync jobs, your AWS ID and key, as well as allowing your Google account to sync the data.  But it is all very straight forward and if you have any questions the support is excellent.  This tool saved me from having to manually sync my S3 storage to GCS manually, which would have taken at least 7 days (and that was even with a monster EC2 instance).

Ultimately, the tools you choose will depend on your specific requirements.  I ended up using a combination of s3s3mirror, AWS bucket versioning, the Google cloud import tool and gsutil.  But my requirements are probably different from the next person and each backup scenario is unique so a combination of these various tools allows for flexibility to accomplish pretty much all scenarios.  Let me know if you have any questions or know of some other tools that I have failed to mention here.  Cloud backups are an interesting and unique challenge that I am still mastering so I would love to hear any tips and tricks you may have.

Review: Webmin Administrator’s Cookbook

webmin cookbookI just recently finished reading the Webmin Administrator’s Cookbook and thought I would share some of my thoughts and opinions about the book.  While I don’t typically review books on the blog I thought this would be a good opportunity to discuss a nice book.  This book is written by a very knowledgeable and credible author – Michal Karzynksi.  His background includes over a decade of experience as a developer in various programming languages as well as a scientific research background.

This book isa good read for everyone from seasoned veterans and professionals all the way down to aspiring and freshly minted admins.

The book itself covers a broad, inclusive set of topics, including logging, user management, backups, web server administration and many others.  The basic theme of the book uses the Webmin tool as a sort of framework to discuss and cover various administrative topics and tasks within the Webmin tool.  From their website, Webmin is described as follows:

Webmin is a web-based interface for system administration for Unix. Using any modern web browser, you can setup user accounts, Apache, DNS, file sharing and much more.

This works out to be a perfect tool for aspiring sysadmins because it really does a nice job of cloaking a lot of the nitty gritty complexity and detail that can be overwhelming and confusing for new admins or users that are new or unfamiliar to the concepts and tooling that Webmin covers.  By using Webmin, one can learn about a large number of interesting topics without having to worry about how to type in all of the commands or how to install/configure the tools that come bundled up in Webmin.  This allows users to really increase their productivity.  Couple the Webmin tool with a cookbook of nice concrete examples and you have a great recipe for learning how to use a powerful tool correctly.

Wrapping such a broad spectrum of topics and tools into a web based tool can be a complicated.  But used as a reference material this book does a great job of making everything clear with good examples both of explaining how everything works together, as well as pictorial examples that really do a nice job of tying the written concepts together with concrete, real world usage.  Now is also a good time to mention that this book follows a nice pattern of organizing topics.  From the outset, the book starts with the more basic administrative topics and principles, covering each topic thoroughly with good description and solid examples.  The book progresses quite nicely through the different topics and eventually gets into and covers some of the more obscure topics.

The Webmin Administrator’s Cookbook does a nice job of combining many complex system administration topics into a nice, easy to follow and read reference guide that can be utilized by all different levels of Linux and administrative experience.  If you use Webmin in any capacity at all, this book would be a great reference and guide to help you be more productive in your day to day with this tool.

You can find more information about the book here.  While you are at it, check out the author, Michal Karzynski’s blog for more interesting and useful tips - http://michal.karzynski.pl.

Set up PEM key authentication

Many times it is useful to keys to authenticate to your servers.  This can dramatically improve security and is a great way to manage servers in bulk as well.  You just need to keep track of your keys rather than having to remember a large number of passwords.  The steps to get PEM key authentication are fairly straight forward but it never hurts to walk through the process of getting them set up correctly.

Side Note: I’d like to also mention briefly, that I have these steps set up to work with Chef, so every server that gets deployed using Chef will use PEM keys out of the box, which works out very nicely.  If you’re interested I can expound on that topic a little more, just let me know.

The first step in the process is to generate some keys using openssl.  If you don’t have openssl go download and install it.  If you do have openssl but haven’t updated in ahwile, please update to avoid the heartbleed vulnerability that was recently exploited (nearly all distributors have released the patched version at this point so it should be trivial).

We want to generate our key and create a PEM file out of it.  Here are the steps:

cd ~/.ssh
ssh-keygen -t dsa -b 1024
openssl dsa -in id_dsa -outform pem > test.pem
cat ida_dsa >> authorized_keys

You can leave the values blank (default) in the ssh-keygen.

Now you should have similar listings in your ~/.ssh directory:

ssh keys

  • authorized_keys – This is the public key that the pem file gets authenticated against
  • id_dsa – This is the private portion of the key that we generated in the steps above
  • id_dsa.pub – This is the public key section that is used when authenticating
  • test.pem – this is the file that will be used to authenticate.  Essentially the private key minus the pass phrase

Now you just need to copy the test.pem file that was just generated to a different host in order to log in with your PEM key using scp or rsync.  Once that is done, the command to connect to the remote host using  your key should look similar to the following:

ssh -i /path/to/pem user@server-name

Next steps.  At this point you should have a working pem authentication on your server.  It is probably a good idea at this point to start looking at hardening the security as well as the SSH configuration on the host.  Small things can go a long way.  For example disabling root login, disabling password authentication, etc. will stop a very large amount of attacks from hitting your server now that you are authenticating with pem keys.

Using a self signed cert with Nginx

After the recent heart bleed incident (which I’m sure many of you well remember) I had to reassign some certificates. It turns out that this was a great opportunity to create a blog post.  Since I do not create and assign certs very frequently it is a good opportunity to take some notes and hopefully ease the process for others.  After patching the vulnerable version of Openssl, there are really only a few steps needed to accomplish this.  Assuming you already have nginx installed, which is trivial to do on Ubuntu, the first step is to create the necessary crt and key files.

sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout /etc/nginx/cert.key -out /etc/nginx/cert.crt

Next you will need to tell nginx to load up you new certs in its config.  Here is an example of what the server block in you /etc/nginx/site-available config might look like.  Notice the ssl_certificate and ssl_certificate_key files correspond to the cert files we created above, which we stuck in the /etc/nginx directory.  If you decide to place these certs in a different location you will need to modify your config file to reflect the location.

server {

listen *:443; 
ssl on; 
ssl_certificate cert.crt; 
ssl_certificate_key cert.key; 
ssl_session_timeout 5m; 
ssl_protocols SSLv3 TLSv1; 
ssl_ciphers ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv3:+EXP; 
ssl_prefer_server_ciphers on;

}

Just to cover all our bases here we will also redirect any requests that come in to port 80 (default web) back to 443 for ssl.  The is a simple addition and will add an additional layer of security.

server { 
listen 80; 
return 301 https://$host$request_uri; 
}

The final step is to reload your configuration and test to make sure everything works.

sudo service nginx reload

If your nginx fails to reload, more than likely there is some sort of configuration or syntax error in your config file.  Comb through it for any potential errors or mistakes.  Once your config is loaded properly you can check your handy work by attempting to hit your site using http://.  If your config is working properly it should automatically redirect you to https://.

That’s all it takes.  I think it might be a good exercise to try something like this with Chef but for now this process works okay by hand.  Let me know what you think or if this can be improved.

Setting up a private git repo in Chef

As a Chef newbie I really might have bit off more than I could chew originally when I was looking at how to get private github repo’s working but am glad I pushed through and got a solution working.  I have no idea if this is the preferred method or if there are any easier ways but this worked for me and so I am hoping that if any other Chef newbies stumble across this problem then they can use this post as a guide or reference.

First, I’d like to give credit where it is due.  I used this post as a template as well as the SSH wrapper section in the deploy documentation on the Chef website.

The first issue I had problems with, is that when you connect to github via SSH it wants the Chef client to accept its public fingerprint.  By default, if you don’t modify anything SSH will just sit there waiting for the fingerprint to be accepted.  That is why the SSH Git wrapper is used, it tells SSH on the Chef client that we don’t care about the authentication to the github server, just accept the key.  Here’s what my ssh git wrapper looks like:

 #!/usr/bin/env bash 
 /usr/bin/env ssh -o "StrictHostKeyChecking=no" -i "/home/vagrant/.ssh/id_rsa" $1 $2

You just need to tell your Chef recipe to use this wrapper script:

# Set up github to use SSH authentication 
cookbook_file "/home/vagrant/.ssh/wrap-ssh4git.sh" do 
  source "wrap-ssh4git.sh" 
  owner "vagrant" 
  mode 00700 
end

The next problem is that when using key authentication, you must specify both a public and a private key.  This isn’t an issue if you are running the server and configs by hand because you can just generate a key on the fly and hand that to github to tell it who you are.  When you are spinning instances up and down you don’t have this luxury (actually you might but it seemed like a pain in the ass).

To get around this, we create a couple of templates in our cookbook to allow our Chef client to connect to github with an already established public and private key, the id_rsa and id_rsa.pub files that are shown.  Here’s what the configs look like in Chef:

# Public key 
template "/home/vagrant/.ssh/id_rsa.pub" do 
  source "id_rsa.pub" 
  owner "vagrant" 
  mode 0600 
end 
 
# Private key 
template "/home/vagrant/.ssh/id_rsa" do 
  source "id_rsa" 
  owner "vagrant" 
  mode 0600 
end

After that is taken care of, the only other minor caveat is that if you are cloning a huge repo then it might timeout unless you override the default timeout value, which is set to 600 seconds (10 mins).  I had some trouble finding this information on the docs but thanks to Seth Vargo I was able to find what I was looking for. This is easy enough to accomplish, just use the following snippet to override the default value

timeout 9999

That should be it.  There are probably other, easier ways to accomplish this and so I definitely think the adage “there’s more than one way to skin a cat” applies here.  If you happen to know another way I’d love to hear it.