Fixing docker-machine shared folder performance issues

It is a known issue that vboxsf (Virtualbox Shared Folders) has performance problems.  This ugly fact becomes a problem if you use docker-machine with the default Virtualbox driver to mount volumes, both on Windows and OS X, especially when mounting directories with a large amount (~17k and above files).  Linux does not suffer from this performance problem since it is able to run Docker natively and does not require you to run docker-machine.

There are various issues floating around Github referencing this problem, most of which remain unresolved.

Unfortunately there is currently not a proper fix for the vboxsf performance issues on OS X and Windows.  In fact, I reached out to the Virtualbox developers around a year ago asking what the deal was and was basically told that fixing shared folder performance was not a high priority issue for their dev team.

After hearing the unsettling news, I set out to find a good way to deal with shared volumes.  I stumbled across a few different approaches to solving the problem but most of them ended up being glitchy (at the time) or overly complicated.  There is a nice write up that mentions many of the tools that I tried myself.

Having tried most of the methods out there, easily the best workaround I have found is to use NFS file shares to mount the “Users” directory using a tool called docker-machine-nfs.  It is easy to install and run and most importantly it just works out of the box, which is exactly what most folks are looking for.

Sadly this method does not work on Windows.  And as far as I can tell there is not a good workaround to this problem if you are running docker-machine on Windows.  It does sounds like some folks maybe have had some success using samba but I have not attempted to get fast volumes working on Windows so can’t say for sure what the best approach is.

To install docker-machine-nfs

curl -s https://raw.githubusercontent.com/adlogix/docker-machine-nfs/master/docker-machine-nfs.sh |
  sudo tee /usr/local/bin/docker-machine-nfs > /dev/null && \
  sudo chmod +x /usr/local/bin/docker-machine-nfs

To run it

Make sure you already have created a docker-machine VM and verify that it is running.  Then run the following command.

docker-machine-nfs <machine-name>

And that’s pretty much it…  It requires admin access to do the NFS mounting so you might need to punch in your password, other than that you can pretty much follow along with what the output is doing.

There are a few caveats to be aware of.

I have noticed that on newer versions of docker-machine, if you run the script too quickly after creating the VM, docker-machine ends up having issues communicating with the Docker daemon.  The work around (for now) is just to wait ~30 seconds the docker-machine VM to boot fully before running the command to mount nfs.

There is also currently an issue on the docker-machine side on version 0.5.5 and above that breaks docker-machine-nfs on the first run, which is described here.  The workaround for that issue is to modify the script and place a “sleep 20” in the script, as per the comments in the issue.  The author appears to have brought the issue up with docker-machine developers so should fixed properly in the near future.

Read More

Dockerizing Sentry

I have created a Github project that has basic instructions for getting started.  You can take a look over there for ideas of how all of this works and to get ideas for your own set up.

I used the following links as reference for my approach to Dockerizing Sentry.

https://registry.hub.docker.com/u/slafs/sentry
https://github.com/rchampourlier/docker-sentry

If you have configurations to use, it is probably a good idea to start from there.  You can check my Github repo for what a basic configuration looks like.  If you are starting from scratch or are using version 7.1.x or above you can use the “sentry init” command to generate a skeleton configuration to work from.

For this setup to work you will need the following prebuilt Docker images/containers. I suggest using something simple like docker-compose to stitch the containers together.

  • redis – https://registry.hub.docker.com/_/redis/
  • postgres – https://registry.hub.docker.com/_/postgres/
  • memcached – https://hub.docker.com/_/memcached/
  • nginx – https://hub.docker.com/_/nginx/

NOTE: If you are running this on OS X you may need to do some trickery and give special permission on the host (mac) level e.g. create ~/docker/postgres directory and give it the correct permission (I just used 777 recursively for testing, make sure to lock it down if you put this in production).

I wrote a little script in my Github project that will take care of setting up all of the directories on the host OS that need to be set up for data to persist.  The script also generates a self signed cert to use for proxying Sentry through Nginx.  Without the certificate, the statistics pages in the Sentry web interface will be broken.

To run the script, run the following command and follow the prompts.  Also make sure you have docker-compose installed beforehand to run all the needed command.

sudo ./setup.sh

The certs that get generated are self signed so you will see the red lock in your browser.  I haven’t tried it yet but I imagine using Let’s Encrytpt to create the certificates would be very easy.  Let me know if you have had any success generating Nginx certs for Docker containers, I might write a follow up post.

Preparing Postgres

After setting up directories and creating certificates, the first thing necessary to getting up and going is to add the Sentry superuser to Postgres (at least 9.4).  To do this, you will need to fire up the Postgres container.

docker-compose up -d postgres

Then to connect to the Postgres DB you can use the following command.

docker-compose run postgres sh -c 'exec psql -h "$POSTGRES_PORT_5432_TCP_ADDR" -p "$POSTGRES_PORT_5432_TCP_PORT" -U postgres'

Once you are logged in to the Postgres container you will need to set up a few Sentry DB related things.

First, create the role.

CREATE ROLE sentry superuser;

And then allow it to login.

ALTER ROLE sentry WITH LOGIN;

Create the Sentry DB.

CREATE DATABASE sentry;

When you are done in the container, \q will drop out of the postgresql shell.

After you’re done configuring the DB components you will need to “prime” Sentry by running it a first time.  This will probably take a little bit of time because it also requires you to build and pull all the other needed Docker images.

docker-compose build
docker-compose up

You will quickly notice if you try to browse to the Sentry URL (e.g. the IP/port of your Sentry container or docker-machine IP if you’re on OS X) that you will get errors in the logs and 503’s if you hit the site.

Repair the database (if needed)

To fix this you will need to run the following command on your DB to repair it if this is the first time you have run through the set up.

docker-compose run sentry sentry upgrade

The default Postgres database username and password is sentry in this setup, as part of the setup the upgrade prompt will ask you got create a new user and password, and make note of what those are.  You will definitely want to change these configs if you use this outside of a test or development environment.

After upgrading/preparing the database, you should be able to bring up the stack again.

docker-compose up -d && docker-compose logs

Now you should be able to get to the Sentry URL and start configuring .  To manage the username/password you can visit the /admin url and set up the accounts.

 

Next steps

The Sentry server should come up and allow you in but will likely need more configuration.  Using the power of docker-compose it is easy to add in any custom configurations you have.  For example, if you need to adjust sentry level configurations all you need to do is edit the file in ./sentry/sentry.conf.py and then restart the stack to pick up the changes.  Likewise, if you need to make changes to Nginx or celery, just edit the configuration file and bump the stack – using “docker-compose up -d”.

I have attempted to configure as many sane defaults in the base config to make the configuration steps easier.  You will probably want to check some of the following settings in the sentry/sentry.conf.py file.

  • SENTRY_ADMIN_EMAIL – For notifications
  • SENTRY_URL_PREFIX – This is especially important for getting stats working
  • SENTRY_ALLOW_ORIGIN – Where to allow communications from
  • ALLOWED_HOSTS – Which hosts can communicate with Sentry

If you have the SENTRY_URL_PREFIX set up correctly you should see something similar when you visit the /queue page, which indicates statistics are working.

Sentry Queue

If you want to set up any kind of email alerting, make sure to check out the mail server settings.

docker-compose.yml example file

The following configuration shows how the Sentry stack should look.  The meat of the logic is in this configuration but since docker-compose is so flexible, you can modify this to use any custom commands, different ports or any other configurations you may need to make Sentry work in your own environment.

# Caching
redis:
  image: redis:2.8
  hostname: redis
  ports:
    - "6379:6379"
   volumes:
     - "/data/redis:/data"

memcached:
  image: memcached
  hostname: memcached
  ports:
    - "11211:11211"

# Database
postgres:
  image: postgres:9.4
  hostname: postgres
  ports:
    - "5432:5432"
  volumes:
    - "/data/postgres/etc:/etc/postgresql"
    - "/data/postgres/log:/var/log/postgresql"
    - "/data/postgres/lib/data:/var/lib/postgresql/data"

# Customized Sentry configuration
sentry:
  build: ./sentry
  hostname: sentry
  ports:
    - "9000:9000"
    - "9001:9001"
  links:
    - postgres
    - redis
    - celery
    - memcached
  volumes:
    - "./sentry/sentry.conf.py:/home/sentry/.sentry/sentry.conf.py"


# Celery
celery:
  build: ./sentry
  hostname: celery
  environment:
    - C_FORCE_ROOT=true
  command: "sentry celery worker -B -l WARNING"
  links:
    - postgres
    - redis
    - memcached
  volumes:
    - "./sentry/sentry.conf.py:/home/sentry/.sentry/sentry.conf.py"

# Celerybeat
celerybeat:
  build: ./sentry
  hostname: celerybeat
  environment:
    - C_FORCE_ROOT=true
  command: "sentry celery beat -l WARNING"
  links:
    - postgres
    - redis
  volumes:
    - "./sentry/sentry.conf.py:/home/sentry/.sentry/sentry.conf.py"

# Nginx
nginx:
  image: nginx
  hostname: nginx
  ports:
    - "80:80"
    - "443:443"
  links:
    - sentry
  volumes:
    - "./nginx/sentry.conf:/etc/nginx/conf.d/default.conf"
    - "./nginx/sentry.crt:/etc/nginx/ssl/sentry.crt"
    - "./nginx/sentry.key:/etc/nginx/ssl/sentry.key"

The Dockerfiles for each of these component are fairly straight forward.  In fact, the same configs can be used for the Sentry, Celery and Celerybeat services.

Sentry

# Kombu breaks in 2.7.11
FROM python:2.7.10

# Set up sentry user
RUN groupadd sentry && useradd --create-home --home-dir /home/sentry -g sentry sentry
WORKDIR /home/sentry

# Sentry dependencies
RUN pip install \
 psycopg2 \
 mysql-python \
 supervisor \
 # Threading
 gevent \
 eventlet \
 # Memcached
 python-memcached \
 # Redis
 redis \
 hiredis \
 nydus

# Sentry
ENV SENTRY_VERSION 7.7.4
RUN pip install sentry==$SENTRY_VERSION

# Set up directories
RUN mkdir -p /home/sentry/.sentry \
 && chown -R sentry:sentry /home/sentry/.sentry \
 && chown -R sentry /var/log

# Configs
COPY sentry.conf.py /home/sentry/.sentry/sentry.conf.py

#USER sentry
EXPOSE 9000/tcp 9001/udp

# Making sentry commands easier to run
RUN ln -s /home/sentry/.sentry /root

CMD sentry --config=/home/sentry/.sentry/sentry.conf.py start

Since the customized Sentry config is rather lengthy, I will point you to the Github repo again.  There are a few values that you will need to provide but they should be pretty self explanatory.

Once the configs have all been put in to place you should be good to go.  A bonus piece would be to add an Upstart service that takes care of managing the stack if the server either gets rebooted or the containers manage to get stuck in an unstable state.  The configuration is a fairly easy thing to do and many other guides and posts have been written about how to accomplish this.

Read More

Enable SSL for your WordPress blog

Updated: 11/18/16

The Let’s Encrypt client was recently renamed to “certbot”.  I have updated the post to use the correct name but if I miss something use certbot or let me know.

With the announcement of the public beta of the Let’s Encrypt project, it is now nearly trivial to get your site set up with an SSL certificate.  One of the best parts about the Let’s Encrypt project is that it is totally free, so there is pretty much no reason to protect your blog set up with an SSL certificate.  The other nice part of Let’s Encrypt is that it is very easy to get your certificate issued.

The first step to get started is grabbing the latest source code from GitHub for the project.  Log on to your WordPress server (I’m running Ubuntu) and clone the repo.  Make sure to install git if you haven’t already.

git clone https://github.com/letsencrypt/certbot.git

There is a shell script you can run to pretty much do everything for you, including installation of any packages and libraries it needs as well as configures paths and other components it needs to work.

cd certbot
./certbot-auto

After the bootstrap is done there should be some CLI options.  Run the command with the -h flag to print out help.

./certbot-auto -h

Since I am using Apache for my blog I will use the “–apache” option.

./certbot-auto --apache

There will be some prompts you need to go through for setting up the certificates and account creation.

let's encrypt

 

 

 

 

 

This process is still somewhat error prone, so if you make a typo you can just rerun the “./letsencrypt-auto” command and follow the prompts.

The certificates will be dropped in to /etc/letsencrypt/live/<website>.  Go double check them if needed.

This process will also generate a new apache configuration file for you to use.  You can check for the file in /etc/apache2/site-enabled.  The import part of this config should look similar to the following:

<VirtualHost *:443>
  UseCanonicalName Off
  ServerAdmin webmaster@localhost
  DocumentRoot /var/www/wordpress
  SSLCertificateFile /etc/letsencrypt/live/thepracticalsysadmin.com/cert.pem
  SSLCertificateKeyFile /etc/letsencrypt/live/thepracticalsysadmin.com/privkey.pem
  Include /etc/letsencrypt/options-ssl-apache.conf
  SSLCertificateChainFile /etc/letsencrypt/live/thepracticalsysadmin.com/chain.pem
</VirtualHost>

As a side note, you will probably want to redirect non https requests to use the encrypted connection.  This is easy enough to do, just go find your .htaccess file (mine was in /var/www/wordpress/.htaccess) and add the following rules.

<IfModule mod_rewrite.c>
  RewriteEngine On
  RewriteCond %{SERVER_PORT} 80
  RewriteRule ^(.*)$ https://example.com/$1 [R,L]
</IfModule>

Before we restart Apache with the new configuration let’s run a quick configtest to make sure it all works as expected.

apachectl configtest

If everything looks okay in the configtest then you can reload or restart apache.

service apache2 restart

Now when you visit your site you should get the nice shiny green lock icon on the address bar.  It is important to remember that the certificates issued by the Let’s Encrypt project are valid for 90 days so you will need to make sure to keep up to date and generate new certificates every so often.  The Let’s Encrypt folks are working on automating this process but for now you will need to manually generate new certificates and reload your web server.

let's encrypt

 

 

 

 

 

 

 

 

 

 

 

 

 

 

That’s it.  Your site should now be functioning with SSL.

Updating the certificate automatically

To take this process one step further We can make a script that can be run via cron (or manually) to update the certificate.

Here’s what the script looks like.

#!/usr/bin/env bash

dir="/etc/letsencrypt/live/example.com"
acme_server="https://acme-v01.api.letsencrypt.org/directory"
domain="example.com"
https="--standalone-supported-challenges tls-sni-01"

# Using webroot method
#/root/letsencrypt/certbot-auto --renew certonly --server $acme_server -a webroot --webroot-path=$dir -d $domain --agree-tos

# Using standalone method
service apache2 stop
# Previously you had to specify options to renew the cert but this has been deprecated
#/root/letsencrypt/certbot-auto --renew certonly --standalone $https -d $domain --agree-tos
# In newer versions you can just use the renew command
/root/letsencrypt/certbot-auto renew --quiet
service apache start

Notice that I have the “webroot” method commented out.  I run a service (Varnish) on port 80 that proxies traffic but also interferes with LE so I chose to run the standalone renewal method.  It is pretty easy, the main difference is that you need to turn off Apache before you run it since Apache binds to to ports 80/443.  But the downtime is okay in my case.

I chose to put the script in to a cron job and have it run every 45 days so that I don’t have to worry about logging on to my server to regenerate the certificate.  Here’s what a sample crontab for this job might look like.

0 0 */45 * * /root/renew_cert.sh

This is a straight forward process and will help with your search engine juices as well.

Read More

ECS cluster turnup with CoreOS and Terraform

Recently I have been evaluating different container clustering tools and technologies.  It has been a fun experience thus far, the tools and community being built around Docker have come a long time since I last looked.  So for today’s post I’d like to go over ECS a little bit.

ECS is essentially the AWS version of container management.  ECS takes care of managing your Docker (container) infrastructure by handling creation, management, destruction and scheduling as well as providing API integration with other AWS services, which is really powerful.  To get ECS up and running all you need to do is create an ECS cluster, either from the AWS console or from some other AWS integration like the CLI or Terraform, then install the agent on servers that you would like ECS to schedule work on.  After setting up the agent and cluster name you are basically ready to go, start by creating a task and then create a service to start running containers on the cluster.  Some cool new features have been announced at this years re:Invent conference but I haven’t had a chance yet to look at them yet.

First impression of ECS

The best part about testing ECS by far has been how easy it is to get set up and running.  It took less than 20 minutes to go from nothing to fully functioning cluster that was scheduling containers to hosts and receiving load.  I think the most powerful aspect of ECS is its integration with other AWS services.  For example, if you need to attach containers/services to a load balancer, the AWS infrastructure is already there so the different pieces of the infrastructure really mesh well together.

The biggest downside so far is that the ECS console interface is still clunky.  It is functional, and I have been able to use it to do everything I have needed but it just feels like it needs some polish and things are nested in menu’s and usually not easy to find.  I’m sure there are plans to improve the interface and as mentioned above some new features were recently announced, so I have a feeling there will be some nice improvements on the way.

I haven’t tried the CLI tool yet but it looks promising for automating containers and services.

Setting things up

Since I am a big fan of CoreOS I decided to try turning up my ECS cluster using CoreOS as the base OS and Terraform to do the heavy lifting and provisioning.

The first step is to create your cluster.  I noticed in the AWS console there was a configuration wizard that guides you through your first cluster which was annoying because there wasn’t a clean way to just create the cluster.  So you will need to follow the on screen instructions for getting your first environment set up.  If any of this is unclear there is a good guide for getting started with ECS here.

After your cluster has been created there is a menu that shows your ECS environments.

ECS cluster menu

 

 

 

 

 

 

 

 

 

 

Next, you will need to turn on the nodes that will be connecting to this cluster.  The first part of this is to get your cloud-config set up to connect to the cluster.  I used the CoreOS docs to set up the ECS agent, making sure to change the ECS_CLUSTER= section in the config.

#cloud-config

coreos:
  units:
  -
  name: amazon-ecs-agent.service
  command: start
  runtime: true
  content: |
  [Unit]
  Description=Amazon ECS Agent
  After=docker.service
  Requires=docker.service
  Requires=network-online.target
  After=network-online.target

  [Service]
  Environment=ECS_CLUSTER=my-cluster
  Environment=ECS_LOGLEVEL=warn
  Environment=ECS_CHECKPOINT=true
  ExecStartPre=-/usr/bin/docker kill ecs-agent
  ExecStartPre=-/usr/bin/docker rm ecs-agent
  ExecStartPre=/usr/bin/docker pull amazon/amazon-ecs-agent
  ExecStart=/usr/bin/docker run --name ecs-agent --env=ECS_CLUSTER=${ECS_CLUSTER} --env=ECS_LOGLEVEL=${ECS_LOGLEVEL} --env=ECS_CHECKPOINT=${ECS_CHECKPOINT} --publish=127.0.0.1:51678:51678 --volume=/var/run/docker.sock:/var/run/docker.sock --volume=/var/lib/aws/ecs:/data amazon/amazon-ecs-agent
  ExecStop=/usr/bin/docker stop ecs-agent

Note that the Environment=ECS_CLUSTER=my-cluster, this is the most important bit to get the server to check in to your cluster, assuming you named it “my-cluster”.  Feel free to add any other values your infrastructure may need.  Once you have the config how you want it, run it through the CoreOS cloud-config validator to make sure it checks out.  If everything looks okay there, your cloud-config should be ready to go.

You can find more info about how to configure the ECS agent in the docs here.

Once you have your cloud-config in order, you will need to get your Terraform “recipe” set up.  I used this awesome github project as the base for my own project.  The Terraform logic from there basically creates an AWS launch config and autoscaling group (and uses the cloud-config from above) to launch instances in to your cluster.  And the ECS agent takes care of the rest, once your servers are up and the agent is reporting in to the cluster.

launch_config.tf

resource "aws_launch_configuration" "ecs" {
  name = "ECS ${var.cluster_name}"
  image_id = "${var.ami}"
  instance_type = "${var.instance_type}"
  iam_instance_profile = "${var.iam_instance_profile}"
  key_name = "${var.key_name}"
  security_groups = ["${split(",", var.security_group_ids)}"]
  user_data = "${file("../cloud-config/ecs.yml")}"

  root_block_device = {
    volume_type = "gp2"
    volume_size = "40"
  }
}

Notice the user_data section.  This is where we inject the cloud config from above to provision CoreOS and launch the ECS agent.

autoscaler.tf

resource "aws_autoscaling_group" "ecs-cluster" {
  availability_zones = ["${split(",", var.availability_zones)}"]
  vpc_zone_identifier = ["${split(",", var.subnet_ids)}"]
  name = "ECS ${var.cluster_name}"
  min_size = "${var.min_size}"
  max_size = "${var.max_size}"
  desired_capacity = "${var.desired_capacity}"
  health_check_type = "EC2"
  launch_configuration = "${aws_launch_configuration.ecs.name}"
  health_check_grace_period = "${var.health_check_grace_period}"

  tag {
    key = "Env"
    value = "${var.environment_name}"
    propagate_at_launch = true
  }

  tag {
    key = "Name"
    value = "ECS ${var.cluster_name}"
    propagate_at_launch = true
  }
}

There are a few caveats I’d like to highlight with this approach.  First, I already have an AWS infrastructure in place that I was testing agains this.  So I didn’t have to do any of the extra work to create a VPC, or a gateway for the VPC.  I didn’t have to create the security groups and subnets either, I just added them to the Terraform code.

The other caveat is that if you want to use the Github project I linked to you will need to make sure that you populate the variables with your own environment specific values.  That is why having the VPC, subnets and security groups was handy for me.  Be sure to browse through the variables.tf file and substitute in your own values.  As an example,  I had to update the variables to use the CoreOS 766.4.0 image.  This AMI will be specific to your AWS region so make sure to look up the AMI first.

variable "ami" {
  /* CoreOS 766.4.0 */
  default = "ami-dbe71d9f"
  description = "AMI id to launch, must be in the region specified by the region variable"
}

Another part I had to modify to get the Github project to work was adding in my AWS credentials which look similar to the following.  Make sure to update these variables with your ID and secret.

provider "aws" {
  access_key = "${var.access_key}"
  secret_key = "${var.secret_key}"
  region = "${var.region}"
}

variable "access_key" {
  description = "AWS access key"
  default = "XXX"
}

variable "secret_key" {
  description = "AWS secret access key"
  default = "xxx"
}

Make sure to also copy/edit the autoscaling.tf and launch_config.tf files to reflect anything that is specific to your environment (Terraform will complain if there are issues).

After you have combed through the variables.tf and updated the Terraform files to your liking you can simply run terraform plan -input=false and see how Terraform will create the ASG for you.

If everything looks good, you can run terrafrom apply -input=false and Terraform will go out and start building your new ECS infrastructure for you.  After a few minutes check the EC2 console and your launch config and autoscaling group should be in there.  If that stuff all looks okay, check the ECS console and your new servers should show up and be ready to go to work for you!

NOTE: If you are starting from scratch, it is possible to do all of the infrastructure provisioning via Terraform but it is too far out of the scope of this post to cover because there are a lot of steps to it.

Read More

Graphite threshold alerting with Sensu

Instrumenting your code to report application level metrics is definitely one of the most powerful monitoring tasks you can accomplish.  It is damn satisfying to get working the first time as well.  Having the ability to look at your application and how it is performing at a granular level can help identify potential issues or bottlenecks but can also give you a greater understanding of how people are interacting with the application at a broad scale.  Everybody loves having these types of metrics to talk about their apps and products so this style of monitoring is a great win for the whole team.

I don’t want to dive in to the specifics of WHAT you should monitor here, that will be unique to every environment.  Instead of covering the what and how of instrumenting the code to report specific metrics, I will be running through an example of what the process might look like for instrumenting a check and alarm for monitoring and alerting purposes at an operations level.  I am not a developer, so I don’t spend a lot of time thinking about what types of things are important to collect metrics on.  Usually my job instead is to figure out how to monitor and alert effectively, based on the metrics that developers come up with.

Sensu has a great plugin to check Graphite thresholds in their plugin repo.  If you haven’t looked already, take a minute to glance over the options a little bit and see how the plugin works.  It is a pretty simple plugin but has been able to do everything I need it to.

One common monitoring task is to check how long requests are taking.  So in this example, we are querying the Graphite server and reporting a critical  status (status 1) if the request averages more than 7 seconds for a response time.

Here is the command you would run manually to check this threshold.  Make sure to download the script if you haven’t already, you can just copy the code directly or clone the repo if you are doing this manually.  If you are using Sensu you can use the sensu_plugin LWRP to grab the script (more on that below).

./check-data -s <servername:port> -t <graphite query> -c 7000 -u user -p password
./check-data -s graphite.example.com -t alias(stats.timer.server.response_time.mean, 'Mean') -c 7000 -u myuser -p awesomepassword

There are a few things to note.  The -s flag specifies which graphite server or endpoint to hit, -t specifies the target or the graphite query to run the script against, the -c flag sets the threshold, -u and -p are used if your Graphite server uses authentication.  If your Graphite instance is public it should probably use auth, otherwise if it is internal only, probably not as important.  Obviously these are just dummy values, included to give you a better idea of what a real command should look like.  Use your own values in their place.

The query we’re running is against a statsd metric that for mean response time for a request that gets recorded from the code (this is the developer instrumenting their code part I mentioned).  This check is specific to my environment so you will need to modify any of your queries to make sure to alert on a useful metric and threshold in your own environment.

Here’s an example of what the graphite graph (rendered in Grafana) looks like.

Sensu Graph

Obviously this is just a sample but it should give you the general idea of what to look for.

If you examine the script, there are a few Ruby Gem requirements to get the script to run, which you will need to be installed if you haven’t already.  They are sensu-plugin, json, json-uri and openssl.  You don’t need the sensu-plugin if you are just running the check manually but you WILL need to have it installed on the Sensu client that will be running the scheduled check.  That can be done manually or with the Sensu Chef recipe (specifically for turning on Sensu embedded ruby and ruby gems), which I recommend using anyway if you plan on doing any type of deployments at scale using Sensu.

Here is the Chef code looks like if you use Sensu to deploy this check automatically.

sensu_check "check_request_time" do 
  command "#{node['sensu']['plugindir']}/check-data.rb -s graphite.example.com -t \"alias(stats.timers.server.facedetection.response_time.mean, 'Mean')\" -c 7000 -a 360 -u myuser -p awesomepassword"
  handlers ["pagerduty", "slack"] 
  subscribers ["core"] 
  interval 60 
  standalone true 
  additional(:notification => "Request time above threshold", :occurrences => 5)
end

This should look familiar if you have any background using the Sensu Chef cookbook.  Basically we are using the sensu_check LWRP to execute the script with the different parameters we want, using the pagerduty and slack handlers, which are just fancy ways to pipe out the results of the check.  We are also saying we want to run this on a scheduled interval time of 60 seconds as a standalone check, which means it will be executed on the client node (not the Sensu server itself).  Finally, we are saying that after 5 failed checks we want to append a message to the handler that says what exactly is going wrong.

You can stick this logic in an existing recipe or create a new one that handles your metrics threshold checks.  I’m not sure what the best practice is for where to put the check but I have a recipe that runs standalone threshold checks that I stuck this logic in to and it seems to work.  Once the logic has been added you should be able to run chef-client for the new check to get picked up.

Read More