Export SNMP metrics with the Prometheus Operator

There are quite a few use cases for monitoring outside of Kubernetes, especially for previously built infrastructure and otherwise legacy systems.  Additional monitoring adds an extra layer of complexity to your monitoring setup and configuration, but fortunately Prometheus makes this extra complexity easier to manage and maintain, inside of Kubernetes.

In this post I will describe a nice clean way to monitor things that are internal to Kubernetes using Prometheus and the Prometheus Operator.  The advantage of this approach is that it allows the Operator to manage and monitor infrastructure, and it allows Kubernetes to do what it’s good at; make sure the things you want are running for you in an easy to maintain, declarative manifest.

If you are already familiar with the concepts in Kubernetes then this post should be pretty straight forward.  Otherwise, you can pretty much copy/paste most of these manifests into your cluster and you should have a good way to monitor things in your environment that are external to Kubernetes.

Below is an example of how to monitor external network devices using the Prometheus SNMP exporter.  There are many other exporters that can be used to monitor infrastructure that is external to Kubernetes but currently it is recommended to set up these configurations outside of the Prometheus Operator to basically separate monitoring concerns (which I plan on writing more about in the future).

Create the deployment and service

Here is what the deployment might look like.

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: snmp-exporter
spec:
  replicas: 1
  selector:
    matchLabels:
      app: snmp-exporter
  template:
    metadata:
    labels:
      app: snmp-exporter
  spec:
    containers:
    - image: oakman/snmp-exporter
    command: ["/bin/snmp_exporter"]
    args: ["--config.file=/etc/snmp_exporter/snmp.yml"]
    name: snmp-exporter
    ports:
    - containerPort: 9116
      name: metrics

And the accompanying service.

apiVersion: v1
kind: Service
metadata:
  labels:
    app: snmp-exporter
  name: snmp-exporter
spec:
  ports:
  - name: http-metrics
    port: 9116
    protocol: TCP
    targetPort: metrics
  selector:
    app: snmp-exporter

At this point you would have a pod in your cluster, attached to a static IP address.  To see if it worked you can check to make sure a service IP was created.  The service is basically what the Operator uses to create targets in Prometheus.

kubectl get sv

From this point you can 1) set up your own instance of Prometheus using Helm or by deploying via yml manifests or 2) set up the Prometheus Operator.

Today we will walk through option 2, although I will probably cover option 1 at some point in the future.

Setting up the Prometheus Operator

The beauty of using the Prometheus Operator is that it gives you a way to quickly add or change Prometheus specific configuration via the Kubernetes API (custom resource definition) and some custom objects provided by the operator, including AlertManager, ServiceMonitor and Prometheus objects.

The first step is to install Helm, which is a little bit outside of the scope of this post but there are lots of good guides on how to do it.  With Helm up and running you can easily install the operator and the accompanying kube-prometheus manifests which give you access to lots of extra Kubernetes metrics, alerts and dashboards.

helm repo add coreos https://s3-eu-west-1.amazonaws.com/coreos-charts/stable/
helm install --name prometheus-operator --set rbacEnable=true --namespace monitoring coreos/prometheus-operator
helm install coreos/kube-prometheus --name kube-prometheus --namespace monitoring

After a few moments you can check to see that resources were created correctly as a quick test.

kubectl get pods -n monitoring

NOTE: You may need to manually add the “prometheus” service account to the monitoring namespace after creating everything.  I ran into some issues because Helm didn’t do this automatically.  You can check this with kubectl get events.

Prometheus Operator configuration

Below are steps for creating custom objects (CRDs) that the Prometheus Operator uses to automatically generate configuration files and handle all of the other management behind the scenes.

These objects are wired up in a way that configs get reloaded and Prometheus will automatically get updated when it sees a change.  These object definitions basically convert all of the Prometheus configuration into a format that is understood by Kubernetes and converted to Prometheus configuration with the operator.

First we make a servicemonitor for monitoring the the snmp exporter.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    k8s-app: snmp-exporter
    prometheus: kube-prometheus # tie servicemonitor to correct Prometheus
  name: snmp-exporter
spec:
  jobLabel: k8s-app
  selector:
    app: snmp-exporter
  namespaceSelector:
    matchNames:
    - monitoring

  endpoints:
  - interval: 60s
    port: http-metrics
    params:
      module:
      - if_mib # Select which SNMP module to use
      target:
      - 1.2.3.4 # Modify this to point at the SNMP target to monitor
    path: "/snmp"
    targetPort: 9116

Next, we create a custom alert and tie it our Prometheus Operator.  The alert doesn’t do anything useful, but is a good demo for showing how easy it is to add and manage alerts using the Operator.

Create an alert-example.yml configuration file, add it as a configmap to k8s and mount it in as a configuration with the ruleSelector label selector and the prometheus operator will do the rest. Below shows how to hook up a test rule into an existing Prometheus (kube-prometheus) alert manager, handled by the prometheus-operator.

kind: ConfigMap
apiVersion: v1
metadata:
 name: josh-test
 namespace: monitoring
 labels:
 role: alert-rules # Standard convention for organizing alert rules
 prometheus: kube-prometheus # tie to correct Prometheus
data:
 test.rules.yaml: |
 groups:
 - name: test.rules # Top level description in Prometeheus
 rules:
 - alert: TestAlert
 expr: vector(1)

Once you have created the rule definition via configmap just use kubectl to create it.

kubectl create -f alert-example.yml -n monitoring

Testing and troubleshooting

You will probably need to port forward the pod to get access to the IP and port in the cluster

kubectl port-forward snmp-exporter-<name> 9116

Then you should be able to visit the pod in your browser (or with curl).

localhost:9116

The exporter itself does a lot more so you will probably want to play around with it.  I plan on covering more of the details of other external exporters and more customized configurations for the SNMP exporter.

For example, if you want to do any sort of monitoring past basic interface stats, etc. you will need to generate and build your own set of MIBs to gather metrics from your infrastructure and also reconfigure your ServiceMonitor object in Kubernetes to use the correct MIBs so that the Operator updates the configuration correctly.

Conclusion

The amount of options for how to use Prometheus is one area of confusion when it comes to using Prometheus, especially for newcomers.  There are lots of ways to do things and there isn’t much direction on how to use them, which can also be viewed as a strength since it allows for so much flexibility.

In some situations it makes sense to use an external (non Operator managed Prometheus) when you need to do things like manage and tune your own configuration files.  Likewise, the Prometheus Operator is a great fit when you are mostly only concerned about managing and monitoring things inside Kubernetes and don’t need to do much external monitoring.

That said, there is some support for external monitoring using the Prometheus Operator, which I want to write about in a different post.  This support is limited to a handful of different external exporters (for the time being) so the best advice is to think about what kind of monitoring is needed and choose the best solution for your own use case.  It may turn out that both types of configurations are needed, but it may also end up being just as easy to use one method or another to manage Prometheus and its configurations.

Read More

Monitoring email flow with MFM

This is a sponsored post by the folks over at EveryCloud.  They have recently developed and released a new tool to help manage and troubleshoot email issues, which is starting to get some traction, especially among Exchange environments.  As a mail admin in a previous life, I can sympathize with desire for better monitor tools.  Here’s their post.


Managing mail flow is a challenge for every systems administrator, and the price of a mistake is very high. Any interruption in mail flow can spell disaster for a company, disrupting daily operations and leaving the management team, the IT team and the systems administrator scrambling for solutions.

While there are a number of mail flow solutions on the market, they tend to be quite pricey, making it difficult for systems administrators, especially those who work for small businesses and start-ups, to justify the cost.

For those who do not already know, the makers behind the EveryCloud mail flow monitor have recently launched a free service – Mail Flow Monitor (MFM). EveryCloud MFM tool is the only free round-trip mail flow monitor on the market, giving systems administrators the ability to observe their organizations’ email systems 24 hours a day, 7 days a week and 365 days a year, all without spending a penny.

mfm dashboard

Some of the features of Mail Flow Monitor include:

  • A full-featured round trip monitor, with start-to-finish email tracking and monitoring
  • Systems administrators can receive real-time text and email alerts whenever a delay or rejection occurs – to your cell phone as well an email or to your alternative email address.
  • Timely monitoring means issues can be addressed quickly, before they spiral out of control
  • The system sends a test email every few minutes to a monitoring mailbox on your server. You set up a forward to send the emails back and the Everycloud team does the rest.
  • MFM is cloud based, which means there is nothing to update or manage.
  • MSP’s and IT Resellers can create an account and manage as many customers as they wish via the EveryCloud Partner Area, all completely free!

When you consider that competing mail filtering solutions generally cost about $30 a month, it is easy to see the saving potential. That $360 annual cost savings may not seem like much, but since it is assessed on a domain level, the charges can add up quickly. In addition, the per-domain charges can make managing a complex IT operation difficult, an extra level of hard work that systems administrators do not need.

From the smallest startups to the largest multinational corporations, modern businesses live and die on their email. An unexpected email breakdown, significant bottleneck or major failure could make the firm’s email inaccessible and unreliable for hours or even days, and every minute of downtime is costing the company money.

Read More

Tips for monitoring Rancher Server

Last week I encountered an interesting bug in Rancher that managed to cause some major problems across my Rancher infrastructure.  Basically, the bug was causing of the Rancher agent clients to continuously bounce between disconnected/reconnected/finished and reconnecting states, which only manifested itself either after a 12 hour period or by deactivating/activating agents (for example adding a new host to an environment).  The only way to temporarily fix the issue was to restart the rancher-server container.

With some help, we were eventually able to resolve the issue.  I picked up a few nice lessons along the way and also became intimately acquainted with some of the inner workings of Rancher.  Through this experience I learned some tips on how to effectively monitor the Rancher server environment that I would otherwise not have been exposed to, which I would like to share with others today.

All said and done, I view this experience as a positive one.  Hitting the bug has not only helped mitigate this specific issue for other users in the future but also taught me a lot about the inner workings of Rancher.  If you’re interested in the full story you can read about all of the details about the incident, including steps to reliably reproduce and how the issue was ultimately resolved here.  It was a bug specific to Rancher v1.5.1-3, so upgrading to 1.5.4 should fix this issue if you come across it.

Before diving into the specifics for this post, I just want to give a shout out to the Rancher community, including @cjellik, @ibuildthecloud, @ecliptok and @shakefu.  The Rancher developers, team and community members were extremely friendly and helpful in addressing and fixing the issue.  Between all the late night messages in the Rancher slack, many many logs, countless hours debugging and troubleshooting I just wanted to say thank you to everyone for the help.  The small things go a long way, and it just shows how great the growing Rancher community is.

Effective monitoring

I use Sysdig as the main source of container and infrastructure monitoring.  To accomplish the metric collection, I run the Sysdig agent as a systemd service when a server starts up so when a server dies and goes away or a new one is added, Sysdig is automatically started up and begins dumping that metric data into the Sysdig Cloud for consumption through the web interface.

I have used this data to create custom dashboards which gives me a good overview about what is happening in the Rancher server environment (and others) at any given time.

sysdig dashboard

The other important thing I discovered through this process, was the role that the Rancher database plays.  For the Rancher HA setup, I am using an externally hosted RDS instance for the Rancher database and was able to fine found some interesting correlations as part of troubleshooting thanks to the metrics in Sysdig.  For example, if the database gets stressed it can cause other unintended side effects, so I set up some additional monitors and alerts for the database.

Luckily Sysdig makes the collection of these additional AWS metrics seamless.  Basically, Sysdig offers an AWS integration which pull in CloudWatch metrics and allows you to add them to dashboards and alert on them from Sysdig, which has been very nice so far.

Below are some useful metrics in helping diagnose and troubleshoot various Rancher server issues.

  • Memory usage % (server)
  • CPU % (server)
  • Heap used over time (server)
  • Number of network connections (server)
  • Network bytes by application (server)
  • Freeable memory over time (RDS)
  • Network traffic over time (RDS)

As you can see, there are quite a few things you can measure with metrics alone.  Often though, this isn’t enough to get the entire picture of what is happening in an environment.

Logs

It is also important to have access to (useful) logs in the infrastructure in order to gain insight into WHY metrics are showing up the way they do and also to help correlate log messages and errors to what exactly is going on in an environment when problems occur.  Docker has had the ability for a while now to use log drivers to customize logging, which has been helpful to us.  In the beginning, I would just SSH into the server and tail the logs with the “docker logs” command but we quickly found that to be cumbersome to do manually.

One alternative to tailing the logs manually is to configure the Docker daemon to automatically send logs to a centralized log collection system.  I use Logstash in my infrastructure with the “gelf” log driver as part of the bootstrap command that runs to start the Rancher server container, but there are other logging systems if Logstash isn’t the right fit.  Here is what the relevant configuration looks like.

...
--log-driver=gelf \
--log-opt gelf-address=udp://<logstash-server>:12201 \
--log-opt tag=rancher-server \
...

Just specify the public address of the Logstash log collector and optionally add tags.  The extra tags make filtering the logs much easier, so I definitely recommend adding at least one.

Here are a few of the Logstash filters for parsing the Rancher logs.  Be aware though, it is currently not possible to log full Java stack traces in Logstash using the gelf input.

if [tag] == "rancher-server" {
    mutate { remove_field => "command" }
    grok {
      match => [ "host", "ip-(?<ipaddr>\d{1,3}-\d{1,3}-\d{1,3}-\d{1,3})" ]
    }

    # Various filters for Rancher server log messages
    grok {
     match => [ "message", "time=\"%{TIMESTAMP_ISO8601}\" level=%{LOGLEVEL:debug_level} msg=\"%{GREEDYDATA:message_body}\"" ]
     match => [ "message", "%{TIMESTAMP_ISO8601} %{WORD:debug_level} (?<context>\[.*\]) %{GREEDYDATA:message_body}" ]
     match => [ "message", "%{DATESTAMP} http: %{WORD:http_type} %{WORD:debug_level}: %{GREEDYDATA:message_body}" ]
   }
 }

There are some issues open for addressing this, but it doesn’t seem like there is much movement on the topic, so if you see a lot of individual messages from stack traces that is the reason.

One option to mitigate the problem of stack traces would be to run a local log collection agent (in a container of course) on the rancher server host, like Filebeat or Fluentd that has the ability to clean up the logs before sending it to something like Logstash, ElasticSearch or some other centralized logging.  This approach has the added benefit of adding encryption to the logs, which GELF does not have (currently).

If you don’t have a centralized logging solution or just don’t care about rancher-server logs shipping to it – the easiest option is to tail the logs locally as I mentioned previously, using the json-file log format.  The only additional configuration I would recommend to the json-file format is to turn on log rotation which can be accomplished with the following configuration.

...
 --log-driver=json-file \
 --log-opt max-size=100mb \
 --log-opt max-file=2 \
...

Adding these logging options will ensure that the container logs for rancher-server will never full up the disk on the server.

Bonus: Debug logs

Additional debug logs can be found inside of each rancher-server container.  Since these debug logs are typically not needed in day to day operations, they are sort of an easter egg, tucked away.  To access these debug logs, they are located in /var/lib/cattle/logs/ inside of the rancher-server container.  The easiest way to analyze the logs is to get them off the server and onto a local machine.

Below is a sample of how to do this.

docker exec -it <rancher-server> bash
cd /var/lib/cattle/logs
cp cattle-debug.log /tmp

Then from the host that the container is sitting on you can docker cp the logs out of the container and onto the working directory of the host.

docker cp <rancher-server>:/tmp/cattle-debug.log .

From here you can either analyze the logs in a text editor available on the server, or you can copy the logs over to a local machine.  In the example below, the server uses ssh keys for authentication and I chose to copy the logs from the server into my local /tmp directory.

 scp -i ~/.ssh/<rancher-server-pem> user@rancher-server:/tmp/cattle-debug.log /tmp/cattle-debug.log

With a local copy of the logs you can either examine the logs using your favorite text editor or you can upload them elsewhere for examination.

Conclusion

With all of our Rancher server metrics dumping into Sysdig Cloud along with our logs dumping into Logstash it has made it easier for multiple people to quickly view and analyze what was going on with the Rancher servers.  In HA Rancher environments with more than one rancher-server running, it also makes filtering logs based on the server or IP much easier.  Since we use 2 hosts in our HA setup we can now easily filter the logs for only the server that is acting as the master.

As these container based grow up, they also become much more complicated to troubleshoot.  With better logging and monitoring systems in place it is much easier to tell what is going on at a glance and with the addition of the monitoring solution we can be much more proactive about finding issues earlier and mitigating potential problems much faster.

Read More

Dockerizing Sentry

I have created a Github project that has basic instructions for getting started.  You can take a look over there for ideas of how all of this works and to get ideas for your own set up.

I used the following links as reference for my approach to Dockerizing Sentry.

https://registry.hub.docker.com/u/slafs/sentry
https://github.com/rchampourlier/docker-sentry

If you have configurations to use, it is probably a good idea to start from there.  You can check my Github repo for what a basic configuration looks like.  If you are starting from scratch or are using version 7.1.x or above you can use the “sentry init” command to generate a skeleton configuration to work from.

For this setup to work you will need the following prebuilt Docker images/containers. I suggest using something simple like docker-compose to stitch the containers together.

  • redis – https://registry.hub.docker.com/_/redis/
  • postgres – https://registry.hub.docker.com/_/postgres/
  • memcached – https://hub.docker.com/_/memcached/
  • nginx – https://hub.docker.com/_/nginx/

NOTE: If you are running this on OS X you may need to do some trickery and give special permission on the host (mac) level e.g. create ~/docker/postgres directory and give it the correct permission (I just used 777 recursively for testing, make sure to lock it down if you put this in production).

I wrote a little script in my Github project that will take care of setting up all of the directories on the host OS that need to be set up for data to persist.  The script also generates a self signed cert to use for proxying Sentry through Nginx.  Without the certificate, the statistics pages in the Sentry web interface will be broken.

To run the script, run the following command and follow the prompts.  Also make sure you have docker-compose installed beforehand to run all the needed command.

sudo ./setup.sh

The certs that get generated are self signed so you will see the red lock in your browser.  I haven’t tried it yet but I imagine using Let’s Encrytpt to create the certificates would be very easy.  Let me know if you have had any success generating Nginx certs for Docker containers, I might write a follow up post.

Preparing Postgres

After setting up directories and creating certificates, the first thing necessary to getting up and going is to add the Sentry superuser to Postgres (at least 9.4).  To do this, you will need to fire up the Postgres container.

docker-compose up -d postgres

Then to connect to the Postgres DB you can use the following command.

docker-compose run postgres sh -c 'exec psql -h "$POSTGRES_PORT_5432_TCP_ADDR" -p "$POSTGRES_PORT_5432_TCP_PORT" -U postgres'

Once you are logged in to the Postgres container you will need to set up a few Sentry DB related things.

First, create the role.

CREATE ROLE sentry superuser;

And then allow it to login.

ALTER ROLE sentry WITH LOGIN;

Create the Sentry DB.

CREATE DATABASE sentry;

When you are done in the container, \q will drop out of the postgresql shell.

After you’re done configuring the DB components you will need to “prime” Sentry by running it a first time.  This will probably take a little bit of time because it also requires you to build and pull all the other needed Docker images.

docker-compose build
docker-compose up

You will quickly notice if you try to browse to the Sentry URL (e.g. the IP/port of your Sentry container or docker-machine IP if you’re on OS X) that you will get errors in the logs and 503’s if you hit the site.

Repair the database (if needed)

To fix this you will need to run the following command on your DB to repair it if this is the first time you have run through the set up.

docker-compose run sentry sentry upgrade

The default Postgres database username and password is sentry in this setup, as part of the setup the upgrade prompt will ask you got create a new user and password, and make note of what those are.  You will definitely want to change these configs if you use this outside of a test or development environment.

After upgrading/preparing the database, you should be able to bring up the stack again.

docker-compose up -d && docker-compose logs

Now you should be able to get to the Sentry URL and start configuring .  To manage the username/password you can visit the /admin url and set up the accounts.

 

Next steps

The Sentry server should come up and allow you in but will likely need more configuration.  Using the power of docker-compose it is easy to add in any custom configurations you have.  For example, if you need to adjust sentry level configurations all you need to do is edit the file in ./sentry/sentry.conf.py and then restart the stack to pick up the changes.  Likewise, if you need to make changes to Nginx or celery, just edit the configuration file and bump the stack – using “docker-compose up -d”.

I have attempted to configure as many sane defaults in the base config to make the configuration steps easier.  You will probably want to check some of the following settings in the sentry/sentry.conf.py file.

  • SENTRY_ADMIN_EMAIL – For notifications
  • SENTRY_URL_PREFIX – This is especially important for getting stats working
  • SENTRY_ALLOW_ORIGIN – Where to allow communications from
  • ALLOWED_HOSTS – Which hosts can communicate with Sentry

If you have the SENTRY_URL_PREFIX set up correctly you should see something similar when you visit the /queue page, which indicates statistics are working.

Sentry Queue

If you want to set up any kind of email alerting, make sure to check out the mail server settings.

docker-compose.yml example file

The following configuration shows how the Sentry stack should look.  The meat of the logic is in this configuration but since docker-compose is so flexible, you can modify this to use any custom commands, different ports or any other configurations you may need to make Sentry work in your own environment.

# Caching
redis:
  image: redis:2.8
  hostname: redis
  ports:
    - "6379:6379"
   volumes:
     - "/data/redis:/data"

memcached:
  image: memcached
  hostname: memcached
  ports:
    - "11211:11211"

# Database
postgres:
  image: postgres:9.4
  hostname: postgres
  ports:
    - "5432:5432"
  volumes:
    - "/data/postgres/etc:/etc/postgresql"
    - "/data/postgres/log:/var/log/postgresql"
    - "/data/postgres/lib/data:/var/lib/postgresql/data"

# Customized Sentry configuration
sentry:
  build: ./sentry
  hostname: sentry
  ports:
    - "9000:9000"
    - "9001:9001"
  links:
    - postgres
    - redis
    - celery
    - memcached
  volumes:
    - "./sentry/sentry.conf.py:/home/sentry/.sentry/sentry.conf.py"


# Celery
celery:
  build: ./sentry
  hostname: celery
  environment:
    - C_FORCE_ROOT=true
  command: "sentry celery worker -B -l WARNING"
  links:
    - postgres
    - redis
    - memcached
  volumes:
    - "./sentry/sentry.conf.py:/home/sentry/.sentry/sentry.conf.py"

# Celerybeat
celerybeat:
  build: ./sentry
  hostname: celerybeat
  environment:
    - C_FORCE_ROOT=true
  command: "sentry celery beat -l WARNING"
  links:
    - postgres
    - redis
  volumes:
    - "./sentry/sentry.conf.py:/home/sentry/.sentry/sentry.conf.py"

# Nginx
nginx:
  image: nginx
  hostname: nginx
  ports:
    - "80:80"
    - "443:443"
  links:
    - sentry
  volumes:
    - "./nginx/sentry.conf:/etc/nginx/conf.d/default.conf"
    - "./nginx/sentry.crt:/etc/nginx/ssl/sentry.crt"
    - "./nginx/sentry.key:/etc/nginx/ssl/sentry.key"

The Dockerfiles for each of these component are fairly straight forward.  In fact, the same configs can be used for the Sentry, Celery and Celerybeat services.

Sentry

# Kombu breaks in 2.7.11
FROM python:2.7.10

# Set up sentry user
RUN groupadd sentry && useradd --create-home --home-dir /home/sentry -g sentry sentry
WORKDIR /home/sentry

# Sentry dependencies
RUN pip install \
 psycopg2 \
 mysql-python \
 supervisor \
 # Threading
 gevent \
 eventlet \
 # Memcached
 python-memcached \
 # Redis
 redis \
 hiredis \
 nydus

# Sentry
ENV SENTRY_VERSION 7.7.4
RUN pip install sentry==$SENTRY_VERSION

# Set up directories
RUN mkdir -p /home/sentry/.sentry \
 && chown -R sentry:sentry /home/sentry/.sentry \
 && chown -R sentry /var/log

# Configs
COPY sentry.conf.py /home/sentry/.sentry/sentry.conf.py

#USER sentry
EXPOSE 9000/tcp 9001/udp

# Making sentry commands easier to run
RUN ln -s /home/sentry/.sentry /root

CMD sentry --config=/home/sentry/.sentry/sentry.conf.py start

Since the customized Sentry config is rather lengthy, I will point you to the Github repo again.  There are a few values that you will need to provide but they should be pretty self explanatory.

Once the configs have all been put in to place you should be good to go.  A bonus piece would be to add an Upstart service that takes care of managing the stack if the server either gets rebooted or the containers manage to get stuck in an unstable state.  The configuration is a fairly easy thing to do and many other guides and posts have been written about how to accomplish this.

Read More

Graphite threshold alerting with Sensu

Instrumenting your code to report application level metrics is definitely one of the most powerful monitoring tasks you can accomplish.  It is damn satisfying to get working the first time as well.  Having the ability to look at your application and how it is performing at a granular level can help identify potential issues or bottlenecks but can also give you a greater understanding of how people are interacting with the application at a broad scale.  Everybody loves having these types of metrics to talk about their apps and products so this style of monitoring is a great win for the whole team.

I don’t want to dive in to the specifics of WHAT you should monitor here, that will be unique to every environment.  Instead of covering the what and how of instrumenting the code to report specific metrics, I will be running through an example of what the process might look like for instrumenting a check and alarm for monitoring and alerting purposes at an operations level.  I am not a developer, so I don’t spend a lot of time thinking about what types of things are important to collect metrics on.  Usually my job instead is to figure out how to monitor and alert effectively, based on the metrics that developers come up with.

Sensu has a great plugin to check Graphite thresholds in their plugin repo.  If you haven’t looked already, take a minute to glance over the options a little bit and see how the plugin works.  It is a pretty simple plugin but has been able to do everything I need it to.

One common monitoring task is to check how long requests are taking.  So in this example, we are querying the Graphite server and reporting a critical  status (status 1) if the request averages more than 7 seconds for a response time.

Here is the command you would run manually to check this threshold.  Make sure to download the script if you haven’t already, you can just copy the code directly or clone the repo if you are doing this manually.  If you are using Sensu you can use the sensu_plugin LWRP to grab the script (more on that below).

./check-data -s <servername:port> -t <graphite query> -c 7000 -u user -p password
./check-data -s graphite.example.com -t alias(stats.timer.server.response_time.mean, 'Mean') -c 7000 -u myuser -p awesomepassword

There are a few things to note.  The -s flag specifies which graphite server or endpoint to hit, -t specifies the target or the graphite query to run the script against, the -c flag sets the threshold, -u and -p are used if your Graphite server uses authentication.  If your Graphite instance is public it should probably use auth, otherwise if it is internal only, probably not as important.  Obviously these are just dummy values, included to give you a better idea of what a real command should look like.  Use your own values in their place.

The query we’re running is against a statsd metric that for mean response time for a request that gets recorded from the code (this is the developer instrumenting their code part I mentioned).  This check is specific to my environment so you will need to modify any of your queries to make sure to alert on a useful metric and threshold in your own environment.

Here’s an example of what the graphite graph (rendered in Grafana) looks like.

Sensu Graph

Obviously this is just a sample but it should give you the general idea of what to look for.

If you examine the script, there are a few Ruby Gem requirements to get the script to run, which you will need to be installed if you haven’t already.  They are sensu-plugin, json, json-uri and openssl.  You don’t need the sensu-plugin if you are just running the check manually but you WILL need to have it installed on the Sensu client that will be running the scheduled check.  That can be done manually or with the Sensu Chef recipe (specifically for turning on Sensu embedded ruby and ruby gems), which I recommend using anyway if you plan on doing any type of deployments at scale using Sensu.

Here is the Chef code looks like if you use Sensu to deploy this check automatically.

sensu_check "check_request_time" do 
  command "#{node['sensu']['plugindir']}/check-data.rb -s graphite.example.com -t \"alias(stats.timers.server.facedetection.response_time.mean, 'Mean')\" -c 7000 -a 360 -u myuser -p awesomepassword"
  handlers ["pagerduty", "slack"] 
  subscribers ["core"] 
  interval 60 
  standalone true 
  additional(:notification => "Request time above threshold", :occurrences => 5)
end

This should look familiar if you have any background using the Sensu Chef cookbook.  Basically we are using the sensu_check LWRP to execute the script with the different parameters we want, using the pagerduty and slack handlers, which are just fancy ways to pipe out the results of the check.  We are also saying we want to run this on a scheduled interval time of 60 seconds as a standalone check, which means it will be executed on the client node (not the Sensu server itself).  Finally, we are saying that after 5 failed checks we want to append a message to the handler that says what exactly is going wrong.

You can stick this logic in an existing recipe or create a new one that handles your metrics threshold checks.  I’m not sure what the best practice is for where to put the check but I have a recipe that runs standalone threshold checks that I stuck this logic in to and it seems to work.  Once the logic has been added you should be able to run chef-client for the new check to get picked up.

Read More