Tips for monitoring Rancher Server

Last week I encountered an interesting bug in Rancher that managed to cause some major problems across my Rancher infrastructure.  Basically, the bug was causing of the Rancher agent clients to continuously bounce between disconnected/reconnected/finished and reconnecting states, which only manifested itself either after a 12 hour period or by deactivating/activating agents (for example adding a new host to an environment).  The only way to temporarily fix the issue was to restart the rancher-server container.

With some help, we were eventually able to resolve the issue.  I picked up a few nice lessons along the way and also became intimately acquainted with some of the inner workings of Rancher.  Through this experience I learned some tips on how to effectively monitor the Rancher server environment that I would otherwise not have been exposed to, which I would like to share with others today.

All said and done, I view this experience as a positive one.  Hitting the bug has not only helped mitigate this specific issue for other users in the future but also taught me a lot about the inner workings of Rancher.  If you’re interested in the full story you can read about all of the details about the incident, including steps to reliably reproduce and how the issue was ultimately resolved here.  It was a bug specific to Rancher v1.5.1-3, so upgrading to 1.5.4 should fix this issue if you come across it.

Before diving into the specifics for this post, I just want to give a shout out to the Rancher community, including @cjellik, @ibuildthecloud, @ecliptok and @shakefu.  The Rancher developers, team and community members were extremely friendly and helpful in addressing and fixing the issue.  Between all the late night messages in the Rancher slack, many many logs, countless hours debugging and troubleshooting I just wanted to say thank you to everyone for the help.  The small things go a long way, and it just shows how great the growing Rancher community is.

Effective monitoring

I use Sysdig as the main source of container and infrastructure monitoring.  To accomplish the metric collection, I run the Sysdig agent as a systemd service when a server starts up so when a server dies and goes away or a new one is added, Sysdig is automatically started up and begins dumping that metric data into the Sysdig Cloud for consumption through the web interface.

I have used this data to create custom dashboards which gives me a good overview about what is happening in the Rancher server environment (and others) at any given time.

sysdig dashboard

The other important thing I discovered through this process, was the role that the Rancher database plays.  For the Rancher HA setup, I am using an externally hosted RDS instance for the Rancher database and was able to fine found some interesting correlations as part of troubleshooting thanks to the metrics in Sysdig.  For example, if the database gets stressed it can cause other unintended side effects, so I set up some additional monitors and alerts for the database.

Luckily Sysdig makes the collection of these additional AWS metrics seamless.  Basically, Sysdig offers an AWS integration which pull in CloudWatch metrics and allows you to add them to dashboards and alert on them from Sysdig, which has been very nice so far.

Below are some useful metrics in helping diagnose and troubleshoot various Rancher server issues.

  • Memory usage % (server)
  • CPU % (server)
  • Heap used over time (server)
  • Number of network connections (server)
  • Network bytes by application (server)
  • Freeable memory over time (RDS)
  • Network traffic over time (RDS)

As you can see, there are quite a few things you can measure with metrics alone.  Often though, this isn’t enough to get the entire picture of what is happening in an environment.

Logs

It is also important to have access to (useful) logs in the infrastructure in order to gain insight into WHY metrics are showing up the way they do and also to help correlate log messages and errors to what exactly is going on in an environment when problems occur.  Docker has had the ability for a while now to use log drivers to customize logging, which has been helpful to us.  In the beginning, I would just SSH into the server and tail the logs with the “docker logs” command but we quickly found that to be cumbersome to do manually.

One alternative to tailing the logs manually is to configure the Docker daemon to automatically send logs to a centralized log collection system.  I use Logstash in my infrastructure with the “gelf” log driver as part of the bootstrap command that runs to start the Rancher server container, but there are other logging systems if Logstash isn’t the right fit.  Here is what the relevant configuration looks like.

...
--log-driver=gelf \
--log-opt gelf-address=udp://<logstash-server>:12201 \
--log-opt tag=rancher-server \
...

Just specify the public address of the Logstash log collector and optionally add tags.  The extra tags make filtering the logs much easier, so I definitely recommend adding at least one.

Here are a few of the Logstash filters for parsing the Rancher logs.  Be aware though, it is currently not possible to log full Java stack traces in Logstash using the gelf input.

if [tag] == "rancher-server" {
    mutate { remove_field => "command" }
    grok {
      match => [ "host", "ip-(?<ipaddr>\d{1,3}-\d{1,3}-\d{1,3}-\d{1,3})" ]
    }

    # Various filters for Rancher server log messages
    grok {
     match => [ "message", "time=\"%{TIMESTAMP_ISO8601}\" level=%{LOGLEVEL:debug_level} msg=\"%{GREEDYDATA:message_body}\"" ]
     match => [ "message", "%{TIMESTAMP_ISO8601} %{WORD:debug_level} (?<context>\[.*\]) %{GREEDYDATA:message_body}" ]
     match => [ "message", "%{DATESTAMP} http: %{WORD:http_type} %{WORD:debug_level}: %{GREEDYDATA:message_body}" ]
   }
 }

There are some issues open for addressing this, but it doesn’t seem like there is much movement on the topic, so if you see a lot of individual messages from stack traces that is the reason.

One option to mitigate the problem of stack traces would be to run a local log collection agent (in a container of course) on the rancher server host, like Filebeat or Fluentd that has the ability to clean up the logs before sending it to something like Logstash, ElasticSearch or some other centralized logging.  This approach has the added benefit of adding encryption to the logs, which GELF does not have (currently).

If you don’t have a centralized logging solution or just don’t care about rancher-server logs shipping to it – the easiest option is to tail the logs locally as I mentioned previously, using the json-file log format.  The only additional configuration I would recommend to the json-file format is to turn on log rotation which can be accomplished with the following configuration.

...
 --log-driver=json-file \
 --log-opt max-size=100mb \
 --log-opt max-file=2 \
...

Adding these logging options will ensure that the container logs for rancher-server will never full up the disk on the server.

Bonus: Debug logs

Additional debug logs can be found inside of each rancher-server container.  Since these debug logs are typically not needed in day to day operations, they are sort of an easter egg, tucked away.  To access these debug logs, they are located in /var/lib/cattle/logs/ inside of the rancher-server container.  The easiest way to analyze the logs is to get them off the server and onto a local machine.

Below is a sample of how to do this.

docker exec -it <rancher-server> bash
cd /var/lib/cattle/logs
cp cattle-debug.log /tmp

Then from the host that the container is sitting on you can docker cp the logs out of the container and onto the working directory of the host.

docker cp <rancher-server>:/tmp/cattle-debug.log .

From here you can either analyze the logs in a text editor available on the server, or you can copy the logs over to a local machine.  In the example below, the server uses ssh keys for authentication and I chose to copy the logs from the server into my local /tmp directory.

 scp -i ~/.ssh/<rancher-server-pem> user@rancher-server:/tmp/cattle-debug.log /tmp/cattle-debug.log

With a local copy of the logs you can either examine the logs using your favorite text editor or you can upload them elsewhere for examination.

Conclusion

With all of our Rancher server metrics dumping into Sysdig Cloud along with our logs dumping into Logstash it has made it easier for multiple people to quickly view and analyze what was going on with the Rancher servers.  In HA Rancher environments with more than one rancher-server running, it also makes filtering logs based on the server or IP much easier.  Since we use 2 hosts in our HA setup we can now easily filter the logs for only the server that is acting as the master.

As these container based grow up, they also become much more complicated to troubleshoot.  With better logging and monitoring systems in place it is much easier to tell what is going on at a glance and with the addition of the monitoring solution we can be much more proactive about finding issues earlier and mitigating potential problems much faster.

Read More

Configure a Rancher HAProxy health check

If you are familiar with ELB/ALB you will know that there are slight idiosyncrasies between the two.  For example, ELB allows you to health check a back end server by TCP port.  Basically allowing the user to check if a back end comes up and is listening on a specified port.  ALB is slightly different in its method for health checking.  ALB uses HTTP checks (layer 7) to ensure back end instances are up and listening.

This becomes a problem in Rancher, when you have multiple stacks in a single environment that are fronted by the Rancher HAProxy load balancer.  By default, the HAProxy config does not have a health check endpoint configured, so ALB is never able to know if the back end server is actually up and listening for requests.

A colleague and I  recently discovered a neat trick for solving this problem if you are fronting your environment with an ALB.  The solution to this conundrum is to sprinkle a little bit of custom configuration to the Rancher HAProxy config.

In Rancher, you can modify the live settings without downtime.  Click on the load balancer that sits behind the ALB and navigate to the Custom haproxy.cfg tab.

haproxy config

Modify the HAProxy config by adding the following:

# Use to report haproxy's status
defaults
    mode http
    monitor-uri /_ping

Click the “Edit” button to apply these changes and you should be all set.

Next, find the health check configuration for the associated ALB in the AWS console and add a check the the /_ping path on port 80 (or whichever port you are exposing/plan to listen on).  It should look similar to the following example.

Health checks

Below is an example that maps a DNS name to an internal Nginx container that is listening for requests on port 80.

HAPRoxy configuration

The check in ALB ensures that the HAProxy load balancer in Rancher is up and running before allowing traffic to be routed to it.  You can verify that your Rancher load balancer is working if the instances behind your ALB start showing a status of healthy in the AWS console.

NOTE: If you don’t have any apps initially behind the Rancher load balancer (or that are listening on the port specified in the health check) the AWS instances behind ALB will remain unhealthy until you add configuration in Rancher for the stacks to be exposed, as pictured above.

After setting up HAProxy, publicly accessible services in private Rancher environments can easily be managed by updating the HAProxy config.  Just add a dns name and a service to link to and HAProxy is able to figure out how and where to route requests to.  To map other services that aren’t listening on port 80, the process is  very similar.  Use the above as a guideline and simply update the target port to whichever port the app is listening on internally.

Read More

Backup Route 53 zones

We have all heard about DNS catastrophes.  I just read about horror story on reddit the other day, where an Azure root DNS zone was accidentally deleted with no backup.  I experienced a similar disaster a few years ago – a simple DNS change managed to knock out internal DNS for an entire domain, which contained hundreds of records.  Reading the post hit close to home, uncovering some of my own past anxiety, so I began poking around for solutions.  Immediately, I noticed that backing up DNS records is usually skipped over as part of the backup process.  Folks just tend to never do it, for whatever reason.

I did discover, though, that backing up DNS is easy.  So I decided to fix the problem.

I wrote a simple shell script that dumps out all Route53 zones for a given AWS account to a json file, and uploads the zones to an S3 bucket.  The script is a handful lines, which is perfect because it doesn’t take much effort to potentially save your bacon.

If you don’t host DNS in AWS, the script can be modified to work for other DNS providers (assuming they have public API’s).

Here’s the script:

#!/usr/bin/env bash

set -e

# Dump route 53 zones to a text file and upload to S3.

BACKUP_DIR=/home/<user>/dns-backup
BACKUP_BUCKET=<bucket>
# Use full paths for cron
CLIPATH="/usr/local/bin"

# Dump all zones to a file and upload to s3
function backup_all_zones () {
  local zones
  # Enumerate all zones
  zones=$($CLIPATH/aws route53 list-hosted-zones | jq -r '.HostedZones[].Id' | sed "s/\/hostedzone\///")
  for zone in $zones; do
  echo "Backing up zone $zone"
  $CLIPATH/aws route53 list-resource-record-sets --hosted-zone-id $zone > $BACKUP_DIR/$zone.json
  done

  # Upload backups to s3
  $CLIPATH/aws s3 cp $BACKUP_DIR s3://$BACKUP_BUCKET --recursive --sse
}

# Create backup directory if it doesn't exist
mkdir -p $BACKUP_DIR
# Backup up all the things
time backup_all_zones

Be sure to update the <user> and <bucket> in the script to match your own environment settings.  Dumping the DNS records to json is nice because it allows for a more programmatic way of working with the data.  This script can be run manually, but is much more useful if run automatically.  Just add the script to a cronjob and schedule it to dump DNS periodically.

For this script to work, the aws cli and jq need to be installed.  The installation is skipped in this post, but is trivial.  Refer to the links for instructions.

The aws cli needs to be configured to use an API key with read access from Route53 and the ability to write to S3.  Details are skipped for this step as well – be sure to consult the AWS documentation on setting up IAM permissions for help with setting up API keys.  Another, simplified approach is to use a pre-existing key with admin credentials (not recommended).

Read More

Easily login to Rancher containers locally

Sometimes managing containers through the Rancher web console can be tedious and painful.  Especially if you need to copy/paste things into or out of the terminal.  I recently discovered a nice little project on Github called Rancher SSH which allows you to connect to a container running in your Rancher environment as if it was local to the machine you are working on, much like SSH and hence the name.

I am still playing around with the functionality but so far it has been very nice and is very easy to get started with.  To get started you can either install it via Homebrew or with Golang.  I chose to use the homebrew option.

brew install fangli/dev/rancherssh

After it is finished installing (it might take a minute or two), you should have access to the rancherssh command from the CLI.  You might need to source your shell in order to pick up tab completion for the command but you should be able to run the command and get some output.

rancherssh

In order to do anything useful with this tool, you will first need to create an API key for rancherssh in Rancher.  Navigate to the environment you’d like to create the key for and then click the API tab in Rancher.  Then click  the “Add Environment API Key” to bring up the dialogue to create a new key.

add api key

After you create your key make not of the Access key (username)  and Secret key (password).  You will need these to configure rancherssh in the step below.  First, create a file somewhere that is easy to remember, called config.yml and populate it, similar to the following, updating the endpoint, access key and secret key.

endpoint: https://your.rancher.server/v1
user: access_key
password: secret_key

That’s pretty much it.  Make sure the endpoint matches your environment correctly, otherwise you should now be able to connect to a container in your Rancher environment.  You’ll need to make sure you run the rancherssh command from the same directory that you configured your config.yml file, but otherwise it should just work.

rancherssh my-stack_container_1

Optionally you can provide all of the configuration information to the CLI and just skip the config file completely.

rancherssh --endpoint="https://your.rancher.server/v1" --user="access_key" --password="secret_key" my-test-container_1

There is one last thing to mention.  rancherssh provides a nice fuzzy matching mechanism for connecting to containers.  For example, if you can’t remember which containers are available to a stack in Rancher you can run a pattern to match the stack, and rancherssh will tell you which containers are running in the stack and allow you to choose which one to connect to.

ranchserssh %my-stack%

If there are multiple containers this command will allow you to pick which one to connect to.

Searching for container %my-stack%
We found more than one containers in system:
[1] my-stack_container_1, Container ID 1i91308 in project 1a216, IP Address 10.42.154.115
[2] my-stack_container_2, Container ID 1i94034 in project 1a216, IP Address 10.42.119.103
[3] my-stack_container_3, Container ID 1i94036 in project 1a216, IP Address 10.42.146.57

I didn’t have any issues at all getting started with this tool, I would definitely recommend checking it out.  Especially if you do a lot of work in your Rancher containers.  It is fast, easy to use and is really useful for the times that using the Rancher UI is too cumbersome.

Read More

My take on the NoOps movement

I recently attended DevOps Days Portland, where Kelsey Hightower gave a nice Keynote about NoOps.  I had heard of the terms NoOps in passing before the conference but never really thought much about it or its implications. Kelsey’s talk started to get me thinking more and more about the idea and what it means to the DevOps world.

For those of you who aren’t familiar, NoOps is a newer tech buzzword that has emerged to describe the concept that an IT environment can become so automated and abstracted from the underlying infrastructure that there is no need for a dedicated team to manage software in-house.

Obviously the term NoOps has caused some friction between the development world and operations/DevOps world because of its perceived meaning along with a very controversial article entitled “I Don’t Want DevOps.  I Want NoOps.” that kicked the whole movement off and sparked the original debate back in 2011.  The main argument from people who work in operations is that there will always be servers running somewhere, as a developer you can’t just magically make servers go away, which I agree with 100%.  It is incredibly short sighted to assume that any environment can work in a way where operations in some form need not exist.

Interestingly though, if you dig into the goals and underlying meaning of NoOps, they are actually fairly reasonable to me when boiled down.  Here are just a few of them, borrowed from the article and Kelsey’s talk:

  • Improve the process of deploying apps
  • Not just VM’s, release management as well
  • Developers don’t want to deal with operations
  • Developers don’t care about hardware

All of these goals seem reasonable to me as an operations person, especially not having to work with developers.  Therefore, when I look at NoOps I don’t necessarily take the ACTUAL underlying meaning of it be to work against operations and DevOps, I look at it as developers trying to find a better way to get their jobs done, however misguided their wording and mindset.  I also see NoOps, from an operations perspective as a shift in the mindset of how to accomplish goals, to improve processes and pipelines, which is something that is very familiar to people who have worked in DevOps.

Because of this perspective, I see an evolution in the way that operations and DevOps works that takes the best ideas from NoOps and applies them in practical ways.  Ultimately, operations people want to be just as productive as developers and NoOps seems like a good set of ideas to get on the same page.

To be able to incorporate ideas from NoOps as cloud and distributed technologies continue to advance, operations folks need to embrace the idea of programming and automation in areas that have been traditionally managed manually as part of the day to day by operation folks in order to abstract away complicated infrastructure and make it easier for developers to accomplish their goals. Examples of these types of things may include automatically provisioning networks and VLAN’s or issuing and deploying certificates by clicking a button.  As more of the infrastructure gets abstracted away, it is important for operations to be able to automate these tasks.

If anything, I think NoOps makes sense as a concept for improving the lives of both developers and operations, which is one facet that DevOps aims to help solve.  So to me, the goals of NoOps are a good thing, even though there has been a lot of stigma about it.  Just to reiterate, I think it is absurd for anybody to say that jobs of operations will going away anytime soon, the job and responsibilities are just evolving to fit the direction other areas of the business are moving.  If anything, the skills of managing cloud infrastructure, automation and building robust systems will be in higher demand.

As an operations/DevOps person just remember to stay curious and always keep working on improving your skill set.

Read More