Easy Prometheus Monitoring in Rancher

Docker monitoring and container monitoring in general is an area that has historically been difficult.  There has been a lot of movement and progress in the last year or so to beef up container monitoring tools but in my experience the tools have either been expensive or difficult to configure and complicated to use.  The combination or Rancher and Prometheus has finally given me hope.  Now it is easy easy to setup and configure a distributed monitoring solution without paying a high price.

Prometheus has recently added support for Rancher via the Rancher exporter, which is great news.  This is by far the easiest method I have discovered thus far for experimenting with Prometheus.

For those that don’t know much about Prometheus, it is an up and coming project created by engineers at Soundcloud which is hosted on Github.  Prometheus is focused on monitoring, specifically focusing on container and Docker monitoring.  Prometheus uses a polling based model for “scraping” metrics out of predefined endpoints.  The Prometheus Rancher exporter enables Prometheus to scrape Rancher server specific metrics, which are very useful to have.  To build on that, one other point worth mentioning here is that Prometheus has a very nice, flexible design built upon different client libraries in a similar way to Graphite, so adding support and instrumenting code for different types of platforms is easy to implement.  Check out the list of exporters in the Prometheus docs for idea on how to get started exporting metrics.

This post won’t cover setting up Rancher server or any of the Rancher environment since it is well documented in other places.  I won’t touch on alerting here either because I honestly haven’t had much time to dig into it much yet.  So with that said, the first step I will focus on in this post is getting Prometheus set up and running.  Luckily it is extremely easy to accomplish this using the Rancher catalog and the Prometheus template.

prometheus stack

Once Prometheus has been bootstrapped and everything is up, test it out by navigating to the Grafana home dashboard created by the bootstrap process.  Since this is a simple demo, my dashboard is located at the IP of the server using port 3000 which is the only port that should need to be publicly exposed if you are interested in sharing the Grafana dashboard.

The default Grafana credentials for this catalog template are admin/admin for the username and password, which is noted in the catalog notes found here.  The Prometheus tools ship with some nice preconfigured dashboards, so after you have things set up, it is definitely worth checking out some of them.

grafana dashboard

If you look around the dashboards you will probably notice that metrics for the Rancher server aren’t available by default.  To enable these metrics we need to configure Prometheus to connect to the Rancher API, as noted in the Rancher monitoring guide.

Navigate to http://<SERVER_IP>:8080/v1/settings/graphite.host on your Rancher server, then in the top right click edit, and then update the value there to point to the server address where InfluxDB was deployed to.

influxdb host

After this setting has been configured, restart the Rancher server container, wait a few minutes and then check Grafana.

rancher server metrics

As you can see, metrics are now flowing in the the dashboard.

Now that we have the basics configured, we can drill down in to individual containers to get a more granular view of what is happening in the environment.  This type of granularity is great because it gives a very detailed view of what exactly is going on inside our environment and gives us an easy way to share visuals with other team members.  Prometheus offers a web interface to interact with the query language and visual results, which is useful to help figure out what kinds of things to visualize in Grafana.

Navigate to the server that the Prometheus server container is deployed to on port 9090.  You should see a screen similar to the following.

promdash

There is  documentation about how to get started with using this tool, so I recommend taking a look and playing around with it yourself.  Once you find some useful metrics, visualized in the graph view, grab the query used to generate the graph and add a new dashboard to Grafana.

Prometheus offers a lot of power and flexibility and is a great tool for monitoring.  I am still very new to Prometheus but so far it looks very promising and I have to say I’m really impressed with the amount of polish and detail I was able to get in just an afternoon of experimenting.  I will be updating this post as I get more exposure to Prometheus and get more metrics and monitoring set up so stay tuned.

Read More

Set up SSL for Rancher Server

One issue you will probably run across if you start to use Rancher to manage your Docker containers is that it doesn’t serve pages over an encrypted connection by default.  If you are looking to put Rancher in to a production scenario, it is a good idea to serve encrypted pages.  HA is another topic, but at this point I have not attempted to set it up yet because it is a much more complicated process currently.  The Rancher folks are working on making HA easier in the near future (if you know an easy way to do it I would love to hear about it).  I would argue though that if you can set up SSL for your Rancher server you are over half way to a full production set up.

The process of getting Rancher to proxy through an encrypted connection is straight forward, assuming you already have some certs to use.  If you don’t already have any official certificates issued *I think* you should be okay with self signed certs, but you won’t get that green lock that everybody loves.  Definitely if you are just testing this set up you should be fine to start out with some self signed certs.  Here is a reference for creating some certs for Nginx to test with.

Another important thing to be aware of is that these instructions are specific to the Nginx method outline above.  I have not tried the Apache method, though I would guess it should be very easy to adapt.

Take a look at the Rancher docs as a starting point for getting started, they are very good and will get you most of the way there.  However, when I went through this process there were a few pieces of information that I had to piece together myself, which is the bulk of what I will be sharing today.

The first step is to adapt the configuration in the docs in to a full Nginx config that can be dropped in to the official Nginx image from Dockerhub.  Here is the config I used.

upstream rancher {
    server rancher-server:8080;
}

server {
    listen 443 ssl;
    server_name test.com;
    ssl_certificate /etc/rancher/test.com.crt;
    ssl_certificate_key /etc/rancher/test.com.key;

    access_log /var/log/nginx/access.log;
    error_log  /var/log/nginx/error.log;

    location / {
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header X-Forwarded-Port $server_port;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_pass http://rancher;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        # This allows the ability for the execute shell window to remain open for up to 15 minutes. Without this parameter, the default is 1 minute and will automatically close.
        proxy_read_timeout 900s;
    }
}

server {
    listen 80;
    server_name test.com;
    return 301 https://$server_name$request_uri;
}

There are a few important things to note about this config.   One is naming the upstream the same name as what the rancher server container is named, in this case rancher-server.

Note that I have used test.com as the server name and so the certs and names are all reflective of that value.  Obviously that will need to be updated with your own values.

Finally, we have added an additional logging section to the config that will pipe the logs to stdout/stderr so we can easily look at the requests from the host OS via the “docker logs” command.

To get the following Docker run command to work correctly you will want to create a directory called /etc/rancher or something easy to remember, and place this config (named as rancher-nginx.conf), along with the certs you have created in to this location.  Alternately you can modify the Docker run command and simply have the volume mounts pointed at the location you store the configuration and certs.  For me, it makes the most sense to group these items together in /etc/rancher.

docker run -d --restart=always --name nginx 
    -v /etc/rancher/rancher-nginx.conf:/etc/nginx/conf.d/default.conf
    -v /etc/rancher/test.com.crt:/etc/rancher/test.com.crt
    -v /etc/rancher/test.com.key:/etc/rancher/test.com.key
    -p 80:80 -p 443:443 --link=rancher-server nginx

This will mount in the correct configuration and certificates to the Nginx docker container, expose port 80 and 443 for web traffic (make sure to adjust any firewall rules you need to get traffic to pass through these ports), and link to the rancher-server container so that the traffic can be proxied.

Additionally, you will need to update any reference to the old address that was using http://<rancher-name>:8080/ to point to https://<rancher-name>/.  Namely the host registration configuration in the Rancher server, but if you were relying on any other outside tools to hit that endpoint they will also need to be updated to use https instead.

Read More

ECS cluster turnup with CoreOS and Terraform

Recently I have been evaluating different container clustering tools and technologies.  It has been a fun experience thus far, the tools and community being built around Docker have come a long time since I last looked.  So for today’s post I’d like to go over ECS a little bit.

ECS is essentially the AWS version of container management.  ECS takes care of managing your Docker (container) infrastructure by handling creation, management, destruction and scheduling as well as providing API integration with other AWS services, which is really powerful.  To get ECS up and running all you need to do is create an ECS cluster, either from the AWS console or from some other AWS integration like the CLI or Terraform, then install the agent on servers that you would like ECS to schedule work on.  After setting up the agent and cluster name you are basically ready to go, start by creating a task and then create a service to start running containers on the cluster.  Some cool new features have been announced at this years re:Invent conference but I haven’t had a chance yet to look at them yet.

First impression of ECS

The best part about testing ECS by far has been how easy it is to get set up and running.  It took less than 20 minutes to go from nothing to fully functioning cluster that was scheduling containers to hosts and receiving load.  I think the most powerful aspect of ECS is its integration with other AWS services.  For example, if you need to attach containers/services to a load balancer, the AWS infrastructure is already there so the different pieces of the infrastructure really mesh well together.

The biggest downside so far is that the ECS console interface is still clunky.  It is functional, and I have been able to use it to do everything I have needed but it just feels like it needs some polish and things are nested in menu’s and usually not easy to find.  I’m sure there are plans to improve the interface and as mentioned above some new features were recently announced, so I have a feeling there will be some nice improvements on the way.

I haven’t tried the CLI tool yet but it looks promising for automating containers and services.

Setting things up

Since I am a big fan of CoreOS I decided to try turning up my ECS cluster using CoreOS as the base OS and Terraform to do the heavy lifting and provisioning.

The first step is to create your cluster.  I noticed in the AWS console there was a configuration wizard that guides you through your first cluster which was annoying because there wasn’t a clean way to just create the cluster.  So you will need to follow the on screen instructions for getting your first environment set up.  If any of this is unclear there is a good guide for getting started with ECS here.

After your cluster has been created there is a menu that shows your ECS environments.

ECS cluster menu

 

 

 

 

 

 

 

 

 

 

Next, you will need to turn on the nodes that will be connecting to this cluster.  The first part of this is to get your cloud-config set up to connect to the cluster.  I used the CoreOS docs to set up the ECS agent, making sure to change the ECS_CLUSTER= section in the config.

#cloud-config

coreos:
  units:
  -
  name: amazon-ecs-agent.service
  command: start
  runtime: true
  content: |
  [Unit]
  Description=Amazon ECS Agent
  After=docker.service
  Requires=docker.service
  Requires=network-online.target
  After=network-online.target

  [Service]
  Environment=ECS_CLUSTER=my-cluster
  Environment=ECS_LOGLEVEL=warn
  Environment=ECS_CHECKPOINT=true
  ExecStartPre=-/usr/bin/docker kill ecs-agent
  ExecStartPre=-/usr/bin/docker rm ecs-agent
  ExecStartPre=/usr/bin/docker pull amazon/amazon-ecs-agent
  ExecStart=/usr/bin/docker run --name ecs-agent --env=ECS_CLUSTER=${ECS_CLUSTER} --env=ECS_LOGLEVEL=${ECS_LOGLEVEL} --env=ECS_CHECKPOINT=${ECS_CHECKPOINT} --publish=127.0.0.1:51678:51678 --volume=/var/run/docker.sock:/var/run/docker.sock --volume=/var/lib/aws/ecs:/data amazon/amazon-ecs-agent
  ExecStop=/usr/bin/docker stop ecs-agent

Note that the Environment=ECS_CLUSTER=my-cluster, this is the most important bit to get the server to check in to your cluster, assuming you named it “my-cluster”.  Feel free to add any other values your infrastructure may need.  Once you have the config how you want it, run it through the CoreOS cloud-config validator to make sure it checks out.  If everything looks okay there, your cloud-config should be ready to go.

You can find more info about how to configure the ECS agent in the docs here.

Once you have your cloud-config in order, you will need to get your Terraform “recipe” set up.  I used this awesome github project as the base for my own project.  The Terraform logic from there basically creates an AWS launch config and autoscaling group (and uses the cloud-config from above) to launch instances in to your cluster.  And the ECS agent takes care of the rest, once your servers are up and the agent is reporting in to the cluster.

launch_config.tf

resource "aws_launch_configuration" "ecs" {
  name = "ECS ${var.cluster_name}"
  image_id = "${var.ami}"
  instance_type = "${var.instance_type}"
  iam_instance_profile = "${var.iam_instance_profile}"
  key_name = "${var.key_name}"
  security_groups = ["${split(",", var.security_group_ids)}"]
  user_data = "${file("../cloud-config/ecs.yml")}"

  root_block_device = {
    volume_type = "gp2"
    volume_size = "40"
  }
}

Notice the user_data section.  This is where we inject the cloud config from above to provision CoreOS and launch the ECS agent.

autoscaler.tf

resource "aws_autoscaling_group" "ecs-cluster" {
  availability_zones = ["${split(",", var.availability_zones)}"]
  vpc_zone_identifier = ["${split(",", var.subnet_ids)}"]
  name = "ECS ${var.cluster_name}"
  min_size = "${var.min_size}"
  max_size = "${var.max_size}"
  desired_capacity = "${var.desired_capacity}"
  health_check_type = "EC2"
  launch_configuration = "${aws_launch_configuration.ecs.name}"
  health_check_grace_period = "${var.health_check_grace_period}"

  tag {
    key = "Env"
    value = "${var.environment_name}"
    propagate_at_launch = true
  }

  tag {
    key = "Name"
    value = "ECS ${var.cluster_name}"
    propagate_at_launch = true
  }
}

There are a few caveats I’d like to highlight with this approach.  First, I already have an AWS infrastructure in place that I was testing agains this.  So I didn’t have to do any of the extra work to create a VPC, or a gateway for the VPC.  I didn’t have to create the security groups and subnets either, I just added them to the Terraform code.

The other caveat is that if you want to use the Github project I linked to you will need to make sure that you populate the variables with your own environment specific values.  That is why having the VPC, subnets and security groups was handy for me.  Be sure to browse through the variables.tf file and substitute in your own values.  As an example,  I had to update the variables to use the CoreOS 766.4.0 image.  This AMI will be specific to your AWS region so make sure to look up the AMI first.

variable "ami" {
  /* CoreOS 766.4.0 */
  default = "ami-dbe71d9f"
  description = "AMI id to launch, must be in the region specified by the region variable"
}

Another part I had to modify to get the Github project to work was adding in my AWS credentials which look similar to the following.  Make sure to update these variables with your ID and secret.

provider "aws" {
  access_key = "${var.access_key}"
  secret_key = "${var.secret_key}"
  region = "${var.region}"
}

variable "access_key" {
  description = "AWS access key"
  default = "XXX"
}

variable "secret_key" {
  description = "AWS secret access key"
  default = "xxx"
}

Make sure to also copy/edit the autoscaling.tf and launch_config.tf files to reflect anything that is specific to your environment (Terraform will complain if there are issues).

After you have combed through the variables.tf and updated the Terraform files to your liking you can simply run terraform plan -input=false and see how Terraform will create the ASG for you.

If everything looks good, you can run terrafrom apply -input=false and Terraform will go out and start building your new ECS infrastructure for you.  After a few minutes check the EC2 console and your launch config and autoscaling group should be in there.  If that stuff all looks okay, check the ECS console and your new servers should show up and be ready to go to work for you!

NOTE: If you are starting from scratch, it is possible to do all of the infrastructure provisioning via Terraform but it is too far out of the scope of this post to cover because there are a lot of steps to it.

Read More

CoreOS etcd2 encryption

Etcd 2.1.1 Encryption and Authentication

New to etcd 2.1.0 is the ability to use authentication to secure your etcd resources.  Encryption and authentication are relatively new additions so I thought I would write a quick blog post to help remember how to get these components up and running as well as help others because some of the ideas were a little confusing to me at first.

I pieced together most of the information for this post together from a few different sources.

The first were a pair of great tutorials (1, 2) for getting etcd encryption up and going.  The second resource used was the etcd-ca project by CoreOS for creating a CA and issuing certs, there are other ways of doing it but this was a straight forward method.  The third resource I recommend look at is the Security page in the CoreOS docs that shows examples of how to piece all of the commands and certs together.  The last resource readers might find useful is the etcd2 docs for the different flags and configuration options.  This resource was helpful for finding out all the various options that I needed to enable to get etcd2 working properly.

Requirements

To use the authentication feature you will need to have etcd 2.1.0 or greater, which means you will need to be running a version of CoreOS that has the correct binary, which means you will either need CoreOS v752.1.0 or above, OR the correct binary version/Docker image.

Authentication is still an “experimental” feature so it may change at any time, therefore I have decided not to get in to any of the details of how it works.  If you are interested you can check out the docs on users and auth.

Running the CA server

At first I was conernced about running a CA server because I’ve had painful experiences in the past with CA’s but the etcd-ca tool makes this process easy and straight forward.  There are a few other CA resources in the etcd2 encryption docs but I won’t cover them here.

The easiest way to use the etcd-ca tool is to run it in a Docker container and write the certs out to the host via a shared voulme.  The following steps will pull the repo and build the binary for running the tool.

docker pull golang
docker run -i -t $(pwd):/go golang /bin/bash
git clone https://github.com/coreos/etcd-ca
cd etcd-ca
./build
cd ./bin

Create the certs

After the etcd-ca binary has been built we can start creating certs.  The first thing necessary is to create the CA certs which will be used to sign all other certs.

./etcd-ca init

After creating the CA signing cert we will create a certificate for the etcd server that will be authenticating to.

./etcd-ca new-cert -ip <etcd_server_ip> <hostname>
./etcd-ca sign <hostname>
./etcd-ca chain <hostname>
./etcd-ca export --insecure <hostname> | tar xvf -

Replace <etcd_server_ip> with the public address of the etcd server and <hostname> with the hostname of the etcd server.  In this example, something like core01 would be a good name.

Optional – Client cert

This is not necessary in all scenarios for setting up encryption for etcd but if you are interested in having clients authenticate with their own cert it isn’t that much effort to add.

./etcd-ca new-cert -ip <etcd_server_ip> client
./etcd-ca sign client
./etcd-ca export --insecure client | tar xvf -

Note:  You may need to move the above keys from the server/clientkey files generated to the correct filename.  Also to note, if you screw up any of the certs or for any reason need to recreate them you can simply delete the certificates from the .etcd-ca/ hidden folder that contains all of the certificates.

Etcd cloud-config

The following cloud-config will configure etcd2 to use the certs we configured above.

There is currently an issue parsing a few of the etcd2 command line flags so the workaround (for now) is to split the configuration up in to a base config and then to add env vars as a a drop in.

write_files:
  - path: /etc/systemd/system/etcd2.service.d/30-configuration.conf
  permissions: '0644'
  content: |
  [Service]
  # General settings
  Environment=ETCD_NAME=etcd-config
  Environment=ETCD_VERBOSE=1
  # Encrytpion
  Environment=ETCD_CLIENT_CERT_AUTH=1
  Environment=ETCD_TRUSTED_CA_FILE=/home/core/ca.crt
  Environment=ETCD_PEER_KEY_FILE=/home/core/server.key
  Environment=ETCD_PEER_CERT_FILE=/home/core/server.crt
  Environment=ETCD_CERT_FILE=/home/core/server.crt
  Environment=ETCD_KEY_FILE=/home/core/server.key
 
  - path: /home/core/ca.crt
  permissions: '0644'
  content: |
  -----BEGIN CERTIFICATE-----
  ca cert content
  -----END CERTIFICATE-----

  - path: /home/core/server.crt
  permissions: '0644'
  content: |
  -----BEGIN CERTIFICATE-----
  server cert content
  -----END CERTIFICATE-----

  - path: /home/core/server.key
  permissions: '0644'
  content: |
  -----BEGIN RSA PRIVATE KEY-----
  server key content
  -----END RSA PRIVATE KEY-----

  - path: /home/core/client.crt
  permissions: '0644'
  content: |
  -----BEGIN CERTIFICATE-----
  client cert content
  -----END CERTIFICATE-----

  - path: /home/core/client.key
  permissions: '0644'
  content: |
  -----BEGIN RSA PRIVATE KEY-----
  client key content
  -----END RSA PRIVATE KEY-----

coreos:
  etcd2:
    name: etcd
    discovery: https://discovery.etcd.io/a1c999ec1a23039996419e0a20cb1e35
    advertise-client-urls: https://$public_ipv4:2379
    initial-advertise-peer-urls: https://$private_ipv4:2380
    listen-client-urls: https://0.0.0.0:2379
    listen-peer-urls: https://$private_ipv4:2380
  units:
    - name: etcd2.service
    command: start

If you don’t want to bootstrap a node with cloud-config and instead are just interested in testing out testing encryption on an existing how you can use the following commands.  You will still need to make sure you follow the steps above to generate all of the necessary certs!

Manually start etcd2 with server certificate:

etcd2 -name infra0 -data-dir infra0 \ -cert-file=/home/core/server.crt -key-file=/home/core/server.key \ -advertise-client-urls=https://<server_ip>:2379 -listen-client-urls=https://<server_ip>:2379

and to test the connection use the following curl command.

curl --cacert /home/core/ca.crt https://172.17.8.101:2379/v2/keys/foo -XPUT -d value=bar -v

Manually start etcd2 with client certificate:

Etcd2 -name infra0 -data-dir infra0 \ -client-cert-auth -trusted-ca-file=/home/core/ca.crt -cert-file=/home/core/server.crt -key-file=/home/core/server.key \ -advertise-client-urls https://<server_ip>:2379 -listen-client-urls https://<server_ip>:2379

Similar to the above command you will just need to add the client certs to authenticate.

curl --cacert /home/core/ca.crt --cert /home/core/client.crt --key /home/core/client.key \ -L https://<server_ip>:2379/v2/keys/foo -XPUT -d value=bar -v

Another way to test the certs out is by using the etcdctl tool by addding a few flags.

etcdctl --ca-file ca.crt --cert-file client.crt --key-file client.key --peers https://<server_ip>:2379 set /foo bar

etcdctl --ca-file ca.crt --cert-file client.crt --key-file client.key --peers https://172.17.8.101:2379 get /foo

Encrypting etcd was a confusing process to me at first due to the complexity of encryption but after working through the above examples, most of the process made sense.  I seem to have a hard time wrapping my head around all of the different parts so hopefully I have effectively showed how the encryption component works.

The etcd-ca tool is very nice for testing because it is simple and straightforward but lacks a few features of a full fledged CA.  I suggest looking at using something like Openssl for a production type scenario.  Especially if things like certificate revocations are important.

Read More

Change CoreOS default toolbox

This is a little trick that allows you to override the default base OS in the CoreOS “toolbox“.  The toolbox is a neat trick to allow you to debug and troubleshoot issues inside containers on CoreOS without having to do any outside work of setting up a container.

The default toolbox OS defaults to Fedora, which we’re going to change to Ubuntu.  There is a custom configuration file that will get read in via the .toolboxrc file, located at /home/core/.toolboxrc by default.  To keep things simple we will only be changing the few pieces of the config to get the toolbox to behave how we want.  More can be changed but we don’t really need to override anything else.

TOOLBOX_DOCKER_IMAGE=ubuntu
TOOLBOX_DOCKER_TAG=14.04

That’s pretty cool, but what if we want to have this config file be in place for all servers?  We don’t want to have to manually write this config file for every server we log in to.

To fix this issue we will add a simple configuration in to the user-data file that gets fed in to the CoreOS cloud-config when the server is created.  You can find more information about the CoreOS cloud-configs here.

The bit in the cloud config that needs to change is the following.

-write_files:
  - path: /home/core/.toolboxrc
    owner: core
    content: |
      TOOLBOX_DOCKER_IMAGE=ubuntu
      TOOLBOX_DOCKER_TAG=14.04

If you are already using cloud-config then this change should be easy, just add the bit starting with -path to your existing -write_files section.  New servers using this config will have the desired toolbox defaults.

This approach gives us an automated, reproducible way to clone our custom toolbox config to every server that uses cloud-config to bootstrap itself.  Once the config is in place simply run the “toolbox” command and it should use the custom values to pull the desired Ubuntu image.

Then you can run your Ubuntu commands and debugging tools from within the toolbox.  Everything else will be the same, we just use Ubuntu now as our default toolbox OS.  Here is the post that gave me the idea to do this originally.

Read More