Easy Prometheus Monitoring in Rancher

March 24, 2016May 16, 2016 Josh Reichardt 1 Comment

Docker monitoring and container monitoring in general is an area that has historically been difficult. There has been a lot of movement and progress in the last year or so to beef up container monitoring tools but in my experience the tools have either been expensive or difficult to configure and complicated to use. The combination or Rancher and Prometheus has finally given me hope. Now it is easy easy to setup and configure a distributed monitoring solution without paying a high price.

Prometheus has recently added support for Rancher via the Rancher exporter, which is great news. This is by far the easiest method I have discovered thus far for experimenting with Prometheus.

For those that don’t know much about Prometheus, it is an up and coming project created by engineers at Soundcloud which is hosted on Github. Prometheus is focused on monitoring, specifically focusing on container and Docker monitoring. Prometheus uses a polling based model for “scraping” metrics out of predefined endpoints. The Prometheus Rancher exporter enables Prometheus to scrape Rancher server specific metrics, which are very useful to have. To build on that, one other point worth mentioning here is that Prometheus has a very nice, flexible design built upon different client libraries in a similar way to Graphite, so adding support and instrumenting code for different types of platforms is easy to implement. Check out the list of exporters in the Prometheus docs for idea on how to get started exporting metrics.

This post won’t cover setting up Rancher server or any of the Rancher environment since it is well documented in other places. I won’t touch on alerting here either because I honestly haven’t had much time to dig into it much yet. So with that said, the first step I will focus on in this post is getting Prometheus set up and running. Luckily it is extremely easy to accomplish this using the Rancher catalog and the Prometheus template.

Once Prometheus has been bootstrapped and everything is up, test it out by navigating to the Grafana home dashboard created by the bootstrap process. Since this is a simple demo, my dashboard is located at the IP of the server using port 3000 which is the only port that should need to be publicly exposed if you are interested in sharing the Grafana dashboard.

The default Grafana credentials for this catalog template are admin/admin for the username and password, which is noted in the catalog notes found here. The Prometheus tools ship with some nice preconfigured dashboards, so after you have things set up, it is definitely worth checking out some of them.

If you look around the dashboards you will probably notice that metrics for the Rancher server aren’t available by default. To enable these metrics we need to configure Prometheus to connect to the Rancher API, as noted in the Rancher monitoring guide.

Navigate to http://<SERVER_IP>:8080/v1/settings/graphite.host on your Rancher server, then in the top right click edit, and then update the value there to point to the server address where InfluxDB was deployed to.

After this setting has been configured, restart the Rancher server container, wait a few minutes and then check Grafana.

As you can see, metrics are now flowing in the the dashboard.

Now that we have the basics configured, we can drill down in to individual containers to get a more granular view of what is happening in the environment. This type of granularity is great because it gives a very detailed view of what exactly is going on inside our environment and gives us an easy way to share visuals with other team members. Prometheus offers a web interface to interact with the query language and visual results, which is useful to help figure out what kinds of things to visualize in Grafana.

Navigate to the server that the Prometheus server container is deployed to on port 9090. You should see a screen similar to the following.

There is documentation about how to get started with using this tool, so I recommend taking a look and playing around with it yourself. Once you find some useful metrics, visualized in the graph view, grab the query used to generate the graph and add a new dashboard to Grafana.

Prometheus offers a lot of power and flexibility and is a great tool for monitoring. I am still very new to Prometheus but so far it looks very promising and I have to say I’m really impressed with the amount of polish and detail I was able to get in just an afternoon of experimenting. I will be updating this post as I get more exposure to Prometheus and get more metrics and monitoring set up so stay tuned.

xhyve vs vbox driver benchmarks for docker-machine

February 16, 2016February 15, 2018 Josh Reichardt 2 Comments

Getting a usable and productive dev environment working with Docker on OS X is not exactly trivial, although it is getting much better. If you have spent any time working with docker-machine and Docker on OS X you’ve probably run across some type of roadblock to getting your dev environment working.

If you have used docker-machine you are probably familiar with the Virtualbox driver, the driver that ships as by default. Obviously it works out of the box but if you have used Virtualbox for any amount of time you have probably discovered some of its quirks. My biggest gripe thus far with Vbox is that their shared folders technology to sync files between the host and VM is slooooow. In fact, I have written about my own workaround here.

I have run in to some other performance issues using VBox. This write up is a very detailed comparison of the performance between VBox and VMWare. The tl;dr of the post is that that the VMWare hypervisor has better performance. To Oracle’s credit though, many of these performance issues have actually been addressed in the vbox 5.0.0 release. So if you aren’t running on 5.x definitely make the jump. The Docker Toolbox ships the newer release so there is no reason not to upgrade.

Making the jump to VBox 5.x may, and most likely should solve your problems but I have been curious about what other options are out there. Recently, as of July 2015, the xhyve hypervisor project has been available on OS X. xhyve is a port of the byhve project, which aims to bring high performance virtualization with a light footprint to OS X. It is still very young but shows a lot of potential.

Even younger than the xhyve project itself is the xhyve driver for docker-machine. It is so young that it is still not an officially supported driver yet, though it looks like it is well on its way. Definitely keep an eye on the xhyve and docker-machine xhyve projects if you are looking for an alternative to either VBox or VMWare. The xhyve docker-machine driver project has recently closed a ticket to be added to brew so it is much less complicated to get working.

Xhyve installation

I will be going over the bare minimum installation instructions to getting everything working. If you are interested in more of the details on how to get the xhyve driver, I suggest taking a look at this awesome blog post. The post goes in to depth on how to install and use the docker-machine xhyve driver if you are interested in a more in depth look at how to get things working.

Make sure you have brew installed first. You will also need to have brew cask installed. After you have brew installed you should be able to get it from the command line with the following command.

brew tap caskroom/cask

Once you have cask installed you should be able to install the remaining components.

brew update
brew install xhyve
brew cask install dockertoolbox

This might take a little bit depending on how fast your internet connection is. After you have the toolbox installed, go grab the docker-machine xhyve driver.

brew install docker-machine-driver-xhyve

If you have the dockertoolbox installed already you might some errors in the output. This just means there was a version conflict somewhere. As of docker-machine version 0.5.6_1, support has been added for the xhyve driver.

There is currently a caveat to using this driver where you need to change some permissions. This should hopefully be fixed in the future but is at least something to be aware of.

sudo chown root:wheel $(brew --prefix)/opt/docker-machine-driver-xhyve/bin/docker-machine-driver-xhyve
sudo chmod u+s $(brew --prefix)/opt/docker-machine-driver-xhyve/bin/docker-machine-driver-xhyve

You will also need to clean out your /etc/exports file if you have made changes.

sudo mv /etc/exports{,.backup} && touch /etc/exports

Then create the machine.

docker-machine create --driver xhyve --xhyve-experimental-nfs-share test

If you can interact with the Docker daemon you should be in business.

Benchmark results

The remainder of the post describes the benchmark and performance results of the VBox driver and the xhyve driver. If you are only interested in getting the xhyve driver working then feel free to skim through the benchmarks, but be sure to take a look at the conclusion for the final verdict.

Below are the specs of the OS X machine that was used to run my benchmarks.

OS X 10.10.5
Virtualbox 5.0.12
xhyve 0.2.0
docker-machine-xyhve 0.2.2

Many of the ideas I used for the benchmark tests were taken from the post linked above. It shows in great detail the methodology that was used to benchmark each of the drivers, which is useful because it gives some really good insight into the tools that were used and how the tests were performed.

Benchmarking on the boot2docker VM is tricky because it is mostly a read only file system and there is no package manger. Therefore I relied on running the benchmarks inside containers, using a few different methodologies for my testing. The first was borrowed from the simple-container-benchmarks project on Dockerhub. This benchmark test gives a good idea of the overall write performance and CPU performance of a container running inside the VM. For network performance I used the iperf3 image located on Dockerhub.

Below are the results of a few random runs for both the VBox driver as well as the xhyve driver. I have left out the specific commands here as they are included in the links to each benchmark. Use the links to each project for specific instructions on how to run the benchmarks yourself if you are interested. The results were interesting because I was expecting the xhyve driver to outperform the VBox driver.

Virtualbox results

container benchmark results (FS write and CPU)

Client mode...
Target: 172.17.0.2
------------------------------
Performance benchmarks
------------------------------
dockerhost: tcp://192.168.99.100:2376
host: 172.17.0.2 a8b790317264
eth0: 172.17.0.2
date: Sat Jan 23 02:24:20 UTC 2016

------------------------------
FS write performance
------------------------------
1073741824 bytes (1.1 GB) copied, 2.39743 s, 448 MB/s
1073741824 bytes (1.1 GB) copied, 2.35377 s, 456 MB/s
1073741824 bytes (1.1 GB) copied, 1.9075 s, 563 MB/s
1073741824 bytes (1.1 GB) copied, 2.37838 s, 451 MB/s
1073741824 bytes (1.1 GB) copied, 2.03373 s, 528 MB/s
1073741824 bytes (1.1 GB) copied, 1.94024 s, 553 MB/s
1073741824 bytes (1.1 GB) copied, 1.99546 s, 538 MB/s
1073741824 bytes (1.1 GB) copied, 2.00287 s, 536 MB/s
1073741824 bytes (1.1 GB) copied, 1.5292 s, 702 MB/s
1073741824 bytes (1.1 GB) copied, 1.92617 s, 557 MB/s

------------------------------
CPU performance
------------------------------
268435456 bytes (268 MB) copied, 22.6775 s, 11.8 MB/s
268435456 bytes (268 MB) copied, 22.1466 s, 12.1 MB/s
268435456 bytes (268 MB) copied, 30.7552 s, 8.7 MB/s
268435456 bytes (268 MB) copied, 22.2861 s, 12.0 MB/s
268435456 bytes (268 MB) copied, 22.5571 s, 11.9 MB/s
268435456 bytes (268 MB) copied, 21.9901 s, 12.2 MB/s
268435456 bytes (268 MB) copied, 21.8232 s, 12.3 MB/s
268435456 bytes (268 MB) copied, 31.3903 s, 8.6 MB/s
268435456 bytes (268 MB) copied, 28.1219 s, 9.5 MB/s
268435456 bytes (268 MB) copied, 31.0172 s, 8.7 MB/s

------------------------------
System info
------------------------------
             total       used       free     shared    buffers     cached
Mem:       1019960     313288     706672     113104       7808     132532
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 58
Stepping:              9
CPU MHz:               2294.770
BogoMIPS:              4589.54
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              3072K

iperf results

Connecting to host 172.17.0.3, port 5201
[  4] local 172.17.0.4 port 39476 connected to 172.17.0.3 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  2.09 GBytes  17.9 Gbits/sec  2321   1.03 MBytes
[  4]   1.00-2.00   sec  2.46 GBytes  21.1 Gbits/sec  496    980 KBytes
[  4]   2.00-3.00   sec  2.24 GBytes  19.3 Gbits/sec  339   1.77 MBytes
[  4]   3.00-4.00   sec  2.54 GBytes  21.8 Gbits/sec  1355    389 KBytes
[  4]   4.00-5.00   sec  2.10 GBytes  18.0 Gbits/sec  106    495 KBytes
[  4]   5.00-6.00   sec  3.00 GBytes  25.7 Gbits/sec  217    411 KBytes
[  4]   6.00-7.00   sec  2.60 GBytes  22.4 Gbits/sec  440   1.72 MBytes
[  4]   7.00-8.00   sec  2.06 GBytes  17.7 Gbits/sec    0   1.72 MBytes
[  4]   8.00-9.00   sec  2.07 GBytes  17.8 Gbits/sec    0   1.72 MBytes
[  4]   9.00-10.00  sec  2.51 GBytes  21.6 Gbits/sec  876    713 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  23.7 GBytes  20.3 Gbits/sec  6150             sender
[  4]   0.00-10.00  sec  23.7 GBytes  20.3 Gbits/sec                  receiver

iperf Done.

xhyve results

container benchmark results (FS write and CPU)

Client mode...
Target: 172.17.0.2
------------------------------
Performance benchmarks
------------------------------
dockerhost:
host: 172.17.0.2 2c8d9ba61eae
eth0: 172.17.0.2
date: Sat Jan 23 02:08:15 UTC 2016

------------------------------
FS write performance
------------------------------
1073741824 bytes (1.1 GB) copied, 8.24671 s, 130 MB/s
1073741824 bytes (1.1 GB) copied, 5.89179 s, 182 MB/s
1073741824 bytes (1.1 GB) copied, 6.05392 s, 177 MB/s
1073741824 bytes (1.1 GB) copied, 5.37728 s, 200 MB/s
1073741824 bytes (1.1 GB) copied, 4.824 s, 223 MB/s
1073741824 bytes (1.1 GB) copied, 5.90409 s, 182 MB/s
1073741824 bytes (1.1 GB) copied, 5.22375 s, 206 MB/s
1073741824 bytes (1.1 GB) copied, 5.07298 s, 212 MB/s
1073741824 bytes (1.1 GB) copied, 5.89058 s, 182 MB/s
1073741824 bytes (1.1 GB) copied, 4.80828 s, 223 MB/s

------------------------------
CPU performance
------------------------------
268435456 bytes (268 MB) copied, 25.478 s, 10.5 MB/s
268435456 bytes (268 MB) copied, 31.3984 s, 8.5 MB/s
268435456 bytes (268 MB) copied, 24.698 s, 10.9 MB/s
268435456 bytes (268 MB) copied, 31.1973 s, 8.6 MB/s
268435456 bytes (268 MB) copied, 23.3705 s, 11.5 MB/s
268435456 bytes (268 MB) copied, 23.3973 s, 11.5 MB/s
268435456 bytes (268 MB) copied, 23.7405 s, 11.3 MB/s
268435456 bytes (268 MB) copied, 23.6118 s, 11.4 MB/s
268435456 bytes (268 MB) copied, 23.5606 s, 11.4 MB/s
268435456 bytes (268 MB) copied, 24.3341 s, 11.0 MB/s

------------------------------
System info
------------------------------
             total       used       free     shared    buffers     cached
Mem:       1020028     291632     728396      70356       6420      89824
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 58
Stepping:              9
CPU MHz:               2294.450
BogoMIPS:              4607.99
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              3072K

iperf results

Connecting to host 172.17.0.2, port 5201
[  4] local 172.17.0.3 port 49244 connected to 172.17.0.2 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  2.29 GBytes  19.7 Gbits/sec    0   1.90 MBytes
[  4]   1.00-2.00   sec  2.84 GBytes  24.4 Gbits/sec  567    953 KBytes
[  4]   2.00-3.00   sec  2.16 GBytes  18.6 Gbits/sec  327    667 KBytes
[  4]   3.00-4.00   sec  2.32 GBytes  19.9 Gbits/sec  166   1.52 MBytes
[  4]   4.00-5.00   sec  2.63 GBytes  22.6 Gbits/sec  565    769 KBytes
[  4]   5.00-6.00   sec  2.71 GBytes  23.3 Gbits/sec  608    583 KBytes
[  4]   6.00-7.00   sec  2.67 GBytes  22.9 Gbits/sec  217   1.40 MBytes
[  4]   7.00-8.00   sec  2.98 GBytes  25.6 Gbits/sec  782    498 KBytes
[  4]   8.00-9.00   sec  2.80 GBytes  24.0 Gbits/sec  359   1.01 MBytes
[  4]   9.00-10.00  sec  2.43 GBytes  20.9 Gbits/sec  883    467 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  25.8 GBytes  22.2 Gbits/sec  4474             sender
[  4]   0.00-10.00  sec  25.8 GBytes  22.2 Gbits/sec                  receiver

iperf Done.

Conclusion

I was on on the fence about VBox performance but the proof is in the pudding here with the test results. The VBox driver had significantly better FS write performance (almost 2x). CPU performance was about equal overall, and network throughput was also very similar. I suspect CPU performance would be in favor of xhyve if these tests were run using VBox 4.x. Regardless, equal CPU performance, similar network throughput and significantly better FS writes tip the scale in favor of the VBox driver.

As frustrating as it can be at times to use Vbox, many of its past performance issues have been fixed as of the v5.0 release. The shared folder issue still exists but is largely taken care of by the great, easy to use tools that the Docker community has written, docker-machine-nfs is my favorite.

Surprisingly, or maybe not THAT surprisingly, xhyve actually performs worse that Virtualbox at this point. xyhve itself is still a super young project and the docker-machine xhyve driver is still super young so there is definitely some room for growth. That said, it was very straightforward to get xhyve and the docker-drive installed and configured, so I believe it is just a matter of time before the xhyve driver matures to a point where it can replace other drivers. One down side of the xhyve driver is that it also suffers from the host to VM shared folder issue and the current best work around is to use the –nfs-share flag that the xhyve docker-machine driver offers.

I will definitely have my eye on the xhyve project moving forward because it looks to be a great alternative to other virtualization technologies for OS X once it reaches a point of maturity. For now, VBox works more than sufficiently, has been around for a long time, is pretty much ubiquitous across platforms and the developers have shown that they are still actively working on improving the project with the recent 5.0 release.

Fixing docker-machine shared folder performance issues

January 17, 2016January 17, 2016 Josh Reichardt 8 Comments

It is a known issue that vboxsf (Virtualbox Shared Folders) has performance problems. This ugly fact becomes a problem if you use docker-machine with the default Virtualbox driver to mount volumes, both on Windows and OS X, especially when mounting directories with a large amount (~17k and above files). Linux does not suffer from this performance problem since it is able to run Docker natively and does not require you to run docker-machine.

There are various issues floating around Github referencing this problem, most of which remain unresolved.

Unfortunately there is currently not a proper fix for the vboxsf performance issues on OS X and Windows. In fact, I reached out to the Virtualbox developers around a year ago asking what the deal was and was basically told that fixing shared folder performance was not a high priority issue for their dev team.

After hearing the unsettling news, I set out to find a good way to deal with shared volumes. I stumbled across a few different approaches to solving the problem but most of them ended up being glitchy (at the time) or overly complicated. There is a nice write up that mentions many of the tools that I tried myself.

Having tried most of the methods out there, easily the best workaround I have found is to use NFS file shares to mount the “Users” directory using a tool called docker-machine-nfs. It is easy to install and run and most importantly it just works out of the box, which is exactly what most folks are looking for.

Sadly this method does not work on Windows. And as far as I can tell there is not a good workaround to this problem if you are running docker-machine on Windows. It does sounds like some folks maybe have had some success using samba but I have not attempted to get fast volumes working on Windows so can’t say for sure what the best approach is.

To install docker-machine-nfs

curl -s https://raw.githubusercontent.com/adlogix/docker-machine-nfs/master/docker-machine-nfs.sh |
  sudo tee /usr/local/bin/docker-machine-nfs > /dev/null && \
  sudo chmod +x /usr/local/bin/docker-machine-nfs

To run it

Make sure you already have created a docker-machine VM and verify that it is running. Then run the following command.

docker-machine-nfs <machine-name>

And that’s pretty much it… It requires admin access to do the NFS mounting so you might need to punch in your password, other than that you can pretty much follow along with what the output is doing.

There are a few caveats to be aware of.

I have noticed that on newer versions of docker-machine, if you run the script too quickly after creating the VM, docker-machine ends up having issues communicating with the Docker daemon. The work around (for now) is just to wait ~30 seconds the docker-machine VM to boot fully before running the command to mount nfs.

There is also currently an issue on the docker-machine side on version 0.5.5 and above that breaks docker-machine-nfs on the first run, which is described here. The workaround for that issue is to modify the script and place a “sleep 20” in the script, as per the comments in the issue. The author appears to have brought the issue up with docker-machine developers so should fixed properly in the near future.

Dockerizing Sentry

December 24, 2015December 25, 2015 Josh Reichardt Leave a comment

I have created a Github project that has basic instructions for getting started. You can take a look over there for ideas of how all of this works and to get ideas for your own set up.

I used the following links as reference for my approach to Dockerizing Sentry.

https://registry.hub.docker.com/u/slafs/sentry
https://github.com/rchampourlier/docker-sentry

If you have configurations to use, it is probably a good idea to start from there. You can check my Github repo for what a basic configuration looks like. If you are starting from scratch or are using version 7.1.x or above you can use the “sentry init” command to generate a skeleton configuration to work from.

For this setup to work you will need the following prebuilt Docker images/containers. I suggest using something simple like docker-compose to stitch the containers together.

redis – https://registry.hub.docker.com/_/redis/
postgres – https://registry.hub.docker.com/_/postgres/
memcached – https://hub.docker.com/_/memcached/
nginx – https://hub.docker.com/_/nginx/

NOTE: If you are running this on OS X you may need to do some trickery and give special permission on the host (mac) level e.g. create ~/docker/postgres directory and give it the correct permission (I just used 777 recursively for testing, make sure to lock it down if you put this in production).

I wrote a little script in my Github project that will take care of setting up all of the directories on the host OS that need to be set up for data to persist. The script also generates a self signed cert to use for proxying Sentry through Nginx. Without the certificate, the statistics pages in the Sentry web interface will be broken.

To run the script, run the following command and follow the prompts. Also make sure you have docker-compose installed beforehand to run all the needed command.

sudo ./setup.sh

The certs that get generated are self signed so you will see the red lock in your browser. I haven’t tried it yet but I imagine using Let’s Encrytpt to create the certificates would be very easy. Let me know if you have had any success generating Nginx certs for Docker containers, I might write a follow up post.

Preparing Postgres

After setting up directories and creating certificates, the first thing necessary to getting up and going is to add the Sentry superuser to Postgres (at least 9.4). To do this, you will need to fire up the Postgres container.

docker-compose up -d postgres

Then to connect to the Postgres DB you can use the following command.

docker-compose run postgres sh -c 'exec psql -h "$POSTGRES_PORT_5432_TCP_ADDR" -p "$POSTGRES_PORT_5432_TCP_PORT" -U postgres'

Once you are logged in to the Postgres container you will need to set up a few Sentry DB related things.

First, create the role.

CREATE ROLE sentry superuser;

And then allow it to login.

ALTER ROLE sentry WITH LOGIN;

Create the Sentry DB.

CREATE DATABASE sentry;

When you are done in the container, \q will drop out of the postgresql shell.

After you’re done configuring the DB components you will need to “prime” Sentry by running it a first time. This will probably take a little bit of time because it also requires you to build and pull all the other needed Docker images.

docker-compose build
docker-compose up

You will quickly notice if you try to browse to the Sentry URL (e.g. the IP/port of your Sentry container or docker-machine IP if you’re on OS X) that you will get errors in the logs and 503’s if you hit the site.

Repair the database (if needed)

To fix this you will need to run the following command on your DB to repair it if this is the first time you have run through the set up.

docker-compose run sentry sentry upgrade

The default Postgres database username and password is sentry in this setup, as part of the setup the upgrade prompt will ask you got create a new user and password, and make note of what those are. You will definitely want to change these configs if you use this outside of a test or development environment.

After upgrading/preparing the database, you should be able to bring up the stack again.

docker-compose up -d && docker-compose logs

Now you should be able to get to the Sentry URL and start configuring . To manage the username/password you can visit the /admin url and set up the accounts.

Next steps

The Sentry server should come up and allow you in but will likely need more configuration. Using the power of docker-compose it is easy to add in any custom configurations you have. For example, if you need to adjust sentry level configurations all you need to do is edit the file in ./sentry/sentry.conf.py and then restart the stack to pick up the changes. Likewise, if you need to make changes to Nginx or celery, just edit the configuration file and bump the stack – using “docker-compose up -d”.

I have attempted to configure as many sane defaults in the base config to make the configuration steps easier. You will probably want to check some of the following settings in the sentry/sentry.conf.py file.

SENTRY_ADMIN_EMAIL – For notifications
SENTRY_URL_PREFIX – This is especially important for getting stats working
SENTRY_ALLOW_ORIGIN – Where to allow communications from
ALLOWED_HOSTS – Which hosts can communicate with Sentry

If you have the SENTRY_URL_PREFIX set up correctly you should see something similar when you visit the /queue page, which indicates statistics are working.

If you want to set up any kind of email alerting, make sure to check out the mail server settings.

docker-compose.yml example file

The following configuration shows how the Sentry stack should look. The meat of the logic is in this configuration but since docker-compose is so flexible, you can modify this to use any custom commands, different ports or any other configurations you may need to make Sentry work in your own environment.

# Caching
redis:
  image: redis:2.8
  hostname: redis
  ports:
    - "6379:6379"
   volumes:
     - "/data/redis:/data"

memcached:
  image: memcached
  hostname: memcached
  ports:
    - "11211:11211"

# Database
postgres:
  image: postgres:9.4
  hostname: postgres
  ports:
    - "5432:5432"
  volumes:
    - "/data/postgres/etc:/etc/postgresql"
    - "/data/postgres/log:/var/log/postgresql"
    - "/data/postgres/lib/data:/var/lib/postgresql/data"

# Customized Sentry configuration
sentry:
  build: ./sentry
  hostname: sentry
  ports:
    - "9000:9000"
    - "9001:9001"
  links:
    - postgres
    - redis
    - celery
    - memcached
  volumes:
    - "./sentry/sentry.conf.py:/home/sentry/.sentry/sentry.conf.py"


# Celery
celery:
  build: ./sentry
  hostname: celery
  environment:
    - C_FORCE_ROOT=true
  command: "sentry celery worker -B -l WARNING"
  links:
    - postgres
    - redis
    - memcached
  volumes:
    - "./sentry/sentry.conf.py:/home/sentry/.sentry/sentry.conf.py"

# Celerybeat
celerybeat:
  build: ./sentry
  hostname: celerybeat
  environment:
    - C_FORCE_ROOT=true
  command: "sentry celery beat -l WARNING"
  links:
    - postgres
    - redis
  volumes:
    - "./sentry/sentry.conf.py:/home/sentry/.sentry/sentry.conf.py"

# Nginx
nginx:
  image: nginx
  hostname: nginx
  ports:
    - "80:80"
    - "443:443"
  links:
    - sentry
  volumes:
    - "./nginx/sentry.conf:/etc/nginx/conf.d/default.conf"
    - "./nginx/sentry.crt:/etc/nginx/ssl/sentry.crt"
    - "./nginx/sentry.key:/etc/nginx/ssl/sentry.key"

The Dockerfiles for each of these component are fairly straight forward. In fact, the same configs can be used for the Sentry, Celery and Celerybeat services.

Sentry

# Kombu breaks in 2.7.11
FROM python:2.7.10

# Set up sentry user
RUN groupadd sentry && useradd --create-home --home-dir /home/sentry -g sentry sentry
WORKDIR /home/sentry

# Sentry dependencies
RUN pip install \
 psycopg2 \
 mysql-python \
 supervisor \
 # Threading
 gevent \
 eventlet \
 # Memcached
 python-memcached \
 # Redis
 redis \
 hiredis \
 nydus

# Sentry
ENV SENTRY_VERSION 7.7.4
RUN pip install sentry==$SENTRY_VERSION

# Set up directories
RUN mkdir -p /home/sentry/.sentry \
 && chown -R sentry:sentry /home/sentry/.sentry \
 && chown -R sentry /var/log

# Configs
COPY sentry.conf.py /home/sentry/.sentry/sentry.conf.py

#USER sentry
EXPOSE 9000/tcp 9001/udp

# Making sentry commands easier to run
RUN ln -s /home/sentry/.sentry /root

CMD sentry --config=/home/sentry/.sentry/sentry.conf.py start

Since the customized Sentry config is rather lengthy, I will point you to the Github repo again. There are a few values that you will need to provide but they should be pretty self explanatory.

Once the configs have all been put in to place you should be good to go. A bonus piece would be to add an Upstart service that takes care of managing the stack if the server either gets rebooted or the containers manage to get stuck in an unstable state. The configuration is a fairly easy thing to do and many other guides and posts have been written about how to accomplish this.

ECS cluster turnup with CoreOS and Terraform

October 9, 2015December 5, 2015 Josh Reichardt 6 Comments

Recently I have been evaluating different container clustering tools and technologies. It has been a fun experience thus far, the tools and community being built around Docker have come a long time since I last looked. So for today’s post I’d like to go over ECS a little bit.

ECS is essentially the AWS version of container management. ECS takes care of managing your Docker (container) infrastructure by handling creation, management, destruction and scheduling as well as providing API integration with other AWS services, which is really powerful. To get ECS up and running all you need to do is create an ECS cluster, either from the AWS console or from some other AWS integration like the CLI or Terraform, then install the agent on servers that you would like ECS to schedule work on. After setting up the agent and cluster name you are basically ready to go, start by creating a task and then create a service to start running containers on the cluster. Some cool new features have been announced at this years re:Invent conference but I haven’t had a chance yet to look at them yet.

First impression of ECS

The best part about testing ECS by far has been how easy it is to get set up and running. It took less than 20 minutes to go from nothing to fully functioning cluster that was scheduling containers to hosts and receiving load. I think the most powerful aspect of ECS is its integration with other AWS services. For example, if you need to attach containers/services to a load balancer, the AWS infrastructure is already there so the different pieces of the infrastructure really mesh well together.

The biggest downside so far is that the ECS console interface is still clunky. It is functional, and I have been able to use it to do everything I have needed but it just feels like it needs some polish and things are nested in menu’s and usually not easy to find. I’m sure there are plans to improve the interface and as mentioned above some new features were recently announced, so I have a feeling there will be some nice improvements on the way.

I haven’t tried the CLI tool yet but it looks promising for automating containers and services.

Setting things up

Since I am a big fan of CoreOS I decided to try turning up my ECS cluster using CoreOS as the base OS and Terraform to do the heavy lifting and provisioning.

The first step is to create your cluster. I noticed in the AWS console there was a configuration wizard that guides you through your first cluster which was annoying because there wasn’t a clean way to just create the cluster. So you will need to follow the on screen instructions for getting your first environment set up. If any of this is unclear there is a good guide for getting started with ECS here.

After your cluster has been created there is a menu that shows your ECS environments.

Next, you will need to turn on the nodes that will be connecting to this cluster. The first part of this is to get your cloud-config set up to connect to the cluster. I used the CoreOS docs to set up the ECS agent, making sure to change the ECS_CLUSTER= section in the config.

#cloud-config

coreos:
  units:
  -
  name: amazon-ecs-agent.service
  command: start
  runtime: true
  content: |
  [Unit]
  Description=Amazon ECS Agent
  After=docker.service
  Requires=docker.service
  Requires=network-online.target
  After=network-online.target

  [Service]
  Environment=ECS_CLUSTER=my-cluster
  Environment=ECS_LOGLEVEL=warn
  Environment=ECS_CHECKPOINT=true
  ExecStartPre=-/usr/bin/docker kill ecs-agent
  ExecStartPre=-/usr/bin/docker rm ecs-agent
  ExecStartPre=/usr/bin/docker pull amazon/amazon-ecs-agent
  ExecStart=/usr/bin/docker run --name ecs-agent --env=ECS_CLUSTER=${ECS_CLUSTER} --env=ECS_LOGLEVEL=${ECS_LOGLEVEL} --env=ECS_CHECKPOINT=${ECS_CHECKPOINT} --publish=127.0.0.1:51678:51678 --volume=/var/run/docker.sock:/var/run/docker.sock --volume=/var/lib/aws/ecs:/data amazon/amazon-ecs-agent
  ExecStop=/usr/bin/docker stop ecs-agent

Note that the Environment=ECS_CLUSTER=my-cluster, this is the most important bit to get the server to check in to your cluster, assuming you named it “my-cluster”. Feel free to add any other values your infrastructure may need. Once you have the config how you want it, run it through the CoreOS cloud-config validator to make sure it checks out. If everything looks okay there, your cloud-config should be ready to go.

You can find more info about how to configure the ECS agent in the docs here.

Once you have your cloud-config in order, you will need to get your Terraform “recipe” set up. I used this awesome github project as the base for my own project. The Terraform logic from there basically creates an AWS launch config and autoscaling group (and uses the cloud-config from above) to launch instances in to your cluster. And the ECS agent takes care of the rest, once your servers are up and the agent is reporting in to the cluster.

launch_config.tf

resource "aws_launch_configuration" "ecs" {
  name = "ECS ${var.cluster_name}"
  image_id = "${var.ami}"
  instance_type = "${var.instance_type}"
  iam_instance_profile = "${var.iam_instance_profile}"
  key_name = "${var.key_name}"
  security_groups = ["${split(",", var.security_group_ids)}"]
  user_data = "${file("../cloud-config/ecs.yml")}"

  root_block_device = {
    volume_type = "gp2"
    volume_size = "40"
  }
}

Notice the user_data section. This is where we inject the cloud config from above to provision CoreOS and launch the ECS agent.

autoscaler.tf

resource "aws_autoscaling_group" "ecs-cluster" {
  availability_zones = ["${split(",", var.availability_zones)}"]
  vpc_zone_identifier = ["${split(",", var.subnet_ids)}"]
  name = "ECS ${var.cluster_name}"
  min_size = "${var.min_size}"
  max_size = "${var.max_size}"
  desired_capacity = "${var.desired_capacity}"
  health_check_type = "EC2"
  launch_configuration = "${aws_launch_configuration.ecs.name}"
  health_check_grace_period = "${var.health_check_grace_period}"

  tag {
    key = "Env"
    value = "${var.environment_name}"
    propagate_at_launch = true
  }

  tag {
    key = "Name"
    value = "ECS ${var.cluster_name}"
    propagate_at_launch = true
  }
}

There are a few caveats I’d like to highlight with this approach. First, I already have an AWS infrastructure in place that I was testing agains this. So I didn’t have to do any of the extra work to create a VPC, or a gateway for the VPC. I didn’t have to create the security groups and subnets either, I just added them to the Terraform code.

The other caveat is that if you want to use the Github project I linked to you will need to make sure that you populate the variables with your own environment specific values. That is why having the VPC, subnets and security groups was handy for me. Be sure to browse through the variables.tf file and substitute in your own values. As an example, I had to update the variables to use the CoreOS 766.4.0 image. This AMI will be specific to your AWS region so make sure to look up the AMI first.

variable "ami" {
  /* CoreOS 766.4.0 */
  default = "ami-dbe71d9f"
  description = "AMI id to launch, must be in the region specified by the region variable"
}

Another part I had to modify to get the Github project to work was adding in my AWS credentials which look similar to the following. Make sure to update these variables with your ID and secret.

provider "aws" {
  access_key = "${var.access_key}"
  secret_key = "${var.secret_key}"
  region = "${var.region}"
}

variable "access_key" {
  description = "AWS access key"
  default = "XXX"
}

variable "secret_key" {
  description = "AWS secret access key"
  default = "xxx"
}

Make sure to also copy/edit the autoscaling.tf and launch_config.tf files to reflect anything that is specific to your environment (Terraform will complain if there are issues).

After you have combed through the variables.tf and updated the Terraform files to your liking you can simply run terraform plan -input=false and see how Terraform will create the ASG for you.

If everything looks good, you can run terrafrom apply -input=false and Terraform will go out and start building your new ECS infrastructure for you. After a few minutes check the EC2 console and your launch config and autoscaling group should be in there. If that stuff all looks okay, check the ECS console and your new servers should show up and be ready to go to work for you!

NOTE: If you are starting from scratch, it is possible to do all of the infrastructure provisioning via Terraform but it is too far out of the scope of this post to cover because there are a lot of steps to it.