ECS cluster turnup with CoreOS and Terraform

Recently I have been evaluating different container clustering tools and technologies.  It has been a fun experience thus far, the tools and community being built around Docker have come a long time since I last looked.  So for today’s post I’d like to go over ECS a little bit.

ECS is essentially the AWS version of container management.  ECS takes care of managing your Docker (container) infrastructure by handling creation, management, destruction and scheduling as well as providing API integration with other AWS services, which is really powerful.  To get ECS up and running all you need to do is create an ECS cluster, either from the AWS console or from some other AWS integration like the CLI or Terraform, then install the agent on servers that you would like ECS to schedule work on.  After setting up the agent and cluster name you are basically ready to go, start by creating a task and then create a service to start running containers on the cluster.  Some cool new features have been announced at this years re:Invent conference but I haven’t had a chance yet to look at them yet.

First impression of ECS

The best part about testing ECS by far has been how easy it is to get set up and running.  It took less than 20 minutes to go from nothing to fully functioning cluster that was scheduling containers to hosts and receiving load.  I think the most powerful aspect of ECS is its integration with other AWS services.  For example, if you need to attach containers/services to a load balancer, the AWS infrastructure is already there so the different pieces of the infrastructure really mesh well together.

The biggest downside so far is that the ECS console interface is still clunky.  It is functional, and I have been able to use it to do everything I have needed but it just feels like it needs some polish and things are nested in menu’s and usually not easy to find.  I’m sure there are plans to improve the interface and as mentioned above some new features were recently announced, so I have a feeling there will be some nice improvements on the way.

I haven’t tried the CLI tool yet but it looks promising for automating containers and services.

Setting things up

Since I am a big fan of CoreOS I decided to try turning up my ECS cluster using CoreOS as the base OS and Terraform to do the heavy lifting and provisioning.

The first step is to create your cluster.  I noticed in the AWS console there was a configuration wizard that guides you through your first cluster which was annoying because there wasn’t a clean way to just create the cluster.  So you will need to follow the on screen instructions for getting your first environment set up.  If any of this is unclear there is a good guide for getting started with ECS here.

After your cluster has been created there is a menu that shows your ECS environments.

ECS cluster menu

 

 

 

 

 

 

 

 

 

 

Next, you will need to turn on the nodes that will be connecting to this cluster.  The first part of this is to get your cloud-config set up to connect to the cluster.  I used the CoreOS docs to set up the ECS agent, making sure to change the ECS_CLUSTER= section in the config.

#cloud-config

coreos:
  units:
  -
  name: amazon-ecs-agent.service
  command: start
  runtime: true
  content: |
  [Unit]
  Description=Amazon ECS Agent
  After=docker.service
  Requires=docker.service
  Requires=network-online.target
  After=network-online.target

  [Service]
  Environment=ECS_CLUSTER=my-cluster
  Environment=ECS_LOGLEVEL=warn
  Environment=ECS_CHECKPOINT=true
  ExecStartPre=-/usr/bin/docker kill ecs-agent
  ExecStartPre=-/usr/bin/docker rm ecs-agent
  ExecStartPre=/usr/bin/docker pull amazon/amazon-ecs-agent
  ExecStart=/usr/bin/docker run --name ecs-agent --env=ECS_CLUSTER=${ECS_CLUSTER} --env=ECS_LOGLEVEL=${ECS_LOGLEVEL} --env=ECS_CHECKPOINT=${ECS_CHECKPOINT} --publish=127.0.0.1:51678:51678 --volume=/var/run/docker.sock:/var/run/docker.sock --volume=/var/lib/aws/ecs:/data amazon/amazon-ecs-agent
  ExecStop=/usr/bin/docker stop ecs-agent

Note that the Environment=ECS_CLUSTER=my-cluster, this is the most important bit to get the server to check in to your cluster, assuming you named it “my-cluster”.  Feel free to add any other values your infrastructure may need.  Once you have the config how you want it, run it through the CoreOS cloud-config validator to make sure it checks out.  If everything looks okay there, your cloud-config should be ready to go.

You can find more info about how to configure the ECS agent in the docs here.

Once you have your cloud-config in order, you will need to get your Terraform “recipe” set up.  I used this awesome github project as the base for my own project.  The Terraform logic from there basically creates an AWS launch config and autoscaling group (and uses the cloud-config from above) to launch instances in to your cluster.  And the ECS agent takes care of the rest, once your servers are up and the agent is reporting in to the cluster.

launch_config.tf

resource "aws_launch_configuration" "ecs" {
  name = "ECS ${var.cluster_name}"
  image_id = "${var.ami}"
  instance_type = "${var.instance_type}"
  iam_instance_profile = "${var.iam_instance_profile}"
  key_name = "${var.key_name}"
  security_groups = ["${split(",", var.security_group_ids)}"]
  user_data = "${file("../cloud-config/ecs.yml")}"

  root_block_device = {
    volume_type = "gp2"
    volume_size = "40"
  }
}

Notice the user_data section.  This is where we inject the cloud config from above to provision CoreOS and launch the ECS agent.

autoscaler.tf

resource "aws_autoscaling_group" "ecs-cluster" {
  availability_zones = ["${split(",", var.availability_zones)}"]
  vpc_zone_identifier = ["${split(",", var.subnet_ids)}"]
  name = "ECS ${var.cluster_name}"
  min_size = "${var.min_size}"
  max_size = "${var.max_size}"
  desired_capacity = "${var.desired_capacity}"
  health_check_type = "EC2"
  launch_configuration = "${aws_launch_configuration.ecs.name}"
  health_check_grace_period = "${var.health_check_grace_period}"

  tag {
    key = "Env"
    value = "${var.environment_name}"
    propagate_at_launch = true
  }

  tag {
    key = "Name"
    value = "ECS ${var.cluster_name}"
    propagate_at_launch = true
  }
}

There are a few caveats I’d like to highlight with this approach.  First, I already have an AWS infrastructure in place that I was testing agains this.  So I didn’t have to do any of the extra work to create a VPC, or a gateway for the VPC.  I didn’t have to create the security groups and subnets either, I just added them to the Terraform code.

The other caveat is that if you want to use the Github project I linked to you will need to make sure that you populate the variables with your own environment specific values.  That is why having the VPC, subnets and security groups was handy for me.  Be sure to browse through the variables.tf file and substitute in your own values.  As an example,  I had to update the variables to use the CoreOS 766.4.0 image.  This AMI will be specific to your AWS region so make sure to look up the AMI first.

variable "ami" {
  /* CoreOS 766.4.0 */
  default = "ami-dbe71d9f"
  description = "AMI id to launch, must be in the region specified by the region variable"
}

Another part I had to modify to get the Github project to work was adding in my AWS credentials which look similar to the following.  Make sure to update these variables with your ID and secret.

provider "aws" {
  access_key = "${var.access_key}"
  secret_key = "${var.secret_key}"
  region = "${var.region}"
}

variable "access_key" {
  description = "AWS access key"
  default = "XXX"
}

variable "secret_key" {
  description = "AWS secret access key"
  default = "xxx"
}

Make sure to also copy/edit the autoscaling.tf and launch_config.tf files to reflect anything that is specific to your environment (Terraform will complain if there are issues).

After you have combed through the variables.tf and updated the Terraform files to your liking you can simply run terraform plan -input=false and see how Terraform will create the ASG for you.

If everything looks good, you can run terrafrom apply -input=false and Terraform will go out and start building your new ECS infrastructure for you.  After a few minutes check the EC2 console and your launch config and autoscaling group should be in there.  If that stuff all looks okay, check the ECS console and your new servers should show up and be ready to go to work for you!

NOTE: If you are starting from scratch, it is possible to do all of the infrastructure provisioning via Terraform but it is too far out of the scope of this post to cover because there are a lot of steps to it.

Read More

Graphite threshold alerting with Sensu

Instrumenting your code to report application level metrics is definitely one of the most powerful monitoring tasks you can accomplish.  It is damn satisfying to get working the first time as well.  Having the ability to look at your application and how it is performing at a granular level can help identify potential issues or bottlenecks but can also give you a greater understanding of how people are interacting with the application at a broad scale.  Everybody loves having these types of metrics to talk about their apps and products so this style of monitoring is a great win for the whole team.

I don’t want to dive in to the specifics of WHAT you should monitor here, that will be unique to every environment.  Instead of covering the what and how of instrumenting the code to report specific metrics, I will be running through an example of what the process might look like for instrumenting a check and alarm for monitoring and alerting purposes at an operations level.  I am not a developer, so I don’t spend a lot of time thinking about what types of things are important to collect metrics on.  Usually my job instead is to figure out how to monitor and alert effectively, based on the metrics that developers come up with.

Sensu has a great plugin to check Graphite thresholds in their plugin repo.  If you haven’t looked already, take a minute to glance over the options a little bit and see how the plugin works.  It is a pretty simple plugin but has been able to do everything I need it to.

One common monitoring task is to check how long requests are taking.  So in this example, we are querying the Graphite server and reporting a critical  status (status 1) if the request averages more than 7 seconds for a response time.

Here is the command you would run manually to check this threshold.  Make sure to download the script if you haven’t already, you can just copy the code directly or clone the repo if you are doing this manually.  If you are using Sensu you can use the sensu_plugin LWRP to grab the script (more on that below).

./check-data -s <servername:port> -t <graphite query> -c 7000 -u user -p password
./check-data -s graphite.example.com -t alias(stats.timer.server.response_time.mean, 'Mean') -c 7000 -u myuser -p awesomepassword

There are a few things to note.  The -s flag specifies which graphite server or endpoint to hit, -t specifies the target or the graphite query to run the script against, the -c flag sets the threshold, -u and -p are used if your Graphite server uses authentication.  If your Graphite instance is public it should probably use auth, otherwise if it is internal only, probably not as important.  Obviously these are just dummy values, included to give you a better idea of what a real command should look like.  Use your own values in their place.

The query we’re running is against a statsd metric that for mean response time for a request that gets recorded from the code (this is the developer instrumenting their code part I mentioned).  This check is specific to my environment so you will need to modify any of your queries to make sure to alert on a useful metric and threshold in your own environment.

Here’s an example of what the graphite graph (rendered in Grafana) looks like.

Sensu Graph

Obviously this is just a sample but it should give you the general idea of what to look for.

If you examine the script, there are a few Ruby Gem requirements to get the script to run, which you will need to be installed if you haven’t already.  They are sensu-plugin, json, json-uri and openssl.  You don’t need the sensu-plugin if you are just running the check manually but you WILL need to have it installed on the Sensu client that will be running the scheduled check.  That can be done manually or with the Sensu Chef recipe (specifically for turning on Sensu embedded ruby and ruby gems), which I recommend using anyway if you plan on doing any type of deployments at scale using Sensu.

Here is the Chef code looks like if you use Sensu to deploy this check automatically.

sensu_check "check_request_time" do 
  command "#{node['sensu']['plugindir']}/check-data.rb -s graphite.example.com -t \"alias(stats.timers.server.facedetection.response_time.mean, 'Mean')\" -c 7000 -a 360 -u myuser -p awesomepassword"
  handlers ["pagerduty", "slack"] 
  subscribers ["core"] 
  interval 60 
  standalone true 
  additional(:notification => "Request time above threshold", :occurrences => 5)
end

This should look familiar if you have any background using the Sensu Chef cookbook.  Basically we are using the sensu_check LWRP to execute the script with the different parameters we want, using the pagerduty and slack handlers, which are just fancy ways to pipe out the results of the check.  We are also saying we want to run this on a scheduled interval time of 60 seconds as a standalone check, which means it will be executed on the client node (not the Sensu server itself).  Finally, we are saying that after 5 failed checks we want to append a message to the handler that says what exactly is going wrong.

You can stick this logic in an existing recipe or create a new one that handles your metrics threshold checks.  I’m not sure what the best practice is for where to put the check but I have a recipe that runs standalone threshold checks that I stuck this logic in to and it seems to work.  Once the logic has been added you should be able to run chef-client for the new check to get picked up.

Read More

Intro to Systemd

I have a rocky relationship with Systemd.  On the one hand I love how powerful and extensive it is.  On the other hand, I hate how cumbersome and clunky it can sometimes feel.  There are a TON of moving components and it is very confusing to use if you have no experience with it.  The aim of this post is NOT to debate relative merits of Systemd but instead to take users through a few basic examples of how to accomplish tasks with Systemd and get familiar with how to manage systems with this framework.

My background is primarily with Debian/Ubuntu so moving over to this init system has been a learning curve.  The problem I had when I first made the transition is that there aren’t a lot of great resources currently for making the change.

Most of my knowledge has been pieced together from various blog posts and sites, the best of which is over at the Arch wiki.  As Systemd continues to mature it is becoming increasingly easier to find good guides and resources but the Arch wiki is still the defacto, go to place to find resources about how to use Systemd.

Another thing that has helped is forcing myself to use CoreOS, which, for better or worse has forced me to learn how to use Systemd.  It seems that most modern Linux distro’s are moving towards Systemd, so it’s probably worth it to at least begin experimenting with it at this point.  Ubuntu 15.04 as well as Debian 8 have both made the jump, along with other RedHat based distro’s, with more to follow.

While the learning curve can be a little steep to start with, you can learn about 80% of the things that Systemd can do with about 20% of the effort.  Getting the basics down takes a little bit of effort but will more than be enough to manage a system.  Before starting, as a disclaimer, I do not claim to be an expert but I have learned the hard way how things work and sometimes don’t work with Systemd.

Basics of Systemd

So now that we have a little bit of background stuff out of the way, we can look at some of the meat and potatoes of Systemd.  The first thing to do is get an idea of what services are running on your system.

systemctl

This will give you a listing of everything running on your system.  Usually there are a very large number of units listed.  This isn’t really important but say for example, there is a failed unit.  You can filter systemctl to only spit out failed units.

systemctl --failed

That’s pretty cool.  If, for examle there is a failed unit file you can drill down in to that unit specifically.

systemctl status <unit>

This will give you process and service information for the unit file and the last 10 lines of logs for the unit as well, which can be really handy for troubleshooting.

One of the easier ways to work with unit files is to use the edit subcommand.  This method removes much of the complexity of having to know exactly where all of the unit files live on the system.

systemctl edit --full <unit>

You can manage and interact with Systemd and the unit files in a straight forward way.  We will examine a few examples.

systemctl start <unit> - start a stopped service
systemctl stop <unit> - stop a running service
systemctl restart <unit> - restart a service
systemctl disable <unit> - stop unit from loading on boot
systemctl enable <unit>  - load unit on boot

If you make a change to a service or unit file (we will go over this later) you need to reload the Systemd daemon for it to pick up the changes to the file.

sytemctl daemon-reload

This is super important (on CoreOS at least) when troubleshooting because if you don’t run the reload your process will not update and will not change its behavior.

There are many many more tips and tricks for working with systemctl but this should give users a basic grasp on interacting with components of the system.  Again, for lots more details I advise that you check out the wiki.

Journalctl

If you are coming from older Debian/Ubuntu distros you will need to get familiar with using journalctl to view and manipulate your log files.

One of the first changes that people notice right away is that there aren’t any beloved log files /var/log in Systemd any more.  Well, there are still logs, but it isn’t what most folks are used to.

The files haven’t gone away, they are simply disguised as a binary file and are accessible via journald.  To check how much space your logs are taking up on disk you can use the following command.

journalctl --disk-usage

Journalctl is a powerful tool for doing just about everything you can think of with log files.  Because journald writes logs as a binary file you need this tool to interact with them.  That’s okay though because journalctl has pretty much all of the features needed to look through logs with its built in flags.  For example, if you want to tail a log file, you can use the “-f” flag to follow “-u” to specify the systemd unit and the systemd unit name to follow it as illustrated below (this is a CoreOS system so your results may vary on a different OS).

journalctl -f -u motdgen

To follow all the entries that journalctl is picking up (pretty much just tailing all the logs as an equivalent) you can just use this.

journalctl -f

Sometimes you only want to view the last X number of entries in a log file.  To look at older entries you can use the “-n” flag.

journalctl -n 100 -u motdgen

You can filter logs to only show messages since the last boot.

journalctl -b

There are many more options and the filtering capabilities of journalctl are very powerful.  Additionally, journalctl offers JSON outputs, filtering by log priority, granular filtering by time with –unit and –since flags, filtering on different boots, and more.

I suggest exploring on your own a little bit to see how far the capabilities can go.  Here is a link to all of the various flags that journalctl has as well as a good writeup to get a good idea of some of the more advanced capabilities of journalctl.

Drop-in Units

Writing unit files is somewhat of an art form.  I feel that it can quickly become such a complicated topic that I won’t discuss it in full detail here.

By default, Systemd unit files are written to /usr/lib/systemd/system and custom defined unit files are written to /etc/systemd/system.  Systemd looks at both locations for which files to load up.  Because of this, you can extend base units that live in the /usr/lib path by augmenting them in /etc/systemd.

This has a few implications.  First, it makes management of unit files a little bit easier to work with.  It also makes extending units a little bit easier.  You don’t need to worry about manipulting the system level units, you just add functionality on top of them in /etc/systemd.

Drop-in units are a handy feature if you need to extend a basic unit.  To make the drop-in work, it must follow the format of /etc/systemd/system/unit.d/override.conf, where unit.d is the full name of the service that you are extending.

The following example extends the CoreOS baked in etcd2 service

/etc/systemd/system/etcd2.service.d/30-configuration.conf

[Service]
# General settings
Environment=ETCD_DEBUG=true

Replacing Cron

Well not exactly.  Systemd doesn’t use cron jobs so if you want to get something to run on a schedule you will need to use a timer unit and associate a service with it.

[Unit]
Description=Log file cleaner (runs daily at midnight)

[Timer]
OnCalendar=daily

Notice that there is a log-cleaner.timer as well as a log-cleaner.service (below).  The easiest way I have found is to create a service/timer pair, which ensures that the timer runs the service based on its schedule.

[Unit]
Description=Log file cleaner

[Service]
ExecStart=/usr/bin/bash -c "truncate -s 1m /data/*.log"

The only thing that the service is doing is running a shell command which will get run based on the timer directive in the timer unit file.

The timer directive in the timer unit is pretty flexible.  For example, you can tell the timer to run based on the calendar either by day, week or month or alternately you can have the unit run services based on the system time by seconds, minutes or hours.

One of the down sides of timers is that there is not an easy way to email results.  With cron this is easily configurable.  Likewise, timers don’t have a random delay to run jobs like cron has.  So, timers are definitely no replacement for cron but have a lot of usable functionality, it is just confusing to learn at first, especially if you are used to cron.

Here is more involved tutorial for working with timers.  Also, the Arch wiki once again is great and has a dedicated section for timers.

Conclusion

This introduction is really just the beginning.  There are so many more pieces to the Systemd puzzle, I have just covered the basics, in fact I have deliberately chosen not to cover many of the topics and concepts because they can convolute the learning process and slow down the knowledge.  I chose to focus instead on the most basic ideas that are used most frequently in day to day administration.  Feel free to go do your own research and play around with the more advanced topics.

Systemd aims to solve a lot of problems but brings about a lot of complexity as well to accomplish this task.  In my opinion it is definitely worth investing the time and energy to learn this system and service manager though, pretty much all the Linux distro’s are moving towards it in the future and you would be doing yourself a disservice by not learning it.

Other areas of Systemd that aren’t covered in this post but are worth researching and playing with are:

  • Targets (analogous to runlevels)
  • Managing NTP with timedateclt
  • Network management with networkd
  • Hostname management with hostnamectl
  • Username management with loginctl

One of the easiest ways to get started with systemd is by grabbing a Vagrant box that leverages Systemd.  I like the CoreOS Vagrant box but there are many other options for getting started.  One of the best tools for learning is by doing, so take some of these commands and resources and go start playing around with Systemd.  Feel free if you have any extra additions or need help getting started with Systemd.

Read More

CoreOS Tips and Tricks

One thing that was never clear to me when I started learning CoreOS were techniques for rapidly testing out different CoreOS features.  I will spend some time walking folks through a few of tips and tricks that I have learned so far along the way learning about CoreOS.

The folks at CoreOS have an awesome repo for testing out features locally, called coreos-vagrant.  If you haven’t heard of it or used it, go check it out.  Another great resource for getting started with the CoreOS Vagrant project are the docs on the CoreOS website, you should be able to find most of the use cases there.

So in this post I will be going over some of what is already detailed in the docs and README but will additionally fill readers in with a few extra tips and tricks I have discovered so far along the way.  I am surprised by all of the hidden secrets I frequently discover buried in CoreOS and its documentation.  It is always fun to find new features and capabilities of the OS that you didn’t know existed.

Vagrant

So to get started we need to briefly cover Vagrant.  Vagrant has made things soooo much easier to test.  If you haven’t heard of Vagrant, definitely go check it out and get familiar with it.  It is basically an interface for controlling VM’s and their various components locally.

When I first starting testing things out with CoreOS I would spin boxes up in either Digital Ocean or AWS with a cloud-config that I would agonize over because I was afraid of screwing up small details or provisioning the server incorrectly.  It is also more of a hassle to provision a cloud server because it involves some additional authentication keys for command line tools or manually creating instances via GUI tools.  However, when testing VM’s you often destroy and recreate instances and so that additional overhead can become tedious.

Using Vagrant I can quickly and easily make changes to a configuration or even test out entirely different CoreOS versions in minutes and not care about getting small details wrong since a) it doesn’t cost me anything extra to run the instance locally and b) I can blow out and reprovision in a few seconds.

I think this local Vagrant approach also makes you a better CoreOS citizen because it forces you to look at what you’re doing and fix issues more often because you are iterating more frequently and therefore testing more features and options of CoreOS out (at least this has been my experience so far).

cloud-config

Cloud-config was initially a painful part of the learning process for me but I have grown to love it.  I like to test out new cloud-configs quite a bit and at first it was frustrating to screw up configs because that meant I had to redo the entire server bootstrap process in DO or AWS.  Terraform makes the provisioning process less painful but it is still a little bit of a hassle, especially when you are smoke testing configs to make sure they work.

Luckily it is dead simple to set up cloud-configs using Vagrant locally.  The repo comes with a “user-data.sample” file that you can copy to “user-data” and away you go, make any modifications you may need or config changes you want to test out.  The local testing via cloud-config discovery alone was a game changer for me.

To fix a problem with your cloud-config you can simply edit the user-data file that was copied in to place on the server and then rerun cloud-init to fix the provisioning.  Below is an example of how to do this cloud-init provisioning.

Before you provision any of you cloud configs though, I recommend testing them out by running them through the CoreOS cloud-config validator tool to help identify any potential problems your config might have before you even run it.  There is an experimental validation flag option in the cloud-init binary shipped with the OS if you want to try it out as well.  Most of the time I find it just as easy to copy the config in to the online checker but there are definitely scenario’s and use cases where it might be a good idea to test locally, I just haven’t needed to yet.

Next, if you have an existing config on your server and would like to modify the existing content and reprovision the server with the updated cloud-config, without destroying and recereating the server, use the following.

sudo /usr/bin/coreos-cloudinit --from-file /path/to/user-data

Then you can watch the logs to make sure they are doing what you expect.

journalctl -b _EXE=/usr/bin/coreos-cloudinit

If you don’t want to muck around with the cloud-config stuff on the server you can easily blow up the server, modify the user-data file on the host and just reprovision the Vagrant machine.  Obviously this method will take a little bit longer but it isn’t a significant penalty and is also is easier to keep track of since you know exactly what user-data values are being passed in the Vagrant machine from the host and can more easily stay on top of the changes you are making.

config.rb

The config section in Vagrant gives you a great deal of flexibility when testing CoreOS out locally.  For example, you can control most options that CoreOS gets provisioned with, including the version release with this,

$image_version = "723.1.0"

You can specify any version inside the quotes to bootstrap the CoreOS instance.  This is handy for testing out new alpha features or things are broken in one release.  Quickly changing versions gives you and easy way to check if they are fixed yet by either rolling back or forward easily.

In the config.rb file you can also specify server level details for things like the hostname,

$instance_name_prefix="core"

How many instances to provision,

$num_instances=1

Custom memory or cpu’s for the instance,

$vm_memory = 1024
$vm_cpus = 1

Shared folders, forwarded ports etc.  Granted these are Vagrant level configurations, it still makes working with CoreOS much easier in my opinion.

Additionally, there is an option to provision the instance with an etcd/2 discovery token to bootstrap etcd when the server gets created.  If you have ever dealt with testing out etcd, this is an option way for quickly bringing servers up and down without ever having to worry about reissuing the discovery tokens, etc.

Tips and Tricks

I have found a few other tips and tricks along the way that can be used when testing CoreOS locally or after it has been deployed.

The first tip is getting the OS version to update manually (without reprovisioning via Vagrant).  For most testing puproses I usually turn off automatic reboots using the following key in my cloud configs.

coreos:
  update:
    group: alpha
    reboot-strategy: off

This will tell CoreOS to try to use the latest alpha (if a version is not specified in your config.rb) and tell CoreOS to not reboot.

Sometimes it is easier to just manually updated the OS than destroy the VM and specify a new version.  To update manually you can run the following commnads.

update_engine_client -check_for_update
journalctl -f (this will follow the update progress)
sudo reboot (after the updated version is downloaded)

After you see that the newest release has been downloaded you can reboot the server and it should boot up with the newest updates.

Another cool trick is to customize the toolbox on CoreOS.  I’ve written about this before but figured I might as well mention it again since it is a useful trick.

By default the toolbox runs Fedora, but we are mainly an Ubuntu/Debian shop so are much more comfortable using the tools bundled with those distros.  It is pretty simple to configure the toolbox to automatically use Debian when the instance is provisioned using the following key in your cloud-config.

-write_files:
  - path: /home/core/.toolboxrc
    owner: core
    content: |
      TOOLBOX_DOCKER_IMAGE=ubuntu
      TOOLBOX_DOCKER_TAG=14.04

When you run the “toolbox” command it will look for Ubuntu instead of the default Fedora image.

Another trick I have used a few times is overriding the update strategy on a server that has already been provisioned using environment variables.

As I have discovered, much of the configuration that takes place happens via environment variables.  So to update the reboot strategy you can modify the /etc/coreos/update.conf file.  The contents should look something like this:

GROUP=beta
REBOOT_STRATEGY=off

If you’d like to have the server use alpha images change the key to GROUP=alpha, etc. for the keys inside the configuration.  After making your changes, you will need to restart the update-engine service.

sudo systemctl restart update-engine

The system should pick up the changes you made and you should be good to go.

The last trick I will highlight in this post is how to get “drop in” services working.  This is a core part of how systemd (especially on CoreOS) works, but so few realize how it works.  By creating a drop in you are simply extending a service to read in extra bits of configuration.  For example, the following unit file extends the system etcd2 service.

Create the following file,

/etc/systemd/system/etcd2.service.d/30-configuration.conf

The etcd2 service will look in this location for its extra configuration.

[Service]
# General settings
Environment=ETCD_NAME=etcd-config
Environment=ETCD_VERBOSE=1

Inside the unit file we are just setting some extra environment variables that etcd2 can then use as flags to instruct it how to run.

There are obviously a lot more CoreOS tricks.  I have just highlighted a few of my favorites here.  I suggest looking at the CoreOS docs, there is a lot of good information over there.  Feel free to comment with your own tricks and I will be sure to try them out and get them added here.

Read More

CoreOS etcd2 encryption

Etcd 2.1.1 Encryption and Authentication

New to etcd 2.1.0 is the ability to use authentication to secure your etcd resources.  Encryption and authentication are relatively new additions so I thought I would write a quick blog post to help remember how to get these components up and running as well as help others because some of the ideas were a little confusing to me at first.

I pieced together most of the information for this post together from a few different sources.

The first were a pair of great tutorials (1, 2) for getting etcd encryption up and going.  The second resource used was the etcd-ca project by CoreOS for creating a CA and issuing certs, there are other ways of doing it but this was a straight forward method.  The third resource I recommend look at is the Security page in the CoreOS docs that shows examples of how to piece all of the commands and certs together.  The last resource readers might find useful is the etcd2 docs for the different flags and configuration options.  This resource was helpful for finding out all the various options that I needed to enable to get etcd2 working properly.

Requirements

To use the authentication feature you will need to have etcd 2.1.0 or greater, which means you will need to be running a version of CoreOS that has the correct binary, which means you will either need CoreOS v752.1.0 or above, OR the correct binary version/Docker image.

Authentication is still an “experimental” feature so it may change at any time, therefore I have decided not to get in to any of the details of how it works.  If you are interested you can check out the docs on users and auth.

Running the CA server

At first I was conernced about running a CA server because I’ve had painful experiences in the past with CA’s but the etcd-ca tool makes this process easy and straight forward.  There are a few other CA resources in the etcd2 encryption docs but I won’t cover them here.

The easiest way to use the etcd-ca tool is to run it in a Docker container and write the certs out to the host via a shared voulme.  The following steps will pull the repo and build the binary for running the tool.

docker pull golang
docker run -i -t $(pwd):/go golang /bin/bash
git clone https://github.com/coreos/etcd-ca
cd etcd-ca
./build
cd ./bin

Create the certs

After the etcd-ca binary has been built we can start creating certs.  The first thing necessary is to create the CA certs which will be used to sign all other certs.

./etcd-ca init

After creating the CA signing cert we will create a certificate for the etcd server that will be authenticating to.

./etcd-ca new-cert -ip <etcd_server_ip> <hostname>
./etcd-ca sign <hostname>
./etcd-ca chain <hostname>
./etcd-ca export --insecure <hostname> | tar xvf -

Replace <etcd_server_ip> with the public address of the etcd server and <hostname> with the hostname of the etcd server.  In this example, something like core01 would be a good name.

Optional – Client cert

This is not necessary in all scenarios for setting up encryption for etcd but if you are interested in having clients authenticate with their own cert it isn’t that much effort to add.

./etcd-ca new-cert -ip <etcd_server_ip> client
./etcd-ca sign client
./etcd-ca export --insecure client | tar xvf -

Note:  You may need to move the above keys from the server/clientkey files generated to the correct filename.  Also to note, if you screw up any of the certs or for any reason need to recreate them you can simply delete the certificates from the .etcd-ca/ hidden folder that contains all of the certificates.

Etcd cloud-config

The following cloud-config will configure etcd2 to use the certs we configured above.

There is currently an issue parsing a few of the etcd2 command line flags so the workaround (for now) is to split the configuration up in to a base config and then to add env vars as a a drop in.

write_files:
  - path: /etc/systemd/system/etcd2.service.d/30-configuration.conf
  permissions: '0644'
  content: |
  [Service]
  # General settings
  Environment=ETCD_NAME=etcd-config
  Environment=ETCD_VERBOSE=1
  # Encrytpion
  Environment=ETCD_CLIENT_CERT_AUTH=1
  Environment=ETCD_TRUSTED_CA_FILE=/home/core/ca.crt
  Environment=ETCD_PEER_KEY_FILE=/home/core/server.key
  Environment=ETCD_PEER_CERT_FILE=/home/core/server.crt
  Environment=ETCD_CERT_FILE=/home/core/server.crt
  Environment=ETCD_KEY_FILE=/home/core/server.key
 
  - path: /home/core/ca.crt
  permissions: '0644'
  content: |
  -----BEGIN CERTIFICATE-----
  ca cert content
  -----END CERTIFICATE-----

  - path: /home/core/server.crt
  permissions: '0644'
  content: |
  -----BEGIN CERTIFICATE-----
  server cert content
  -----END CERTIFICATE-----

  - path: /home/core/server.key
  permissions: '0644'
  content: |
  -----BEGIN RSA PRIVATE KEY-----
  server key content
  -----END RSA PRIVATE KEY-----

  - path: /home/core/client.crt
  permissions: '0644'
  content: |
  -----BEGIN CERTIFICATE-----
  client cert content
  -----END CERTIFICATE-----

  - path: /home/core/client.key
  permissions: '0644'
  content: |
  -----BEGIN RSA PRIVATE KEY-----
  client key content
  -----END RSA PRIVATE KEY-----

coreos:
  etcd2:
    name: etcd
    discovery: https://discovery.etcd.io/a1c999ec1a23039996419e0a20cb1e35
    advertise-client-urls: https://$public_ipv4:2379
    initial-advertise-peer-urls: https://$private_ipv4:2380
    listen-client-urls: https://0.0.0.0:2379
    listen-peer-urls: https://$private_ipv4:2380
  units:
    - name: etcd2.service
    command: start

If you don’t want to bootstrap a node with cloud-config and instead are just interested in testing out testing encryption on an existing how you can use the following commands.  You will still need to make sure you follow the steps above to generate all of the necessary certs!

Manually start etcd2 with server certificate:

etcd2 -name infra0 -data-dir infra0 \ -cert-file=/home/core/server.crt -key-file=/home/core/server.key \ -advertise-client-urls=https://<server_ip>:2379 -listen-client-urls=https://<server_ip>:2379

and to test the connection use the following curl command.

curl --cacert /home/core/ca.crt https://172.17.8.101:2379/v2/keys/foo -XPUT -d value=bar -v

Manually start etcd2 with client certificate:

Etcd2 -name infra0 -data-dir infra0 \ -client-cert-auth -trusted-ca-file=/home/core/ca.crt -cert-file=/home/core/server.crt -key-file=/home/core/server.key \ -advertise-client-urls https://<server_ip>:2379 -listen-client-urls https://<server_ip>:2379

Similar to the above command you will just need to add the client certs to authenticate.

curl --cacert /home/core/ca.crt --cert /home/core/client.crt --key /home/core/client.key \ -L https://<server_ip>:2379/v2/keys/foo -XPUT -d value=bar -v

Another way to test the certs out is by using the etcdctl tool by addding a few flags.

etcdctl --ca-file ca.crt --cert-file client.crt --key-file client.key --peers https://<server_ip>:2379 set /foo bar

etcdctl --ca-file ca.crt --cert-file client.crt --key-file client.key --peers https://172.17.8.101:2379 get /foo

Encrypting etcd was a confusing process to me at first due to the complexity of encryption but after working through the above examples, most of the process made sense.  I seem to have a hard time wrapping my head around all of the different parts so hopefully I have effectively showed how the encryption component works.

The etcd-ca tool is very nice for testing because it is simple and straightforward but lacks a few features of a full fledged CA.  I suggest looking at using something like Openssl for a production type scenario.  Especially if things like certificate revocations are important.

Read More