Build a Pine64 Kubernetes Cluster with k3os

July 14, 2019July 14, 2019 Josh Reichardt Leave a comment

Kubernetes (k3os) arm64 cluster with custom 3D printed case

The k3os project was recently announced and I finally got a chance to test it out. k3os greatly simplifies the steps needed to create a Kubernetes cluster along with its counterpart, k3s, to reduce the overhead of running Kubernetes clusters. Paired with Rancher for the UI, all of these components make for an even better option. You can even run Rancher in your (arm64) k3os cluster via the Rancher Helm chart now.

Instead of using Etcd, k3s opts to use SQLite by default and does some other magic to reduce extra Kubernetes bloat and simplify management. Check here for more about k3s and how it works and how to run it.

k3os replaces some complicated OS components with much simpler ones. For example, instead of using Systemd it uses OpenRC, instead of Docker it uses containerd, it also leverages connman for configuring network components and it doesn’t use a package manager.

The method I am showing in this post uses the k3os overlay installation, which is detailed here. The reason for this choice is because the pine64 boards use u-boot to boot the OS and so special steps are needed to accommodate for the way this process is handled. The upside of this method is that these instructions should pretty much work for any of the Pi form factor boards, including the newly released Raspberry Pi4, with minimal changes.

Setup

If you haven’t downloaded and imaged your Pine64 yet, I like to use the ayufan images, which can be found here. You can easily write these images to a microSD card on OSX using something like Etcher.

Assuming the node is connected to your network, you can SSH into it.

ssh rock64@rock64 # or use the ip, password is rock64

When using the overlay installation, the first step is to download the k3os rootfs and lay it down on the host. This step applies to all nodes in the cluster.

curl -sL https://github.com/rancher/k3os/releases/download/v0.2.1/k3os-rootfs-arm64.tar.gz | tar --strip-components=1 -zxvf - -C

The above command is installing v0.2.1 which is the most current version as of writing this, so make sure to check if there is a newer version available.

After installing, lay down the following configurations into /k3os/system/config.yaml, modifying as needed. After the machine is rebooted this path will become read only so if you need to change the configuration again you will need to edit /etc/fstab to change the location to be writable again.

Server node

ssh_authorized_keys:
- ssh-rsa <your-public-ssh-key-to-login>
hostname: k3s-master

k3os:
  data_sources:
  - cdrom
  dns_nameservers:
  - 192.168.1.1 # update this to your local or public DNS server if needed
  ntp_servers:
  - 0.us.pool.ntp.org
  - 1.us.pool.ntp.org
  password: rancher
  token: <TOKEN>

The k3s config will be written to /etc/rancher/k3s/k3s.yaml on this node so make sure to grab it if you want to connect the cluster from outside this node. Reboot the machine to boot to the new filesystem and you should be greeted with the k3os splash screen.

Agent node

The agent uses nearly the same config, with the addition of the server_url. Just point the agent nodes to the server/master and you should be good to go. Again, reboot after creating the config and the host should boot to the new filesystem and everything should be ready.

ssh_authorized_keys:
- ssh-rsa <your-public-ssh-key-to-login>
hostname: k3s-node-1

k3os:
  data_sources:
  - cdrom
  dns_nameservers:
  - 192.168.1.1 # update this to your local or public DNS server if needed
  ntp_servers:
  - 0.us.pool.ntp.org
  - 1.us.pool.ntp.org
  password: rancher
  server_url: https://<server-ip-or-hostname>:6443
  token: <TOKEN>

You can do a lot more with the bootstrap configurations, such as setting labels or environment variables. Some folks in the community have had luck getting the wifi configuration working on the RPi4’s out of the box, but I haven’t been able to get it to work yet on my Pine64 cluster. Check the docs for more details on the various configuration options.

After the nodes have been rebooted and configs applied, the cluster “should just work”. You can check that the cluster is up using k3s using the kubectl passthrough command (checking from the master node below).

k3s-master [~]$ k3s kubectl get nodes
NAME         STATUS   ROLES    AGE     VERSION
k3s-master   Ready    <none>   6d1h    v1.14.1-k3s.4
k3s-node-1   Ready    <none>   7m10s   v1.14.1-k3s.4

NOTE: After installing the overlay filesystem there will be no package manager and no obvious way to upgrade the kernel so use this guide only for testing purposes. The project is still very young and a number of things still need to be added, including update mechanisms and HA. Be sure to follow the k3os issue tracker and Rancher Slack (#k3os channel) for updates and developments.

Conclusion

This is easily the best method I have found so far for getting a Kubernetes cluster up and running, minus the few caveats mentioned above, which I believe will be resolved very soon. I have been very impressed with how simple and easy it has been to configure and use. The next step for me is to figure out how to run Rancher and start working on some configurations for running workloads on the cluster. I will share more on that project in another post.

There are definitely some quirks to getting this setup working for the Pi and Pine64 based boards, but aren’t major problems by any means.

References

This post was heavily inspired by this gist for getting the overlay installation method working on Raspberry Pi.

Test Rancher 2.0 using Minikube

November 29, 2017 Josh Reichardt 1 Comment

If you haven’t heard yet, Rancher recently revealed news that they will be building out a new v2.0 of their popular container orchestration and management platform to be built specifically to run on top of Kubernetes. In the container realm, Kubernetes has recently become a clear favorite in the battle of orchestration and container management. There are still other options available, but it is becoming increasingly clear that Kubernetes has the largest community, user base and overall feature set so a lot of the new developments are building onto Kubernetes rather than competing with it directly. Ultimately I think this move to build on Kubernetes will be good for the container and cloud community as companies can focus more narrowly now on challenges tied specifically around security, networking, management, etc, rather than continuing to invent ways to run containers.

With Minikube and the Docker for Mac app, testing out this new Rancher 2.0 functionality is really easy. I will outline the (rough) process below, but a lot of the nuts and bolts are hidden in Minikube and Rancher. So if you’re really interested in learning about what’s happening behind the scenes, you can take a look at the Minikube and Rancher logs in greater detail.

Speaking of Minkube and Rancher, there are a few low level prerequisites you will need to have installed and configured to make this process work smoothly, which are listed out below.

Prerequisites

Tested on OSX
Get Minikube working – use the Kubernetes/Minikube notes as a reference (you may need to bump memory to 4GB)
Working version of kubectl
Install and configure docker for mac app

I won’t cover the installation of these perquisites, but I have blogged about a few of them before and have provided links above for instructions on getting started if you aren’t familiar with any of them.

Get Rancher 2.0 working locally

The quick start guide on the Rancher website has good details for getting this working – http://rancher.com/docs/rancher/v2.0/en/quick-start-guide/. On OSX you can use the Docker for Mac app to get a current version of Docker and compose. After Docker is installed, the following command will start the Rancher container for testing.

docker run -d --restart=unless-stopped -p 8080:8080 --name rancher-server rancher/server:preview

Check that you can access the Rancher 2.0 UI by navigating to http://localhost:8080 in your browser.

If you wanted to dummy a host name to make access a little bit easier you could just add an extra entry to /etc/hosts.

Import Minikube

You can import an existing cluster into the Rancher environment. Here we will import the local Minikube instance we got going earlier so we can test out some of the new Rancher 2.0 functionality. Alternately you could also add a host from a cloud provider.

In Rancher go to Hosts, Use Existing Kubernetes.

Then grab the IP address that your local machine is using on your network. If you aren’t familiar, on OSX you can reach into the terminal and type “ifconfig” and pull out the IP your machine is using. Also make sure to set the port to 8080, unless you otherwise modified the port map earlier when starting Rancher.

Registering the host will generate a command to run that applies configuration on the Kubernetes cluster. Just copy this kubectl command in Rancher and run it against your Minikube machine.

The above command will join Minikube into the Rancher environment and allow Rancher to manage it. Wait a minute for the Rancher components (mainly the rancher-agent continer/pod) to bootstrap into the Minikube environment. Once everything is up and running, you can check things with kubectl.

kubectl get pods --all-namespaces | grep rancher

Alternatively, to verify this, you can open the Kubernetes dashboard with the “minikube dashboard” command and see the rancher-agent running.

On the Rancher side of things, after a few minutes, you should see the Minikube instance show up in the Rancher UI.

That’s it. You now have a working Rancher 2.0 instance that is connected to a Kubernetes cluster (Minikube). Getting the environment to this point should give you enough visibility into Rancher and Kubernetes to start tinkering and learning more about the new features that Rancher 2.0 offers.

The new Rancher 2.0 UI is nice and simplifies a lot of the painful aspects of managing and administering a Kubernetes cluster. For example, on each host, there are metrics for memory, cpu, disk, etc. as well as specs about the server and its hardware. There are also built in conveniences for dealing with load balancers, secrets and other components that are normally a pain to deal with. While 2.0 is still rough around the edges, I see a lot of promise in the idea of building a management platform on top Kubernetes to make administrative tasks easier, and you can still exec to the container for the UI and check logs easily, which is one of my favorite parts about Rancher. The extra visualization is a nice touch for folks that aren’t interested in the CLI or don’t need to know how things work at a low level.

When you’re done testing, simply stop the rancher container and start it again whenever you need to test. Or just blow away the container and start over if you want to start Rancher again from scratch.

Tips for monitoring Rancher Server

April 17, 2017April 17, 2017 Josh Reichardt Leave a comment

Last week I encountered an interesting bug in Rancher that managed to cause some major problems across my Rancher infrastructure. Basically, the bug was causing of the Rancher agent clients to continuously bounce between disconnected/reconnected/finished and reconnecting states, which only manifested itself either after a 12 hour period or by deactivating/activating agents (for example adding a new host to an environment). The only way to temporarily fix the issue was to restart the rancher-server container.

With some help, we were eventually able to resolve the issue. I picked up a few nice lessons along the way and also became intimately acquainted with some of the inner workings of Rancher. Through this experience I learned some tips on how to effectively monitor the Rancher server environment that I would otherwise not have been exposed to, which I would like to share with others today.

All said and done, I view this experience as a positive one. Hitting the bug has not only helped mitigate this specific issue for other users in the future but also taught me a lot about the inner workings of Rancher. If you’re interested in the full story you can read about all of the details about the incident, including steps to reliably reproduce and how the issue was ultimately resolved here. It was a bug specific to Rancher v1.5.1-3, so upgrading to 1.5.4 should fix this issue if you come across it.

Before diving into the specifics for this post, I just want to give a shout out to the Rancher community, including @cjellik, @ibuildthecloud, @ecliptok and @shakefu. The Rancher developers, team and community members were extremely friendly and helpful in addressing and fixing the issue. Between all the late night messages in the Rancher slack, many many logs, countless hours debugging and troubleshooting I just wanted to say thank you to everyone for the help. The small things go a long way, and it just shows how great the growing Rancher community is.

Effective monitoring

I use Sysdig as the main source of container and infrastructure monitoring. To accomplish the metric collection, I run the Sysdig agent as a systemd service when a server starts up so when a server dies and goes away or a new one is added, Sysdig is automatically started up and begins dumping that metric data into the Sysdig Cloud for consumption through the web interface.

I have used this data to create custom dashboards which gives me a good overview about what is happening in the Rancher server environment (and others) at any given time.

The other important thing I discovered through this process, was the role that the Rancher database plays. For the Rancher HA setup, I am using an externally hosted RDS instance for the Rancher database and was able to fine found some interesting correlations as part of troubleshooting thanks to the metrics in Sysdig. For example, if the database gets stressed it can cause other unintended side effects, so I set up some additional monitors and alerts for the database.

Luckily Sysdig makes the collection of these additional AWS metrics seamless. Basically, Sysdig offers an AWS integration which pull in CloudWatch metrics and allows you to add them to dashboards and alert on them from Sysdig, which has been very nice so far.

Below are some useful metrics in helping diagnose and troubleshoot various Rancher server issues.

Memory usage % (server)
CPU % (server)
Heap used over time (server)
Number of network connections (server)
Network bytes by application (server)
Freeable memory over time (RDS)
Network traffic over time (RDS)

As you can see, there are quite a few things you can measure with metrics alone. Often though, this isn’t enough to get the entire picture of what is happening in an environment.

Logs

It is also important to have access to (useful) logs in the infrastructure in order to gain insight into WHY metrics are showing up the way they do and also to help correlate log messages and errors to what exactly is going on in an environment when problems occur. Docker has had the ability for a while now to use log drivers to customize logging, which has been helpful to us. In the beginning, I would just SSH into the server and tail the logs with the “docker logs” command but we quickly found that to be cumbersome to do manually.

One alternative to tailing the logs manually is to configure the Docker daemon to automatically send logs to a centralized log collection system. I use Logstash in my infrastructure with the “gelf” log driver as part of the bootstrap command that runs to start the Rancher server container, but there are other logging systems if Logstash isn’t the right fit. Here is what the relevant configuration looks like.

...
--log-driver=gelf \
--log-opt gelf-address=udp://<logstash-server>:12201 \
--log-opt tag=rancher-server \
...

Just specify the public address of the Logstash log collector and optionally add tags. The extra tags make filtering the logs much easier, so I definitely recommend adding at least one.

Here are a few of the Logstash filters for parsing the Rancher logs. Be aware though, it is currently not possible to log full Java stack traces in Logstash using the gelf input.

if [tag] == "rancher-server" {
    mutate { remove_field => "command" }
    grok {
      match => [ "host", "ip-(?<ipaddr>\d{1,3}-\d{1,3}-\d{1,3}-\d{1,3})" ]
    }

    # Various filters for Rancher server log messages
    grok {
     match => [ "message", "time=\"%{TIMESTAMP_ISO8601}\" level=%{LOGLEVEL:debug_level} msg=\"%{GREEDYDATA:message_body}\"" ]
     match => [ "message", "%{TIMESTAMP_ISO8601} %{WORD:debug_level} (?<context>\[.*\]) %{GREEDYDATA:message_body}" ]
     match => [ "message", "%{DATESTAMP} http: %{WORD:http_type} %{WORD:debug_level}: %{GREEDYDATA:message_body}" ]
   }
 }

There are some issues open for addressing this, but it doesn’t seem like there is much movement on the topic, so if you see a lot of individual messages from stack traces that is the reason.

One option to mitigate the problem of stack traces would be to run a local log collection agent (in a container of course) on the rancher server host, like Filebeat or Fluentd that has the ability to clean up the logs before sending it to something like Logstash, ElasticSearch or some other centralized logging. This approach has the added benefit of adding encryption to the logs, which GELF does not have (currently).

If you don’t have a centralized logging solution or just don’t care about rancher-server logs shipping to it – the easiest option is to tail the logs locally as I mentioned previously, using the json-file log format. The only additional configuration I would recommend to the json-file format is to turn on log rotation which can be accomplished with the following configuration.

...
 --log-driver=json-file \
 --log-opt max-size=100mb \
 --log-opt max-file=2 \
...

Adding these logging options will ensure that the container logs for rancher-server will never full up the disk on the server.

Bonus: Debug logs

Additional debug logs can be found inside of each rancher-server container. Since these debug logs are typically not needed in day to day operations, they are sort of an easter egg, tucked away. To access these debug logs, they are located in /var/lib/cattle/logs/ inside of the rancher-server container. The easiest way to analyze the logs is to get them off the server and onto a local machine.

Below is a sample of how to do this.

docker exec -it <rancher-server> bash
cd /var/lib/cattle/logs
cp cattle-debug.log /tmp

Then from the host that the container is sitting on you can docker cp the logs out of the container and onto the working directory of the host.

docker cp <rancher-server>:/tmp/cattle-debug.log .

From here you can either analyze the logs in a text editor available on the server, or you can copy the logs over to a local machine. In the example below, the server uses ssh keys for authentication and I chose to copy the logs from the server into my local /tmp directory.

 scp -i ~/.ssh/<rancher-server-pem> user@rancher-server:/tmp/cattle-debug.log /tmp/cattle-debug.log

With a local copy of the logs you can either examine the logs using your favorite text editor or you can upload them elsewhere for examination.

Conclusion

With all of our Rancher server metrics dumping into Sysdig Cloud along with our logs dumping into Logstash it has made it easier for multiple people to quickly view and analyze what was going on with the Rancher servers. In HA Rancher environments with more than one rancher-server running, it also makes filtering logs based on the server or IP much easier. Since we use 2 hosts in our HA setup we can now easily filter the logs for only the server that is acting as the master.

As these container based grow up, they also become much more complicated to troubleshoot. With better logging and monitoring systems in place it is much easier to tell what is going on at a glance and with the addition of the monitoring solution we can be much more proactive about finding issues earlier and mitigating potential problems much faster.

awesome-rancher

March 15, 2017 Josh Reichardt 2 Comments

Awesome lists are a newish phenomenon for organizing links and information that have been gaining popularity on Github. You can read more about awesome lists here, but essentially they are curated lists of helpful links and other various resources that are put together by members of their communities.

There are lots of awesome lists these days on Github, and there are even awesome lists for non-tech categories, including recipes and video games. Oddly enough though through my own searching, I quickly discovered that there was no awesome list for Rancher.

So I decided to give a little bit back to the Rancher community in my own way. If you aren’t aren’t familiar with Rancher, it is essentially a platform and orchestration engine for container based workloads. Out of all of the container management and orchestration platforms that have been growing out of the container popularity explosion, using Rancher has definitely the most pleasant experience and easiest to use. Also, if you are new to Rancher, please go check out the list on Github for more information, that is what it is there for.

Check out the awesome-rancher project here.

The goal for me and for the project is to make awesome-rancher the go to resource for finding useful information. Therefore, I want awesome-rancher to be as easy to use and as complete as possible so new users have a good jumping off place when they decide to dive into Rancher.

If you are a Rancher community member or even just notice a key resource missing from the list, please either let me know with a comment here, on twitter/facebook, or just open a PR on the project itself (which is probably the easiest).

Configure a Rancher HAProxy health check

March 1, 2017 Josh Reichardt Leave a comment

If you are familiar with ELB/ALB you will know that there are slight idiosyncrasies between the two. For example, ELB allows you to health check a back end server by TCP port. Basically allowing the user to check if a back end comes up and is listening on a specified port. ALB is slightly different in its method for health checking. ALB uses HTTP checks (layer 7) to ensure back end instances are up and listening.

This becomes a problem in Rancher, when you have multiple stacks in a single environment that are fronted by the Rancher HAProxy load balancer. By default, the HAProxy config does not have a health check endpoint configured, so ALB is never able to know if the back end server is actually up and listening for requests.

A colleague and I recently discovered a neat trick for solving this problem if you are fronting your environment with an ALB. The solution to this conundrum is to sprinkle a little bit of custom configuration to the Rancher HAProxy config.

In Rancher, you can modify the live settings without downtime. Click on the load balancer that sits behind the ALB and navigate to the Custom haproxy.cfg tab.

Modify the HAProxy config by adding the following:

# Use to report haproxy's status
defaults
    mode http
    monitor-uri /_ping

Click the “Edit” button to apply these changes and you should be all set.

Next, find the health check configuration for the associated ALB in the AWS console and add a check the the /_ping path on port 80 (or whichever port you are exposing/plan to listen on). It should look similar to the following example.

Below is an example that maps a DNS name to an internal Nginx container that is listening for requests on port 80.

The check in ALB ensures that the HAProxy load balancer in Rancher is up and running before allowing traffic to be routed to it. You can verify that your Rancher load balancer is working if the instances behind your ALB start showing a status of healthy in the AWS console.

NOTE: If you don’t have any apps initially behind the Rancher load balancer (or that are listening on the port specified in the health check) the AWS instances behind ALB will remain unhealthy until you add configuration in Rancher for the stacks to be exposed, as pictured above.

After setting up HAProxy, publicly accessible services in private Rancher environments can easily be managed by updating the HAProxy config. Just add a dns name and a service to link to and HAProxy is able to figure out how and where to route requests to. To map other services that aren’t listening on port 80, the process is very similar. Use the above as a guideline and simply update the target port to whichever port the app is listening on internally.