ELK stack

Performance tuning ELK stack

Building on my previous post describing how to quickly set up a centralized logging solution using the ElasticSearch + Logstash + Kibana (ELK) stack, we have a fully functioning, Docker based ELK stack agreggating and handling our logs.  The only problem is that performance is either waaay too slow or the stack seems to crash.  As I worked through this problem myself, I found a few settings that vastly improved the stability and performance of my ELK stack.

So in this post I will share some of the useful tweaks and changes that worked in my own environment to help squeeze additional performance out of the setup.

Hopefully these adjustments will help others!

Adjusting Logstash

Out of the box, Logstash does a pretty good job of setting things up with reasonable default values.  The main adjustment that I have found to be useful is setting the default number of Logstsash “workers” when the Logstash process starts.  A good rule of thumb would be one worker per CPU.  So if the server has 4 cpu’s, your Logstash start up command would look similar to the following.

/opt/logstash/bin/logstash --verbose -w 4 -f /etc/logstash/server.conf

The important bit to note is the “-w 4” part.  A poorly configured server.conf file may also lead to performance issues but that will be very specific to the user.  For the most part, unless the configuration contains many conditionals and expensive codec calls or excessive filtering, performance here should be stable.

If you are concerned about utilization I recommend watching cpu and memory consumption by the Logstash process, signs that there could be a configuration issue would be cpu maxing out.

The main thing to be aware of with a custom number of workers is that some codecs may not work properly because they are not thread safe.  If you are using the “multiline” codec in any of your inputs then you will not be able to leverage multiple workers, or if you are using multiple workers you won’t be able to use the codec until the thread safe problems have been fixed.  The good news is that this is a known issue and is being worked on, hopefully will be fixed by the time 1.5.0 is released.  It tripped me up initially and so I thought I would mention the issue.

Increase Java heap memory

It turns out that ElasticSearch is a bit of a memory hog once you start actually sending data through Logstash to have ES consume.  In my own testing, I discovered that logs would mysteriously stop being recorded in to ES and consequently would fail to show up in my dashboards.

The first tweak to make is to increase the amount of memory available to Java to process ES indices.  I have discovered that there is a script that ES uses to load up Java when it starts, which is passing in a value of 1GB of RAM to start.

After some digging, I discovered that the default ES configuration I was using was quickly running out of memory and was crashing because the ES heap memory couldn’t keep up with the load (mostly indexes).

Here is a sample of the errors I was seeing when ES and Logstash stopped processing logs.

message [out of memory][OutOfMemoryError[Java heap space]]

This was a good starting point for investigating.  Basically, what this means is that the ES container had a Java heap memory setting set to 1GB which was exhausting the the memory allocated to ES, even though there was much more memory available on the server.

To increase this memory limit, we will override this script with our own custom values.

This script is called “elasticsearch.sh.in” and we will be overriding the default value ES_MAX_MEM with a value of “4g” as show below.  The general guideline that has been recommended is to use a value here of about 50% of the total amount of memory.  So if your server has 8GB of memory then setting it to 4GB here will be the 50% we are looking for.

There are many other options that can be overridden but the most import value is the max memory value that we have updated.

We can inject this custom value as an environment variable in our Dockerfile which makes managing custom configurations much easier if we need to make additions or adjustments later on.

ENV ES_HEAP_SIZE=8g

I am posting the script that sets the values below as a reference in case there are other values you need to override.  Again, we can use use environmental variables to set these up in our Dockefile if needed.

#!/bin/sh

ES_CLASSPATH=$ES_CLASSPATH:$ES_HOME/lib/elasticsearch-1.5.0.jar:$ES_HOME/lib/*:$ES_HOME/lib/sigar/*

if [ "x$ES_MIN_MEM" = "x" ]; then
 ES_MIN_MEM=256m
fi
if [ "x$ES_MAX_MEM" = "x" ]; then
 ES_MAX_MEM=1g
fi
if [ "x$ES_HEAP_SIZE" != "x" ]; then
 ES_MIN_MEM=$ES_HEAP_SIZE
 ES_MAX_MEM=$ES_HEAP_SIZE
fi

# min and max heap sizes should be set to the same value to avoid
# stop-the-world GC pauses during resize, and so that we can lock the
# heap in memory on startup to prevent any of it from being swapped
# out.
JAVA_OPTS="$JAVA_OPTS -Xms${ES_MIN_MEM}"
JAVA_OPTS="$JAVA_OPTS -Xmx${ES_MAX_MEM}"

# new generation
if [ "x$ES_HEAP_NEWSIZE" != "x" ]; then
 JAVA_OPTS="$JAVA_OPTS -Xmn${ES_HEAP_NEWSIZE}"
fi

# max direct memory
if [ "x$ES_DIRECT_SIZE" != "x" ]; then
 JAVA_OPTS="$JAVA_OPTS -XX:MaxDirectMemorySize=${ES_DIRECT_SIZE}"
fi

# set to headless, just in case
JAVA_OPTS="$JAVA_OPTS -Djava.awt.headless=true"

# Force the JVM to use IPv4 stack
if [ "x$ES_USE_IPV4" != "x" ]; then
 JAVA_OPTS="$JAVA_OPTS -Djava.net.preferIPv4Stack=true"
fi

JAVA_OPTS="$JAVA_OPTS -XX:+UseParNewGC"
JAVA_OPTS="$JAVA_OPTS -XX:+UseConcMarkSweepGC"

JAVA_OPTS="$JAVA_OPTS -XX:CMSInitiatingOccupancyFraction=75"
JAVA_OPTS="$JAVA_OPTS -XX:+UseCMSInitiatingOccupancyOnly"

# GC logging options
if [ "x$ES_USE_GC_LOGGING" != "x" ]; then
 JAVA_OPTS="$JAVA_OPTS -XX:+PrintGCDetails"
 JAVA_OPTS="$JAVA_OPTS -XX:+PrintGCTimeStamps"
 JAVA_OPTS="$JAVA_OPTS -XX:+PrintClassHistogram"
 JAVA_OPTS="$JAVA_OPTS -XX:+PrintTenuringDistribution"
 JAVA_OPTS="$JAVA_OPTS -XX:+PrintGCApplicationStoppedTime"
 JAVA_OPTS="$JAVA_OPTS -Xloggc:/var/log/elasticsearch/gc.log"
fi

# Causes the JVM to dump its heap on OutOfMemory.
JAVA_OPTS="$JAVA_OPTS -XX:+HeapDumpOnOutOfMemoryError"
# The path to the heap dump location, note directory must exists and have enough
# space for a full heap dump.
#JAVA_OPTS="$JAVA_OPTS -XX:HeapDumpPath=$ES_HOME/logs/heapdump.hprof"

# Disables explicit GC
JAVA_OPTS="$JAVA_OPTS -XX:+DisableExplicitGC"

# Ensure UTF-8 encoding by default (e.g. filenames)
JAVA_OPTS="$JAVA_OPTS -Dfile.encoding=UTF-8"

Then when Elasticsearch starts up it will look for this custom configuration script and start Java with the desired 4GB of memory.  This is one easy way to squeeze performance out of your server without making any other changes or modifying your server.

Modify Elasticsearch configuration

This one is also very easy to put in to place.  We are already using the elasticsearch.yml so the only thing that needs to be done is to create some additional configurations to this file, rebuild the container, and restart the ES container with the updated values.

A good setting to configure to help control ES memory usage is to set the indices field cache size.  Limiting this indices cache size makes sense because you rarely need to retrieve logs that are older than a few days.  By default ES will hold old indices in memory and will never let them go.  So unless you have unlimited memory than it makes sense to limit the memory in this scenario.

To limit the cache size simply add the following value anywhere in your custom elasticsearch.yml configuration file.  This setting and adjusting the Java heap memory size should be enough to get started but there are a few other things that might be worth checking.

indices.fielddata.cache.size:  40%

If you only make one change, add this line to your ES configuration!  This setting will let go of the oldest indices first so you won’t be dropping new indices, 9/10 times this is probably what you want when accessing data in Logstash.  More information about controlling memory usage can be found here.

Another idea worth looking at for an easy performance boost would be disabling swap if it has been enabled.  Again, in most cloud environment and images swap is turned off, but it is always a setting worth checking.

To bypass the OS swap setting you can simply configure a no swap value in ES by adding the following to your elasticsearch.yml configurtion file.

bootstrap.mlockall: true

To check that this has value has been configured properly you can run this command.

curl http://localhost:9200/_nodes/process?pretty

This may cause memory warnings when ES starts up (eg, nuable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out. Increase RLIMIT_MEMLOCK (ulimit).) but you should be able to ignore these warnings.  If you are concerned, turn these limits off at the OS level which is demonstrated below.

Misc

Other low hanging fruit includes disabling open file limits on the OS.  ES can run in to problems if there is a cap on the amount of files that its processes can open or have open at a time.  I have run in to open file limit issues before and they are never fun to deal with.  This shouldn’t be an issue if you are running ES in a Docker container with the Ubuntu 14.04 base image.

If you aren’t sure about the open file limits for ES you can run the following command to get a better idea of the current limits.

ulimit -n

Make sure both the soft and hard limits are either set to unlimited or to an extremely high number.

This should take care of most if not all of the stability issues.  After putting these changes in place in my own environment I went from multiple crashes per day to so far none in over a week.  If you are still experiencing issues you might want to take a look at the ES production deployment guide for help or the #logstash and #elasticsearch IRC channels on freenode.

Josh Reichardt

Josh is the creator of this blog, a system administrator and a contributor to other technology communities such as /r/sysadmin and Ops School. You can also find him on Twitter and Facebook.