Oliver's Blog

Docker, IT Automation & more…

CoreOS Cluster Discovery & Troubleshooting


This blog post explores CoreOS cluster discovery problems and how to troubleshoot them. This blog post is part 8 of my DDDocker (a dummy’s docker diary) series. Since this post is one of my more popular blog posts on linkedIn, I have decided to move it to my WordPress site.

In the DDDocker (7) linkedIn post I had shown how to easily set up a CoreOS cluster using Vagrant. However, I had experienced problems with the health of the CoreOS cluster, caused by the presence of a HTTP proxy. This post is exploring on CoreOS cluster health and troubleshooting steps.

Based on what we have learned in DDDocker (7)  and the current post, we can install a healthy CoreOS cluster in less than ten or 15 minutes following this post.

 

Version History

v1 (2015-07-21) : Manual Cluster Discovery & Troubleshooting
v2 (2015-07-27): Added the Appendix “Healthy Docker (CoreOS) Cluster in less than 10 Minutes”
v3 (2015-08-19): Moved the Appendix “Healthy Docker (CoreOS) Cluster in less than 10 Minutes” to an independent blog post on WordPress
v4 (2016-06-03): Article moved to WordPress (here) and re-written the introduction

Cluster Discovery

In the DDDocker (7) linkedIn post I have created a CoreOS cluster with 3 cluster nodes. This is our starting point for the current post:

2015.07.15-17_17_50-hc_001

Before following the instructions in the CoreOS Quick Start Guide, section “Process Management with fleet” let us test, whether the cluster members are aware of each other:

Checking the Cluster Health

We will see below that the cluster is not healthy, since I have deployed

Connect to the first node using “vagrant ssh core-01”, or via putty like described in the DDDocker (7) post. Within the window, type:

fleetctl list-machines

Problems with Cluster Initialization behind a HTTP Proxy

In my case, this does not work:

2015-07-17_164244_capture_053

The reason is that fleet depends on etcd and the nodes cannot connect to the public etcd discovery server on https://discovery.etcd.io, as defined in the user-data file.

This can be seen with

journalctl -b -u etcd | less

where we see a note saying “failed to connect discovery service[https://discovery.etcd.io/e86ff40c2076ccf901fdc0681f11417e]”:

The problem: I have performed the installation behind a HTTP proxy and etcd discovery does not yet support communication via HTTP proxies (see here change request for CoreOS).

This is, why the discovery server has not discovered any nodes on my token, which I can see in a browser onhttps://discovery.etcd.io/e86ff40c2076ccf901fdc0681f11417e:

Workaround: connect to the Internet temporarily

There are two possible workarounds/solutions:

  1. connect to the Internet temporarily (for discovery process only)
  2. add a local etcd discovery agent as suggested on http://stackoverflow.com/questions/25019355/coreos-fleetctl-list-machines-show-error.

In my case, I have chosen 1, using a hotspot function of my mobile phone: only seconds after having connected to the Internet, the cluster nodes are seen on the public discovery agent:

and the cluster turns our to be “healthy”:

and also fleetctl works fine on core-01:

All cluster nodes are visible and healthy. The same picture is seen on all 3 cluster nodes. And it does not change, if I stop the Internet connection and hide behind a HTTP proxy again.

Summary

We have seen that cluster discovery of a CoreOS cluster depends on the etcd discovery service. The easiest way to set up etcd is to use the public discovery service on https://discovery.etcd.io. However, you need to make sure that each cluster node can reach this service without the need to pass a HTTP proxy, since etcd discovery does not support proxies yet. Even if curl or wget works, etcd discover will still fail.

There are two ways to resolve the etcd discovery issue:

  1. –shown in this post–
    temporarily connect to the Internet
  2. –not tested on my side–
    create your own etcd discovery service, as described onhttp://stackoverflow.com/questions/25019355/coreos-fleetctl-list-machines-show-error. For etcd newbies like me, also https://github.com/coreos/etcd/issues/1404 seems to give some insight.

The temporary connection to the Internet worked like a charm, and the cluster is still healthy after disconnecting the cluster from the Internet.

Further Reading


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: