Docker Swarm with Ansible - a late #swarmweek entry

Introduction

In this tutorial we'll learn how to set up a Docker Swarm cluster using Ansible to orchestrate the basics.

Provision some nodes

First up, let's provision some machines to set this up on. I'm using DigitalOcean in this example, but you can use whichever cloud provider you like.

I've created three nodes:

  1. Manager
  2. Replica
  3. Node

I have the following inventory file to represent these:

./swarm_cluster

[node]
45.55.144.228  private_ip=10.132.48.106 ansible_ssh_user=root

[manager]
104.236.61.150  private_ip=10.132.44.190 ansible_ssh_user=root

[replica]
159.203.91.233 private_ip=10.132.48.105 ansible_ssh_user=root

(for simplicity, I'm using the actual IPs. Don't worry, this swarm no longer exists ;) ).

We'll basically be following along the official tutorial: Build a Swarm cluster for production

First-up let's bootstrap our nodes. We'll need docker engine and consul on each node. I'm not going to go into detail on these roles, but you will find them in the accompanying github repo for this tutorial. This tutorial assumes that you have the accompanying roles.

Install the basics

swarm.yml


- hosts: 
  - all
  vars:
    - initial_cluster_size: 3
  pre_tasks:
    - name: Install ansible requirements
      pip:
        name: "docker-py"
        state: present
      tags:
        - swarm
  roles:
    - ubuntubase
    - docker
    - consul

  • initial_cluster_size is used to help consul bootstrap it's cluster. Set this to the size of your initial cluster.

and run the playbook:

ansible-playbook swarm.yml -i swarm_cluster

Set up the Consul cluster

Ok. So now we have docker and consul running on our server. Let's ssh in and check what's what with consul:


ssh root@104.236.61.150
...
# see the members:
root@manager:~# consul members
Node     Address             Status  Type    Build  Protocol  DC
manager  10.132.44.190:8301  alive   server  0.6.3  2         dc1

# we have no members, so we need to manually join our nodes:

root@manager:~# consul join 10.132.48.106 10.132.48.105
Successfully joined cluster by contacting 2 nodes.

# now we have a consul cluster
root@manager:~# consul members
Node     Address             Status  Type    Build  Protocol  DC
manager  10.132.44.190:8301  alive   server  0.6.3  2         dc1
node     10.132.48.106:8301  alive   server  0.6.3  2         dc1
replica  10.132.48.105:8301  alive   server  0.6.3  2         dc1

You can also run: consul monitor to view the logs. Hopefully you should see that a leader has been elected:


[INFO] consul: adding server foo (Addr: 127.0.0.2:8300) (DC: dc1)
[INFO] consul: adding server bar (Addr: 127.0.0.1:8300) (DC: dc1)
[INFO] consul: Attempting bootstrap with nodes: [127.0.0.3:8300 127.0.0.2:8300 127.0.0.1:8300]
    ...
[INFO] consul: cluster leadership acquired

Notes:

  • Consul is clustered, so it doesn't matter which node you log into.
  • Running join will join our nodes into a cluster. This is supposed to happen automatically - but in my case it didn't.
  • You can run consul monitor to check the logs of our cluster

Important notes about running consul:

If you look at the upstart script we are using to run consul you can see that the command we run is:


/usr/local/bin/consul agent \
    -data-dir="/tmp/consul" -ui -bind=10.132.48.106 -client=0.0.0.0 \
    -bootstrap-expect 3\
    -server\
  • -client=0.0.0.0 is required because Swarm tries to communicate via the provided consul address (below) on port 8500. By default consul only uses the default loop-back address .. I don't actually know what that means .. but the effect that it has is that swarm cannot reach the consul server and therefore cannot assign a leader .. and basically it won't work. Providing -client means that swarm can communicate on <internal_ip>:8500. To Quote from someone smarter than me:

Finally, the client_addr line tells Consul to listen on all interfaces (not just loopback, which is the default).

Check out the consul UI

The upstart script that runs with our consul setup sets all the nodes to also support the consul UI. To view the UI locally, we'll tunnel through one of our servers:

# create the tunnel
ssh -N -f -L 8500:localhost:8500 root@45.55.144.228

Now you can view the consul UI on your local machine at localhost:8500/ui

Now we have Consul installed and running, let's set up our swarm:

Provision the Swarm

Set up the master

First up, let's set up the swarm manager:

swarm.yml

- hosts:
   - manager
  tasks:
    #$ docker run -d -p 4000:4000 swarm manage -H :4000 --replication --advertise 172.30.0.161:4000 consul://172.30.0.161:8500
    - name: Run consul manager
      docker:
        name: swarm
        image: swarm
        command: "manage -H :4000 --replication --advertise {{private_ip}}:4000 consul://{{private_ip}}:8500"
        state: started
        ports:
          - "4000:4000"
        expose: 
          - 4000
      tags:
        - swarm

Notes:

  • We're only running this on the manager node/s from our inventory (hosts: manager)

Run the playbook again:

ansible-playbook swarm.yml -i swarm_cluster

Now, if you log into the manager node and run docker ps, you should see that we have a swarm container running on our server:

root@manager:~# docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
3b753e4772b4        swarm               "/swarm manage -H :40"   34 seconds ago      Up 33 seconds       2375/tcp            swarm

Set up the replica

Now, our replicat is actually just another master node. Docker Swarm will handle which master acts like the master. As such, we just add the replica to our hosts list above


- hosts:
   - manager
   - replica
  ...

This deviates from the official tutorial, but I found that when trying it the way that the official tutorial works, I often got stuck. For example, my "manager" node would loose the master election and my replica node would become master. In the official tutorial, this is problematic because they do not expose port 4000 on the replica - therefore: one cannot communicate with the master node.

Notes:

  • Note in each case we're using the {{private_ip}} for consul. This tutorial is slightly different from the official one in that we have installed consul on all our nodes. Because consul is clustered talking to consul on any node is the same.
  • See the note above about exposing the client_ip. Without -client=0.0.0.0 set up, our above swarm command will not be able to communicate with Consul on the provided IP. Further: it will also not be able to communicate with localhost because in that instance, localhost refers to localhost inside the docker container. This again deviates from the tutorial, however: without this change, the tutorial did not work for me.

Set up the node

To set up the node, add:

- hosts:
   - node
  tasks:
    #$ docker run -d swarm join --advertise=172.30.0.69:2375 consul://172.30.0.161:8500        
    - name: Run consul node
      docker:
        name: swarm
        image: swarm
        command: "join --advertise {{private_ip}}:4000 consul://{{private_ip}}:8500"
        state: started
      tags:
        - swarm

You can now ssh into the node server and you should see that there is a docker swarm container running on there.

Note: The master tries to communicate with the node on port 2375. I needed to add: DOCKER_OPTS="-H tcp://0.0.0.0:2375" to my config file in /etc/default/docker. I then also needed to communicate with: -H 2375 when talking to the local docker instance.

Communicate with the Swarm

You now have a Docker Swarm running. Some things you can do:

Log into the swarm manager:

Check the status of the cluster:

docker -H :4000 info

Run a container on the cluster

docker -H :4000 run hello-world

Check which node the container ran on:

docker -H :4000 ps -a

...

Issues I had that appear not be be documented:

  • Swarm is unable to communicate with consul. Need to specify client=0.0.0.0 for consul agent
  • Docker master is unable to communicate with nodes on port 2375. Need to add DOCKER_OPTS="-H tcp://0.0.0.0:2375" to my config file in /etc/default/docker. This means that the normal docker .. command will no longer work on the node. You need to specify: docker -H :2375 ....

Next Steps

  • Get the swarm working with Docker Compose
  • Test the new (beta) on-node-failure re-scheduling feature
  • Test with rolling over an entire node (Chaos Monkey style)
  • Look into Registrator to automatically register the nodes with consul ... (or should that actually already be happening?)

References: