mesos-module-dvdi Installation Walkthrough

As other people in the EMC {code} team started getting involved with Apache Mesos, I have been helping by fielding questions and troubleshooting installations of both Mesos and mesos-module-dvdi so that others can kick the tires and become subject matter experts. It was brought to my attention that a good clean walkthrough of adding external volume/storage support to an existing production Mesos cluster might be of value to others which brings us to this blog post. We will cover a couple examples of launching Marathon tasks to make use of our new found capability. Also, we will take a look at functionality that the mesos-module-dvdi brings to the table and talk about what those features mean to you.

Preflight Checklist

Before we begin, I want to make sure that everyone who is going to follow along has a properly configured Mesos cluster and for those that don’t have a Mesos cluster, there is a very good walkthrough guide by DigitalOcean that I would highly recommend. I would encourage performing every step in this guide and not skip or take short cuts when deploying your cluster as it can lead to wonky behavior of your cluster.

Wonky Mesos Clusters

There are a number of open source projects out there that attempt to deploy a proper Mesos cluster, but each one that I have evaluated (there have been a lot) has always fallen short somewhere. – everpeace/vagrant-mesos currently only works for version 0.23.1 and older. There haven’t been updates to the project in some time. – mesosphere/playa-mesos runs on virtualbox, but I found a number of issues with how the nodes are configured and ran into issues using certain frameworks, such as the Elastic Search framework, because of virtualbox’s networking layer. The configuration that is deployed is only a single master node; so no high availability. This project also hasn’t been updated in some time. – minimesos project deploys a configuration that is backed by docker containers so if you need to modify something like the slave nodes (in our case, we need to) this isn’t going to be a good option. Like the previous method, single master and no high availability.

If you want kick the tires on all the functionality that Mesos has to offer, I would still recommend that you deploy these nodes by hand. If that is too time consuming or you don’t have access to AWS or hardware resources, playa-mesos gets you most of the way there. The {code} team has a fork of that project which has some bug fixes as well as some enhancements. You can find our fork on GitHub located at emccode/vagrant/playa-mesos and if you chose to go this route, the {code} fork of playa-mesos has mesos-module-dvdi already pre-configured using VirtualBox as the external storage provider in which case you can use this walkthough to verify your settings. Just a reminder, the purpose of this walkthrough is adding external persistent volumes for production Mesos environments in which you will most likely take the steps below and incorporate them into a DevOps tool that your organization uses.

mesos-module-dvdi Installation Walkthrough

mesos-module-dvdi builds on top of functionality that is provided by REX-Ray and DVDCLI. So the first thing we need to do is install both packages on all your Mesos slave/agent nodes (NOTE: that all steps outlined in this section needs to be done on all your slave/agent nodes as root). You can do that by running the following commands:

curl -sSL https://dl.bintray.com/emccode/rexray/install | sh -
curl -sSL https://dl.bintray.com/emccode/dvdcli/install | sh -

Then you need you configure REX-Ray with the storage provider you plan on using. Since I usually host my configurations on Amazon for demonstration purposes, I am going to configure REX-Ray to use EC2 and place the file at /etc/rexray/config.yml. This is what my configuration file looks like:

rexray:
  storageDrivers:
  - ec2
  volume:
    mount:
      preempt: true
    unmount:
      ignoreUsedCount: true
aws:
  accessKey: <YOUR_ACCESS_KEY>
  secretKey: <YOUR_SECRET_KEY>

You can do a simple test to make sure REX-Ray and DVDCLI are functioning properly by running the following command: rexray volume. If you need help on configuring REX-Ray for other storage platforms, you can view the REX-Ray configuration guide.

Next, we need to install the mesos-module-dvdi which gives us the ability to provision, mount, unmount external volumes to Mesos tasks. You can download the version that matches your Mesos configuration from our GitHub releases page. My Mesos cluster is running version 0.27.0, so I am going to download libmesos_dvdi_isolator-0.27.0.so and place the .so in the /usr/lib directory. We also need to create a Mesos isolator configuration file so that the Mesos slave/agent knows how to load the isolator module. Create the file /usr/lib/dvdi-mod.json with the following content inside (replace with your version of the downloaded isolator):

{
   "libraries": [
     {
       "file": "/usr/lib/libmesos_dvdi_isolator-0.27.0.so",
       "modules": [
         {
           "name": "com_emccode_mesos_DockerVolumeDriverIsolator",
           "parameters": [
            {
              "key": "isolator_command",
              "value": "/emc/dvdi_isolator"
            }
          ]
         }
       ]
     }
   ]
 }

Finally, we just need to add the hooks for the Mesos slave/agent to pick up the configuration file and load the isolator. You can do that by running the following shell commands:

echo "docker,mesos" | tee /etc/mesos-slave/containerizers
echo "com_emccode_mesos_DockerVolumeDriverIsolator" | tee /etc/mesos-slave/isolation
echo "file:///usr/lib/dvdi-mod.json" | tee /etc/mesos-slave/modules
service mesos-slave restart

Success! This configuration will also allow you to run external volumes for docker workloads which is why you see docker in the first command. To learn more about using external storage with the docker containerizer in Apache Mesos, please checkout this great article on the EMC {code} blog.

Very Nice Great Success

Exercising the mesos-module-dvdi

Now that we have all this configured correctly, lets start out by creating a couple of external volumes by launching a Mesos task. Create a file called basic.json with the following content inside:

{
  "id": "hello-mesos",
  "cmd": "while [ true ] ; do touch /var/lib/rexray/volumes/firstvolume/hello1 ; sleep 5 ; done",
  "mem": 16,
  "cpus": 0.1,
  "instances": 1,
  "env": {
    "DVDI_VOLUME_NAME": "firstvolume",
    "DVDI_VOLUME_OPTS": "size=5,iops=150,volumetype=io1,newfstype=xfs,overwritefs=false",
    "DVDI_VOLUME_DRIVER": "rexray",
    "DVDI_VOLUME_NAME1": "secondvolume",
    "DVDI_VOLUME_DRIVER1": "rexray",
    "DVDI_VOLUME_OPTS1": "size=6,volumetype=gp2,newfstype=ext4,overwritefs=false"
  }
}

Then we can launch the Marathon task by running the following shell command:

curl -k -XPOST -d @basic.json -H "Content-Type: application/json" :8080/v2/apps

This simply creates two new volumes called “firstvolume” and “secondvolume” and uses the default REX-Ray mount location /var/librexray/volumes to mount these volumes.

For a more complex example, lets stand up a simple web server with an external volume and scribble some data on it. Create a file called web.json with the following content inside and replace YOUR_NAME_HERE with your first name (be careful and try not to remove the single quote trailing it):

{
  "id": "webserver",
  "uris": [
    "https://github.com/dvonthenen/goprojects/raw/master/bin/statichttpserver"
  ],
  "cmd": "echo 'Hello, YOUR_NAME_HERE' $(cat /var/lib/rexray/volumes/webdata/readme.txt | wc -l) >> /var/lib/rexray/volumes/webdata/readme.txt && chmod u+x statichttpserver &&  ./statichttpserver -port=$PORT -path=/var/lib/rexray/volumes/webdata",
  "mem": 16,
  "cpus": 0.1,
  "instances": 1,
  "constraints": [
    ["hostname", "UNIQUE"]
  ],
  "env": {
    "DVDI_VOLUME_NAME": "webdata",
    "DVDI_VOLUME_OPTS": "size=5,iops=150,volumetype=io1,newfstype=xfs,overwritefs=false",
    "DVDI_VOLUME_DRIVER": "rexray"
  }
}

Then we can launch the Marathon task by running the following shell command:

curl -k -XPOST -d @web.json -H "Content-Type: application/json" :8080/v2/apps

You can find out what slave/agent node and port the web server is running on by clicking on the running task in the Marathon UI and looking at the application configuration information on the page (as seen below).

Marathon Task

And if you open up that hostname and port in your web browser (you may have noticed that my FQDN name isn’t the same since I configured my EC2 instances without elastic IPs), you will have accessed the webserver we downloaded from my GitHub account and you should see a file readme.txt at the root. When you click that readme.txt to display the contents, you should see your name inside the text file.

Opening up the readme.txt

Although the second example is a fairly meaningful real world example of using external volumes/storage, the real benefits don’t become apparent until you start doing things like failing the Mesos slave/agent node that is currently running the task. We can simulate this failure, by simply stopping the Marathon task and relaunching it. The new instance of our task may start up on a different node, but the important take away is that the external volume from the first instance will be reattached to the new task thus preserving any prior data. If you open the readme.txt after the task enters the running state, you should see the previous “Hello” line with a “0” at end now followed by a “Hello” line ending in “1” that this new instance has added on. Now just imagine a database like PostgreSQL or an Elastic Search node as a Marathon task. Any failure in the Mesos slave/agent or health check will trigger the task to start up from its previous state on a new node. Say “Hello” to high availability!

Persistence!

What’s Next…

Very cool! Hopefully this served as a good guide to adding external volume/storage support to your existing Apache Mesos cluster using REX-Ray, DVDCLI, and mesos-module-dvdi. I have a couple of ideas on some follow up blog posts… thinking about how I do dev/test for mesos-module-dvdi with Docker containers or perhaps more about frameworks. If you have any topics you might want me to discuss, please drop me a line on twitter at @dvonthenen.

Enabling External Storage on Mesos Frameworks

There has been a huge push to take containers to the next level by twisting them to do much more. So much more in fact that many are starting to use them in ways that were never originally intended and even going against the founding principles of containers. The most noteworthy principle of containers being left on the designing room floor is without a doubt is being “stateless”. It is pretty evident that this trend is only accelerating… just doing a simple search of popular traditional databases in Docker Hub yields results like MySQL, MariaDB, Postgres, and OracleLinux in Docker Hub (Oracle suggests you might try running an Oracle instance in a Docker container. LAF!). Then there is all the NoSQL implementations like Elastic Search, Cassandra, MongoDB, and CouchBase just to name a few. We are going to take a look to see how we can bring these workloads into the next evolution of stateful containers using the Mesos Elastic Search Framework as a proposed model.

After Though

The problem with stateful containers today is that pretty much every implementation of a container whether its Docker, Apache Mesos, or etc has been architected with those original principles, such as being stateless, in mind. When the container goes away, anything associated with the container is gone. This obviously makes it difficult to maintain configuration or keep long term data around. Why make this tradeoff to begin with? It keeps the design simple. The problem is useful applications all have state somewhere. As a response, container implementations enabled the ability to store data on the local disks of compute nodes; thus tying workloads to a single node. However on the failure of a particular node, you could potentially lose all data on the direct attached storage. Not good for high availability, but at least there was some answer to this need.

Enter the Elastic Search Mesos Framework

This brings me to a recently submitted Pull Request to the Elastic Search (ES) Mesos Framework project on GitHub to add support for External Storage orchestration, but more importantly to enable management of those external storage resources among Mesos slave/agent nodes. Before I jump into talking about the ES Framework, I probably should quickly talk about Mesos Frameworks in general. A Mesos Framework is a way to specialize a particular workload or application. This specialization can come in the form of tuning the application to best utilize hardware, like a GPU for heavy graphics processing, on a given slave node or even distributing tasks that are scaled out to place them in different racks within a datacenter for high availability. A Framework is consists of 2 components a Scheduler and an Executor. When an resource offer is passed along to a Scheduler, the Scheduler can evaluate the offers, apply its unique view or knowledge of its particular workload, and deploy specialized tasks or applications in the form of Executors to slave/agent nodes (seen below).

Mesos Architecture - Picture Thanks to DigitalOcean

The ES Framework behaves in the same way described above. The design of the ES Scheduler and Executor have been done in such a way that both components have been implemented in Docker containers. The ES Scheduler is deployed to Marathon via Docker and by default the Scheduler will create 3 Elastic Search nodes based on a special list of criteria to meet. If the offer meets that criteria, an Elastic Search Executor in the form of a Docker container will be created on the Mesos slave/agent node representing the output for that offer. Within that Executor image holds the task which in this case is an Elastic Search node.

Deep Dive: How the External Storage Works

Let’s do a deep dive on the Pull Request and discuss why I made some of the decisions that I did. I first broke apart the OfferStrategy.java into a base class containing everything common to a “normal” Elastic Search strategy versus one that will make use of external storage. Then the OfferStrategyNormal.java retains the original functionality and behavior of the ES Scheduler which is on by default. Then I created the OfferStrategyExternalStorage.java which removes all checks for storage requirements. Since the storage used in this mode is all managed externally, the Scheduler does not need to take storage requirements into account when it looks at the criteria for deployment.

The critical piece in assigning External Volumes to Elastic Search nodes is to be able to uniquely associate a set of Volumes containing configuration and data of each elastic search node represented by /tmp/config and /data. That means we need to create, at minimum, runtime unique IDs. What do I mean runtime unique? It means that if there exists a ES node with an identifier of ID2, there exists no other node with an ID2. If an ID is freed lets say ID2 from a Mesos slave/agent node failure, we make every attempt to reuse that ID2. This identifier is defined as a task environment variable as seen in ExecutorEnvironmentalVariables.java.

addToList(ELASTICSEARCH_NODE_ID, Long.toString(lNodeId));
LOGGER.debug("Elastic Node ID: " + lNodeId);

private void addToList(String key, String value) {
  envList.add(getEnvProto(key, value));
}

Why an environment variable? Because when the task and therefore the Executor is lost, the reference to the ES Node ID is freed so that when a new ES Node is created, it will replace the failed node and the ES Node ID will be recycled. How do we determine what Node ID we should be using when selecting a new or recycling a Node ID? We do this using the following function in ClusterState.java:

public long getElasticNodeId() {
    List taskList = getTaskList();

    //create a bitmask of all the node ids currently being used
    long bitmask = 0;
    for (TaskInfo info : taskList) {
        LOGGER.debug("getElasticNodeId - Task:");
        LOGGER.debug(info.toString());
        for (Variable var : info.getExecutor().getCommand().getEnvironment().getVariablesList()) {
            if (var.getName().equalsIgnoreCase(ExecutorEnvironmentalVariables.ELASTICSEARCH_NODE_ID)) {
                bitmask |= 1 << Integer.parseInt(var.getValue()) - 1;
                break;
            }
        }
    }
    LOGGER.debug("Bitmask: " + bitmask);

    //the find out which node ids are not being used
    long lNodeId = 0;
    for (int i = 0; i < 31; i++) {
        if ((bitmask & (1 << i)) == 0) {
            lNodeId = i + 1;
            LOGGER.debug("Found Free: " + lNodeId);
            break;
        }
    }

    return lNodeId;
}

We get the current running task list, find out which tasks have the environment variable set, build a bit mask, then walk the bitmask starting from the least significant bit until we have a free ID. Fairly simple. As someone who doesn’t run Elastic Search in production, it was pointed out to me this would only support 32 nodes so there is a future commit that will be done to make this generic for an unlimited number of nodes.

Let’s Do This Thing

To run this, you need to have a Mesos configuration running 0.25.0 (version supported by the ES Framework) with at least 3 slave/agent nodes, you need to have your Docker Volume Driver installed, like REX-Ray for example, you need to pre-create the volumes you plan on using based on the parameter --frameworkName (default: elasticsearch) appended with the node id and config/data (example: elasticsearch1config and elasticsearch1data), and then start the ES Scheduler with the command line parameter --externalVolumeDriver=rexray or what ever volume driver you happen to be using. You are all set! Pretty easy huh? Interested in seeing more? You can find a demo on YouTube located below.

BONUS! The Elastic Search Framework has a facility (although only recommended for very advanced users) for using the Elastic Search JAR directly on the Mesos slave/agent node and in that case, code was also added in this Pull Request to use the mesos-module-dvdi, which is a Mesos Isolator, to create and mount your external volumes. You just need to install mesos-module-dvdi and DVDCLI.

The good news is that the Pull Request has been accepted and it is currently slated for the 8.0 release of Elastic Search Framework. The bad news is the next release looks like to be version 7.2. So you are going to have to wait a little longer before you get an official release with this External Volume support. HOWEVER if you are interested in test driving the functionality, I have the Elastic Search Docker Images used for the YouTube video up on my Docker Hub page. If you want to kick the tires first hand, you can visit https://hub.docker.com/r/dvonthenen/ for images and instructions on how to get up and running. The both the Scheduler image (and the Executor image) were auto created as a result of the gradle build done for the demo.

Frameworks.NEXT

What’s up next? This was a good exercise in adding on to an existing Mesos Scheduler and Executor and the {code} team may potentially have a Framework of our own on the way. Stay tuned!

Drops the Mic

Looking Back at SCaLE x14

For those that don’t know what SCaLE is… SCaLE stands for the Southern California Linux Expo and this marked the 14th year the conference has been held which happened to be in Pasadena, CA on Jan 21-24. This was the first time I have attended SCaLE and I have to say that it was quite refreshing going to a conference where the primary purpose wasn’t trying to sell you product but rather ideas and open source projects of interest. A lot of this is due to a very community driven focus which encompassed everything from session selection, a large volunteer staff, and etc.

There were a number of sessions and tracks ranging from Ubucon (everything Ubuntu), PostgreSQL, MySQL, Apache Bigtop, Open Source in Education, Unikernels, Robotics using Golang (pictured below) and etc. I took it upon myself to take a slice out of each pie to get a good feel for what the conference had to offer. Like everything in life, there was a range of the good, the bad and the ugly… but I would have to say there wasn’t much, if any, in the latter category. For reference, you can get a copy of the conference schedule here as a sampling of what types of sessions SCaLE provides.

Gobot - Robots Powered by Golang

The other significant difference is that audience or the make up of the attendees in this conference is significantly different than traditional conferences backed by big money which in this instance was predominately developers, DevOps peeps, and sysadmins. Gone are the sales and marketing people with the heavy focus on landing deals with private rooms off to the side where contracts are being drawn up. If you are looking to network with other developers and the actual users of a particular technology, this is a fantastic place to connect with those people.

Without further ado, here are some notes from sessions I found interesting. It should also be noted that I have included a lot of the session links in this blog as many of the slide decks are available on those pages.

IoT and Snappy Ubuntu Core

This session was titled Internet of Things gets ‘snappy’ with Ubuntu Core and its main focus was introducing people to the idea of Snappy Ubuntu Core which Ubuntu is pushing as the solution for building applications in the cloud and on embedded devices. Think of it as a minimal Ubuntu operating system built using the same libraries as traditional Ubuntu just with a much smaller footprint wrapped with a convenient package management that can orchestrate installation and upgrades. The IoT demo was an application built using snappy on a raspberry pi with an IP camera that can do facial tracking. Pretty cool! You can find more information about Snappy Ubuntu Core here.

IoT using Snappy Ubuntu Core on a Raspberry Pi

Juju Charms

This was my first exposure to Juju Charms. I had heard the name being thrown around before but didn’t have the time to take a look at the offering. The session Writing your first Juju Charm was a well presented introduction for first time users. The main takeaway is that Juju models relationships with other applications encapsulated in what is called a charm and that those charms, for example MySQL, are created, maintained, and tuned by subject experts. Pulling those vetted charms helps provide a solid foundation to build your applications on. Dependencies in charms are automatically pulled in and the likeness was compared that to the layers of an onion; you just layer in the applications you want. Then when pulling together your solution, its simply dragging and connecting these charms within the UI.

In writing your own charm, the method of hooking these applications together is done by an event driven engine. For example, your application contained within your own charm doesn’t get installed until the http server started and the MariaDB started event are received. Overall a good session. You can find more information about Juju here and here.

Docker and Unikernels

Although what I am going to write about here wasn’t a session at SCaLE per se, but speakers from both the Docker and Unikernel presentations at this special edition of the Docker LA Meetup SCaLE x14 Edition had sessions at the conference. The first speaker was Jérôme Petazzoni from Docker (pictured on the right). We discussed Swarm and the advances made in the latest release specifically around mulit-network orchestration between containers instances to satisfy the cluster use case. I did enjoy the fact that the discussion was almost completely driven by demonstration, but from the sounds of known issues even walking into this demo, it makes me think that this isn’t ready for prime time yet. I definitely appreciate the candor and honesty from the presenter about these issue; speaks to his professionalism.

The second half of the meetup quickly switched gears to talk about the Docker acquisition of Unikernel Systems who is the primary driver of the open source ecosystem of unikernel.org. Presenting a 101 type course on unikernels was Amir Chaudhry and Richard Mortier (pictured on the left). The session was quite fascinating considering I had not heard of the term unikernel until Docker made the announcement that morning. In a nutshell, the idea of a unikernel is such that you only take the various parts of a particular stack, say networking, that you need in order for your application to run. This implies that the stack is modular and has clear separation of concerns. The claim is that this type of operating system has:

  • a smaller footprint in terms of size
  • better security because they have a lower penetration profile
  • better performance because unnecessary services aren’t running
  • boots quicker than a traditional operating system which lends to cloud applications

While all this might be true, I have some reservations about dynamically pulling pieces of the operating system to build your final application image. Unikernel System via their tools claim to have solved that dependency nightmare. For the sake of argument if we accept that have been able to solve dynamic dependencies between modules, a lot of these operating systems have been rewritten to the ground up which begs the questions about stability. Additionally, these unikernels don’t have a traditional kernel or protected layer meaning that the application has full access all the way down the stack. Think about the security ramifications of this design.

It seems like unikernels are at odds with what Ubuntu Core decided to do which was just create a common minimal operating system using the same exact libraries we have been using for decades to run cloud and embedded applications. I think I prefer the Ubuntu Core design, but I have a feeling the hype and backing of Docker’s marketing machine might smash that idea.

Jérôme Petazzoni of Docker, Amir Chaudhry and Richard Mortier of Unikernel Systems

Mentoring in Open Source

I attended the Dent the Universe: Mentoring in Open Source session after a fellow coworker was interested in but was unable to attend SCaLE. I was pleasantly surprised with both the content of the discussion as well as the make up of the audience. The audience had a mix of gender, race, backgrounds, and also age… as in there were teens, toddlers, and babies in the audience (pretty sure the babies didn’t really follow was was going on).

The upshot of the session is that relationships between mentors and mentees bring forth a sense of community like what we see in the open source world. Its a platform for exchanging ideas similar to a meetup but on an individual 1 on 1 setting. There are many open source projects and organizations out there that have mentoring programs included by not limited to: Google, Apache, Fedora, and Ubuntu. The session covered what mentors are and aren’t, how you build that relationship, the dedication that mentors need to have, and even how mentees might go about getting a mentor. Of all the sessions that I did sit in on, I have to say I took the most amount of notes in this one… maybe it was because the subject matter was so different, but nonetheless this was definitely an excellent session and was well worth the time.

Nomad, a Golang Cluster Scheduler

I just want to give a quick mention about the Nomad – An introduction to Cluster Schedulers session. I found the overall design of Nomad intriguing based on the claimed feature set. Nomad is a HashiCorp project which if you don’t know who HashiCorp is, they are the ones who brought you projects like Vagrant and Consul. Nomad is written in Golang and supports scheduling for virtual infrastructure, Docker, Java applications, and etc. They have support for Global and Local cluster regions (think multi-datacenter support) and has facilities for deploying, scaling, rolling updates, load balancing, integrates with Consul and more. Although it seems like scale isn’t their strong suit as the presenter only believes the stress testing has only been done on 100 or so nodes; however, Nomad might be a good enough solution for a mid-sized business. I will be keeping on eye on this considering the success that HashiCorp has had on their other projects.

Expo Floor at SCaLE x14

Conclusion

Overall, SCaLE was a great conference with veritable buffet of topics that can interest a wide variety of people who attend. Hell for the price, under $40 if you get in on a coupon code, you can’t beat the value for the level of information in the sessions and the discussions had with the many attendees and speakers. I would definitely recommend checking out this conference next year and if you can convince your employer to fund the trip or if you can’t and have the spare time and (not much) cash to come on your own dime, its definitely a worth while experience.