1-888-SEMAPHORE
1-888-SEMAPHORE

Adventures in OpenStack with a SeaMicro 15000

The Chassis

Semaphore was recently approached by a friend of the company who was now working at AMD and happened to have one of their recently acquired SeaMicro 15000 high density compute chassis available for some short term testing. They offered to loan it to us for a bit in hopes we’d resell them with a particular focus on OpenStack as an enterprise private cloud since it requires some expertise to get up and running. Having never used OpenStack, and even having little experience with AWS and other cloud services, naturally we said, “Of course!”

A week or so later, AMD pulled up to the loading dock with 3 pallets of hardware. They’d brought a SeaMicro 15k chassis, and 2 large disk arrays (we decided to only set up one of the arrays given limited cooling in our lab area). A lot of heavy lifting later, we had both devices on the lab bench, powered on, and ready to start deployment.

After power on, we got the following specs from the chassis and disk array:

  • 64 Server Cards, each with a single Intel Xeon E3-1265L V3 chip (4 cores, 8 threads), and 32GB of DDR3 RAM
  • 1 Storage Card with 8 480GB SSD drives in a RAID configuration
  • 2 MX (controller) cards, each with 2 10Gig Ethernet SFP+ ports
  • 60 3TB JBOD drives, attached in 2 paths to the chassis via an e-SATA connection

The slickest thing about the SeaMicro chassis is that the compute cards are essentially just a Haswell Northbridge (less a graphics controller). The Southbridge is replaced by a set of ASICs which communicate with the SeaMicro chassis for disk and network I/O configuration and presentation. Each server card has 8 network paths that present to the server as Intel E1000 NICs. The virtual server NICs are fully configurable from the SeaMicro chassis using an IOS-like command line, with full 802.1q VLAN tagging and trunking support (if desired). By default the chassis presents a single untagged VLAN to all the server NICs, as well as the external 10Gig ports.

Disk I/O is even better, since we had a RAID storage card, we configured a single volume of around 3TB, and with just a few lines of configuration on the chassis were able to split the RAID volume into 64 virtual volumes of 48GB, and present each to one of the server cards as a root volume. These were presented as hot-plug SCSI devices, and could be dynamically moved from one server card to another via a couple quick config statements on the SeaMicro chassis. For the JBOD, we were able to assign disk ranges to lists of servers with a single command, feeding it a list of server cards and a number of drives, and the SM would automatically assign that number of disks (still hot-plug) to the server and attach them via the SeaMicro internal fabric and ASICs. Pretty cool stuff! (And invaluable during the OpenStack deployment. More on that later.)

On to OpenStack!

With chassis powered on, root volumes assigned, and the JBOD volumes available to assign to whichever server they made the most sense, we were ready to get going with OpenStack. First hurdle: there is zero removable media available to the chassis. This is pretty normal for setups like this, but unlike something like VmWare, there isn’t any easy ability to mount an ISO for install. Fortunately installing a DHCP sever is trivial on OSX, and it has a built-in TFTP server, so setting up a quick PXE boot server took just a few minutes to get the bootstrap node up. A nice feature of the SeaMicro chassis is that the power-on command for the individual servers allows a one-time PXE boot order change that will go away on the next power on, so you don’t need to mess with boot order in the BIOS at all. We installed Ubuntu 14.04 on one of the server nodes for bootstrapping and then started to look at what we needed to do next.

We’d received a SeaMicro/OpenStack Reference Architecture document which AMD made available to us, as well as finding a blog article on how a group from Canonical configured 10 SeaMicro chassis for OpenStack in about 6 hours, as well as an OpenStack Rapid Deployment Guide for Ubuntu. This seemed like enough knowledge to be dangerous when starting from absolutely nothing, so we dove right in.

Bootstrapping the metal

The reference/rapid deployment architectures all appeared to use MaaS (Metal as a Service) for bootstrapping the individual server blades. MaaS also has a plugin for the SeaMicro chassis to further speed deployment, so once the MaaS admin page was up and running, we were off to the races:

maas maas node­group probe­and­enlist­hardware model=seamicro15k mac= username=admin password=seamicro power_control=restapi2

A few seconds later, MaaS was populated with 64 nodes, each with 8 displayed MAC addresses. Not too shabby. We deleted the bootstrap node from the MaaS node list since it was statically configured, then told MaaS to commission the other 63 nodes for automation. Using the SeaMicro REST API, MaaS powered on each server using PXE boot, ran a quick smoke test to confirm reachability, then powered it back off and listed it as ready for use. Easy as pie, pretty impressive compared to the headaches of booting headless/diskless consoles of old. (I’m looking at you SunOS)

All the Ubuntu + OpenStack reference architectures use a service orchestration tool called Juju. It’s based on set of scripts called “charms” to deploy an individual service to a machine, configure it, then add relationship hooks to other services. (E.G., tell an API service that it’s going to be using MySQL as its shared backend database)

Juju requires its own server (machine “0”) to run the orchestration tools and deploy services from, to a quick bootstrap after pointing Juju in the direction of the MaaS API, and I had a bootstrap server running, powered on and automatically provisioned by MaaS. Juju also deploys the deploying user’s ssh public key to the new server, for use with its internal “juju ssh <machine/service>” command, which is quite handy. (I’d later come to learn that password auth is basically nonexistent in cloud architecture, at least on initial deployment. Works for us).

Now it was time to start getting OpenStack deployed. The AMD provided reference architecture didn’t quite match the Ubuntu one, which didn’t at all match what I was seeing in the Canonical deployment test, so I had to make some decisions. By default when you deploy a new Juju service, it instantiates a new machine. This seems very wasteful on a deployment of this size, so it made sense to colocate some of the services, so the initial deployment looked a bit like this:

juju deploy mysql
juju deploy –config=openstack.cfg keystone
juju deploy –config=openstack.cfg nova-cloud-controller
juju deploy nova-compute
juju deploy glance –to 2
juju deploy rabbitmq-server –to 1
juju deploy openstack-dashboard –to 2

Networking headaches

Once completed, this (sorta) left us with 4 new servers with the basic OpenStack services running. Keystone, Glance and Horizon (dashboard) were all colocated on one server, and MySQL and RabbitMQ on another. The Nova controller and first Nova Compute server were standalone. (Both the Ubuntu and AMD reference architectures used this basically layout) After a lengthy series of “add-relation” commands, the services were associated and I had an apparently working OpenStack cloud with a functional dashboard. A few clicks later and an instance was spawned running Ubuntu 14.04 Server, success! Kinda… It didn’t appear to have any networking. The reference config from AMD had then “quantum-gateway” charm installed (the charm for the new named Neutron networking service), but the config file supplied used a Flat DHCP Networking service through Nova, which didn’t appear to actually be working out of the box. Most of the documentation used Neutron rather than Nova-Network anyways, which seemed like a better solution for what we wanted to do. No problem, just change the nova-cloud-controller charm config to use Neutron instead, right?

Wrong.

The network configuration is baked into the configs at install time by Juju. While some config can be changed post-deploy, that wasn’t one of them. This was the first (of many) times that the “juju destroy-environment” command came in handy as a reset to zero button. After a few false starts, we had the above cloud config up and running again, this time with quantum-gateway (Why the charm hasn’t been renamed to Neutron, we don’t know) correctly deployed and configured to work with the Nova cloud controller. This also added the missing “Networks” option to Horizon, allowing us to automatically create public and private subnets, as well as private tenant routers for L3 services between the networks. An instance was brought up again, and this time it could ping things! A floating external IP was easily associated with the instance, and with a few security changes we should ping the instance from the outside world. Success! Since our keypair was automatically installed to the instance on create, we opened an ssh session to the instance and… got absolutely nothing.

Neutron, as deployed by Juju by default uses its ML2 (Modular Layer 2) plugin to allow for configurable tenant network backends. By default, it uses GRE tunnels between compute nodes to tie the tenant networks together across the OpenStack-internal management network. This is great, but because GRE is an encapsulation protocol, it has overhead and reduces your effective MTU. Our attempts to run non-ICMP traffic were running into MTU issues (as is common with GRE) and failing. The quantum-gateway Juju charm does have a knob to reduce the tenant network MTU, but since the SeaMicro supports jumbo frames across its fabric, we added a DHCP Option 26 to the MaaS server to increase the management network MTU to 9000 on server start time, and rebooted the whole cluster.

SeaMicro Storage quirks

At this point we had a single working instances with full networking available on our one compute node. There were two things left to do before the cloud could really be considered “working”, scale out compute capacity, and add persistent volume storage.

To this point, the instances were using temporary storage on the compute card that would be destroyed when the instance was terminated. This works for most instances, but there was a slight problem, our compute nodes only had 48GB of attached storage, and some of that was taken up by the hypervisor OS. That doesn’t leave a lot for instance storage. Since we had 60 3TB drives attached to the SeaMicro, we decided to give each compute node one disk, giving it 3TB for local non-persistent instance volumes. The initial plan was to add a total of 20 compute nodes, which surely would be as simple as typing “juju deploy -n 20 nova-compute”, right? This is where the biggest headache of using Juju/MaaS/SeaMicro came into play. Juju is machine agnostic, it grabs a random machine from MaaS based on constraints about RAM and CPU cores (if given, all our machines are identical, so there were not constraints). Juju tracks hostnames, which are derived from MaaS. MaaS assigns hostnames to devices as a random 5-character string in the local domain (.master) in this case, and tracks the server MAC addresses. The SeaMicro chassis is only away of the MAC addresses of the servers. On top of this, we needed to have the disk available to the compute node prior to deploying nova-compute onto it.

So, how to add the disk to the compute nodes? Well, first we needed to know which machines we’re talking about. Juju can add an empty container machine, although it doesn’t have a “-n” flag, so type “juju add-machine” 20 times and wait for them to boot. While waiting, get the hostnames of the last 20 machines from Juju. Then go over to MaaS’s web interface (which can only show 50 machines at a time), and search for the random 5-digit string for each of the 20 servers, and make note of the MAC address. Then go over to the SeaMicro command line and issue “show server summary | include ” to get the server number in the SeaMicro chassis. It’s annoyingly time consuming, and if you end up destroying and rebuilding the Juju environment or even the compute nodes, you have to do it all over again, since MaaS randomly assigns the servers to Juju. Ugh.

As a side note, since this was a fairly substantial portion of the time spent getting the initial install up and running, we reached out to AMD about these issues. They’re already on the problem, and are working with Canonical to further integrate the SeaMicro’s REST API with MaaS so the MaaS assigned machine names match the server IDs in the chassis itself, as well as presenting the presence of disk resources to MaaS so they can be used as Juju constraints when assigning new nodes for a particular function. For example, when creating the storage nodes, Juju could be told to only pick a machine with attached SSD resources for assignment. These two changes would significantly streamline the provisioning process, as well as making it much easier to determine which compute cards in the chassis were being used by Juju rather than having to cross-reference them by MAC address in MaaS.

Fortunately, once the server list was generated, attaching the storage on the SeaMicro was easy: “storage assign-range 2/0,4/0,7/0… 1 disk external-disks” and the chassis would automatically assign and attach one of the JBOD drives to each of the listed servers as VDisk 1 (the 2nd disk attached to the server). Since a keypair is already installed on the Juju server, a little shell scripting made it fairly easy to automatically login to each of the empty nodes and format and mount the newly attached disk on the instances path for deployment. Deployment then works fairly automatically, “juju add-unit nova-computer –to 10” for the 20 new machines. After spawning a few test instances, the cloud was left looking something like this:

Storage Services

At this point what we were really missing was persistent volume storage so instances didn’t need to lose their data when terminated. OpenStack offers a few ways to do this. The basic OpenStack volume service is “Cinder”, which uses pluggable backends, LVM2 volume groups being the default. Since this exercise is as a basic OpenStack proof of concept, we didn’t utilize any of the more advanced storage mechanisms available to OpenStack to start with, choosing to use the SeaMicro to assign 6 3TB JBOD drives to each Cinder node in an LVM configuration across 3 nodes for a total amount of ~54TB of non-redundant persistent volume storage. Cinder + LVM has some significant issues in terms of redundancy, but it was easy enough to setup. We created some mid-sized volumes from our Ubuntu Server image, started some instances form the volumes, then tore down the instances and re-created them on different hypervisors. As expected, all our data was still in there. Performance wasn’t particularly bad, although we didn’t do much in the way of load testing on it. For file I/O heavy loads and redundancy, there are certainly some better ways to approach storage that we’ll explore in another writeup.

At this point, we haven’t implemented any object storage. This can be done with the Ceph distributed filesystem, or Swift which is OpenStack’s version of S3 object storage. Since we’re using local storage for Glance images and didn’t have a use case for object storage during this proof of concept, we decided to skip this step for the time being until we do a more thorough look at OpenStack’s various storage options and subsystems.

Console Juju

OpenStack offers a couple flavors of VNC console services, one directly proxied through the cloud controller (novnc), and a java viewer (xvpvnc). These are fairly straightforward to setup, involving a console proxy service running on the cloud controller (or any other externally accessible server), and a VNC service running on each compute node. Configuration is just a few lines in /etc/nova/nova.conf on each of these servers. But, there’s a caveat here, there isn’t a Juju charm or configuration option for the VNC console services. Because the console services have their configuration installed in a file managed by Juju, any relationship change affecting the nova-cloud-controller or nova-compute service will cause the console configuration to be wiped out on every node. Additionally, the console config on the compute nodes needs to be configured (and the compute service restarted) BEFORE any instances are created on that node. If any instances exist beforehand, they won’t have console access, only new instances will. While this isn’t the end of the world, especially since one assumes the base relationships in Juju wouldn’t be changing much, it does highlight a potential problem with Juju in that if you’re adding custom config that isn’t deployed with the charm, you run the risk of losing it. While we haven’t looked at how difficult custom charms are to write yet, this clearly could be a problem in other areas as well, for example using iSCSI as a cinder and/or ceph backend, using something other than the built-in network backend for Neutron, etc. While there will always be a tradeoff when using service orchestration tools, this does seem like a significant one, since being able to add custom config segments to a managed config file is fairly important.
It seems unlikely to us that large production OpenStack clouds are deployed in this manner. The potential to wipe out large amounts of configuration unexpectedly (Or worse, have inconsistent configuration where newer compute units have different configs than older ones) is significant.

(Note: The below scripts will auto-deploy the VNC console code to all compute instances in /etc/nova/nova.conf immediately after the rabbitmq config)

deploy_console.sh – Run from MaaS node
#!/bin/bash
for i in `juju status nova-compute |grep public-address | awk ‘{print $2}’`; do
scp setup_compute_console.sh ubuntu@$i:
ssh ubuntu@$i ‘sudo /bin/bash /home/ubuntu/setup_compute_console.sh’
done;

deploy_compute_console.sh – Run from MaaS node
#!/bin/bash
. /root/.profile
IPADDR=`ifconfig br0 | grep “inet addr:” | cut -d: -f2 | awk ‘{print $1}’`
echo $IPADDR
grep -q “vnc_enabled = true” /etc/nova/nova.conf
isvnc=$?
if [ “$isvnc” == “1” ]; then
# Get line of RabbitMQ config
rline=`grep -n -m 1 rabbit_host /etc/nova/nova.conf | cut -f1 -d:`
((rline++))
echo ${rline}
sed -i “${rline} i\\
\\
vnc_enabled = true \\
vncserver_listen = 0.0.0.0 \\
vncserver_proxyclient_address = ${IPADDR} \\
novncproxy_base_url=\”http://192.168.243.7:6080/vnc_auto.html\” \\
xvpvpnproxy_base_url=\”http://192.168.243.7:6081/console\” \\
” /etc/nova/nova.conf
apt-get install novnc -y
initctl restart nova-compute
fi

Parting thoughts

This article is long enough already, but the initial impression is that OpenStack is complicated, but not as bad as it looks. This is obviously aided by rapid deployment tools, but once the architecture and how the services interact makes sense, most of the mystery is gone. Additionally, if you want a lot of compute resources in a small (and power efficient) footprint, the SeaMicro 15000 is an incredible solution. Juju/MaaS have some issues in terms of ease of use with the SeaMicro, but at least some of them are already being addressed by AMD/Canonical.

Since our proof of concept was basically done, we had the option to go a couple different directions here, the most obvious being an exploration of more advanced, efficient and redundant OpenStack storage. To do that, we’d need to tear down the Juju based stack and go with something more flexible. Since this isn’t production, no better way to do that than to just install everything from scratch, so stay tuned for that in upcoming articles.