OpenStack - Take 2 - Storage Architecture w/ X-IO HyperISE

Re-architecting things ...  Again.

After getting basic services configured, we ran into a bit of a conundrum where storage was concerned.  

OpenStack has two storage methods, block and object (served by Swift and Cinder in the native OpenStack architecture).  Swift is object-only and Cinder is basically a block abstraction layer that sits over a backend driver of some sort.  Swift is inherently redundant, replicates itself, and can easily use a variety of disk mediums, but we didn't have as much of a use case for it as we did volume storage for persistent VM volumes.  

Cinder, on the other hand, has issues.  While it can have multiple API servers, a cinder volume (when using LVM) is basically tied to a specific cinder volume server.  Even though the SeaMicro has the capability of shifting disks around, cinder doesn't have any concept of migration.  What this means in essence is that if a cinder server goes offline, all volumes attached to that server are unavailable until it is restore of rebuilt, even though we could have easily made the data available to another cinder server.  This isn't ideal for the architecture we're trying to build.  

Enter Ceph.  Ceph is distributed, redundant (much like Swift), supports both object and block storage, and is a redundant and highly available cinder backend option.  It's also incredibly resource intensive in terms of network and storage I/O.  Like Swift, it has inherent replication.  The differences are quick to spot though, Swift is "eventually consistent".  An object written to swift will be replicated to other locations (depending on the number of defined replicates) at some point in time, but it isn't a blocking operation.  Ceph is immediately consistent, when data changes on a ceph object or block, it is immediately replicated, which means you get double-hits on the network for write operations.  On top of that, Ceph simultaneously writes to both its Journal and to the actual storage device as a sequential operation.  (btrfs is a copy-on-write filesystem and can avoid the double-write penalty, but it is not considered production ready).  This means that the architecture needs to be considered a good deal more carefully than just randomly assigning disks to our storage servers and creating a large LVM volume on them.

 

Re-architecting the root disk layout

While testing the SeaMicro and OpenStack, we also had our partners over a X-IO drop by one of their new iSCSI based HyperISE SSD+disk arrays.  While it isn't technically appropriate for Ceph, it would perform faster than the external JBOD spindles that were attached to the SeaMicro, so we decided to include it in the design.  This gave us three tiers of storage:

  1. SeaMicro Internal SSD - ~3TB of full SSD RAID, currently divided into 64 48GB volumes as the server's root disks
  2. X-IO HyperISE - ~16TB usable auto-tiered SSD+disk RAID, with 20Gbps aggregate uplink into the SeaMicro
  3. SeaMicro external JBOD - 60 ~3TB individual drives attached to the SeaMicro via eSATA

The first thing that stuck out is that our decision to use the internal SSD as the root drives for our servers was a dumb one, that's the fastest disk on the system, and we were wasting it on tasks that needed no I/O performance whatsoever.  We decided to keep our bootstrap server on the SSD just to save the trouble of rebuilding it, but we unassigned all the other server's storage (something the SeaMicro makes quite easy), then wrote a script to delete the other 63 volumes and recycle them back into the RAID pool.  This left us with 3080GB free in our internal SSD pool for high performance storage use.

This does leave us with the small problem of not having any root disks for our servers (other than the bootstrap server that we left configured).  Since we don't care about performance, we're going to just carve up 6 of the 3TB JBOD drives into 11 volumes each and assign those to the other servers.  (This leaves us with 54 JBOD drives unused, which will become important later).  To do this, we need to switch the JBOD slots into "volume" mode first: storage set mgmt-mode volume slot 2.  We'll then take the first 6 disks (JBOS2/6 - JBOD2/11 on our chassis) and create 11 volumes on each disk, giving us a total of 66 volumes (3 more than we need, but that's fine), and finally assign the new volumes to vdisk 0 on each server (other than our bootstrap server):

 

seasm15k01# storage clear-metadata disk JBOD2/6-81 slot 2
All data on the specified disk(s) will be lost with this operation
Are you sure you want to proceed (yes/no): yes
Please enter 'yes' again if you really want to proceed (yes/no): yes
seasm15k01# storage clear-metadata disk JBOD5/0-75 slot 5 
All data on the specified disk(s) will be lost with this operation
Are you sure you want to proceed (yes/no): yes
Please enter 'yes' again if you really want to proceed (yes/no): yes
seasm15k01# storage create pool 2/rootpool-1 disk JBOD2/6
Pool 2/rootpool-1 created successfully.
seasm15k01# storage create pool 2/rootpool-2 disk JBOD2/7
Pool 2/rootpool-2 created successfully.
seasm15k01# storage create pool 2/rootpool-3 disk JBOD2/8
Pool 2/rootpool-3 created successfully.
seasm15k01# storage create pool 2/rootpool-4 disk JBOD2/9
Pool 2/rootpool-4 created successfully.
seasm15k01# storage create pool 2/rootpool-5 disk JBOD2/10
Pool 2/rootpool-5 created successfully.
seasm15k01# storage create pool 2/rootpool-6 disk JBOD2/11
  Pool 2/rootpool-6 created successfully.
seasm15k01# storage create volume-prefix 2/rootpool-1/rootvol size max\#11 count 11 
***********
seasm15k01# storage create volume-prefix 2/rootpool-2/rootvol size max\#11 count 11 
***********
seasm15k01# storage create volume-prefix 2/rootpool-3/rootvol size max\#11 count 11 
***********
seasm15k01# storage create volume-prefix 2/rootpool-4/rootvol size max\#11 count 11 
***********
seasm15k01# storage create volume-prefix 2/rootpool-5/rootvol size max\#11 count 11 
***********
seasm15k01# storage create volume-prefix 2/rootpool-6/rootvol size max\#11 count 11 
***********
seasm15k01(config)# storage assign-range 0/0-31/0,33/0-63/0 0 volume rootvol uuid 
seasm15k01(config)# end
seasm15k01# show storage assign brief 
*****************************************************************
  server    vdisk     type                    id                   assignment    property  
-------------------------------------------------------------------------------------------
    0/0       0      volume    rootvol(2/rootpool-3/rootvol-10)      active         RW     
    1/0       0      volume     rootvol(2/rootpool-4/rootvol-9)      active         RW     
    2/0       0      volume     rootvol(2/rootpool-4/rootvol-8)      active         RW     
    3/0       0      volume     rootvol(2/rootpool-4/rootvol-7)      active         RW     
    4/0       0      volume     rootvol(2/rootpool-4/rootvol-6)      active         RW     
    5/0       0      volume     rootvol(2/rootpool-4/rootvol-5)      active         RW     
    6/0       0      volume     rootvol(2/rootpool-4/rootvol-4)      active         RW     
    7/0       0      volume     rootvol(2/rootpool-4/rootvol-3)      active         RW     
    8/0       0      volume     rootvol(2/rootpool-4/rootvol-2)      active         RW     
    9/0       0      volume     rootvol(2/rootpool-4/rootvol-1)      active         RW     
   10/0       0      volume     rootvol(2/rootpool-4/rootvol-0)      active         RW     
   11/0       0      volume    rootvol(2/rootpool-1/rootvol-10)      active         RW     
   12/0       0      volume     rootvol(2/rootpool-5/rootvol-9)      active         RW     
   13/0       0      volume     rootvol(2/rootpool-5/rootvol-4)      active         RW     
   14/0       0      volume     rootvol(2/rootpool-5/rootvol-5)      active         RW     
   15/0       0      volume     rootvol(2/rootpool-5/rootvol-6)      active         RW     
   16/0       0      volume     rootvol(2/rootpool-5/rootvol-7)      active         RW     
   17/0       0      volume     rootvol(2/rootpool-5/rootvol-0)      active         RW     
   18/0       0      volume     rootvol(2/rootpool-5/rootvol-1)      active         RW     
   19/0       0      volume     rootvol(2/rootpool-5/rootvol-2)      active         RW     
   20/0       0      volume     rootvol(2/rootpool-5/rootvol-3)      active         RW     
   21/0       0      volume     rootvol(2/rootpool-3/rootvol-1)      active         RW     
   22/0       0      volume     rootvol(2/rootpool-3/rootvol-0)      active         RW     
   23/0       0      volume     rootvol(2/rootpool-3/rootvol-3)      active         RW     
   24/0       0      volume     rootvol(2/rootpool-3/rootvol-2)      active         RW     
   25/0       0      volume     rootvol(2/rootpool-3/rootvol-5)      active         RW     
   26/0       0      volume     rootvol(2/rootpool-3/rootvol-4)      active         RW     
   27/0       0      volume     rootvol(2/rootpool-3/rootvol-7)      active         RW     
   28/0       0      volume     rootvol(2/rootpool-3/rootvol-6)      active         RW     
   29/0       0      volume    rootvol(2/rootpool-2/rootvol-10)      active         RW     
   30/0       0      volume     rootvol(2/rootpool-3/rootvol-8)      active         RW     
   31/0       0      volume    rootvol(2/rootpool-4/rootvol-10)      active         RW     
   32/0       0      volume      RAIDVOL(7/RAIDPOOL/RAIDVOL-0)       active         RW     
   33/0       0      volume    rootvol(2/rootpool-5/rootvol-10)      active         RW     
   34/0       0      volume     rootvol(2/rootpool-6/rootvol-1)      active         RW     
   35/0       0      volume    rootvol(2/rootpool-6/rootvol-10)      active         RW     
   36/0       0      volume     rootvol(2/rootpool-6/rootvol-0)      active         RW     
   37/0       0      volume     rootvol(2/rootpool-3/rootvol-9)      active         RW     
   38/0       0      volume     rootvol(2/rootpool-6/rootvol-2)      active         RW     
   39/0       0      volume     rootvol(2/rootpool-6/rootvol-3)      active         RW     
   40/0       0      volume     rootvol(2/rootpool-6/rootvol-4)      active         RW     
   41/0       0      volume     rootvol(2/rootpool-6/rootvol-5)      active         RW     
   42/0       0      volume     rootvol(2/rootpool-6/rootvol-6)      active         RW     
   43/0       0      volume     rootvol(2/rootpool-6/rootvol-7)      active         RW     
   44/0       0      volume     rootvol(2/rootpool-6/rootvol-8)      active         RW     
   45/0       0      volume     rootvol(2/rootpool-6/rootvol-9)      active         RW     
   46/0       0      volume     rootvol(2/rootpool-2/rootvol-2)      active         RW     
   47/0       0      volume     rootvol(2/rootpool-2/rootvol-3)      active         RW     
   48/0       0      volume     rootvol(2/rootpool-2/rootvol-0)      active         RW     
   49/0       0      volume     rootvol(2/rootpool-2/rootvol-1)      active         RW     
   50/0       0      volume     rootvol(2/rootpool-2/rootvol-6)      active         RW     
   51/0       0      volume     rootvol(2/rootpool-2/rootvol-7)      active         RW     
   52/0       0      volume     rootvol(2/rootpool-2/rootvol-4)      active         RW     
   53/0       0      volume     rootvol(2/rootpool-2/rootvol-5)      active         RW     
   54/0       0      volume     rootvol(2/rootpool-2/rootvol-8)      active         RW     
   55/0       0      volume     rootvol(2/rootpool-2/rootvol-9)      active         RW     
   56/0       0      volume     rootvol(2/rootpool-1/rootvol-6)      active         RW     
   57/0       0      volume     rootvol(2/rootpool-1/rootvol-7)      active         RW     
   58/0       0      volume     rootvol(2/rootpool-1/rootvol-4)      active         RW     
   59/0       0      volume     rootvol(2/rootpool-1/rootvol-5)      active         RW     
   60/0       0      volume     rootvol(2/rootpool-1/rootvol-2)      active         RW     
   61/0       0      volume     rootvol(2/rootpool-1/rootvol-3)      active         RW     
   62/0       0      volume     rootvol(2/rootpool-1/rootvol-0)      active         RW     
   63/0       0      volume     rootvol(2/rootpool-1/rootvol-1)      active         RW     
* 64 entries

 

Once done, we're left with a very similar layout to what we had before, but using the JBOD drives instead.  Because we're running redundant controllers, losing a since JBOD drive costs us at most a controller and 10 compute  servers.  (Note: in a production environment, we would either be booting from an iSCSI SAN, or have more internal RAID resources on the SeaMicro to insulate against drive failures.  This layout is something of a quirk of our particular environment)

Of course, since we just wiped out all of our root drives, we need to rebuild the stack.  Again.  We're getting pretty good at this.  The only real difference is that we'll change our DHCP configuration to distribute the 3 controller and 3 storage servers across the 6 JBOD drives (1 controller/storage and 9-10 compute resources per drive).  To make that work, we'll use the following assignments:

  • controller-0 - Server 0/0 (rootpool-3)
  • controller-1 - Server 1/0 (rootpool-4)
  • controller-2 - Server 11/0 (rootpool-1)
  • storage-0 - Server 12/0 (rootpool-5)
  • storage-1 - Server 29/0 (rootpool-2)
  • storage-2 - Server 34/0 (rootpool-6)

When changing the DHCP config file, we'll simply swap the compute entry's MAC address with the appropriate controller or storage MAC address, keep the same IP assignments as our previous build, no other changes are necessary.

(On the plus side, getting out OpenStack deployment back to this point was fairly painless by following the previous writeups in this blog.)

With HaProxy, MariaDB+Galera, RabbitMQ and Keystone re-deployed, we can circle back to about how to get the storage component of OpenStack in place.  

 

Network Connectivity to the storage servers

We've already assigned and preseeded our storage servers, but now we need to decide how to configure them.  Because we're settling on a storage backend that has replication requirements, as well as iSCSI connectivity, we need to have more than one storage network available.  On our SeaMicro, we've already assigned VLAN 100 as our management VLAN (this VLAN is internal to the SeaMicro).  We'll now create VLAN 150 (Client Storage) as the storage network between the clients and the Ceph servers, as well as VLAN 200 (Storage Backend) as the iSCSI and replication network.  On the storage servers themselves, we've already assigned NIC 0 as the management NIC.  We're going to assign NICs 1-3 as the client storage network (3Gbps aggregate throughput per server) and NICs 4-7 as the backend network (4Gbps aggregate throughput per server).  This allows for the higher overhead of the replication and iSCSI networks to have more bandwidth available.  (Our iSCSI array has already been connected to interfaces TenGig 0/1 and TenGig 7/1 in VLAN 200)

 

seasm15k01# conf
Enter configuration commands, one per line. End with CNTL/Z.
seasm15k01(config)# switch system-vlan 150
seasm15k01(config)# switch system-vlan 200
seasm15k01(config)# server id 12/0
seasm15k01(config-id-12/0)# nic 1
seasm15k01(config-nic-1)# untagged-vlan 150
seasm15k01(config-nic-1)# nic 2
seasm15k01(config-nic-2)# untagged-vlan 150
seasm15k01(config-nic-2)# nic 3
seasm15k01(config-nic-3)# untagged-vlan 150
seasm15k01(config-nic-3)# nic 4
seasm15k01(config-nic-4)# untagged-vlan 200
seasm15k01(config-nic-4)# nic 5
seasm15k01(config-nic-5)# untagged-vlan 200
seasm15k01(config-nic-5)# nic 6
seasm15k01(config-nic-6)# untagged-vlan 200
seasm15k01(config-nic-6)# nic 7
seasm15k01(config-nic-7)# untagged-vlan 200
seasm15k01(config-nic-7)# exit
seasm15k01(config-id-12/0)# exit
seasm15k01(config)# server id 29/0
seasm15k01(config-id-29/0)# nic 1
seasm15k01(config-nic-1)# untagged-vlan 150
seasm15k01(config-nic-1)# nic 2
seasm15k01(config-nic-2)# untagged-vlan 150
seasm15k01(config-nic-2)# nic 3
seasm15k01(config-nic-3)# untagged-vlan 150
seasm15k01(config-nic-3)# nic 4
seasm15k01(config-nic-4)# untagged-vlan 200
seasm15k01(config-nic-4)# nic 5
seasm15k01(config-nic-5)# untagged-vlan 200
seasm15k01(config-nic-5)# nic 6
seasm15k01(config-nic-6)# untagged-vlan 200
seasm15k01(config-nic-6)# nic 7
seasm15k01(config-nic-7)# untagged-vlan 200
seasm15k01(config-nic-7)# exit
seasm15k01(config-id-29/0)# exit
seasm15k01(config)# server id 34/0
seasm15k01(config-id-34/0)# nic 1
seasm15k01(config-nic-1)# untagged-vlan 150
seasm15k01(config-nic-1)# nic 2
seasm15k01(config-nic-2)# untagged-vlan 150
seasm15k01(config-nic-2)# nic 3
seasm15k01(config-nic-3)# untagged-vlan 150
seasm15k01(config-nic-3)# nic 4
seasm15k01(config-nic-4)# untagged-vlan 200
seasm15k01(config-nic-4)# nic 5
seasm15k01(config-nic-5)# untagged-vlan 200
seasm15k01(config-nic-5)# nic 6
seasm15k01(config-nic-6)# untagged-vlan 200
seasm15k01(config-nic-6)# nic 7
seasm15k01(config-nic-7)# untagged-vlan 200
seasm15k01(config-nic-7)# end
seasm15k01# show vlan
Default Vlan                    : 0   
Number of User Configured Vlans : 3   
Number of Default Vlans         : 1   
Flags : T = Tagged              U = Untagged 
      : I = Incomplete bond state because of difference in the bond member configuration.
      : D = interface configured for untagged traffic drop
      : P = Vlan pass through enabled
Vlan    Port Members                                    
----- -------------------------------------------------------------------------------------------------------------
100     srv 0/0/0      (U ), srv 1/0/0      (U ), srv 17/0/0     (U ), srv 16/0/0     (U ), srv 32/0/0     (U )
        srv 33/0/0     (U ), srv 49/0/0     (U ), srv 48/0/0     (U ), srv 2/0/0      (U ), srv 3/0/0      (U )
        srv 19/0/0     (U ), srv 18/0/0     (U ), srv 34/0/0     (U ), srv 35/0/0     (U ), srv 51/0/0     (U )
        srv 50/0/0     (U ), srv 6/0/0      (U ), srv 7/0/0      (U ), srv 23/0/0     (U ), srv 22/0/0     (U )
        srv 38/0/0     (U ), srv 39/0/0     (U ), srv 55/0/0     (U ), srv 54/0/0     (U ), srv 10/0/0     (U )
        srv 11/0/0     (U ), srv 27/0/0     (U ), srv 26/0/0     (U ), srv 42/0/0     (U ), srv 43/0/0     (U )
        srv 59/0/0     (U ), srv 58/0/0     (U ), srv 14/0/0     (U ), srv 15/0/0     (U ), srv 31/0/0     (U )
        srv 30/0/0     (U ), srv 46/0/0     (U ), srv 47/0/0     (U ), srv 63/0/0     (U ), srv 62/0/0     (U )
        srv 12/0/0     (U ), srv 13/0/0     (U ), srv 29/0/0     (U ), srv 28/0/0     (U ), srv 44/0/0     (U )
        srv 45/0/0     (U ), srv 61/0/0     (U ), srv 60/0/0     (U ), srv 8/0/0      (U ), srv 9/0/0      (U )
        srv 25/0/0     (U ), srv 24/0/0     (U ), srv 40/0/0     (U ), srv 41/0/0     (U ), srv 57/0/0     (U )
        srv 56/0/0     (U ), srv 4/0/0      (U ), srv 5/0/0      (U ), srv 21/0/0     (U ), srv 20/0/0     (U )
        srv 36/0/0     (U ), srv 37/0/0     (U ), srv 53/0/0     (U ), srv 52/0/0     (U )
150     srv 34/0/1     (U ), srv 34/0/2     (U ), srv 34/0/3     (U ), srv 12/0/1     (U ), srv 12/0/2     (U )
        srv 12/0/3     (U ), srv 29/0/1     (U ), srv 29/0/2     (U ), srv 29/0/3     (U )
200     te 0/1         (U ), te 7/1         (U ), srv 34/0/4     (U ), srv 34/0/5     (U ), srv 34/0/6     (U )
        srv 34/0/7     (U ), srv 12/0/4     (U ), srv 12/0/5     (U ), srv 12/0/6     (U ), srv 12/0/7     (U )
        srv 29/0/4     (U ), srv 29/0/5     (U ), srv 29/0/6     (U ), srv 29/0/7     (U )


With our NICs in the correct VLANs, now we need to decide how to use them.  Because we're using iSCSI on the backend, we could use MPIO there, which is typically the iSCSI recommended approach.  However, that doesn't help us much with the client side network or replication.  Since our iSCSI array is presenting 4 MPIO targets already, we have distinct enough flows that we can take advantage of LACP if configured with a Layer 3+4 hashing algorithm.  On top of that, an awesome feature of the SeaMicro is auto-LACP between its internal fabric and the server cards.  All we need to do is configure linux for LACP NIC bonding (mode 4) with the right hash and we're good to go.  Let's start by installing the interface bonding software with "apt-get install ifenslave"

We then add the bonding module to the system:

echo "bonding" >> /etc/modules
modprobe bonding

Then add the following to /etc/network/interfaces:

auto eth1
iface eth1 inet manual
bond-master bond0

 

auto eth2
iface eth2 inet manual
bond-master bond0

 

auto eth3
iface eth3 inet manual
bond-master bond0

 

auto bond0
iface bond0 inet static
address 10.1.2.10
netmask 255.255.255.0
bond-mode 4
bond-miimon 100
bond-lacp-rate 0
bond-slaves eth1 eth2 eth3
bond_xmit_hash_policy layer3+4

 

auto eth4
iface eth4 inet manual
bond-master bond1

 

auto eth5
iface eth5 inet manual
bond-master bond1

 

auto eth6
iface eth6 inet manual
bond-master bond1

 

auto eth7
iface eth7 inet manual
bond-master bond1

 

auto bond1
iface bond1 inet static
address 10.2.3.10
netmas 255.255.255.0
bond-mode 4
bond-miimon 100
bond-lacp-rate 0
bond-slaves eth4 eth5 eth6 eth7
bond_xmit_hash_policy layer3+4

 

 

At this point, the easiest way to get the bonded interfaces active is to just reboot the server.  They should be functional when it restarts.

root@storage-0:~# ip addr

10: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 

    link/ether 00:22:99:ec:05:01 brd ff:ff:ff:ff:ff:ff

    inet 10.1.2.10/24 brd 10.1.2.255 scope global bond0

       valid_lft forever preferred_lft forever

    inet6 fe80::222:99ff:feec:501/64 scope link 

       valid_lft forever preferred_lft forever

11: bond1: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 

    link/ether 00:22:99:ec:05:06 brd ff:ff:ff:ff:ff:ff

    inet 10.2.3.10/8 brd 10.255.255.255 scope global bond1

       valid_lft forever preferred_lft forever

    inet6 fe80::222:99ff:feec:506/64 scope link 

       valid_lft forever preferred_lft forever

 

And on the SeaMicro, they can be see in its LACP bond info command:

seasm15k01# show lacp info server 12/0 

Server ID 12/0

 

Bond ID   Slave ID      Slave-State     Actor-state Partner-state VLAN-id     Bond-MAC     

-------------------------------------------------------------------------------------

320            4          bundled        3d        3d         200        00:22:99:ec:05:06

320            5          bundled        3d        3d         200        00:22:99:ec:05:06

320            6          bundled        3d        3d         200        00:22:99:ec:05:06

320            7          bundled        3d        3d         200        00:22:99:ec:05:06

322            1          bundled        3d        3d         150        00:22:99:ec:05:03

322            2          bundled        3d        3d         150        00:22:99:ec:05:03

322            3          bundled        3d        3d         150        00:22:99:ec:05:03

-------------------------------------------------------------------------------------


iSCSI and MPIO

Now that we have a bundled uplink, we can bring up the X-IO ISE array.  Since the X-IO presents 4 targets, we don't need to do more than 1 session per server on the storage server side, since 4 targets is enough to utilitze our full LACP link.  We'll start with installing the required utilities with "apt-get install multipath-tools open-iscsi"

We're not bothering with internal security right now, so we'll leave off any CHAP authentication for the iSCSI sessions, making them fairly easy to discover and login:
 

root@storage-0:~# iscsiadm -m discovery -t st -p 10.1.2.1

10.1.2.1:3260,1 iqn.2004-11.com.x-io:3fe10004-t2

10.1.2.1:3260,1 iqn.2004-11.com.x-io:3fe10004-t1

root@storage-0:~# iscsiadm -m discovery -t st -p 10.1.2.2

10.1.2.2:3260,1 iqn.2004-11.com.x-io:3fe10004-t4

10.1.2.2:3260,1 iqn.2004-11.com.x-io:3fe10004-t3

root@storage-0:~# iscsiadm -m node -L all

Once login has been confirmed and the drives are visible on the system, set iSCSI to automatically connect on start in /etc/iscsi/iscsid.conf: "node.startup = automatic"

On our system, we now have drives /dev/sdb-e visible in dmesg.  We need to quickly create a basic /etc/multipath.conf file:

 

defaults {

    user_friendly_names     yes

}

blacklist { 

devnode "sda$"

}

blacklist_exceptions {

device{

vendor "XIOTECH"

}

}

devices{

device {

        vendor                  "XIOTECH"

        product                 "ISE3400"

        path_grouping_policy    multibus

        getuid_callout "/lib/udev/scsi_id --whitelisted --device=/dev/%n"

        path_checker tur

        path_selector "round-robin 0"

        no_path_retry 12

        rr_min_io 1

}

}


Once the config file is in place, restart multipath with "service multipath-tools restart" and the multipath device should be available for configuration:
 

root@storage-0:~# multipath -ll

mpath0 (36001f932004f0000052a000200000000) dm-0 XIOTECH,ISE3400

size=5.1T features='1 queue_if_no_path' hwhandler='0' wp=rw

`-+- policy='round-robin 0' prio=1 status=active

  |- 34:0:0:0 sde 8:64 active ready running

  |- 32:0:0:0 sdb 8:16 active ready running

  |- 33:0:0:0 sdc 8:32 active ready running

  `- 35:0:0:0 sdd 8:48 active ready running

root@storage-0:~# sgdisk /dev/mapper/mpath0 -p

Creating new GPT entries.

Disk /dev/mapper/mpath0: 10848567296 sectors, 5.1 TiB

Logical sector size: 512 bytes

Disk identifier (GUID): 74C6457B-38C6-41A3-8EC6-AC1A70018AC1

Partition table holds up to 128 entries

First usable sector is 34, last usable sector is 10848567262

Partitions will be aligned on 2048-sector boundaries

Total free space is 10848567229 sectors (5.1 TiB)


Once this is all confirmed, we'll do the same on the other 2 servers and their iSCSI exported volumes.
 

 

 

Leveraging MPIO and the Seamicro Fabric

Because of the SeaMicro's abstraction layer between the server cards and storage on the chassis, a unique ability exists to present the same disk to a server via multiple ASIC paths.  Since we're already using MPIO for the iSCSI connection, it's fairly trivial to increase performance between the storage servers and the SSD based disk on the SeaMicro chassis. Vdisk 0 is already in use by the root volume, so we're start with vdisk 1 and assign our volumes to the servers.  We have a specific use in mind for the SSD volume which we'll get into in the next article, but for now we're going to create 3 500GB volumes and get them attached.
 

seasm15k01# storage create volume-prefix 7/RAIDPOOL/Journal size 500 count 3 

***

seasm15k01# conf

Enter configuration commands, one per line. End with CNTL/Z.
seasm15k01(config)# storage assign 12/0 1 volume 7/RAIDPOOL/Journal-0 
seasm15k01(config)# storage assign 12/0 2 volume 7/RAIDPOOL/Journal-0 

seasm15k01(config)# storage assign 12/0 3 volume 7/RAIDPOOL/Journal-0 

seasm15k01(config)# storage assign 12/0 5 volume 7/RAIDPOOL/Journal-0 

seasm15k01(config)# storage assign 29/0 5 volume 7/RAIDPOOL/Journal-1 

seasm15k01(config)# storage assign 29/0 3 volume 7/RAIDPOOL/Journal-1

seasm15k01(config)# storage assign 29/0 2 volume 7/RAIDPOOL/Journal-1

seasm15k01(config)# storage assign 29/0 1 volume 7/RAIDPOOL/Journal-1

seasm15k01(config)# storage assign 34/0 1 volume 7/RAIDPOOL/Journal-2

seasm15k01(config)# storage assign 34/0 2 volume 7/RAIDPOOL/Journal-2

seasm15k01(config)# storage assign 34/0 3 volume 7/RAIDPOOL/Journal-2

seasm15k01(config)# storage assign 34/0 5 volume 7/RAIDPOOL/Journal-2

seasm15k01(config)# 


Once the storage assignment is complete, we can move to the storage server and create a quick script to pull the serial number from the drive.  (Note: The SeaMicro appears to present the same UUID for all volumes, so we cannot use UUID blacklisting in this case, so we're blacklisting "devnode sda$" in the multipath config)

/root/getDiskSerialNum:

#!/bin/bash

/usr/bin/sginfo -s $1 | cut -d\' -f2 | tr -d '\n'


We can use the serial number pulled from the above script to determine the serial of the multipath presented disk, and then modify our config to whitelist it in /etc/multipath.conf:
 

defaults {

    user_friendly_names     yes

}

blacklist { 

    devnode "sda$"

    device {

        vender "ATA"

        product "*"

    }

}

blacklist_exceptions {

    device{

        vendor    "XIOTECH"

    }

    device{

        vendor "ATA"

        product "SMvDt6S3HSgAOfPp"

    }

}

devices{

    device {

        vendor                  "XIOTECH"

        product                 "ISE3400"

        path_grouping_policy    multibus

        getuid_callout        "/lib/udev/scsi_id --whitelisted --device=/dev/%n"

        path_checker        tur

        path_selector        "round-robin 0"

        no_path_retry        12

        rr_min_io        1

    }

    device {

        vendor "ATA"

        user_friendly_names yes

        rr_min_io 1

        no_path_retry queue

        rr_weight uniform

        path_grouping_policy group_by_serial=1 

        getuid_callout "/root/getDiskSerialNum /dev/%n"

    }

}

Checking our active multipath links now shows both the iSCSI multipath and the direct-attached SSD multipath devices available:

root@storage-0:~# multipath -ll

mpath1 (35000c5001feb99f0) dm-2 ATA,SMvDt6S3HSgAOfPp

size=500G features='0' hwhandler='0' wp=rw

|-+- policy='round-robin 0' prio=1 status=active

| `- 4:0:0:0  sdb 8:16  active ready running

|-+- policy='round-robin 0' prio=1 status=enabled

| `- 8:0:0:0  sdc 8:32  active ready running

|-+- policy='round-robin 0' prio=1 status=enabled

| `- 12:0:0:0 sdd 8:48  active ready running

`-+- policy='round-robin 0' prio=1 status=enabled

  `- 20:0:0:0 sde 8:64  active ready running

mpath0 (36001f932004f0000052a000200000000) dm-0 XIOTECH,ISE3400

size=5.1T features='1 queue_if_no_path' hwhandler='0' wp=rw

`-+- policy='round-robin 0' prio=1 status=active

  |- 34:0:0:0 sdi 8:128 active ready running

  |- 33:0:0:0 sdf 8:80  active ready running

  |- 32:0:0:0 sdg 8:96  active ready running

  `- 35:0:0:0 sdh 8:112 active ready running

This leaves us with two high speed volumes available.

 

Just a Bunch Of Disks

The last piece in our storage architecture is slower but high capacity spindle storage.  We left most of the JBOD disks unallocated on the SeaMicro chassis, now we're going to create full-disk volumes out of those and assign 18 of them to each of the storage servers.  A quirk of the SeaMicro: pools cannot span multiple disks unless they are in a RAID configuration, so we will end up needing to create 54 JBOD pools first, then assigning a single volume to each pool.  Fortunately this process is fairly easy to script.  Once this process is complete, we'll end up with a volume layout as follows:

seasm15k01# show storage volume brief 

*****************************************************************************************************************************************************************************************************************************************************************************************************************************

A = Assigned, U = Unassigned, L = Linear, S = Stripe

  slot       pool name      volume name     prov. size     actual size    attr  

--------------------------------------------------------------------------------

    2       jbodpool-1       jbodvol-1          2794GB       2794.00GB     AL   

    2       jbodpool-2       jbodvol-2          2794GB       2794.00GB     AL   

    2       jbodpool-3       jbodvol-3          2794GB       2794.00GB     AL   

    2       jbodpool-4       jbodvol-4          2794GB       2794.00GB     AL   

    2       jbodpool-5       jbodvol-5          2794GB       2794.00GB     AL   

    2       jbodpool-6       jbodvol-6          2794GB       2794.00GB     AL   

    2       jbodpool-7       jbodvol-7          2794GB       2794.00GB     AL   

    2       jbodpool-8       jbodvol-8          2794GB       2794.00GB     AL   

    2       jbodpool-9       jbodvol-9          2794GB       2794.00GB     AL   

    2       jbodpool-10     jbodvol-10          2794GB       2794.00GB     AL   

    2       jbodpool-11     jbodvol-11          2794GB       2794.00GB     AL   

    2       jbodpool-12     jbodvol-12          2794GB       2794.00GB     AL   

    2       jbodpool-13     jbodvol-13          2794GB       2794.00GB     AL   

    2       jbodpool-14     jbodvol-14          2794GB       2794.00GB     AL   

    2       jbodpool-15     jbodvol-15          2794GB       2794.00GB     AL   

    2       jbodpool-16     jbodvol-16          2794GB       2794.00GB     AL   

    2       jbodpool-17     jbodvol-17          2794GB       2794.00GB     AL   

    2       jbodpool-18     jbodvol-18          2794GB       2794.00GB     AL   

    2       jbodpool-19     jbodvol-19          2794GB       2794.00GB     AL   

    2       jbodpool-20     jbodvol-20          2794GB       2794.00GB     AL   

    2       jbodpool-21     jbodvol-21          2794GB       2794.00GB     AL   

    2       jbodpool-22     jbodvol-22          2794GB       2794.00GB     AL   

    2       jbodpool-23     jbodvol-23          2794GB       2794.00GB     AL   

    2       jbodpool-24     jbodvol-24          2794GB       2794.00GB     AL   

    2       rootpool-1       rootvol-0           254GB        254.00GB     AL   

    2       rootpool-1       rootvol-1           254GB        254.00GB     AL   

    2       rootpool-1       rootvol-2           254GB        254.00GB     AL   

    2       rootpool-1       rootvol-3           254GB        254.00GB     AL   

    2       rootpool-1       rootvol-4           254GB        254.00GB     AL   

    2       rootpool-1       rootvol-5           254GB        254.00GB     AL   

    2       rootpool-1       rootvol-6           254GB        254.00GB     AL   

    2       rootpool-1       rootvol-7           254GB        254.00GB     AL   

    2       rootpool-1       rootvol-8           254GB        254.00GB     UL   

    2       rootpool-1       rootvol-9           254GB        254.00GB     UL   

    2       rootpool-1      rootvol-10           254GB        254.00GB     AL   

    2       rootpool-2       rootvol-0           254GB        254.00GB     AL   

    2       rootpool-2       rootvol-1           254GB        254.00GB     AL   

    2       rootpool-2       rootvol-2           254GB        254.00GB     AL   

    2       rootpool-2       rootvol-3           254GB        254.00GB     AL   

    2       rootpool-2       rootvol-4           254GB        254.00GB     AL   

    2       rootpool-2       rootvol-5           254GB        254.00GB     AL   

    2       rootpool-2       rootvol-6           254GB        254.00GB     AL   

    2       rootpool-2       rootvol-7           254GB        254.00GB     AL   

    2       rootpool-2       rootvol-8           254GB        254.00GB     AL   

    2       rootpool-2       rootvol-9           254GB        254.00GB     AL   

    2       rootpool-2      rootvol-10           254GB        254.00GB     AL   

    2       rootpool-3       rootvol-0           254GB        254.00GB     AL   

    2       rootpool-3       rootvol-1           254GB        254.00GB     AL   

    2       rootpool-3       rootvol-2           254GB        254.00GB     AL   

    2       rootpool-3       rootvol-3           254GB        254.00GB     AL   

    2       rootpool-3       rootvol-4           254GB        254.00GB     AL   

    2       rootpool-3       rootvol-5           254GB        254.00GB     AL   

    2       rootpool-3       rootvol-6           254GB        254.00GB     AL   

    2       rootpool-3       rootvol-7           254GB        254.00GB     AL   

    2       rootpool-3       rootvol-8           254GB        254.00GB     AL   

    2       rootpool-3       rootvol-9           254GB        254.00GB     AL   

    2       rootpool-3      rootvol-10           254GB        254.00GB     AL   

    2       rootpool-4       rootvol-0           254GB        254.00GB     AL   

    2       rootpool-4       rootvol-1           254GB        254.00GB     AL   

    2       rootpool-4       rootvol-2           254GB        254.00GB     AL   

    2       rootpool-4       rootvol-3           254GB        254.00GB     AL   

    2       rootpool-4       rootvol-4           254GB        254.00GB     AL   

    2       rootpool-4       rootvol-5           254GB        254.00GB     AL   

    2       rootpool-4       rootvol-6           254GB        254.00GB     AL   

    2       rootpool-4       rootvol-7           254GB        254.00GB     AL   

    2       rootpool-4       rootvol-8           254GB        254.00GB     AL   

    2       rootpool-4       rootvol-9           254GB        254.00GB     AL   

    2       rootpool-4      rootvol-10           254GB        254.00GB     AL   

    2       rootpool-5       rootvol-0           254GB        254.00GB     AL   

    2       rootpool-5       rootvol-1           254GB        254.00GB     AL   

    2       rootpool-5       rootvol-2           254GB        254.00GB     AL   

    2       rootpool-5       rootvol-3           254GB        254.00GB     AL   

    2       rootpool-5       rootvol-4           254GB        254.00GB     AL   

    2       rootpool-5       rootvol-5           254GB        254.00GB     AL   

    2       rootpool-5       rootvol-6           254GB        254.00GB     AL   

    2       rootpool-5       rootvol-7           254GB        254.00GB     AL   

    2       rootpool-5       rootvol-8           254GB        254.00GB     UL   

    2       rootpool-5       rootvol-9           254GB        254.00GB     AL   

    2       rootpool-5      rootvol-10           254GB        254.00GB     AL   

    2       rootpool-6       rootvol-0           254GB        254.00GB     AL   

    2       rootpool-6       rootvol-1           254GB        254.00GB     AL   

    2       rootpool-6       rootvol-2           254GB        254.00GB     AL   

    2       rootpool-6       rootvol-3           254GB        254.00GB     AL   

    2       rootpool-6       rootvol-4           254GB        254.00GB     AL   

    2       rootpool-6       rootvol-5           254GB        254.00GB     AL   

    2       rootpool-6       rootvol-6           254GB        254.00GB     AL   

    2       rootpool-6       rootvol-7           254GB        254.00GB     AL   

    2       rootpool-6       rootvol-8           254GB        254.00GB     AL   

    2       rootpool-6       rootvol-9           254GB        254.00GB     AL   

    2       rootpool-6      rootvol-10           254GB        254.00GB     AL   

    5       jbodpool-25     jbodvol-25          2794GB       2794.00GB     AL   

    5       jbodpool-26     jbodvol-26          2794GB       2794.00GB     AL   

    5       jbodpool-27     jbodvol-27          2794GB       2794.00GB     AL   

    5       jbodpool-28     jbodvol-28          2794GB       2794.00GB     AL   

    5       jbodpool-29     jbodvol-29          2794GB       2794.00GB     AL   

    5       jbodpool-30     jbodvol-30          2794GB       2794.00GB     AL   

    5       jbodpool-31     jbodvol-31          2794GB       2794.00GB     AL   

    5       jbodpool-32     jbodvol-32          2794GB       2794.00GB     AL   

    5       jbodpool-33     jbodvol-33          2794GB       2794.00GB     AL   

    5       jbodpool-34     jbodvol-34          2794GB       2794.00GB     AL   

    5       jbodpool-35     jbodvol-35          2794GB       2794.00GB     AL   

    5       jbodpool-36     jbodvol-36          2794GB       2794.00GB     AL   

    5       jbodpool-37     jbodvol-37          2794GB       2794.00GB     AL   

    5       jbodpool-38     jbodvol-38          2794GB       2794.00GB     AL   

    5       jbodpool-39     jbodvol-39          2794GB       2794.00GB     AL   

    5       jbodpool-40     jbodvol-40          2794GB       2794.00GB     AL   

    5       jbodpool-41     jbodvol-41          2794GB       2794.00GB     AL   

    5       jbodpool-42     jbodvol-42          2794GB       2794.00GB     AL   

    5       jbodpool-43     jbodvol-43          2794GB       2794.00GB     AL   

    5       jbodpool-44     jbodvol-44          2794GB       2794.00GB     AL   

    5       jbodpool-45     jbodvol-45          2794GB       2794.00GB     AL   

    5       jbodpool-46     jbodvol-46          2794GB       2794.00GB     AL   

    5       jbodpool-47     jbodvol-47          2794GB       2794.00GB     AL   

    5       jbodpool-48     jbodvol-48          2794GB       2794.00GB     AL   

    5       jbodpool-49     jbodvol-49          2794GB       2794.00GB     AL   

    5       jbodpool-50     jbodvol-50          2794GB       2794.00GB     AL   

    5       jbodpool-51     jbodvol-51          2794GB       2794.00GB     AL   

    5       jbodpool-52     jbodvol-52          2794GB       2794.00GB     AL   

    5       jbodpool-53     jbodvol-53          2794GB       2794.00GB     AL   

    5       jbodpool-54     jbodvol-54          2794GB       2794.00GB     AL   

    7        RAIDPOOL        Journal-0           500GB        500.00GB     AL   

    7        RAIDPOOL        Journal-1           500GB        500.00GB     AL   

    7        RAIDPOOL        Journal-2           500GB        500.00GB     AL   

    7        RAIDPOOL        RAIDVOL-0            48GB         48.00GB     AL   

* 124 entries


Once that's done, we can assign the disks from these pools to our storage servers with a single command:

seasm15k01(config)# storage assign-range 12/0,29/0,34/0 4,6-22 volume jbodvol uuid 


Now on our three storage servers, we have the following drives available:

root@storage-0:~# cat /proc/partitions 

major minor  #blocks  name

 

   8        0  266338304 sda

   8        1  232951808 sda1

   8        2          1 sda2

   8        5   33383424 sda5

   8       16 2929721344 sdb

   8       32 2929721344 sdc

   8       48  524288000 sdd

   8       64 2929721344 sde

   8       80 2929721344 sdf

   8       96  524288000 sdg

   8      112 2929721344 sdh

   8      128 2929721344 sdi

   8      144  524288000 sdj

   8      160 2929721344 sdk

   8      176 2929721344 sdl

   8      192 2929721344 sdm

   8      208 2929721344 sdn

   8      224 2929721344 sdo

   8      240  524288000 sdp

  65        0 2929721344 sdq

  65       16 2929721344 sdr

  65       32 2929721344 sds

  65       48 2929721344 sdt

  65       64 2929721344 sdu

  65       80 2929721344 sdv

  65       96 2929721344 sdw

 252        0  524288000 dm-0

  65      144 5424283648 sdz

  65      160 5424283648 sdaa

  65      112 5424283648 sdx

  65      128 5424283648 sdy

 252        1 5424283648 dm-1



You can see our partition root drive available on sda, the directly attached SSD available at dm-0, and the iSCSI target on dm-1.  The rest of the available partitions are the simple JBOD mounts.

Now we're ready to actually do something with all of this disk.

Previous: OpenStack - Take 2 - The Keystone Identity Service

OpenStack - Take 2 - The Keystone Identity Service

Keystone is, more or less, the glue that ties OpenStack together.  It's required for any of the individual services to be installed and function together.  
 
Fortunately for us, keystone is basically just a REST API, so it's very easy to make redundant and there isn't a whole lot to it.  
 
We'll start by installing keystone and the python mysql client on all three controller nodes:
 
apt-get install keystone python-mysqldb
 
 
Once that's done, we need a base configuration for keystone.  There are a lot of default options installed in the config file, but we really only care (for now) about giving it an admin token, and connecting it to our DB and Message queue.  Also, because we're colocating our load balancers on the controller nodes (something which clearly wouldn't be done in production), we're going to shift the ports that keystone is binding to so the real ports are available to HAProxy.  (The default ports are being incremented by 10000 for this.)  Everything else will be left at its default value.

OpenStack - Take 2 - HA Database and Message Queue

With our 3 controller servers running on a bare bones Ubuntu install, it's time to start getting the services required for OpenStack up and running.  Before any of the actual cloud services can be installed, we need a shared database and message queue.  Because our design goal here is both redundancy and load balancing, we can't just install a basic MySQL package.
 
A little researched showed that there are 3 options for a MySQL compatible multi-master cluster: MySQL w/ a wsrep patch, MariaDB, or Percona XtraDB Cluster.  In all cases Galera is used as the actual clustering mechanism.  For ease of installation (and because it has solid Ubuntu package support), we decided to use MariaDB + Galera for the clustered database.

Read more: OpenStack - Take 2 - HA Database and Message Queue

OpenStack - Take 2 - High Availability for all services

A major part of our design decision for rebuilding our OpenStack from scratch was availability, closer to what one would see in production.  This is one of the things Juju got right, installing most services using HAProxy so that clients could connect to any of the servers running the requested service.  What it lacked was load balancers and external HA access.
 
Since we're doing 3 controller nodes, and basically converging all services onto those 3 nodes, we'll do that with the load balancers as well.  We need both internal and externally accessible load balanced and redundant servers to take into account both customer APIs and internal/management access from the compute nodes.  

Read more: OpenStack - Take 2 - High Availability for all services

OpenStack - Take 2 - Doing It The Hard Way

This is Part 2 of an ongoing series of testing AMD's SeaMicro 15000 chassis with OpenStack.   (Note: Part 1 is delayed while it gets vetted by AMD for technical accuracy)
 
In the first part, we configured the SeaMicro for a basic OpenStack deploying using MaaS and Juju for bootstrapping and orchestration.  This works fine for the purposes of showing what OpenStack looks like and what it can do (quickly) on the SeaMicro hardware.  That's great and all, but Juju is only easily configurable within the context of the options provided with its specific service charms.  Because it fully manages the configuration, any manual configuration added (for example, the console proxy in our previous example) will get wiped out if any Juju changes (to a relationship for example) are made.  
 
For production purposes, there are other more powerful orchestration suites out there (Puppet, Chef, SaltStack, etc) but because they are highly configurable they also require a significantly larger amount of manual intervention and scripting.  This makes sense, of course, since the reason Juju is as rapid and easy as it is is exactly the same reason that it is of questionable value in a production deployment.  To that end, we're going to deploy OpenStack on the SeaMicro chassis the hard way: from scratch.  

Read more: OpenStack - Take 2 - Doing It The Hard Way