1-888-SEMAPHORE
1-888-SEMAPHORE

OpenStack – Take 2 – Storage Architecture w/ X-IO HyperISE

Re-architecting things…again.

After getting basic services configured, we ran into a bit of a conundrum where storage was concerned.

OpenStack has two storage methods, block and object (served by Swift and Cinder in the native OpenStack architecture). Swift is object-only and Cinder is basically a block abstraction layer that sits over a backend driver of some sort. Swift is inherently redundant, replicates itself, and can easily use a variety of disk mediums, but we didn’t have as much of a use case for it as we did volume storage for persistent VM volumes.

Cinder, on the other hand, has issues. While it can have multiple API servers, a cinder volume (when using LVM) is basically tied to a specific cinder volume server. Even though the SeaMicro has the capability of shifting disks around, cinder doesn’t have any concept of migration. What this means in essence is that if a cinder server goes offline, all volumes attached to that server are unavailable until it is restore of rebuilt, even though we could have easily made the data available to another cinder server.  This isn’t ideal for the architecture we’re trying to build.

Enter Ceph.  Ceph is distributed, redundant (much like Swift), supports both object and block storage, and is a redundant and highly available cinder backend option. It’s also incredibly resource intensive in terms of network and storage I/O. Like Swift, it has inherent replication. The differences are quick to spot though, Swift is “eventually consistent.” An object written to swift will be replicated to other locations (depending on the number of defined replicates) at some point in time, but it isn’t a blocking operation. Ceph is immediately consistent, when data changes on a ceph object or block, it is immediately replicated, which means you get double-hits on the network for write operations. On top of that, Ceph simultaneously writes to both its Journal and to the actual storage device as a sequential operation. (btrfs is a copy-on-write filesystem and can avoid the double-write penalty, but it is not considered production ready).  This means that the architecture needs to be considered a good deal more carefully than just randomly assigning disks to our storage servers and creating a large LVM volume on them.

Re-architecting the root disk layout

While testing the SeaMicro and OpenStack, we also had our partners over a X-IO drop by one of their new iSCSI based HyperISE SSD+disk arrays.  While it isn’t technically appropriate for Ceph, it would perform faster than the external JBOD spindles that were attached to the SeaMicro, so we decided to include it in the design.  This gave us three tiers of storage:

  1. SeaMicro Internal SSD – ~3TB of full SSD RAID, currently divided into 64 48GB volumes as the server’s root disks
  2. X-IO HyperISE – ~16TB usable auto-tiered SSD+disk RAID, with 20Gbps aggregate uplink into the SeaMicro
  3. SeaMicro external JBOD – 60 ~3TB individual drives attached to the SeaMicro via eSATA

The first thing that stuck out is that our decision to use the internal SSD as the root drives for our servers was a dumb one, that’s the fastest disk on the system, and we were wasting it on tasks that needed no I/O performance whatsoever.  We decided to keep our bootstrap server on the SSD just to save the trouble of rebuilding it, but we unassigned all the other server’s storage (something the SeaMicro makes quite easy), then wrote a script to delete the other 63 volumes and recycle them back into the RAID pool.  This left us with 3080GB free in our internal SSD pool for high performance storage use.

This does leave us with the small problem of not having any root disks for our servers (other than the bootstrap server that we left configured).  Since we don’t care about performance, we’re going to just carve up 6 of the 3TB JBOD drives into 11 volumes each and assign those to the other servers.  (This leaves us with 54 JBOD drives unused, which will become important later).  To do this, we need to switch the JBOD slots into “volume” mode first: storage set mgmt-mode volume slot 2.  We’ll then take the first 6 disks (JBOS2/6 – JBOD2/11 on our chassis) and create 11 volumes on each disk, giving us a total of 66 volumes (3 more than we need, but that’s fine), and finally assign the new volumes to vdisk 0 on each server (other than our bootstrap server):


seasm15k01# storage clear-metadata disk JBOD2/6-81 slot 2
All data on the specified disk(s) will be lost with this operation
Are you sure you want to proceed (yes/no): yes
Please enter ‘yes’ again if you really want to proceed (yes/no): yes
seasm15k01# storage clear-metadata disk JBOD5/0-75 slot 5
All data on the specified disk(s) will be lost with this operation
Are you sure you want to proceed (yes/no): yes
Please enter ‘yes’ again if you really want to proceed (yes/no): yes
seasm15k01# storage create pool 2/rootpool-1 disk JBOD2/6
Pool 2/rootpool-1 created successfully.
seasm15k01# storage create pool 2/rootpool-2 disk JBOD2/7
Pool 2/rootpool-2 created successfully.
seasm15k01# storage create pool 2/rootpool-3 disk JBOD2/8
Pool 2/rootpool-3 created successfully.
seasm15k01# storage create pool 2/rootpool-4 disk JBOD2/9
Pool 2/rootpool-4 created successfully.
seasm15k01# storage create pool 2/rootpool-5 disk JBOD2/10
Pool 2/rootpool-5 created successfully.
seasm15k01# storage create pool 2/rootpool-6 disk JBOD2/11
Pool 2/rootpool-6 created successfully.
seasm15k01# storage create volume-prefix 2/rootpool-1/rootvol size max\#11 count 11
***********
seasm15k01# storage create volume-prefix 2/rootpool-2/rootvol size max\#11 count 11
***********
seasm15k01# storage create volume-prefix 2/rootpool-3/rootvol size max\#11 count 11
***********
seasm15k01# storage create volume-prefix 2/rootpool-4/rootvol size max\#11 count 11
***********
seasm15k01# storage create volume-prefix 2/rootpool-5/rootvol size max\#11 count 11
***********
seasm15k01# storage create volume-prefix 2/rootpool-6/rootvol size max\#11 count 11
***********
seasm15k01(config)# storage assign-range 0/0-31/0,33/0-63/0 0 volume rootvol uuid
seasm15k01(config)# end
seasm15k01# show storage assign brief
*****************************************************************
server vdisk type id assignment property
——————————————————————————————-
0/0 0 volume rootvol(2/rootpool-3/rootvol-10) active RW
1/0 0 volume rootvol(2/rootpool-4/rootvol-9) active RW
2/0 0 volume rootvol(2/rootpool-4/rootvol-8) active RW
3/0 0 volume rootvol(2/rootpool-4/rootvol-7) active RW
4/0 0 volume rootvol(2/rootpool-4/rootvol-6) active RW
5/0 0 volume rootvol(2/rootpool-4/rootvol-5) active RW
6/0 0 volume rootvol(2/rootpool-4/rootvol-4) active RW
7/0 0 volume rootvol(2/rootpool-4/rootvol-3) active RW
8/0 0 volume rootvol(2/rootpool-4/rootvol-2) active RW
9/0 0 volume rootvol(2/rootpool-4/rootvol-1) active RW
10/0 0 volume rootvol(2/rootpool-4/rootvol-0) active RW
11/0 0 volume rootvol(2/rootpool-1/rootvol-10) active RW
12/0 0 volume rootvol(2/rootpool-5/rootvol-9) active RW
13/0 0 volume rootvol(2/rootpool-5/rootvol-4) active RW
14/0 0 volume rootvol(2/rootpool-5/rootvol-5) active RW
15/0 0 volume rootvol(2/rootpool-5/rootvol-6) active RW
16/0 0 volume rootvol(2/rootpool-5/rootvol-7) active RW
17/0 0 volume rootvol(2/rootpool-5/rootvol-0) active RW
18/0 0 volume rootvol(2/rootpool-5/rootvol-1) active RW
19/0 0 volume rootvol(2/rootpool-5/rootvol-2) active RW
20/0 0 volume rootvol(2/rootpool-5/rootvol-3) active RW
21/0 0 volume rootvol(2/rootpool-3/rootvol-1) active RW
22/0 0 volume rootvol(2/rootpool-3/rootvol-0) active RW
23/0 0 volume rootvol(2/rootpool-3/rootvol-3) active RW
24/0 0 volume rootvol(2/rootpool-3/rootvol-2) active RW
25/0 0 volume rootvol(2/rootpool-3/rootvol-5) active RW
26/0 0 volume rootvol(2/rootpool-3/rootvol-4) active RW
27/0 0 volume rootvol(2/rootpool-3/rootvol-7) active RW
28/0 0 volume rootvol(2/rootpool-3/rootvol-6) active RW
29/0 0 volume rootvol(2/rootpool-2/rootvol-10) active RW
30/0 0 volume rootvol(2/rootpool-3/rootvol-8) active RW
31/0 0 volume rootvol(2/rootpool-4/rootvol-10) active RW
32/0 0 volume RAIDVOL(7/RAIDPOOL/RAIDVOL-0) active RW
33/0 0 volume rootvol(2/rootpool-5/rootvol-10) active RW
34/0 0 volume rootvol(2/rootpool-6/rootvol-1) active RW
35/0 0 volume rootvol(2/rootpool-6/rootvol-10) active RW
36/0 0 volume rootvol(2/rootpool-6/rootvol-0) active RW
37/0 0 volume rootvol(2/rootpool-3/rootvol-9) active RW
38/0 0 volume rootvol(2/rootpool-6/rootvol-2) active RW
39/0 0 volume rootvol(2/rootpool-6/rootvol-3) active RW
40/0 0 volume rootvol(2/rootpool-6/rootvol-4) active RW
41/0 0 volume rootvol(2/rootpool-6/rootvol-5) active RW
42/0 0 volume rootvol(2/rootpool-6/rootvol-6) active RW
43/0 0 volume rootvol(2/rootpool-6/rootvol-7) active RW
44/0 0 volume rootvol(2/rootpool-6/rootvol-8) active RW
45/0 0 volume rootvol(2/rootpool-6/rootvol-9) active RW
46/0 0 volume rootvol(2/rootpool-2/rootvol-2) active RW
47/0 0 volume rootvol(2/rootpool-2/rootvol-3) active RW
48/0 0 volume rootvol(2/rootpool-2/rootvol-0) active RW
49/0 0 volume rootvol(2/rootpool-2/rootvol-1) active RW
50/0 0 volume rootvol(2/rootpool-2/rootvol-6) active RW
51/0 0 volume rootvol(2/rootpool-2/rootvol-7) active RW
52/0 0 volume rootvol(2/rootpool-2/rootvol-4) active RW
53/0 0 volume rootvol(2/rootpool-2/rootvol-5) active RW
54/0 0 volume rootvol(2/rootpool-2/rootvol-8) active RW
55/0 0 volume rootvol(2/rootpool-2/rootvol-9) active RW
56/0 0 volume rootvol(2/rootpool-1/rootvol-6) active RW
57/0 0 volume rootvol(2/rootpool-1/rootvol-7) active RW
58/0 0 volume rootvol(2/rootpool-1/rootvol-4) active RW
59/0 0 volume rootvol(2/rootpool-1/rootvol-5) active RW
60/0 0 volume rootvol(2/rootpool-1/rootvol-2) active RW
61/0 0 volume rootvol(2/rootpool-1/rootvol-3) active RW
62/0 0 volume rootvol(2/rootpool-1/rootvol-0) active RW
63/0 0 volume rootvol(2/rootpool-1/rootvol-1) active RW
* 64 entries

Once done, we’re left with a very similar layout to what we had before, but using the JBOD drives instead. Because we’re running redundant controllers, losing a since JBOD drive costs us at most a controller and 10 compute servers. (Note: in a production environment, we would either be booting from an iSCSI SAN, or have more internal RAID resources on the SeaMicro to insulate against drive failures. This layout is something of a quirk of our particular environment)

Of course, since we just wiped out all of our root drives, we need to rebuild the stack. Again. We’re getting pretty good at this. The only real difference is that we’ll change our DHCP configuration to distribute the 3 controller and 3 storage servers across the 6 JBOD drives (1 controller/storage and 9-10 compute resources per drive). To make that work, we’ll use the following assignments:

  • controller-0 – Server 0/0 (rootpool-3)
  • controller-1 – Server 1/0 (rootpool-4)
  • controller-2 – Server 11/0 (rootpool-1)
  • storage-0 – Server 12/0 (rootpool-5)
  • storage-1 – Server 29/0 (rootpool-2)
  • storage-2 – Server 34/0 (rootpool-6)

When changing the DHCP config file, we’ll simply swap the compute entry’s MAC address with the appropriate controller or storage MAC address, keep the same IP assignments as our previous build, no other changes are necessary.

(On the plus side, getting out OpenStack deployment back to this point was fairly painless by following the previous writeups in this blog.)

With HaProxy, MariaDB+Galera, RabbitMQ and Keystone re-deployed, we can circle back to about how to get the storage component of OpenStack in place.

Network Connectivity to the storage servers

We’ve already assigned and preseeded our storage servers, but now we need to decide how to configure them. Because we’re settling on a storage backend that has replication requirements, as well as iSCSI connectivity, we need to have more than one storage network available. On our SeaMicro, we’ve already assigned VLAN 100 as our management VLAN (this VLAN is internal to the SeaMicro). We’ll now create VLAN 150 (Client Storage) as the storage network between the clients and the Ceph servers, as well as VLAN 200 (Storage Backend) as the iSCSI and replication network. On the storage servers themselves, we’ve already assigned NIC 0 as the management NIC. We’re going to assign NICs 1-3 as the client storage network (3Gbps aggregate throughput per server) and NICs 4-7 as the backend network (4Gbps aggregate throughput per server). This allows for the higher overhead of the replication and iSCSI networks to have more bandwidth available. (Our iSCSI array has already been connected to interfaces TenGig 0/1 and TenGig 7/1 in VLAN 200)


seasm15k01# conf
Enter configuration commands, one per line. End with CNTL/Z.
seasm15k01(config)# switch system-vlan 150
seasm15k01(config)# switch system-vlan 200
seasm15k01(config)# server id 12/0
seasm15k01(config-id-12/0)# nic 1
seasm15k01(config-nic-1)# untagged-vlan 150
seasm15k01(config-nic-1)# nic 2
seasm15k01(config-nic-2)# untagged-vlan 150
seasm15k01(config-nic-2)# nic 3
seasm15k01(config-nic-3)# untagged-vlan 150
seasm15k01(config-nic-3)# nic 4
seasm15k01(config-nic-4)# untagged-vlan 200
seasm15k01(config-nic-4)# nic 5
seasm15k01(config-nic-5)# untagged-vlan 200
seasm15k01(config-nic-5)# nic 6
seasm15k01(config-nic-6)# untagged-vlan 200
seasm15k01(config-nic-6)# nic 7
seasm15k01(config-nic-7)# untagged-vlan 200
seasm15k01(config-nic-7)# exit
seasm15k01(config-id-12/0)# exit
seasm15k01(config)# server id 29/0
seasm15k01(config-id-29/0)# nic 1
seasm15k01(config-nic-1)# untagged-vlan 150
seasm15k01(config-nic-1)# nic 2
seasm15k01(config-nic-2)# untagged-vlan 150
seasm15k01(config-nic-2)# nic 3
seasm15k01(config-nic-3)# untagged-vlan 150
seasm15k01(config-nic-3)# nic 4
seasm15k01(config-nic-4)# untagged-vlan 200
seasm15k01(config-nic-4)# nic 5
seasm15k01(config-nic-5)# untagged-vlan 200
seasm15k01(config-nic-5)# nic 6
seasm15k01(config-nic-6)# untagged-vlan 200
seasm15k01(config-nic-6)# nic 7
seasm15k01(config-nic-7)# untagged-vlan 200
seasm15k01(config-nic-7)# exit
seasm15k01(config-id-29/0)# exit
seasm15k01(config)# server id 34/0
seasm15k01(config-id-34/0)# nic 1
seasm15k01(config-nic-1)# untagged-vlan 150
seasm15k01(config-nic-1)# nic 2
seasm15k01(config-nic-2)# untagged-vlan 150
seasm15k01(config-nic-2)# nic 3
seasm15k01(config-nic-3)# untagged-vlan 150
seasm15k01(config-nic-3)# nic 4
seasm15k01(config-nic-4)# untagged-vlan 200
seasm15k01(config-nic-4)# nic 5
seasm15k01(config-nic-5)# untagged-vlan 200
seasm15k01(config-nic-5)# nic 6
seasm15k01(config-nic-6)# untagged-vlan 200
seasm15k01(config-nic-6)# nic 7
seasm15k01(config-nic-7)# untagged-vlan 200
seasm15k01(config-nic-7)# end
seasm15k01# show vlan
Default Vlan : 0
Number of User Configured Vlans : 3
Number of Default Vlans : 1
Flags : T = Tagged U = Untagged
: I = Incomplete bond state because of difference in the bond member configuration.
: D = interface configured for untagged traffic drop
: P = Vlan pass through enabled
Vlan Port Members
—– ————————————————————————————————————-
100 srv 0/0/0 (U ), srv 1/0/0 (U ), srv 17/0/0 (U ), srv 16/0/0 (U ), srv 32/0/0 (U )
srv 33/0/0 (U ), srv 49/0/0 (U ), srv 48/0/0 (U ), srv 2/0/0 (U ), srv 3/0/0 (U )
srv 19/0/0 (U ), srv 18/0/0 (U ), srv 34/0/0 (U ), srv 35/0/0 (U ), srv 51/0/0 (U )
srv 50/0/0 (U ), srv 6/0/0 (U ), srv 7/0/0 (U ), srv 23/0/0 (U ), srv 22/0/0 (U )
srv 38/0/0 (U ), srv 39/0/0 (U ), srv 55/0/0 (U ), srv 54/0/0 (U ), srv 10/0/0 (U )
srv 11/0/0 (U ), srv 27/0/0 (U ), srv 26/0/0 (U ), srv 42/0/0 (U ), srv 43/0/0 (U )
srv 59/0/0 (U ), srv 58/0/0 (U ), srv 14/0/0 (U ), srv 15/0/0 (U ), srv 31/0/0 (U )
srv 30/0/0 (U ), srv 46/0/0 (U ), srv 47/0/0 (U ), srv 63/0/0 (U ), srv 62/0/0 (U )
srv 12/0/0 (U ), srv 13/0/0 (U ), srv 29/0/0 (U ), srv 28/0/0 (U ), srv 44/0/0 (U )
srv 45/0/0 (U ), srv 61/0/0 (U ), srv 60/0/0 (U ), srv 8/0/0 (U ), srv 9/0/0 (U )
srv 25/0/0 (U ), srv 24/0/0 (U ), srv 40/0/0 (U ), srv 41/0/0 (U ), srv 57/0/0 (U )
srv 56/0/0 (U ), srv 4/0/0 (U ), srv 5/0/0 (U ), srv 21/0/0 (U ), srv 20/0/0 (U )
srv 36/0/0 (U ), srv 37/0/0 (U ), srv 53/0/0 (U ), srv 52/0/0 (U )
150 srv 34/0/1 (U ), srv 34/0/2 (U ), srv 34/0/3 (U ), srv 12/0/1 (U ), srv 12/0/2 (U )
srv 12/0/3 (U ), srv 29/0/1 (U ), srv 29/0/2 (U ), srv 29/0/3 (U )
200 te 0/1 (U ), te 7/1 (U ), srv 34/0/4 (U ), srv 34/0/5 (U ), srv 34/0/6 (U )
srv 34/0/7 (U ), srv 12/0/4 (U ), srv 12/0/5 (U ), srv 12/0/6 (U ), srv 12/0/7 (U )
srv 29/0/4 (U ), srv 29/0/5 (U ), srv 29/0/6 (U ), srv 29/0/7 (U )

With our NICs in the correct VLANs, now we need to decide how to use them. Because we’re using iSCSI on the backend, we could use MPIO there, which is typically the iSCSI recommended approach. However, that doesn’t help us much with the client side network or replication. Since our iSCSI array is presenting 4 MPIO targets already, we have distinct enough flows that we can take advantage of LACP if configured with a Layer 3+4 hashing algorithm. On top of that, an awesome feature of the SeaMicro is auto-LACP between its internal fabric and the server cards. All we need to do is configure linux for LACP NIC bonding (mode 4) with the right hash and we’re good to go. Let’s start by installing the interface bonding software with “apt-get install ifenslave”

We then add the bonding module to the system:


echo “bonding” >> /etc/modules
modprobe bonding

Then add the following to /etc/network/interfaces:


uto eth1
iface eth1 inet manual
bond-master bond0

auto eth2
iface eth2 inet manual
bond-master bond0

auto eth3
iface eth3 inet manual
bond-master bond0

auto bond0
iface bond0 inet static
address 10.1.2.10
netmask 255.255.255.0
bond-mode 4
bond-miimon 100
bond-lacp-rate 0
bond-slaves eth1 eth2 eth3
bond_xmit_hash_policy layer3+4

auto eth4
iface eth4 inet manual
bond-master bond1

auto eth5
iface eth5 inet manual
bond-master bond1

auto eth6
iface eth6 inet manual
bond-master bond1

auto eth7
iface eth7 inet manual
bond-master bond1

auto bond1
iface bond1 inet static
address 10.2.3.10
netmas 255.255.255.0
bond-mode 4
bond-miimon 100
bond-lacp-rate 0
bond-slaves eth4 eth5 eth6 eth7
bond_xmit_hash_policy layer3+4

At this point, the easiest way to get the bonded interfaces active is to just reboot the server. They should be functional when it restarts.


root@storage-0:~# ip addr
10: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 00:22:99:ec:05:01 brd ff:ff:ff:ff:ff:ff
inet 10.1.2.10/24 brd 10.1.2.255 scope global bond0
valid_lft forever preferred_lft forever
inet6 fe80::222:99ff:feec:501/64 scope link
valid_lft forever preferred_lft forever
11: bond1: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 00:22:99:ec:05:06 brd ff:ff:ff:ff:ff:ff
inet 10.2.3.10/8 brd 10.255.255.255 scope global bond1
valid_lft forever preferred_lft forever
inet6 fe80::222:99ff:feec:506/64 scope link
valid_lft forever preferred_lft forever

And on the SeaMicro, they can be see in its LACP bond info command:



seasm15k01# show lacp info server 12/0
Server ID 12/0
Bond ID Slave ID Slave-State Actor-state Partner-state VLAN-id Bond-MAC
————————————————————————————-
320 4 bundled 3d 3d 200 00:22:99:ec:05:06
320 5 bundled 3d 3d 200 00:22:99:ec:05:06
320 6 bundled 3d 3d 200 00:22:99:ec:05:06
320 7 bundled 3d 3d 200 00:22:99:ec:05:06
322 1 bundled 3d 3d 150 00:22:99:ec:05:03
322 2 bundled 3d 3d 150 00:22:99:ec:05:03
322 3 bundled 3d 3d 150 00:22:99:ec:05:03
————————————————————————————-

iSCSI and MPIO

Now that we have a bundled uplink, we can bring up the X-IO ISE array. Since the X-IO presents 4 targets, we don’t need to do more than 1 session per server on the storage server side, since 4 targets is enough to utilitze our full LACP link. We’ll start with installing the required utilities with “apt-get install multipath-tools open-iscsi”

We’re not bothering with internal security right now, so we’ll leave off any CHAP authentication for the iSCSI sessions, making them fairly easy to discover and login:


root@storage-0:~# iscsiadm -m discovery -t st -p 10.1.2.1
10.1.2.1:3260,1 iqn.2004-11.com.x-io:3fe10004-t2
10.1.2.1:3260,1 iqn.2004-11.com.x-io:3fe10004-t1
root@storage-0:~# iscsiadm -m discovery -t st -p 10.1.2.2
10.1.2.2:3260,1 iqn.2004-11.com.x-io:3fe10004-t4
10.1.2.2:3260,1 iqn.2004-11.com.x-io:3fe10004-t3
root@storage-0:~# iscsiadm -m node -L all

Once login has been confirmed and the drives are visible on the system, set iSCSI to automatically connect on start in /etc/iscsi/iscsid.conf:



“node.startup = automatic”

On our system, we now have drives /dev/sdb-e visible in dmesg. We need to quickly create a basic /etc/multipath.conf file:


defaults {
user_friendly_names yes
}
blacklist {
devnode “sda$”
}
blacklist_exceptions {
device{
vendor “XIOTECH”
}
}
devices{
device {
vendor “XIOTECH”
product “ISE3400”
path_grouping_policy multibus
getuid_callout “/lib/udev/scsi_id –whitelisted –device=/dev/%n”
path_checker tur
path_selector “round-robin 0”
no_path_retry 12
rr_min_io 1
}
}

Once the config file is in place, restart multipath with “service multipath-tools restart” and the multipath device should be available for configuration:


root@storage-0:~# multipath -ll
mpath0 (36001f932004f0000052a000200000000) dm-0 XIOTECH,ISE3400
size=5.1T features=’1 queue_if_no_path’ hwhandler=’0′ wp=rw
`-+- policy=’round-robin 0′ prio=1 status=active
|- 34:0:0:0 sde 8:64 active ready running
|- 32:0:0:0 sdb 8:16 active ready running
|- 33:0:0:0 sdc 8:32 active ready running
`- 35:0:0:0 sdd 8:48 active ready running
root@storage-0:~# sgdisk /dev/mapper/mpath0 -p
Creating new GPT entries.
Disk /dev/mapper/mpath0: 10848567296 sectors, 5.1 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): 74C6457B-38C6-41A3-8EC6-AC1A70018AC1
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 10848567262
Partitions will be aligned on 2048-sector boundaries
Total free space is 10848567229 sectors (5.1 TiB)

Once this is all confirmed, we’ll do the same on the other 2 servers and their iSCSI exported volumes.

Leveraging MPIO and the Seamicro Fabric

Because of the SeaMicro’s abstraction layer between the server cards and storage on the chassis, a unique ability exists to present the same disk to a server via multiple ASIC paths.  Since we’re already using MPIO for the iSCSI connection, it’s fairly trivial to increase performance between the storage servers and the SSD based disk on the SeaMicro chassis. Vdisk 0 is already in use by the root volume, so we’re start with vdisk 1 and assign our volumes to the servers.  We have a specific use in mind for the SSD volume which we’ll get into in the next article, but for now we’re going to create 3 500GB volumes and get them attached.


seasm15k01# storage create volume-prefix 7/RAIDPOOL/Journal size 500 count 3
***
seasm15k01# conf
Enter configuration commands, one per line. End with CNTL/Z.
seasm15k01(config)# storage assign 12/0 1 volume 7/RAIDPOOL/Journal-0
seasm15k01(config)# storage assign 12/0 2 volume 7/RAIDPOOL/Journal-0
seasm15k01(config)# storage assign 12/0 3 volume 7/RAIDPOOL/Journal-0
seasm15k01(config)# storage assign 12/0 5 volume 7/RAIDPOOL/Journal-0
seasm15k01(config)# storage assign 29/0 5 volume 7/RAIDPOOL/Journal-1
seasm15k01(config)# storage assign 29/0 3 volume 7/RAIDPOOL/Journal-1
seasm15k01(config)# storage assign 29/0 2 volume 7/RAIDPOOL/Journal-1
seasm15k01(config)# storage assign 29/0 1 volume 7/RAIDPOOL/Journal-1
seasm15k01(config)# storage assign 34/0 1 volume 7/RAIDPOOL/Journal-2
seasm15k01(config)# storage assign 34/0 2 volume 7/RAIDPOOL/Journal-2
seasm15k01(config)# storage assign 34/0 3 volume 7/RAIDPOOL/Journal-2
seasm15k01(config)# storage assign 34/0 5 volume 7/RAIDPOOL/Journal-2
seasm15k01(config)#

Once the storage assignment is complete, we can move to the storage server and create a quick script to pull the serial number from the drive.  (Note: The SeaMicro appears to present the same UUID for all volumes, so we cannot use UUID blacklisting in this case, so we’re blacklisting “devnode sda$” in the multipath config) /root/getDiskSerialNum:


#!/bin/bash
/usr/bin/sginfo -s $1 | cut -d\’ -f2 | tr -d ‘\n’

We can use the serial number pulled from the above script to determine the serial of the multipath presented disk, and then modify our config to whitelist it in /etc/multipath.conf:


defaults {
user_friendly_names yes
}
blacklist {
devnode “sda$”
device {
vender “ATA”
product “*”
}
}
blacklist_exceptions {
device {
vendor “XIOTECH”
}
device {
vendor “ATA”
product “SMvDt6S3HSgAOfPp”
}
}
devices {
device {
vendor “XIOTECH”
product “ISE3400”
path_grouping_policy multibus
getuid_callout “/lib/udev/scsi_id –whitelisted –device=/dev/%n”
path_checker tur
path_selector “round-robin 0”
no_path_retry 12
rr_min_io 1
}
device {
vendor “ATA”
user_friendly_names yes
rr_min_io 1
no_path_retry queue
rr_weight uniform
path_grouping_policy group_by_serial=1
getuid_callout “/root/getDiskSerialNum /dev/%n”
}
}

Checking our active multipath links now shows both the iSCSI multipath and the direct-attached SSD multipath devices available:


root@storage-0:~# multipath -ll
mpath1 (35000c5001feb99f0) dm-2 ATA,SMvDt6S3HSgAOfPp
size=500G features=’0′ hwhandler=’0′ wp=rw
|-+- policy=’round-robin 0′ prio=1 status=active
| `- 4:0:0:0 sdb 8:16 active ready running
|-+- policy=’round-robin 0′ prio=1 status=enabled
| `- 8:0:0:0 sdc 8:32 active ready running
|-+- policy=’round-robin 0′ prio=1 status=enabled
| `- 12:0:0:0 sdd 8:48 active ready running
`-+- policy=’round-robin 0′ prio=1 status=enabled
`- 20:0:0:0 sde 8:64 active ready running
mpath0 (36001f932004f0000052a000200000000) dm-0 XIOTECH,ISE3400
size=5.1T features=’1 queue_if_no_path’ hwhandler=’0′ wp=rw
`-+- policy=’round-robin 0′ prio=1 status=active
|- 34:0:0:0 sdi 8:128 active ready running
|- 33:0:0:0 sdf 8:80 active ready running
|- 32:0:0:0 sdg 8:96 active ready running
`- 35:0:0:0 sdh 8:112 active ready running

This leaves us with two high speed volumes available.

Just a Bunch Of Disks

The last piece in our storage architecture is slower but high capacity spindle storage. We left most of the JBOD disks unallocated on the SeaMicro chassis, now we’re going to create full-disk volumes out of those and assign 18 of them to each of the storage servers. A quirk of the SeaMicro: pools cannot span multiple disks unless they are in a RAID configuration, so we will end up needing to create 54 JBOD pools first, then assigning a single volume to each pool. Fortunately this process is fairly easy to script. Once this process is complete, we’ll end up with a volume layout as follows:


seasm15k01# show storage volume brief
*****************************************************************************************************************************************************************************************************************************************************************************************************************************
A = Assigned, U = Unassigned, L = Linear, S = Stripe
slot pool name volume name prov. size actual size attr
——————————————————————————–
2 jbodpool-1 jbodvol-1 2794GB 2794.00GB AL
2 jbodpool-2 jbodvol-2 2794GB 2794.00GB AL
2 jbodpool-3 jbodvol-3 2794GB 2794.00GB AL
2 jbodpool-4 jbodvol-4 2794GB 2794.00GB AL
2 jbodpool-5 jbodvol-5 2794GB 2794.00GB AL
2 jbodpool-6 jbodvol-6 2794GB 2794.00GB AL
2 jbodpool-7 jbodvol-7 2794GB 2794.00GB AL
2 jbodpool-8 jbodvol-8 2794GB 2794.00GB AL
2 jbodpool-9 jbodvol-9 2794GB 2794.00GB AL
2 jbodpool-10 jbodvol-10 2794GB 2794.00GB AL
2 jbodpool-11 jbodvol-11 2794GB 2794.00GB AL
2 jbodpool-12 jbodvol-12 2794GB 2794.00GB AL
2 jbodpool-13 jbodvol-13 2794GB 2794.00GB AL
2 jbodpool-14 jbodvol-14 2794GB 2794.00GB AL
2 jbodpool-15 jbodvol-15 2794GB 2794.00GB AL
2 jbodpool-16 jbodvol-16 2794GB 2794.00GB AL
2 jbodpool-17 jbodvol-17 2794GB 2794.00GB AL
2 jbodpool-18 jbodvol-18 2794GB 2794.00GB AL
2 jbodpool-19 jbodvol-19 2794GB 2794.00GB AL
2 jbodpool-20 jbodvol-20 2794GB 2794.00GB AL
2 jbodpool-21 jbodvol-21 2794GB 2794.00GB AL
2 jbodpool-22 jbodvol-22 2794GB 2794.00GB AL
2 jbodpool-23 jbodvol-23 2794GB 2794.00GB AL
2 jbodpool-24 jbodvol-24 2794GB 2794.00GB AL
2 rootpool-1 rootvol-0 254GB 254.00GB AL
2 rootpool-1 rootvol-1 254GB 254.00GB AL
2 rootpool-1 rootvol-2 254GB 254.00GB AL
2 rootpool-1 rootvol-3 254GB 254.00GB AL
2 rootpool-1 rootvol-4 254GB 254.00GB AL
2 rootpool-1 rootvol-5 254GB 254.00GB AL
2 rootpool-1 rootvol-6 254GB 254.00GB AL
2 rootpool-1 rootvol-7 254GB 254.00GB AL
2 rootpool-1 rootvol-8 254GB 254.00GB UL
2 rootpool-1 rootvol-9 254GB 254.00GB UL
2 rootpool-1 rootvol-10 254GB 254.00GB AL
2 rootpool-2 rootvol-0 254GB 254.00GB AL
2 rootpool-2 rootvol-1 254GB 254.00GB AL
2 rootpool-2 rootvol-2 254GB 254.00GB AL
2 rootpool-2 rootvol-3 254GB 254.00GB AL
2 rootpool-2 rootvol-4 254GB 254.00GB AL
2 rootpool-2 rootvol-5 254GB 254.00GB AL
2 rootpool-2 rootvol-6 254GB 254.00GB AL
2 rootpool-2 rootvol-7 254GB 254.00GB AL
2 rootpool-2 rootvol-8 254GB 254.00GB AL
2 rootpool-2 rootvol-9 254GB 254.00GB AL
2 rootpool-2 rootvol-10 254GB 254.00GB AL
2 rootpool-3 rootvol-0 254GB 254.00GB AL
2 rootpool-3 rootvol-1 254GB 254.00GB AL
2 rootpool-3 rootvol-2 254GB 254.00GB AL
2 rootpool-3 rootvol-3 254GB 254.00GB AL
2 rootpool-3 rootvol-4 254GB 254.00GB AL
2 rootpool-3 rootvol-5 254GB 254.00GB AL
2 rootpool-3 rootvol-6 254GB 254.00GB AL
2 rootpool-3 rootvol-7 254GB 254.00GB AL
2 rootpool-3 rootvol-8 254GB 254.00GB AL
2 rootpool-3 rootvol-9 254GB 254.00GB AL
2 rootpool-3 rootvol-10 254GB 254.00GB AL
2 rootpool-4 rootvol-0 254GB 254.00GB AL
2 rootpool-4 rootvol-1 254GB 254.00GB AL
2 rootpool-4 rootvol-2 254GB 254.00GB AL
2 rootpool-4 rootvol-3 254GB 254.00GB AL
2 rootpool-4 rootvol-4 254GB 254.00GB AL
2 rootpool-4 rootvol-5 254GB 254.00GB AL
2 rootpool-4 rootvol-6 254GB 254.00GB AL
2 rootpool-4 rootvol-7 254GB 254.00GB AL
2 rootpool-4 rootvol-8 254GB 254.00GB AL
2 rootpool-4 rootvol-9 254GB 254.00GB AL
2 rootpool-4 rootvol-10 254GB 254.00GB AL
2 rootpool-5 rootvol-0 254GB 254.00GB AL
2 rootpool-5 rootvol-1 254GB 254.00GB AL
2 rootpool-5 rootvol-2 254GB 254.00GB AL
2 rootpool-5 rootvol-3 254GB 254.00GB AL
2 rootpool-5 rootvol-4 254GB 254.00GB AL
2 rootpool-5 rootvol-5 254GB 254.00GB AL
2 rootpool-5 rootvol-6 254GB 254.00GB AL
2 rootpool-5 rootvol-7 254GB 254.00GB AL
2 rootpool-5 rootvol-8 254GB 254.00GB UL
2 rootpool-5 rootvol-9 254GB 254.00GB AL
2 rootpool-5 rootvol-10 254GB 254.00GB AL
2 rootpool-6 rootvol-0 254GB 254.00GB AL
2 rootpool-6 rootvol-1 254GB 254.00GB AL
2 rootpool-6 rootvol-2 254GB 254.00GB AL
2 rootpool-6 rootvol-3 254GB 254.00GB AL
2 rootpool-6 rootvol-4 254GB 254.00GB AL
2 rootpool-6 rootvol-5 254GB 254.00GB AL
2 rootpool-6 rootvol-6 254GB 254.00GB AL
2 rootpool-6 rootvol-7 254GB 254.00GB AL
2 rootpool-6 rootvol-8 254GB 254.00GB AL
2 rootpool-6 rootvol-9 254GB 254.00GB AL
2 rootpool-6 rootvol-10 254GB 254.00GB AL
5 jbodpool-25 jbodvol-25 2794GB 2794.00GB AL
5 jbodpool-26 jbodvol-26 2794GB 2794.00GB AL
5 jbodpool-27 jbodvol-27 2794GB 2794.00GB AL
5 jbodpool-28 jbodvol-28 2794GB 2794.00GB AL
5 jbodpool-29 jbodvol-29 2794GB 2794.00GB AL
5 jbodpool-30 jbodvol-30 2794GB 2794.00GB AL
5 jbodpool-31 jbodvol-31 2794GB 2794.00GB AL
5 jbodpool-32 jbodvol-32 2794GB 2794.00GB AL
5 jbodpool-33 jbodvol-33 2794GB 2794.00GB AL
5 jbodpool-34 jbodvol-34 2794GB 2794.00GB AL
5 jbodpool-35 jbodvol-35 2794GB 2794.00GB AL
5 jbodpool-36 jbodvol-36 2794GB 2794.00GB AL
5 jbodpool-37 jbodvol-37 2794GB 2794.00GB AL
5 jbodpool-38 jbodvol-38 2794GB 2794.00GB AL
5 jbodpool-39 jbodvol-39 2794GB 2794.00GB AL
5 jbodpool-40 jbodvol-40 2794GB 2794.00GB AL
5 jbodpool-41 jbodvol-41 2794GB 2794.00GB AL
5 jbodpool-42 jbodvol-42 2794GB 2794.00GB AL
5 jbodpool-43 jbodvol-43 2794GB 2794.00GB AL
5 jbodpool-44 jbodvol-44 2794GB 2794.00GB AL
5 jbodpool-45 jbodvol-45 2794GB 2794.00GB AL
5 jbodpool-46 jbodvol-46 2794GB 2794.00GB AL
5 jbodpool-47 jbodvol-47 2794GB 2794.00GB AL
5 jbodpool-48 jbodvol-48 2794GB 2794.00GB AL
5 jbodpool-49 jbodvol-49 2794GB 2794.00GB AL
5 jbodpool-50 jbodvol-50 2794GB 2794.00GB AL
5 jbodpool-51 jbodvol-51 2794GB 2794.00GB AL
5 jbodpool-52 jbodvol-52 2794GB 2794.00GB AL
5 jbodpool-53 jbodvol-53 2794GB 2794.00GB AL
5 jbodpool-54 jbodvol-54 2794GB 2794.00GB AL
7 RAIDPOOL Journal-0 500GB 500.00GB AL
7 RAIDPOOL Journal-1 500GB 500.00GB AL
7 RAIDPOOL Journal-2 500GB 500.00GB AL
7 RAIDPOOL RAIDVOL-0 48GB 48.00GB AL
* 124 entries

Once that’s done, we can assign the disks from these pools to our storage servers with a single command:


seasm15k01(config)# storage assign-range 12/0,29/0,34/0 4,6-22 volume jbodvol uuid

Now on our three storage servers, we have the following drives available:


root@storage-0:~# cat /proc/partitions
major minor #blocks name
8 0 266338304 sda
8 1 232951808 sda1
8 2 1 sda2
8 5 33383424 sda5
8 16 2929721344 sdb
8 32 2929721344 sdc
8 48 524288000 sdd
8 64 2929721344 sde
8 80 2929721344 sdf
8 96 524288000 sdg
8 112 2929721344 sdh
8 128 2929721344 sdi
8 144 524288000 sdj
8 160 2929721344 sdk
8 176 2929721344 sdl
8 192 2929721344 sdm
8 208 2929721344 sdn
8 224 2929721344 sdo
8 240 524288000 sdp
65 0 2929721344 sdq
65 16 2929721344 sdr
65 32 2929721344 sds
65 48 2929721344 sdt
65 64 2929721344 sdu
65 80 2929721344 sdv
65 96 2929721344 sdw
252 0 524288000 dm-0
65 144 5424283648 sdz
65 160 5424283648 sdaa
65 112 5424283648 sdx
65 128 5424283648 sdy
252 1 5424283648 dm-1

You can see our partition root drive available on sda, the directly attached SSD available at dm-0, and the iSCSI target on dm-1. The rest of the available partitions are the simple JBOD mounts.

Now we’re ready to actually do something with all of this disk.