using kvm (linux-kvm)

kvm is a great virtual machine mechanism, I really like the small footprint (two modules) and the reuse of the qemu code. It works great out of the box and with a few other utilities its very easy to administer. Here I hope to compile some documentation of what we've done so far, and what we learned while implementing kvm in our environment. Some of this will hopefully translate to your environment. First I'll mention all the tools that go into setting up our environment and what resources we have. Then I'll go over our configurations and some possibly useful caveats.

packages

Just by installing the kvm package, you can start using kvm, but to implement a deployment, you need a few other packages. Here is what we have in production now. We add these packages using puppet. With these packages installed, let me tell you about the environment.
*Local package, we use to contain an updated kvm...when kvm was releasing a new version every week, it was useful to keep the kmod separate.
+When is subversion not useful.

environment

Our current environment was arrived at by testing a few things that really didn't work very well. The current setup uses fibre channel for storage and blade servers for the hypervisors.

To do live migration you need shared storage. You can do NFS, which is really slow. You could do iSCSI, which is ironically even slower. Or you could use the qemu-nbd network block device server, which is better than the first two but means you're only as strong as the server running the nbd. This is also the case for iSCSI or NFS but those are both proven technologies running on stable servers or netapps if you are lucky. For these reasons, I recommend using fibre channel. We have fibre channel with a fibre channel switch with dual connections for redundancy. I'll cover how to configure that later with multipath.

We are using blades as the hypervisors. This has a few key advantages. The first is latency between hypervisors is very minimal. It depends on your manufacturer but most blades feature very good bandwidth between blades of the same chassis using virtual ethernet devices. The other advantage is that as far as the rest of the network is concerned, whether or not a vm is living on blade 1 or blade 10 doesn't matter. Nothing has to change with routing to get to the vm, it's still the same destination (assuming you don't have multiple switches in your blade chassis.

configuration for live migration

To allow for live migration, the disk devices must appear to be the same on all the hypervisors. As we are using fibre channel, all of our luns will appear as separate scsi devices, but he names will probably be different, what is sdb on machine x might be sdf on machine y. Also, since our fibre channel has redundant connections to the switch, each lun will appear twice on the servers. To ensure the devices appear the same on all servers, we implemented multipath.

Using multipath is a critical step, and it's very straightforward to implement. When setting up a new virtual machine we carve out part of the fibre channel array for the device. Our array will return the id of the new device upon creation of a new lun, we make note of that to make sure we are talking about the right slice of the disk.

First, we configure multipath to look at the scsi devices, by default everything is blacklisted, so comment out the lines that look like this in /etc/multipath.conf:

blacklist {
        devnode "*"
}
change to:
#blacklist {
#        devnode "*"
#}
Next we'll tell multipath how to recognise the different luns on the system. This is usually at the bottom of the default multipath.conf file.
devices {
        device {
                vendor  "Vendor"
                product "OurFC"
                getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
}

This tells multipath how to identify luns on the system uniquely. For example, we have a lun configured for a test vm, it appears on the system as device /dev/sdb, we run scsi_id against /block/sdc to determine the scsi_id of the lun.
[root@hypervisor0 ~]# /sbin/scsi_id -g -u -s /block/sdc
Vendor_FF01000033100100
In our case, since we have redundant links to the fibre channel so both /dev/sdc and /dev/sde return the same results
[root@hypervisor0 ~]# /sbin/scsi_id -g -u -s /block/sde
Vendor_FF01000033100100
Now, to make this useful, we have to define an alias for this scsi_id, so we'll make an alias called vm1, since this is our first vm.
multipaths {
        multipath {
                wwid            Vendor_FF01000033100100
                alias           vm1
        }
}
Here we are telling multipath to create a device in /dev/mapper called vm1 and have any data destined to that device to be routed to either /dev/sdc or /dev/sde depending on which is available. You can specify a round-robin on that, but that's no important right now. The important thing is that on any server we build, if we have the same multipath.conf file, the device /dev/mapper/vm1 will exist on that server. Moreoever, it will be the correct lun on the fibre channel each time.

making a vm

The first step is to create a new lun in the vm can live. After creating the new lun, tell the kernel to rescan the scsi bus for any new drives.
[root@hypervisor ~]# echo "- - -" > /sys/class/scsi_host/host3/scan
[root@hypervisor ~]# dmesg |grep sd
SCSI device sdag: drive cache: write back
 sdag: unknown partition table
sd 3:0:1:14: Attached scsi disk sdag
sd 3:0:1:14: Attached scsi generic sg35 type 0
Using dmesg we can see the device name assigned by the kernel to the new partition. We can then run scsi_id on the device to determine the scsi_id of that device.
[root@hypervisor ~]# /sbin/scsi_id -g -u -s /block/sdag
Vendor_AA01000D33100100
We can then add the id to /etc/multipath.conf to create a device in /dev/mapper. So in our multipath.conf we put
multipaths {
        multipath {
                wwid            Vendor_AA01000D33100100
                alias           vm_test
        }
}
This will create a block file /dev/mapper/vm_test, this is what we will tell the vm to use as it's hard drive. Now, if we use some version control (subversion), or configuration control (puppet), we can push this config file to all our hypervisors. This means that /dev/mapper/vm_test will exist on each of the hypervisors, moreover, it will be the correct block file for our system.
<devices>
    <emulator>/usr/bin/qemu-kvm
    <disk type='block' device='disk'>
      <source dev='/dev/mapper/ldap'/>
      <target dev='hda' bus='ide'/>
    </disk>