Category Archives: Linux

Building a VM cluster with your bare hands.

I’ve been using KVM more and more frequently in the past year (with nearly 25 virtual guests in production). While there are graphical user interfaces for KVM out there, I’ve yet to see one that supports DRBD replication (although one may exist). For that reason, I’ve basically built my Virtualization cluster using a handful of open-source tools “by hand”.

Kernel-Based Virtualization Machine
A penguin juggling multiple tasks and OSes

KVM has the capability to run both Windows and Linux operating systems (as well as numerous others), supports both CPU based virtualization and para-virtualization, and has virtual drivers that can run in the guest instance speeding up disk and network IO as well as balloon memory drivers to reduce the actual utilization of memory on the host machine.

I’ll stop here and say this – For home use, I’d probably suggest avoiding anything laid out in this article and simply use Virtualbox. I use it on my desktop extensively and have for many years, but for this article I’m focusing on building a cluster with KVM.

VirtualBox
VirtualBox running Meego (another project of mine)

With the absence of a proper Storage Area Network (SAN), I’m utilizing DRBD (Distributed Replicated Block Device) to provide VM disk replication across both virtual nodes. This allows for live migration of a guest from front-end node to front-end node. Additionally, this architecture will still allow for replacement of storage OR supplementing of storage with a SAN in the future.

VM Cluster
The VM Cluster Layout.

DRBD replicates individual LVM volumes (and not the raid array as a whole) across my 2 host nodes. Each virtual guest has it’s own logical volume assigned to it, which is accessed via DRBD block device interfaces  (/dev/drbd<number>).

In the example image above, jabber0 and web0 (virtual “guests“) are running on virtual0 (a virtual “host” or “node“), with web1 (another “guest“) running on virtual1 (a virtual “host” or “node“). The DRBD resource is set to Primary mode on the virtual host connected to a running guest, with the disk being copied to the Secondary (the non-running virtual host) Primary mode allows the virtual host (and it’s guests) to access the DRBD resource (and read/write from the connected logical volume).

As far as a virtual guest is concerned, there is no DRBD, only a /dev/vda or /dev/sda device.

Only during live-migration should the DRBD resources on both virtual hosts be placed into Primary (a situation called Dual Primary). As one virtual guest is paused prior to the other going active, data corruption will not occur.

The Virtual Server itself (Well, it looks like this)

Each node is presently a Dell PowerEdge 2950 with 32G of memory and over 1 Terrabyte of storage. With the DRBD replication this gives approximately 1 Terrabyte of storage (and not a combined 2 Terrabytes).

Each node has 4 gigabit ethernet interfaces.

Interface Purpose
eth0 Administrative Access
eth1 DRBD Replication
eth2 Connected to the world_br0 bridge for guest host routing
eth3 Connected to the world_br1 bridge for guest host routing

 

There are presently three ethernet bridges on each node:

 

Bridge Interface Purpose
kickstart_br0 Used to kickstart guest machines
world_br0 Used to connect guest machines to the public network
world_br1 Used to connect guest machines to the public network

 

Connecting to a Guest:

Each guest is accessible via standard means (ssh) when configured correctly. Additionally, one can connect to each guest by VNCing to a unique port on the virtual host. (I do maintain a list of which DRBD ports and VNC ports are used for each of my virtual guests)

Configuring an LVM volume:

The “vmdisks” LVM volume group is approximately 1.3TB of disk storage, used to provide individual volumes to the guest instances. I use a logical volume of 30G for most guests.

To add a logical disk for guest usage is simple – and Must be done uniformly across all nodes:

lvcreate -L <size>M -n <name> vmdisks

 

Initial Configuration of the DRBD MD device:

The DRBD MD device is the actual block device that the Guest machine will interface with.

The following MUST BE DONE ACROSS ALL NODES, however only upon initial creation:

Update the /etc/drbd.conf file to add your new node (here’s an example):

resource <resource name>
{
net
{
allow-two-primaries;
}

syncer
{
rate 33M;
}

startup
{
become-primary-on virtual0.braindeadprojects.com;
}

on virtual0.braindeadprojects.com
{
device /dev/drbd<Next Available block>;
disk /dev/vmdisks/<LVM Volume Group>;
address 10.255.255.1:<Next available port>;
meta-disk internal;
}

on virtual1.braindeadprojects.com
{
device /dev/drbd<Next available block>;
disk /dev/vmdisks/<LVM Volume Group>;
address 10.255.255.2:<next available port>;
meta-disk internal;
}

}

 

After updating the config, create the block device and enable it:

#drbdadm create-md <resource name>
#drbdadm up <resource name>

 

At this point, all nodes have a record of this DRBD resource. /proc/drbd will have additional information.

The following must be done ONLY ON THE PRIMARY (MASTER) NODE:

#drbdadm — –overwrite-data-of-peer primary <resource name>

 

This will begin an initial synchronization across the nodes. Again, this is only run on the “Master node” (the virtual host node that is initially running the VM guest).  At this time, the DRBD resource is available on ALL nodes, however until the synchronization is finished, reads/writes will take slightly longer.

An important note on synchronization:

The syncer{} stanza in the resource config plays an important role in how fast a drive is synchronized. Default sync speed is roughly 340K/sec, which in turn causes a drive synchronization of a 30G drive to take appx 28 hours.

This can safely be set to 33M/sec in my environment, reducing sync-time to roughly 20 minutes, depending upon load.

Sync rate will play an important factor in instances where an entire node fails, and the resources of the failover node cannot keep up. In such an event, a 3rd node should be added to the system, with drives synced to it.

Creating the VM Guest:

I’m utilizing libvirt as a convenient way to provision and manage virtual machines.

libVirt
libVirt, handling all your hypervisor and sardine needs since 2009

Creation of a guest is fairly easy, and can be done interactively or via a one-liner:

#virt-install –connect qemu:///system -n <Guest Name> -r <RAM in MB> –vcpus=1 \
–arch=<i686|x86_64|…> –vnc –vncport=<unused VNC port number> –noautoconsole –os-type linux –accelerate \
–network=bridge:<kickstart_br0|world_br0|world_br1> –hvm –disk path=/dev/drbd<resource number>,bus=virtio \
–<pxe|import|cdrom>

 

After which time the guest will automatically start, with it’s vnetX interface bridging with kickstart_br0.

I’ve installed DNSMasq on each host machine. It sits atop the kickstart_br0 interface, and assigns the VM an IP in the 192.168.6.0/24 network (via DHCP), and PXE boots/kickstarts it off a mirroring server. (The 192.168.6.0/24 network is MASQUERADEd in iptables so requests appear to come from virtual[01])

After kickstarting the guest, the reboot process tends to shut down the virtual guest so it may need to be restarted (Normal reboots are not prone to this shutdown). Once restarted, server configuration can be done via ssh from the host node, or VNC.

Trixbox-KVM
A KVM Trixbox guest: It's ALIVE!!!

Once the machine is built, customized and ready to be placed on the Internet, power down the VM guest and edit the XML config file (replacing kickstart_br0 with world_br0 or world_br1). If you find that the VM guest attempts to PXE boot once again, you may need to also change the boot device order (specifying hd instead of network)

You will also want to adjust the clock to source itself from the host machine.

# virsh
Welcome to virsh, the virtualization interactive terminal.
Type:  ‘help’ for help with commands
‘quit’ to quit
virsh # edit <guestname>


<os>
<type arch=’x86_64′ machine=’rhel5.4.0′>hvm</type>
<boot dev=’network’/>
<boot dev=’hd’/>
</os>


<interface type=’bridge’>
<mac address=’54:52:00:2d:21:10’/>
<source bridge=’kickstart_br0’/>
<target dev=’vnet1’/>
</interface>


<clock offset=’localtime’/>

 

To boot from CD:

 

<disk type=’file’ device=’cdrom’>
<driver name=’qemu’ type=’raw’/>
<source file=’/usr/src/systemrescuecd-x86-1.6.2.iso’/>
<target dev=’hdc’ bus=’ide’/>
<readonly/>
<address type=’drive’ controller=’0′ bus=’1′ unit=’0’/>
</disk>

 

After editing the XML, you can restart the VM.

 

virsh # start <guestname>

 

General usage:

 

I’ve made sure to install virt-top, an interface to the hypervisor similar to the “top” command. This gives a nice overview of the system:

#virt-top

 

virt-top
virt-top: An overview of CPU and memory

The shell API for libvirt makes manipulating guest instances easy. Here are a few of the more frequently used virsh commands:

#list <–all> (Lists running and non-running guests)
#start <guestname> (Starts guest instance)
#autostart <guestname> (Marks guest to be autostarted at node boot)
#destroy <guestname> (Immediately powers off guest)
#suspend <guestname> (Powers down guest gracefully)
#reboot <guestname> (Reboots guest)
#edit <guestname> (Edits the guest XML config)
#migrate (See the migration section for more info)

 

Live migration:

Live migration between nodes can be done via ssh (with shared keys) or TLS. I’m currently utilizing the ssh method:

Prior to migration, the DRBD resource needs to be place in Primary on both nodes:

#drbdadm primary <resource name>

 

After doing so, the following is run on the SENDING node only:

#virsh migrate –live <GUEST> qemu+ssh://<RECEIVING NODE>/system

 

As part of the migration process, the sending node copies memory and kernel state via ssh to the receiving node. During the migration process, the guest instance on the sending node is active, with the guest node being marked as paused.

 

Once information is migrated, the sending node pauses it’s guest instance, with the receiving node un-pausing. Actual migration depends upon a number of factors, although is generally accomplished in under 35 seconds.

Following the migration, it’s essential to place the losing node into DRBD secondary mode. Should I accidentally try to start the guest on the losing node, this will prevent it from obtaining access to the disk (and corrupting data):

#drbdadm secondary <resource name>

 

Virtualizing Physical Machines:

 

Virtualizing a physical machine is extremely easy. Instead of PXE booting and kickstarting an install (–pxe),  I use the –cdrom /usr/src/systemrescuecd-x86-1.6.2.iso flag when creating the virtual guest. On each virtual host, I have a copy of the excellent Gentoo based SystemRescueCd.

When the system boots into the live CD, partition the drive (usually /dev/vda or /dev/sda) as you wish (taking advantage of LVM for non-boot partitions if possible).

Create a /mnt/migrate directory from the live cd, and mount your newly created partitions there.

mount /dev/sda2 /mnt/migrate
for dir in {boot,proc,dev}; do mkdir /mnt/migrate/$dir; done
mount /dev/sda1 /mnt/migrate/boot

(Do the same for /var and any other directories you have partitioned separately)

Utilizing rsync over ssh, synchronize all files from the physical host to the virtual one (taking care that you perform the action correctly, so as not to overwrite the original server). A handful of files and directories NEED TO BE OMITTED, namely:

/proc
/dev
/sys
/lost+found
(possibly others)

I generally use an rsync command similar to this one:

#time rsync –exclude ‘/dev/’ –exclude ‘/proc/’ –exclude ‘/sys/’ –exclude ‘/lost+found/’  -ave ssh <originalserverip>:/ /mnt/migrate/

The following devices need to be present in order to boot into the OS.  Either rsync them or manually make them with mknod.

/dev/console
/dev/initctl
/dev/null
/dev/zero

Another easy way to accomplish this is:

for file in {console,initctl,null,zero}; do cp -a /dev/$file /mnt/migrate/dev/$file ; done

 

Following the rsync, the virtual guest will need a bootloader and an updated initial ramdisk. Both of these are best done in a chroot environment:

mount -o bind /dev/ /mnt/migrate/dev/
mount -t proc none /mnt/migrate/proc/
mount -o bind /sys/ /mnt/migrate/sys/
chroot /mnt/migrate/ /bin/bash

 

Inside the chroot environment, you will need to update /etc/mtab and /etc/fstab to reflect your new partitioning (at the very least drives will likely change to /dev/vda). You will also need to update /boot/grub/device.map to reflect hd0 as a valid drive.

Once these changes have been made, grub installation should be as simple as:

grub-install –no-floppy /dev/vda

 

(Although sometimes grub needs to be further caressed. See the Gentoo Handbook section on installing the grub bootloader for good documentation)

If a simple grub-install does not work:

# grub
grub> root (hd0,0)           <– where hd0,0 is the first partition on the first drive
grub> setup (hd0)            <– install grub on the mbr of the first drive
grub> quit

 

With the Bootloader installed, we need to create a working initial ramdisk with necessary drivers. Look for the most recent installed kernel in grub.conf and create the ramdisk (replace the version numbers with yours):

mkinitrd -f /boot/initrd-2.6.18-194.32.1.el5.img 2.6.18-194.32.1.el5

To take advantage of virtio (network and disk) modules on newer kernels, you may opt to include them in the initrd:

mkinitrd -f –with virtio_pci –with virtio_blk \ –with virtio_net /boot/initrd-2.6.18-194.32.1.el5.img 2.6.18-194.32.1.el5

 

Please be aware that many files (IE: databases) should only be rsynced when shutdown. For this reason, it’s often best to shutdown MySQL, PostgreSQL, and various other services PRIOR TO RSYNCing to prevent database corruption.

How to physicalize a virtual machine:

In the event of a major issue, converting a virtual machine back to a physical machine has the same process as physical to virtual, but reversed.

Of Note:

While Fedora currently supports (automatically, out of the box) SELinux profiles/labels for KVM guest instances, Centos 5.6 does not. It will be incorporated in Centos 6, however… and I plan on migrating to that OS eventually.

Final Thoughts:

As with everything, there’s pros and cons to this methodology.

  • While I’ve always preferred avoiding GUIs, the fact is they standardize on what steps happen in which order (limiting the potential for user-induced errors.)
  • A high performance SAN (or perhaps an OpenFiler box) would make things much easier to configure and migrate, but at the same time introduce a possible single point of failure.
  • Utilizing an automation engine (like puppet) could limit the number of steps needed to provision a virtual guest across all nodes.
  • Outside of some possible restrictions (virtio drivers being specific to KVM, LVM2 support for Windows), migrating from the present day system to vmware, virtualbox, or <insert your favorite hypervisor here> — should be fairly easy, requiring simply creating a guest and pointing it at an LVM share.

All in all, the system has been in production for nearly a year now and is performing beautifully. And best of all, I’m saving on power and generating less heat.

 

Stop the bleeding!

I’ve been quite annoyed recently with my video card, the “nVidia Corporation GeForce 8400 GS (rev a1)“. A number of sites using Flash tend to bleed through Firefox or Chrome and into other tabs or even other workspaces.

questo è fastidioso
I can still SEE YOU!

I’ve upgraded the nvidia-drivers a number of times, never actually fixing the problem. Other Gentoo users on the #gentoo channel of freenode have suggested migrating to gnash instead… and while I have contemplated this, I’ve noticed a number of things that don’t work well under gnash on my netbook.

Thankfully I’m not the only person to be experiencing this. Earlier today I came across a solution (that while not optimal) definitely fixes the problem:

 

Untick this box
No acceleration, but no bleeding either.

Disabling hardware acceleration thankfully stops the bleedthrough. (Just right click on a flash movie, select “Settings” and disable acceleration under “display”) You will need to restart your browsers for it to fully take effect.

GNS3 and Gentoo – fixing QEMU networking

I was hoping to have time to further familiarize myself with IPv6 this weekend –  but workplace emergencies, sleep, and a technical glitch slowed me down. Here’s the story of the technical glitch.

GNS3 - this thing rocks.

GNS3/dynamips can handle emulation of a number of routers, and I was happy to see that it can emulate host PCs as well. Unfortunately networking in GNS3  is done via UDP tunnels – something that requires a Qemu patch to implement.

For some reason QEMU won't really start if it's networked.

In Gentoo, simply building GNS3 from the Sunrise Overlay doesn’t provide sufficient support for this requirement. Emulated routers (and I’m assuming switches) work fine when networked together. Emulated host machines work fine when NOT connected to anything. But once you network things together you’ll discover that your Qemu host won’t start.

Here’s why – when you start a Qemu host, this is essentially what GNS3 runs:

/usr/bin/qemu -name QEMU1 -m 256 /mnt/virtualmachines/dynips/IPv6-tc/working/QEMU1/FLASH -hdb /mnt/virtualmachines/dynamips/IPv6-tc/working/QEMU1/SWAP -enable-kvm -net nic,vlan=0,macaddr=00:aa:00:8f:e4:00,model=rtl8139 -net udp,vlan=0,sport=20000,dport=10006,daddr=127.0.0.1 -net nic,vlan=1,macaddr=00:00:ab:fa:72:01,model=rtl8139 -serial telnet:127.0.0.1:3000,server,nowait -no-acpi -vnc :0

GNS3 attempts to use the -net udp option, something not offered in the default QEMU-KVM distribution. With no Gentoo USE flags to add the patch in either 0.13.0 or 0.14.50, I decided to simply upgrade to 0.14.50 and see if it was part of the default build:

# kvm –version
QEMU emulator version 0.14.50 (qemu-kvm-devel), Copyright (c) 2003-2008 Fabrice Bellard
# kvm –help | grep “net udp”
#

Nope, it’s not. So the next step is to remove app-emulation/qemu-kvm from your system and build it by hand.

First, remove kvm and download the source from kernel.org:

# emerge -C app-emulation/qemu-kvm

# cd /usr/src/

# wget http://mirrors.kernel.org/gentoo/distfiles/qemu-kvm-0.13.0.tar.gz

Unpack and patch it, using the patch provided by the gns-3 team .

# tar zxvpf qemu-kvm-0.13.0.tar.gz

# cd qemu-kvm-0.13.0/

Testing the patch, I ran into a handful of issues:

# patch –dry-run -p1 < /mnt/nas/downloads/qemu-0.13.0-
mcast-udp.patch
patching file Makefile.objs
Hunk #1 FAILED at 25.
1 out of 1 hunk FAILED — saving rejects to file Makefile.objs.rej
patching file block/raw-win32.c
Hunk #1 FAILED at 93.
Hunk #2 FAILED at 347.
2 out of 2 hunks FAILED — saving rejects to file block/raw-win32.c.rej
patching file hw/e1000.c
Hunk #1 FAILED at 567.
1 out of 1 hunk FAILED — saving rejects to file hw/e1000.c.rej
patching file net/udp.c
patching file net/udp.h
patching file net.c
Hunk #1 FAILED at 30.
Hunk #2 FAILED at 1075.
2 out of 2 hunks FAILED — saving rejects to file net.c.rej
patching file net.h
Hunk #1 FAILED at 33.
1 out of 1 hunk FAILED — saving rejects to file net.h.rej
patching file qemu-options.hx
Hunk #1 FAILED at 996.
1 out of 1 hunk FAILED — saving rejects to file qemu-options.hx.rej

A quick modification and we’ve got a new patch file. This one applies nicely:

# patch –dry-run -p1 < /mnt/nas/downloads/gentoo-qemu-0.13.0.gns3.patch
patching file Makefile.objs
patching file QMP/qmp-commands.txt
patching file block/raw-win32.c
patching file config-all-devices.mak
patching file config-host.h
patching file config-host.h-timestamp
patching file config-host.ld
patching file config-host.mak
patching file hw/e1000.c
patching file libdis/config.mak
patching file libdis-user/config.mak
patching file libhw32/config.mak
patching file libhw64/config.mak
patching file net/udp.c
patching file net/udp.h
patching file net.c
patching file net.h
patching file qemu-doc.html
patching file qemu-img-cmds.texi
patching file qemu-img.1
patching file qemu-monitor.texi
patching file qemu-nbd.8
patching file qemu-options.hx
patching file qemu-options.texi
patching file qemu-tech.html
patching file qemu.1
patching file roms/seabios/config.mak
patching file roms/vgabios/config.mak
patching file x86_64-softmmu/config-devices.mak
patching file x86_64-softmmu/config-devices.mak.old
patching file x86_64-softmmu/config-target.mak
vonnegut qemu-kvm-0.13.0 #

So let’s apply it for real and then configure, build, and install our new Qemu:

# patch -p1 < /mnt/nas/downloads/gentoo-qemu-0.13.0.gns3.patch

# ./configure –prefix=/usr –target-list=i386-softmmu –enable-sdl

# make && make install

A quick check to ensure UDP tunneling is compiled in:

# /usr/bin/qemu –version
QEMU emulator version 0.13.0 (qemu-kvm-0.13.0), Copyright (c) 2003-2008 Fabrice Bellard
# /usr/bin/qemu –help | grep “net udp”
-net udp[,vlan=n]sport=sport,dport=dport,daddr=host
#

And voila – now not only does my host machine start when networked into my layout, it also has connectivity:

Yes, it finally works.

Next up – continue lab work with IPv6, and begin studying for my CCNP.

Back in the dark ages

Recently (don’t ask me why, seriously) I had to migrate a physical Redhat 9 server to a virtualized platform (KVM).

Yes, ideally one rebuilds the deprecated server anew, but due to time-constraints and a number of other issues that wasn’t a possibility. Unfortunately, sometimes you just have no choice but to kick the can down the street.

The migration from physical to virtual is simple – create the virtual guest, create an LVM logical volume for the disk,  format it and rsync the contents from the physical to the virtual drive. After that, install grub on the guest and voila – you’re done.

Keeping some consistency (although unnecessary), I went with an EXT3 filesystem on the “new” guest. Unfortunately, I came across the following snags:

  • The inode size used in Redhat 9 is 128 bytes, but modern systems (ie: the host I formatted the partitions from ) use 256 bytes.
  • There’s a number of attributes that weren’t present in Redhat 9.

Here’s the output of tune2fs on a Redhat 9 EXT3 partition:

# tune2fs -l /dev/hda1
tune2fs 1.32 (09-Nov-2002)
Filesystem volume name:   /boot
Last mounted on:          <not available>
Filesystem UUID:          9e6fa853-18bc-4c08-bb7c-51c74e0c11ae
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal filetype needs_recovery  sparse_super
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              8032
Block count:              32096
Reserved block count:     1604
Free blocks:              17451
Free inodes:              7982
First block:              1
Block size:               1024
Fragment size:            1024
Blocks per group:         8192
Fragments per group:      8192
Inodes per group:         2008
Inode blocks per group:   251
Filesystem created:       Tue Oct 26 12:48:49 2010
Last mount time:          Thu Oct 28 15:58:10 2010
Last write time:          Thu Oct 28 15:58:10 2010
Mount count:              11
Maximum mount count:      28
Last checked:             Tue Oct 26 12:48:49 2010
Check interval:           15552000 (6 months)
Next check after:         Sun Apr 24 12:48:49 2011
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               128
Journal UUID:             <none>
Journal inode:            8
Journal device:           0x0000
First orphan inode:       0

The following is output from a Centos 5.5 formatted EXT3 filesystem:

# tune2fs -l /dev/mapper/VolGroup00-LogVol00
tune2fs 1.39 (29-May-2006)
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          7384cac8-b098-4c85-be6d-643443ae3d3d
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              6809088
Block count:              6807552
Reserved block count:     340377
Free blocks:              6204875
Free inodes:              6752521
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      1022
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         32736
Inode blocks per group:   1023
Filesystem created:       Thu Oct 28 13:31:34 2010
Last mount time:          Thu Oct 28 17:43:57 2010
Last write time:          Thu Oct 28 17:43:57 2010
Mount count:              2
Maximum mount count:      -1
Last checked:             Thu Oct 28 13:31:34 2010
Check interval:           0 (<none>)
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               128
Journal inode:            8
Default directory hash:   tea
Directory Hash Seed:      42eb54b3-8f66-4aef-8578-388c9863423c
Journal backup:           inode blocks

As you can see, some basic features and the inode size has changed in 10+ years. The solution:  format the “new” guest  to match the specs of the old:

#mke2fs -O has_journal,filetype,sparse_super,^ext_attr,^resize_inode,^dir_index \
-I 128 -j /dev/sda$disk

And that’s it- rysnc the files over to the guest VM, install grub, verify your fstab is good, and voila – you’ve transferred a physical server to a virtual one.