Monday 12 April 2010

Adding Virtual Linux boxes (running in a Windows host) to a Condor Pool

As the manager of a Condor pool, I've decided to increase the number of CPUs available for Condor. Until now Condor at our site run only in Linux machines, but there are also quite a lot of Windows boxes around which can be useful. But our users develop code in Linux, so my goal was to provide more Linux machines (though these would be Virtual Machines running inside the Windows host) to the final users.

Condor provides a Virtual Machine Universe. This way we could run a linux virtual machine inside a Windows host, but the support in Windows is limited to VMWare. I'm not sure if we could use the VMWare Player to avoid costs, but for the moment I have tried a different route: POVB.

POVB stands for Pools of Virtual Boxes and the project "is focused on creating Linux-based VirtualBox virtual machines to deploy Condor pools in an Windows environment".

The code can be downloaded at  SourceForge and the installation in a single machine is very easy. I'm now running version 1.4.3 in a Windows XP (service pack 3) PC (AMD Athlon 64 X2 Dual Core Processor 6000+ 3.01 Ghz, with 3.25 GB of RAM) and these notes reflect this version. In order to install POVB in a single machine you just need to run the script INSALL.BAT inside the povb-1.4.3 folder. This will install VirtualBox, a CentOS virtual machine, and the Windows services to get everything running. The script takes some time, since it is downloading quite a lot of stuff from the Internet, so patience will help. When all is done, you will see a povb directory in C:\

Before rebooting the computer, you need to change some basic settings in the file C:\povb\condor_status\personal_config.txt (I personalized DOMAIN, CM_FULLNAME, CM_SHORTNAME and CM_IPADDRESS, and didn't touch the rest.

Now, after rebooting the machine, all necessary stuff will be started automatically. By default you get a 32 bits machine running CentOS, although you can create your own VM and modify it to your heart's content (as we will see later).

To verify that everything is working correctly, you can first check in the Windows Host (with the Windows Task Manager) that the process VBoxHeadless.exe is running (with a previous version of POVB I got stuck here due to a problem with detecting correctly the number of CPUs in my PC). If this is not the case, you can start VirtualBox manually and try to start the povb VM to figure out possible errors.

Assuming that VBoxHeadless.exe is running in the Windows PC, then you should check that the VM got registered with the Condor pool. The name of the machine is worker with the MAC address included. For instance, in my newly included VM, I get:

angelv@vaso:/etc/condor$ condor_status | grep -i worker
slot1@worker_EEFF0 LINUX      INTEL  Unclaimed Idle     1.000   821  0+00:00:04
slot2@worker_EEFF0 LINUX      INTEL  Unclaimed Idle     0.800   821  0+00:00:05
angelv@vaso:/etc/condor$


The name of the actual machine is: 

angelv@vaso:/etc/condor$ condor_status -l | grep -i worker | grep -i machine
Machine = "worker_EEFF090909E6.ll.iac.es"


So, the last step is just to verify that the VM can actually run jobs. Once you have found the name of the machine, you can check its attributes. In particular we are interested in HOSTINFO_HostOsLoad and HOSTINFO_POVBLoad, since these were problematic in my case. You can find whether these show up with the following command:

angelv@vaso:/etc/condor$ condor_status -l worker_EEFF090909E6.ll.iac.es | grep -i hostinfo
CpuBusy = ((HOSTINFO_HostOsLoad - HOSTINFO_POVBLoad) >= 0.500000)
Start = ((HOSTINFO_HostOsKeyboardIdle > 15 * 60) && (((HOSTINFO_HostOsLoad - HOSTINFO_POVBLoad) <= 0.500000) || (State != "Unclaimed" && State != "Owner")))
HOSTINFO_HostOsLoad = 0.010000
HOSTINFO_POVBLoad = 0.010000


If you cannot see them, you might have found the same problem I did, which it looks like it is related to regional settings. If you open the file C:\povb\condor_status\machine_stats.txt and HostOsLoad dn POVBLoad are written with a comma (e.g. 0,04), then you have the same problem I did. The developers of POVB are aware of this problem, but until they have a chance of fixing it, the following workaround did the trick.

In Windows, stop the POVB service, open VirtualBox and change the virtual hard disk povb_primary_hd.vdi from "Immutable" to "Normal" and start the povb VM manually (the root password is by default YouReallyNeedToChangeMe! ). Once it starts you can change to the condor user (su - condor), where you will see all the Condor stuff.  In its home directory /home/condor/ you can find the Condor software together with the configuration files, logs etc. The main configuration file is located in /home/condor/etc/condor_config, with secondary config files in /home/condor/condor_config_local. Logs and the execute and spool directories are located in /home/condor/local.localhost.

Of particular interest here is the file read_stats.sh If you run it and get HostOsLoad and POBVLoad with commas, then you can easily solve it by renaming this file to read_stats_orig.sh and creating a new read_stats.sh file:

$ cat read_stats.sh
#!/bin/bash
/home/condor/read_stats_orig.sh | sed 's/,/./' -

Once this is in place HOSTINFO_HostOsLoad and HOSTINFO_POVBLoad will start appearing in the VM information you get with condor_status, and then you will be able to use this VMs as regular Linux PCs in your Condor pool.


If you had to change the read_stats.sh file, then you can just substitute the povb_primary_hd.vdi file that comes with the POVB distribution with the one in C:\povb (just in case, copy it when VirtualBox is not running. You can stop the POVB service via the Control Panel).

Another issue that I had for our setting is that I only want to run the Linux VMs after hours, because even if the VM is not being used by Condor, VirtualBox can consume quite a lot of RAM and I don't want that our users notice it. For this I just created two scripts in C:\povb, one with net start povb_service and the other one with net stop povb_service, and I scheduled them according to our needs, so that the POVB service is not running during working hours.


With this in place, I have started spreading Linux VMs in a few Windows test PCs. If all goes well, then next step will be to create my own VMs. For this, there is a guide in: http://sourceforge.net/apps/trac/poolsofvirtualb/wiki/Bootstrapping





No comments: