Cluster of Dual Pentium IIs: Kompuutaa

Kompuutaa (which is 'computer' translated into the Japanese syllabary then back into the Roman alphabet) is a Beowulf cluster, consisting of 24 dual-processor PentiumII machines (48 processors total) linked by fast ethernet, running Linux and communicating via MPI. It is being used for simulations of convection inside the Earth, and (sometimes) for simulations of seismic wave propagation. It resides in the Earth and Space Sciences Department at the UCLA, and has been built up over a period of time, starting in the summer of 1998, reaching a 30-processor configuration in November 1998 and the present 48-processor configuration in August 1999.

The entire system was built from components, not by buying ready-made PCs. Prices of most major components have dropped by a factor of at least 2 since the start of the project, and CPUs have become much faster. A 48-processor configuration could now be built from components for a price of ~$30-40,000, giving fantastic performance for only the cost of a high-end Unix workstation!

Such clusters are now becoming widespread and there is nothing particularly unusual about this one but I give some hardware/software specifications, design issues, and performance tests here which may be useful for anyone out there thinking about building their own. There are also some instructions for usage and instructions for using the PBS queueing system.

Specifications

Each unit

ASUS P2B-D motherboard
2 x 400 MHz Pentium II
512 Mb non-ECC memory
4.3 Gb system disk (various manufacturers)
fast ethernet card (Netgear FA310TX)
cheap video card (various types, typically Trident chipset with 1-2Mb RAM)
Antec KS-388 case
Build cost ~$1550 by Oct 1999 prices

Front-end unit (01): As above except

additional FA310TX ethernet card
ATI XPert98 8Mb video card
ATA/33 PCI IDE controller for 4 additional devices
6 additional IDE drives giving 218 GB disk space
768 Mb memory total
170 GB SCSI-to-IDE RAID (ordered from here)
Exabyte Mammoth 20/40 GB tape drive

Software

RedHat Linux 7.0 + 2.4.2 kernel
MPI-CH
Compilers are the EGCS (g77/g++) and Portland Group compiler package
Iris Explorer for Visualization (beta version available free for Linux from NAG)

Networking

Asante 6224 10/100 autosensing switch (cost ~$1100)
Configured as a private network- only the front end is connected to the outside world via an additional ethernet adaptor

Design issues

SMP or single-CPU?

The conventional wisdom is that dual-processor (SMP) units do not work well because the processors are competing for the same memory bus (and in this case, the same ethernet channel), so you don't get twice the speed. However, tests indicate that the applications we want to run do get good speedup on dual-CPU boards (some numbers later...of course, you don't get twice the speed by using 2 processors in different boxes, either). Other applications may be different, but here are some arguments in favor of using dual-CPU boards:

It costs <50% more to include a second processor in a PC. Thus, if you get a speedup of at least 50% (corresponding to a parallel efficiency of 75%), it is worth doing.
Requirements for space, power and cooling are much less with SMP units than with the same number of CPUs in single-CPU boxes
Switch requirements are also a factor of 2 lower, unless you put 2 ethernet cards in each PC to maintain the same bandwidth/CPU as in single-CPU units, which may be a good idea.

Alpha or PentiumII?

PCs using the DEC alpha chip look very promising in principle and we do indeed get faster performance (see later). However, despite the very high theoretical peak speed of the alpha CPU, typical applications only realize a small fraction of this. In addition, the price of PII systems has gone down greatly in the last year whereas alphas have not gone down very much, such that a dual-PII system now costs less than a single-alpha system. The test given later shows that for the applications of interest to us, PentiumII CPUs give a better price:performance ratio.

Why 24 units?

This is the size of the fast ethernet switch. 24-way fast ethernet switches can be obtained for a little more than $1000. Larger ones are much more expensive, e.g., $5000 for a 36-way switch.

Build or buy ready-made?

Building from components generally saves a lot of money, and allows you to get the exact specification desired. It takes about 1 hour to assemble each PC. It is not necessary to go through the Linux installation process for each PC since system disks can be easily 'cloned' using the 'dd' command, as described on Caltech's website.

Any hardware incompatibility problems?

We have had some minor problems.
(i) We tried 2 other types of motherboard and had couldn't get them to work properly (which isn't to say that they absolutely can't be made to work, but we gave up). In particular on a Soyo 6BI?? motherboard the BIOS would not recognize the boot partition on an existing system disk, didn't seem to allow installation of the boot loader from the RedHat Linux installer, and even refused to boot without a keyboard attached. By contrast, the ASUS P2B-D motherboards have worked perfectly with no problem every time, 26 of them so far.
(ii) The PCI ATA/33 IDE controller card did not work with Linux kernel 2.0.34 (which came with Red Hat 5.1), but it was a simple matter to download and install a more recent kernel (2.0.37), from ftp.kernel.org. However, an ATA/66 controller we have does not work even with kernel 2.2.5 (Red Hat 6.0).
(iii) A 37 GB IDE disk (by IBM) could not be recognized by the motherboard's BIOS, and has to be treated as 32GB. The ATA/33 controller card can also only detect it as 32 GB- this is why we were trying an ATA/66 controller. Future upgrades to BIOS and/or Linux will no doubt fix this in the near future.
(iv) The Linux Disk Druid utility had problems with a large IDE (16GB) disk- it couldn't recognize the proper number of cylinders, and tried to treat it as only 8Gb. This was got around (for that system disk) by using fdisk to set the correct numbers. For data disks it is not necessary to set up partitions - just connect the disk, do mkfs followed by mount, and it has worked every time and recognized the correct disk size.
(v) The 'tulip' ethernet driver we were using, which used to work, did not work with the latest (D1) version of the Netgear FA310TX ethernet adaptors. Downloading and installing the latest tulip driver from CESDIS fixed that.

What are power and cooling requirements?

A UCLA electrician recently measured the current usage of the cluster as 13 Amps, corresponding to 1430 Watts. The cluster fills a small office with air conditioning on full blast and it is usually cold in there.

Performance

(i) Single-processor performance: PII vs. other CPUs.

These are some test I did in summer 1998. Of course, faster processors of all these types are now available, but this comparison, in which the PII holds its own against much more expensive CPUs, was influencial in choosing the PentiumII architecture. Two different codes were tested, a finite-volume convection simulation code (left 4 bars) and a finite-difference wave propagation code (right 4 bars). Height of bars is proportional to speed (higher is better).

Both applications are written in Fortran (77+extensions).
Units on vertical axis are speedup relative to a 150MHz R4400 SGI Indigo2.
PII timings using g77
Alpha timings done on a PC with a 21164 motherboard (100MHz bus), RedHat Linux 5.0 and the NDP compiler

(ii) Parallel performance

3 applications have been tested: two different 3-D convection codes, and a 3-D seismic wave propagation code. One of these doesn't scale too well, while the other 2 scale well with the right parameters.

Code 1: 3D Finite-difference elastic wave propagation

The scaling is tested either with a fixed problem size (4 million grid points) or a problem size proportional to the number of nodes (2 million points/CPU). Another comparison is to use either 1 CPU per box (with the other one sitting idle), or both CPUs per box (i.e., the SMP issue). Code compiled using g77. Plotted is parallel EFFICIENCY (ie, % of ideal speedup).

Parallel efficiency levels out at ~80% for a scaled problem size
SMP: Comparing red to yellow, it is slightly (~8%) more efficient to use 1-CPU per unit rather than dual CPUs, which is probably due to competition for the system bus and/or ethernet card. The performance/price still better for dual-CPU units.
For scaling to seconds, a single processor takes 6.16s/timestep for 2M points.

Code 2: 3D Spherical convection using a spectral transform method

Mantle convection code developed by Gary Glatzmaier (1988-), modified and parallelized by Paul Tackley (1992-)
Spherical harmonics horizontally, Chebyshev polynomials radially
Parallelized using a 2D azimuthal decomposition in both grid and spectral space
FFT and Legendre transforms are parallelized, cross-node
scaled well on the Intel Delta (up to 512 i860 nodes)
Lmax is the maximum spherical harmonic degree and order- the corresponding number of grid points is: Lmax=31 =>96x48x59 grid points; Lmax=63=>192x96x59; Lmax=127=>384x192x59
Compiled using g77, with 1 CPU timestep is 5.23, 23.48 and 121.8 s respectively. Runs much faster when compiled with pgf77.

Efficiency drops of rapidly with number of processors even at high resolution
Parallel FFTs and Legendre transforms involve a lot of communication and are not efficient on this type of machine
Probably better to parallelize this type of method using transposes (in which data is transposed back and forth such that FFTs and Legendre transforms proceed with no communication). This approach has been very successful for a related geodynamo code by Gary Glatzmaier.

Code 3: 3D cartesian convection, multigrid, grid-based method

Primitive variables (vx,vy,vz,p,T) on a staggered grid.
Multigrid solver for momentum and continuity equations
Finite-volume advection and diffusion
2 alternative relaxation methods are implemented:
(i) Do a sweep over each variable in turn (vx, vy, vz, p) with communication after each sweep.
(ii) "Matrix relax" relaxes all variables in the same sweep by solving for one cell (6 velocity components + pressure) at a time. 1 communication step at the end of the sweep.

The efficiency tests below compare these 2 methods, as well as fixed vs. scaled problem size.

Efficiencies in 85-90% range for scaled problem with matrix relaxation.
Matrix relaxation is much more efficient because (i) it has several times as many floating point operations per iteration, and (ii) there is less communication per iteration.
Single-CPU timings for the 4 cases are 7.5, 31.0, 3.7, 15.4 seconds respectively (g77).