The Cluster Tux Project
By Steven Webb

Pic stolen from
mklinux.org

Overview
    The TECOM Cluster Tux project was an attempt the year 1999 to move a highly-computational task from an 300Mhz 8-CPU SGI O2000 to a group of PCs running Linux (8 nodes, dual-cpu @ 500Mhz Pentium 3). The idea was inspired by the Beowulf project and some other general linux clustering projects. This is the documentation of the planning and construction of the first 8-node cluster for the project. When I left the project in 2005, I had over 500 nodes in 10+ clusters running MM5 for this same project. We had gone away from using Dolphin's product for Myrinet in 2005.
What is MM5?
    MM5 is a piece of software that takes a bunch of weather data and interpolates the data over time to make a forecast for a certain area. This forecast includes everything you can think of (moisture, air preassure, cloud cover, ...) and with the help of a package called Vis5D or VisAD you can manipulate this forecast in 3D.

    For more information, see the NCAR MM5 Homepage.
What is MPI?
    MPI stands for "Message-Passing Interface" and is a cross-platform method of inter-process communication. It's predecessor was PVM which stands for "Parallel Virtual Machine". Both of these APIs allow processes to communicate over a network in a heterogeneous environment without having to worry about byte-order or any of the other problems that often plague distributed hetergeneous programming.

    For more information, see the Argonne National Labs' Computer Science Division's MPI page.
Porting MM5 to MPI
Why do this?
    Max MIPS per $ = cheap computing power.

    Today's desktop machines are growing in total CPU power much faster and can be purchased much cheaper than today's supercomputers. An 8-processor SGI Origin 2000 with 1GB of ram costs around $120K and the processors only come at 300Mhz. For the same ammount of money (May 1999 dollars) it is possible to set up 24 $5000 PC's which would have each 1GB of ram and 900-1100 Mhz of processing power each. This is a substantial performance/cost increase ($120K SGI = 2400Mhz, $120K Intel cluster = 24000Mhz - this is a factor of 10!).
Design
  • Networking
    We decided to go with Gigabit ethernet (originially) as the network transport of choice because the MM5 architecture is very network dependant and although MM5 is optimized to run in a parallel environment, was initially written to be run on a shared-memory machine. Since we're moving MM5 over to a cluster with distributed memory the network bandwidth all of a sudden becomes a very serious bottleneck.

    Then, we decided to add a Dolphin/Scali Interconnect system which (it turns out) has much less latency than almost all other MPI-based solutions.

  • Node hardware
    • 19" rack-mount ATX computer case
    • Epox KP6BS ATX dual-CPU motherboard
    • 2 x Intel Pentium III 500 Mhz CPU
    • 1 Gigabyte RAM in master ,512M RAM in each slave
    • 3com 3c985B SX Gigabit network card
    • 4.3 Gigabyte Maxtor Harddrive

  • Node software

  • Misc
    • Rack
    • Cisco 8-port gigabit ethernet switch
    • Dolphin/Scali Interconnect system
    • One extra Maxtor 10 Gigabyte harddrive
    • B&W SVGA monitor
    • Keyboard

  • Price
    • Approx $47,000 (in 1999 dollars and technology) (minus the man-hours of course, but this includes both Gigabit ethernet and Scali. Subtract $8K for the Gigabit ethernet and you have the 'package' price). Now it can be done for much less with multi-core processors and much cheaper networking options.
Current Status
    Here are some timings (Gigabit ethernet, Myranet & Scali):
    (Click for a bigger image).
    Thanks to Al Bourgeois for his work with MM5 and getting me these benchmarks!
    Some pics: More here
How to use it:
  • Under mpich:
    • always do your work from "c-tux" or "node1". It is not necessary to log into any other nodes.
    • copy your pre-processed data into /data.
    • make sure that the links in /mm5v3/Run point to the right files in /data
    • cd into /mm5v3/Run
    • type "sh clean.sh" to clean out the old debug and output files from any old runs.
    • type "./mpirun -arch LINUX -np 16 /mm5v3/Run/mm5.mpp"
    • to monitor your run, check out rsl.error.0000 by doing a "tail -f rsl.error.0000"

    • If you need to re-compile, let me know and I'll show you the ropes.
  • Scampi:
    • Use mpimon as shown in the user's manual found here

This has been a Quicky production