[nSLUG] Wither workstations
nslug at fop.ns.ca
Sun Feb 13 16:07:45 AST 2005
On Sun, 13 Feb 2005, George N. White III wrote:
> On Sun, 13 Feb 2005, Dop Ganger wrote:
> I get the impression that you can't have an loose papers around these
> things or they will get sucked onto the fan grill. Moshe Bar says his
> dual G5 even sounds like a vacuum cleaner.
Well, it's not *quite* that bad with the DL machines, but I suspect that
might be because the 1U configuration means that there's no dead air spots
where heat accumulates.
> I think exporting the display will work for most users, but if the
> rack mount machines are used as headless workstations then
> we will want to bring switched ethernet to every machine (as opposed
> to the beowulf approach where you put a switch on the rack for
> message passing and the only the master node is connected to the
Yes, you'll definitely want a switched infrastructure, preferably with a
gigabit backbone to/from the master node(s).
> I've been reading up on grid technology. Our workload is a mix of
> compiles, visualizations, with lots of fairly routine batch processing. In
> principle load balancing tools should help us get some extra batch
> processing done outside working hours, but in practice a lot of effort
> goes into managing disk space. A typical batch processing run reads a
> terabyte and generates 100GB of output -- some people just fire off
> jobs until they run out of space, clean up the mess, and start over with
> the first job that bombed. That's OK on a "personal" workstation, but I'm
> not sure the existing tools are robust enough to let those jobs loose on a
> grid. Looking at some of the published experiences with real grids, 20-30%
> of jobs fail due to running out of disk.
Disk handling is definitely fun. It sounds like you'll need a central
storage cluster (or SAN, or NAS, or whatever fits) then have plenty of
scratch space on the nodes.
A few things that may help; with SGE you can create queues that have
resource lists defined, so you can split up jobs between those that
require (say) 50GB and those that require 100GB of disk storage, which
help isolate jobs from each other. You can also use the quota system to
stall or kill jobs that exceed their limitations. You can then wrap a
script around the job that will clean up the scratch space after they're
I've also heard good things about Rocks
(http://www.rocksclusters.org/Rocks/), which is more of a cluster
management management system. You can then plug in cluster management
systems (SGE, Maui, Beowulf, etc) which gives you greater flexibility to
manage the cluster.
More information about the nSLUG