[nSLUG] Wither workstations
George N. White III
aa056 at chebucto.ns.ca
Mon Feb 14 08:49:14 AST 2005
On Sun, 13 Feb 2005, Dop Ganger wrote:
> On Sun, 13 Feb 2005, George N. White III wrote:
> Disk handling is definitely fun. It sounds like you'll need a central
> storage cluster (or SAN, or NAS, or whatever fits) then have plenty of
> scratch space on the nodes.
Yes. The guiding principles should be:
1. you don't need to take a box apart to swap a disk
2. replacement disks don't require any manual partioning or restores
(e.g., RAID or automated configuration like Rocks for system disks).
3. you don't need to install the disk in a PC and run a DOS utility to
get the failure codes for the return authorization. Someday we will have
disks that sense when they are about to fail and submit the return
authorization request for you.
> A few things that may help; with SGE you can create queues that have
> resource lists defined, so you can split up jobs between those that
> require (say) 50GB and those that require 100GB of disk storage, which
> help isolate jobs from each other. You can also use the quota system to
> stall or kill jobs that exceed their limitations. You can then wrap a
> script around the job that will clean up the scratch space after they're
We use quotas, but they are far from ideal for our workloads since
many jobs create huge temporary files (e.g., uncompressed images)
so either you give out quotas that total way more than the total
disk space and hope no two users run big jobs at the same time
or you adjust quotas early and often to track workloads. A system
that adjusted quotas dynamically based on actual job characteristics
would be a big help.
I've been playing with condor (it supports SGI, so people can start using
it immediately), but on FC2 ver. 6.6.8 couldn't determine the system
memory or swap space and you don't get sources so you can't investigate.
Condor tries to be good about cleaning up after failed jobs.
> I've also heard good things about Rocks
> (http://www.rocksclusters.org/Rocks/), which is more of a cluster
> management management system. You can then plug in cluster management
> systems (SGE, Maui, Beowulf, etc) which gives you greater flexibility to
> manage the cluster.
Our modellers have a Rocks cluster -- I'm hoping to share some rack space
with them. The big advantage of Rocks is that it does a fresh install of
the OS each time it restarts a node, so you don't have any persistent
state on the nodes.
George N. White III <aa056 at chebucto.ns.ca>
More information about the nSLUG