[Aces-support] ITRDA--again?

Sai Ravela ravela at MIT.EDU
Tue Dec 7 03:20:06 EST 2004


Folks,

	Its 3am and ITRDA is gone again. What is going on?
Time to vent.
For the last few days I have not been able to use the ITRDA at all for all
sorts of crashes to the head node
(I don't know if the dell boxes are dying too, but they don't seem to be).
We *need* a stable server, even if an older machine and kernel (I really
don't care about the latest queueing or distribution), so work gets work
done at a reasonable pace. This is the most essential thing as I see it. Do
we know what the last stable version is and roll-back?

	We do not have a policy for how upgrades are done and in my view we need
one. The head node is not a stomping ground for ad-hoc updates and changes,
and we can't afford a "things will be lumpy till they smooth out" policy
here. There are many examples of upgrade policies out there that seem to be
working. I suggest we borrow one of those and put it to good use here. It
goes something like this: If it ain't broken don't fix it.

How about we keep things as they are until we *know* an upgrade works. How
would we know this? A spare machine would help (there are four login servers
now and AFAIK, the others are not used, there is a cluster where one or two
nodes per site can be thoroughly tested first...yes! this makes sense!). How
about an announcement to the community that goes like this:

"Dear users, we plan to upgrade the following machines on such and so day
and time. Please let us know whether there are any objections. We have every
reason to believe that this upgrade is useful and has been tested on such
and so for so long and such. Thank you very much, XOXOXO"

OK, that should do it, by god, its that simple!

ITRDA should not continue to be the unlucky scapegoat, we need to get stuff
done.

Sorry to vent, but I've had one too many crashes here.
What are we going to do about this?
End vent.

Sai

-----Original Message-----
From: aces-support-bounces at mitgcm.org
[mailto:aces-support-bounces at mitgcm.org]On Behalf Of
ACES-support at mitgcm.org
Sent: Monday, December 06, 2004 5:52 PM
To: ACES-support at mitgcm.org
Cc: ACES-support at mitgcm.org
Subject: Re: [Aces-support] home quota changed?


hello yuhua-

the quotas have not been changed, but
if you are writing data to your home
directory then you should re-consider
(for there are quotas on the home space).

this may be a consequence of either saturday's
or this-afternoon's crashes. we've backed off
to the last-known-working kernel, so the machine
should be relatively more stable...

[greg]

> From: zhyh at mit.edu
> Date: Mon, 6 Dec 2004 14:51:57 -0500 (EST)
> MIME-Version: 1.0
> Cc:
> Reply-To: ACES-support at mitgcm.org
>
> Hello,
>
> I failed to write any files to the /home directory. Has the quota been
> changed?
>
> Yuhua
>
> +++++++++++++++++++++++++++++++++++++++++++++++++++
> Yuhua Zhou
>
> Remote Sensing & Data Assimilation Group
> Hydrology Program
> Ralph M. Parsons Laboratory
> Department of Civil and Environmental Engineering
> MIT, 48-212
> Cambridge, MA 02139
>
> _______________________________________________
> Aces-support mailing list
> Aces-support at acesgrid.org
> http://acesgrid.org/mailman/listinfo/aces-support
>
_______________________________________________
Aces-support mailing list
Aces-support at acesgrid.org
http://acesgrid.org/mailman/listinfo/aces-support




More information about the Aces-support mailing list