[Aces-support] -bash: qstat : command not found
Peter H Israelsson
peteri at MIT.EDU
Thu Aug 4 14:48:45 EDT 2005
FYI, I just ran a test program and everything seems to be working fine
now, qsub
works when called from another script. Thanks for the help.
Peter
Quoting aces-admin at techsquare.com:
> hello world-
>
> i believe that the compute nodes are
> back to normal. looks to have been corruption
> due to ao's crash earlier yesterday.
>
> continuing to look at the root cause(s),
> but please send mail if you notice anything
> else odd...
>
> [greg]
>
>
>> From: Simon Mcclusky <simon at mit.edu>
>> Mime-Version: 1.0
>> Date: Thu, 04 Aug 2005 09:47:51 -0400
>> Cc:
>> Reply-To: ACES-support at mitgcm.org, simon at mit.edu
>>
>> Peter,
>>
>> There seems to be something strange about the way ao compute nodes are
>> behaving. I can issue qsub commands that are accepted from the geojr
>> compute nodes, but not from the ao compute nodes. Until this is resolved
>> you can run your scripts on the geojr cluster.
>>
>> Greg - the problem on ao is the compute nodes see strange info for the
>> torque /bin directory where qsub resides:
>>
>> ?--------- ? ? ? ? ? bin
>>
>> I have no clue what this means!
>>
>> -Simon
>> On Thu, 2005-08-04 at 03:38, Peter H Israelsson wrote:
>> > I am having a similar but slightly different problem with the qsub
>> command, as
>> > detailed below.
>> >
>> > First I need to explain how my jobs are run: Because my
>> simulations are longer
>> > than the max walltime of the queues available to me, I have
>> rewritten my code
>> > to automatically run as a sequence of smaller simulations. The code
>> > automatically stops itself when its run time is approaching the
>> max walltime,
>> > and signals the calling pbs script that the simulation needs to be
>> restarted
>> > from its current time level. Before exiting, the calling pbs
>> script creates a
>> > new pbs script file and submits a new job, i.e., the last command
>> it issues
>> > before exiting is "qsub [new_job]". That way, the next part of
>> the simulation
>> > is assigned a new job number, and the walltime is reset.
>> >
>> > This process was working fine before the ao system went down last
>> night, i.e.,
>> > the code stopped and restarted itself automatically with no problems.
>> > However,
>> > since the system went down last night, this sequential process is
>> now failing
>> > because it says that it cannot find the qsub command:
>> > /var/torque/mom_priv/jobs/8322.ao.SC: line 65: qsub: command not found
>> > I have tested this a number of times and get the same result each time.
>> >
>> > So something is different on ao since yesterday's reboot. The strange
>> > thing is
>> > that when I manually log on to ao.acesgrid.org, I do have access to qsub,
>> > qstat, etc. So I don't understand why the 'qsub' command doesn't
>> work when
>> > issued by an existing job.
>> >
>> > Any ideas what is going on? Thanks.
>> >
>> > Regards,
>> > Peter
>> >
>> > PS Greg, I am confused by your last email because the module 'magick'
>> > you refer
>> > to is not listed when I type 'module avail'. Aren't the qsub, qstat, etc
>> > commands automatically loaded (in one of the 'default' modules such as
>> > 'torque/1.2.0p4')? Also, I get an error when I try typing 'module load
>> > magick', saying that the module cannot be found.
>> >
>> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> > Peter H. Israelsson
>> > Massachusetts Institute of Technology
>> > Department of Civil & Environmental Engineering
>> > 48-114, 15 Vassar Street, Cambridge, MA 02139, USA
>> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> >
>> >
>> > Quoting aces-admin at techsquare.com:
>> >
>> > > hello beghein-
>> > >
>> > > are you certain that your login bits are setup
>> > > to load the module magick ? i just tested this
>> > > and it worked for me...
>> > >
>> > > . ssh ts at ao.acesgrid.org
>> > > . ao: module list
>> > > . ao: qstat
>> > >
>> > > [greg]
>> > >
>> > >> Mime-Version: 1.0
>> > >> Date: Wed, 3 Aug 2005 16:32:03 -0400
>> > >> From: Caroline Beghein <beghein at mit.edu>
>> > >> Cc:
>> > >> Reply-To: ACES-support at mitgcm.org
>> > >>
>> > >> Hi
>> > >>
>> > >> Is there still something wrong with the cluster? Whether I login to
>> > >> ao or geojr, I cannot start any job. If I type qsub ... or qstat I
>> > >> get "-bash: qstat : command not found"
>> > >> What does that mean?
>> > >>
>> > >> Thanks
>> > >>
>> > >>
>> > >> --
>> > >> Caroline
>> > >>
>> > >>
>> > >>
>> > >> Caroline Beghein
>> > >> 77 Massachusetts avenue #54-526
>> > >> Cambridge, MA 02139
>> > >> tel.: +1 617 253 3589
>> > >> http://www.mit.edu/~beghein
>> > >> _______________________________________________
>> > >> Aces-support mailing list
>> > >> Aces-support at acesgrid.org
>> > >> http://acesgrid.org/mailman/listinfo/aces-support
>> > >>
>> > >
>> > > _______________________________________________
>> > > Aces-support mailing list
>> > > Aces-support at acesgrid.org
>> > > http://acesgrid.org/mailman/listinfo/aces-support
>> > >
>> >
>> >
>> > _______________________________________________
>> > Aces-support mailing list
>> > Aces-support at acesgrid.org
>> > http://acesgrid.org/mailman/listinfo/aces-support
>> --
>> Simon McClusky
>> RM 54-614, Dept EAPS, MIT,
>> 77 Massachusetts Ave,
>> Cambridge, MA 02139
>> USA
>>
>> email: simon at mit.edu
>> Ph: 617 253-3077
>> Fax: 617 253-1699
>> Cell:857 928-5891
>>
>>
>>
>> _______________________________________________
>> Aces-support mailing list
>> Aces-support at acesgrid.org
>> http://acesgrid.org/mailman/listinfo/aces-support
>>
> _______________________________________________
> Aces-support mailing list
> Aces-support at acesgrid.org
> http://acesgrid.org/mailman/listinfo/aces-support
>
More information about the Aces-support
mailing list