[Aces-support] -bash: qstat : command not found

Peter H Israelsson peteri at MIT.EDU
Fri Aug 5 10:52:50 EDT 2005


hello greg,

that explains it then...this morning things are working fine.

thanks, and let me know if you need me to kill any processes to free up 
specific
nodes.

peter



Quoting aces-admin at techsquare.com:

> hello peteri-
>
> you had a number of jobs running on
> the upper-ao nodes (a54-1727-069++)
> that i did not kill yesterday.
>
> it was not possible, however, to fix
> the "qsub not found" problem without
> killing your jobs, so the best that
> i could do was to make sure that these
> nodes were allocated to no further jobs.
>
> this morning, as your jobs had completed,
> i was able to fix the remaining nodes (a
> handful of them) and restore them to the queues.
>
> [greg]
>
>
>> Date: Fri,  5 Aug 2005 00:02:38 -0400
>> From: Peter H Israelsson <peteri at mit.edu>
>> MIME-Version: 1.0
>> Cc:
>> Reply-To: ACES-support at mitgcm.org
>>
>> It seems that I spoke too soon earlier.  I am getting the same error again
>> periodically, i.e., qsub is not found when issued by another calling
>> script. It is happening more often than not with my current batch of
>> simulations. However, I reran my test script and it worked fine.  So I
>> guess as before some
>> of the nodes are having problems while some are OK.  I'll see if I 
>> can figure
>> out which ones don't work.  Sorry to be bearer of bad news (again).
>>
>> Regards,
>> Peter
>>
>>
>> Quoting aces-admin at techsquare.com:
>>
>> > hello world-
>> >
>> > i believe that the compute nodes are
>> > back to normal. looks to have been corruption
>> > due to ao's crash earlier yesterday.
>> >
>> > continuing to look at the root cause(s),
>> > but please send mail if you notice anything
>> > else odd...
>> >
>> > [greg]
>> >
>> >
>> >> From: Simon Mcclusky <simon at mit.edu>
>> >> Mime-Version: 1.0
>> >> Date: Thu, 04 Aug 2005 09:47:51 -0400
>> >> Cc:
>> >> Reply-To: ACES-support at mitgcm.org, simon at mit.edu
>> >>
>> >> Peter,
>> >>
>> >> There seems to be something strange about the way ao compute nodes are
>> >> behaving. I can issue qsub commands that are accepted from the geojr
>> >> compute nodes, but not from the ao compute nodes. Until this is resolved
>> >> you can run your scripts on the geojr cluster.
>> >>
>> >> Greg - the problem on ao is the compute nodes see strange info for the
>> >> torque /bin directory where qsub resides:
>> >>
>> >> ?---------  ? ?    ?       ?            ? bin
>> >>
>> >> I have no clue what this means!
>> >>
>> >> -Simon
>> >> On Thu, 2005-08-04 at 03:38, Peter H Israelsson wrote:
>> >> > I am having a similar but slightly different problem with the qsub
>> >> command, as
>> >> > detailed below.
>> >> >
>> >> > First I need to explain how my jobs are run: Because my
>> >> simulations are longer
>> >> > than the max walltime of the queues available to me, I have
>> >> rewritten my code
>> >> > to automatically run as a sequence of smaller simulations.  The code
>> >> > automatically stops itself when its run time is approaching the
>> >> max walltime,
>> >> > and signals the calling pbs script that the simulation needs to be
>> >> restarted
>> >> > from its current time level.  Before exiting, the calling pbs
>> >> script creates a
>> >> > new pbs script file and submits a new job, i.e., the last command
>> >> it issues
>> >> > before exiting is "qsub [new_job]".  That way, the next part of
>> >> the simulation
>> >> > is assigned a new job number, and the walltime is reset.
>> >> >
>> >> > This process was working fine before the ao system went down last
>> >> night, i.e.,
>> >> > the code stopped and restarted itself automatically with no problems.
>> >> > However,
>> >> > since the system went down last night, this sequential process is
>> >> now failing
>> >> > because it says that it cannot find the qsub command:
>> >> > /var/torque/mom_priv/jobs/8322.ao.SC: line 65: qsub: command not found
>> >> > I have tested this a number of times and get the same result each time.
>> >> >
>> >> > So something is different on ao since yesterday's reboot.  The strange
>> >> > thing is
>> >> > that when I manually log on to ao.acesgrid.org, I do have 
>> access to qsub,
>> >> > qstat, etc.  So I don't understand why the 'qsub' command doesn't
>> >> work when
>> >> > issued by an existing job.
>> >> >
>> >> > Any ideas what is going on?  Thanks.
>> >> >
>> >> > Regards,
>> >> > Peter
>> >> >
>> >> > PS Greg, I am confused by your last email because the module 'magick'
>> >> > you refer
>> >> > to is not listed when I type 'module avail'.  Aren't the qsub, 
>> qstat, etc
>> >> > commands automatically loaded (in one of the 'default' modules such as
>> >> > 'torque/1.2.0p4')?  Also, I get an error when I try typing 'module load
>> >> > magick', saying that the module cannot be found.
>> >> >
>> >> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> >> >   Peter H. Israelsson
>> >> >   Massachusetts Institute of Technology
>> >> >   Department of Civil & Environmental Engineering
>> >> >   48-114, 15 Vassar Street, Cambridge, MA 02139, USA
>> >> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> >> >
>> >> >
>> >> > Quoting aces-admin at techsquare.com:
>> >> >
>> >> > > hello beghein-
>> >> > >
>> >> > > are you certain that your login bits are setup
>> >> > > to load the module magick ? i just tested this
>> >> > > and it worked for me...
>> >> > >
>> >> > > . ssh ts at ao.acesgrid.org
>> >> > > . ao: module list
>> >> > > . ao: qstat
>> >> > >
>> >> > > [greg]
>> >> > >
>> >> > >> Mime-Version: 1.0
>> >> > >> Date: Wed, 3 Aug 2005 16:32:03 -0400
>> >> > >> From: Caroline Beghein <beghein at mit.edu>
>> >> > >> Cc:
>> >> > >> Reply-To: ACES-support at mitgcm.org
>> >> > >>
>> >> > >> Hi
>> >> > >>
>> >> > >> Is there still something wrong with the cluster? Whether I login to
>> >> > >> ao or geojr, I cannot start any job. If I type qsub ... or qstat I
>> >> > >> get "-bash: qstat : command not found"
>> >> > >> What does that mean?
>> >> > >>
>> >> > >> Thanks
>> >> > >>
>> >> > >>
>> >> > >> --
>> >> > >> 	Caroline
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >> Caroline Beghein
>> >> > >> 77 Massachusetts avenue #54-526
>> >> > >> Cambridge, MA 02139
>> >> > >> tel.: +1 617 253 3589
>> >> > >> http://www.mit.edu/~beghein
>> >> > >> _______________________________________________
>> >> > >> Aces-support mailing list
>> >> > >> Aces-support at acesgrid.org
>> >> > >> http://acesgrid.org/mailman/listinfo/aces-support
>> >> > >>
>> >> > >
>> >> > > _______________________________________________
>> >> > > Aces-support mailing list
>> >> > > Aces-support at acesgrid.org
>> >> > > http://acesgrid.org/mailman/listinfo/aces-support
>> >> > >
>> >> >
>> >> >
>> >> > _______________________________________________
>> >> > Aces-support mailing list
>> >> > Aces-support at acesgrid.org
>> >> > http://acesgrid.org/mailman/listinfo/aces-support
>> >> --
>> >> Simon McClusky
>> >> RM 54-614, Dept EAPS, MIT,
>> >> 77 Massachusetts Ave,
>> >> Cambridge, MA 02139
>> >> USA
>> >>
>> >> email: simon at mit.edu
>> >> Ph: 617 253-3077
>> >> Fax: 617 253-1699
>> >> Cell:857 928-5891
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> Aces-support mailing list
>> >> Aces-support at acesgrid.org
>> >> http://acesgrid.org/mailman/listinfo/aces-support
>> >>
>> > _______________________________________________
>> > Aces-support mailing list
>> > Aces-support at acesgrid.org
>> > http://acesgrid.org/mailman/listinfo/aces-support
>> >
>>
>>
>> _______________________________________________
>> Aces-support mailing list
>> Aces-support at acesgrid.org
>> http://acesgrid.org/mailman/listinfo/aces-support
>>
> _______________________________________________
> Aces-support mailing list
> Aces-support at acesgrid.org
> http://acesgrid.org/mailman/listinfo/aces-support
>





More information about the Aces-support mailing list