[Aces-support] -bash: qstat : command not found

aces-admin at techsquare.com aces-admin at techsquare.com
Thu Aug 4 10:28:56 EDT 2005


hello world-

i believe that the compute nodes are
back to normal. looks to have been corruption
due to ao's crash earlier yesterday. 

continuing to look at the root cause(s), 
but please send mail if you notice anything 
else odd...

[greg]


> From: Simon Mcclusky <simon at mit.edu>
> Mime-Version: 1.0
> Date: Thu, 04 Aug 2005 09:47:51 -0400
> Cc: 
> Reply-To: ACES-support at mitgcm.org, simon at mit.edu
> 
> Peter,
> 
> There seems to be something strange about the way ao compute nodes are
> behaving. I can issue qsub commands that are accepted from the geojr
> compute nodes, but not from the ao compute nodes. Until this is resolved
> you can run your scripts on the geojr cluster.
> 
> Greg - the problem on ao is the compute nodes see strange info for the
> torque /bin directory where qsub resides:
> 
> ?---------  ? ?    ?       ?            ? bin
> 
> I have no clue what this means!
> 
> -Simon  
> On Thu, 2005-08-04 at 03:38, Peter H Israelsson wrote:
> > I am having a similar but slightly different problem with the qsub command, as
> > detailed below.
> > 
> > First I need to explain how my jobs are run: Because my simulations are longer
> > than the max walltime of the queues available to me, I have rewritten my code
> > to automatically run as a sequence of smaller simulations.  The code
> > automatically stops itself when its run time is approaching the max walltime,
> > and signals the calling pbs script that the simulation needs to be restarted
> > from its current time level.  Before exiting, the calling pbs script creates a
> > new pbs script file and submits a new job, i.e., the last command it issues
> > before exiting is "qsub [new_job]".  That way, the next part of the simulation
> > is assigned a new job number, and the walltime is reset.
> > 
> > This process was working fine before the ao system went down last night, i.e.,
> > the code stopped and restarted itself automatically with no problems.  
> > However,
> > since the system went down last night, this sequential process is now failing
> > because it says that it cannot find the qsub command:
> > /var/torque/mom_priv/jobs/8322.ao.SC: line 65: qsub: command not found
> > I have tested this a number of times and get the same result each time.
> > 
> > So something is different on ao since yesterday's reboot.  The strange 
> > thing is
> > that when I manually log on to ao.acesgrid.org, I do have access to qsub,
> > qstat, etc.  So I don't understand why the 'qsub' command doesn't work when
> > issued by an existing job.
> > 
> > Any ideas what is going on?  Thanks.
> > 
> > Regards,
> > Peter
> > 
> > PS Greg, I am confused by your last email because the module 'magick' 
> > you refer
> > to is not listed when I type 'module avail'.  Aren't the qsub, qstat, etc
> > commands automatically loaded (in one of the 'default' modules such as
> > 'torque/1.2.0p4')?  Also, I get an error when I try typing 'module load
> > magick', saying that the module cannot be found.
> > 
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >   Peter H. Israelsson
> >   Massachusetts Institute of Technology
> >   Department of Civil & Environmental Engineering
> >   48-114, 15 Vassar Street, Cambridge, MA 02139, USA
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > 
> > 
> > Quoting aces-admin at techsquare.com:
> > 
> > > hello beghein-
> > >
> > > are you certain that your login bits are setup
> > > to load the module magick ? i just tested this
> > > and it worked for me...
> > >
> > > . ssh ts at ao.acesgrid.org
> > > . ao: module list
> > > . ao: qstat
> > >
> > > [greg]
> > >
> > >> Mime-Version: 1.0
> > >> Date: Wed, 3 Aug 2005 16:32:03 -0400
> > >> From: Caroline Beghein <beghein at mit.edu>
> > >> Cc:
> > >> Reply-To: ACES-support at mitgcm.org
> > >>
> > >> Hi
> > >>
> > >> Is there still something wrong with the cluster? Whether I login to
> > >> ao or geojr, I cannot start any job. If I type qsub ... or qstat I
> > >> get "-bash: qstat : command not found"
> > >> What does that mean?
> > >>
> > >> Thanks
> > >>
> > >>
> > >> --
> > >> 	Caroline
> > >>
> > >>
> > >>
> > >> Caroline Beghein
> > >> 77 Massachusetts avenue #54-526
> > >> Cambridge, MA 02139
> > >> tel.: +1 617 253 3589
> > >> http://www.mit.edu/~beghein
> > >> _______________________________________________
> > >> Aces-support mailing list
> > >> Aces-support at acesgrid.org
> > >> http://acesgrid.org/mailman/listinfo/aces-support
> > >>
> > >
> > > _______________________________________________
> > > Aces-support mailing list
> > > Aces-support at acesgrid.org
> > > http://acesgrid.org/mailman/listinfo/aces-support
> > >
> > 
> > 
> > _______________________________________________
> > Aces-support mailing list
> > Aces-support at acesgrid.org
> > http://acesgrid.org/mailman/listinfo/aces-support
> -- 
> Simon McClusky
> RM 54-614, Dept EAPS, MIT,
> 77 Massachusetts Ave,
> Cambridge, MA 02139
> USA
> 
> email: simon at mit.edu
> Ph: 617 253-3077
> Fax: 617 253-1699
> Cell:857 928-5891
> 
> 
> 
> _______________________________________________
> Aces-support mailing list
> Aces-support at acesgrid.org
> http://acesgrid.org/mailman/listinfo/aces-support
> 



More information about the Aces-support mailing list