[Aces-support] -bash: qstat : command not found
aces-admin at techsquare.com
aces-admin at techsquare.com
Fri Aug 5 10:07:19 EDT 2005
hello peteri-
you had a number of jobs running on
the upper-ao nodes (a54-1727-069++)
that i did not kill yesterday.
it was not possible, however, to fix
the "qsub not found" problem without
killing your jobs, so the best that
i could do was to make sure that these
nodes were allocated to no further jobs.
this morning, as your jobs had completed,
i was able to fix the remaining nodes (a
handful of them) and restore them to the queues.
[greg]
> Date: Fri, 5 Aug 2005 00:02:38 -0400
> From: Peter H Israelsson <peteri at mit.edu>
> MIME-Version: 1.0
> Cc:
> Reply-To: ACES-support at mitgcm.org
>
> It seems that I spoke too soon earlier. I am getting the same error again
> periodically, i.e., qsub is not found when issued by another calling
> script. It is happening more often than not with my current batch of
> simulations. However, I reran my test script and it worked fine. So I
> guess as before some
> of the nodes are having problems while some are OK. I'll see if I can figure
> out which ones don't work. Sorry to be bearer of bad news (again).
>
> Regards,
> Peter
>
>
> Quoting aces-admin at techsquare.com:
>
> > hello world-
> >
> > i believe that the compute nodes are
> > back to normal. looks to have been corruption
> > due to ao's crash earlier yesterday.
> >
> > continuing to look at the root cause(s),
> > but please send mail if you notice anything
> > else odd...
> >
> > [greg]
> >
> >
> >> From: Simon Mcclusky <simon at mit.edu>
> >> Mime-Version: 1.0
> >> Date: Thu, 04 Aug 2005 09:47:51 -0400
> >> Cc:
> >> Reply-To: ACES-support at mitgcm.org, simon at mit.edu
> >>
> >> Peter,
> >>
> >> There seems to be something strange about the way ao compute nodes are
> >> behaving. I can issue qsub commands that are accepted from the geojr
> >> compute nodes, but not from the ao compute nodes. Until this is resolved
> >> you can run your scripts on the geojr cluster.
> >>
> >> Greg - the problem on ao is the compute nodes see strange info for the
> >> torque /bin directory where qsub resides:
> >>
> >> ?--------- ? ? ? ? ? bin
> >>
> >> I have no clue what this means!
> >>
> >> -Simon
> >> On Thu, 2005-08-04 at 03:38, Peter H Israelsson wrote:
> >> > I am having a similar but slightly different problem with the qsub
> >> command, as
> >> > detailed below.
> >> >
> >> > First I need to explain how my jobs are run: Because my
> >> simulations are longer
> >> > than the max walltime of the queues available to me, I have
> >> rewritten my code
> >> > to automatically run as a sequence of smaller simulations. The code
> >> > automatically stops itself when its run time is approaching the
> >> max walltime,
> >> > and signals the calling pbs script that the simulation needs to be
> >> restarted
> >> > from its current time level. Before exiting, the calling pbs
> >> script creates a
> >> > new pbs script file and submits a new job, i.e., the last command
> >> it issues
> >> > before exiting is "qsub [new_job]". That way, the next part of
> >> the simulation
> >> > is assigned a new job number, and the walltime is reset.
> >> >
> >> > This process was working fine before the ao system went down last
> >> night, i.e.,
> >> > the code stopped and restarted itself automatically with no problems.
> >> > However,
> >> > since the system went down last night, this sequential process is
> >> now failing
> >> > because it says that it cannot find the qsub command:
> >> > /var/torque/mom_priv/jobs/8322.ao.SC: line 65: qsub: command not found
> >> > I have tested this a number of times and get the same result each time.
> >> >
> >> > So something is different on ao since yesterday's reboot. The strange
> >> > thing is
> >> > that when I manually log on to ao.acesgrid.org, I do have access to qsub,
> >> > qstat, etc. So I don't understand why the 'qsub' command doesn't
> >> work when
> >> > issued by an existing job.
> >> >
> >> > Any ideas what is going on? Thanks.
> >> >
> >> > Regards,
> >> > Peter
> >> >
> >> > PS Greg, I am confused by your last email because the module 'magick'
> >> > you refer
> >> > to is not listed when I type 'module avail'. Aren't the qsub, qstat, etc
> >> > commands automatically loaded (in one of the 'default' modules such as
> >> > 'torque/1.2.0p4')? Also, I get an error when I try typing 'module load
> >> > magick', saying that the module cannot be found.
> >> >
> >> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >> > Peter H. Israelsson
> >> > Massachusetts Institute of Technology
> >> > Department of Civil & Environmental Engineering
> >> > 48-114, 15 Vassar Street, Cambridge, MA 02139, USA
> >> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >> >
> >> >
> >> > Quoting aces-admin at techsquare.com:
> >> >
> >> > > hello beghein-
> >> > >
> >> > > are you certain that your login bits are setup
> >> > > to load the module magick ? i just tested this
> >> > > and it worked for me...
> >> > >
> >> > > . ssh ts at ao.acesgrid.org
> >> > > . ao: module list
> >> > > . ao: qstat
> >> > >
> >> > > [greg]
> >> > >
> >> > >> Mime-Version: 1.0
> >> > >> Date: Wed, 3 Aug 2005 16:32:03 -0400
> >> > >> From: Caroline Beghein <beghein at mit.edu>
> >> > >> Cc:
> >> > >> Reply-To: ACES-support at mitgcm.org
> >> > >>
> >> > >> Hi
> >> > >>
> >> > >> Is there still something wrong with the cluster? Whether I login to
> >> > >> ao or geojr, I cannot start any job. If I type qsub ... or qstat I
> >> > >> get "-bash: qstat : command not found"
> >> > >> What does that mean?
> >> > >>
> >> > >> Thanks
> >> > >>
> >> > >>
> >> > >> --
> >> > >> Caroline
> >> > >>
> >> > >>
> >> > >>
> >> > >> Caroline Beghein
> >> > >> 77 Massachusetts avenue #54-526
> >> > >> Cambridge, MA 02139
> >> > >> tel.: +1 617 253 3589
> >> > >> http://www.mit.edu/~beghein
> >> > >> _______________________________________________
> >> > >> Aces-support mailing list
> >> > >> Aces-support at acesgrid.org
> >> > >> http://acesgrid.org/mailman/listinfo/aces-support
> >> > >>
> >> > >
> >> > > _______________________________________________
> >> > > Aces-support mailing list
> >> > > Aces-support at acesgrid.org
> >> > > http://acesgrid.org/mailman/listinfo/aces-support
> >> > >
> >> >
> >> >
> >> > _______________________________________________
> >> > Aces-support mailing list
> >> > Aces-support at acesgrid.org
> >> > http://acesgrid.org/mailman/listinfo/aces-support
> >> --
> >> Simon McClusky
> >> RM 54-614, Dept EAPS, MIT,
> >> 77 Massachusetts Ave,
> >> Cambridge, MA 02139
> >> USA
> >>
> >> email: simon at mit.edu
> >> Ph: 617 253-3077
> >> Fax: 617 253-1699
> >> Cell:857 928-5891
> >>
> >>
> >>
> >> _______________________________________________
> >> Aces-support mailing list
> >> Aces-support at acesgrid.org
> >> http://acesgrid.org/mailman/listinfo/aces-support
> >>
> > _______________________________________________
> > Aces-support mailing list
> > Aces-support at acesgrid.org
> > http://acesgrid.org/mailman/listinfo/aces-support
> >
>
>
> _______________________________________________
> Aces-support mailing list
> Aces-support at acesgrid.org
> http://acesgrid.org/mailman/listinfo/aces-support
>
More information about the Aces-support
mailing list