Hi Will and Dan,
Thanks for your messages. I did delete the core files previously;
I have generated new ones by not restricting the machines I am using.
They are in /afs/hep.wisc.edu/home/frankjp/core_files, along with a log
file. If either of you could help me figure out what is going on I would
greatly appreciate the help.
I find these, as I said, by noticing a process crash. By
retricting the allowed machines in my submit file, I can get my processes
to finish, so I don't think it is my code (I've run thousands of jobs with
this code several months ago without noticing this issue). I have no idea
how widespread it is, or whether it is something local to the machine I am
submitting from.
Will, I can try set up my future runs in a local scratch directory
(is this what you recommend?) if you think that will help. However, each
job writes minimal data (a few hundred bytes, I changed this in response
to similar comments from Steve a while back by removing intermediate
output). Given this size, is the strain on AFS large? Is some other
variable the appropriate one to consider, and not output size? If so I
will certainly change my configuration, but I thought I avoided this.
Thanks for your help,
Frank
On Wed, 13 Feb 2008, Will Maier via UW-HEP Help System wrote:
>
> Hi, Frank-
>
> On Wed, Feb 13, 2008 at 01:56:10PM +0000, Frank Petriello via UW-HEP Help System wrote:
> > I am submitting a question about Condor in hopes that perhaps
> > Sridhara or Dan Bradley can help, please let me know if there is
> > someone else in Physics I can send this to.
>
> Dan or Sridhara may yet chime in, but my first suggestion would be
> to direct your jobs' output somewhere besides AFS. Writing job
> output to AFS puts a significant strain on our resources and can
> cause difficult-to-debug failures.
>
> Beyond that, I don't have access to the machines you mention, but I
> don't see anything odd in their ClassAds. Dan spot checked a few of
> them, and wasn't able to find any memory or similar problems,
> either.
>
> [...]
> > ----------
> > 001 (046.010.000) 02/13 02:03:20 Job executing on host:
> > <128.105.167.40:47354>
> >
> > 005 (046.010.000) 02/13 02:03:23 Job terminated.
> > (0) Abnormal termination (signal 11)
> > (1) Corefile in:
> > /afs/hep.wisc.edu/user/frankjp/Work/ttZ/NLOV_qqb_minimalLarin/condor_submits/core.46.10
>
> This core file doesn't seem to exist any more. Did you clean it up?
> If possible, a preserved core file it might help us debug your
> problem.
>
> Thanks!
>
>
-- |