Message13653

Author frankjp
Recipients ajit, dan, dasu, frankjp, rader, radtke, wcmaier
Date 2008.02.13 15:11
Content
Hi Will and Dan,

	Thanks for your messages.  I did delete the core files previously; 
I have generated new ones by not restricting the machines I am using.  
They are in /afs/hep.wisc.edu/home/frankjp/core_files, along with a log 
file.  If either of you could help me figure out what is going on I would 
greatly appreciate the help.

	I find these, as I said, by noticing a process crash.  By
retricting the allowed machines in my submit file, I can get my processes
to finish, so I don't think it is my code (I've run thousands of jobs with
this code several months ago without noticing this issue).  I have no idea
how widespread it is, or whether it is something local to the machine I am
submitting from.

	Will, I can try set up my future runs in a local scratch directory
(is this what you recommend?) if you think that will help.  However, each
job writes minimal data (a few hundred bytes, I changed this in response
to similar comments from Steve a while back by removing intermediate
output).  Given this size, is the strain on AFS large?  Is some other
variable the appropriate one to consider, and not output size?  If so I
will certainly change my configuration, but I thought I avoided this.

Thanks for your help,
Frank

On Wed, 13 Feb 2008, Will Maier via UW-HEP Help System wrote:

> 
> Hi, Frank-
> 
> On Wed, Feb 13, 2008 at 01:56:10PM +0000, Frank Petriello via UW-HEP Help System wrote:
> > I am submitting a question about Condor in hopes that perhaps
> > Sridhara or Dan Bradley can help, please let me know if there is
> > someone else in Physics I can send this to.
> 
> Dan or Sridhara may yet chime in, but my first suggestion would be
> to direct your jobs' output somewhere besides AFS. Writing job
> output to AFS puts a significant strain on our resources and can
> cause difficult-to-debug failures.
> 
> Beyond that, I don't have access to the machines you mention, but I
> don't see anything odd in their ClassAds. Dan spot checked a few of
> them, and wasn't able to find any memory or similar problems,
> either.
> 
> [...]
> > ----------
> > 001 (046.010.000) 02/13 02:03:20 Job executing on host: 
> > <128.105.167.40:47354>
> > 
> > 005 (046.010.000) 02/13 02:03:23 Job terminated.
> > 	(0) Abnormal termination (signal 11)
> > 	(1) Corefile in: 
> > /afs/hep.wisc.edu/user/frankjp/Work/ttZ/NLOV_qqb_minimalLarin/condor_submits/core.46.10
> 
> This core file doesn't seem to exist any more. Did you clean it up?
> If possible, a preserved core file it might help us debug your
> problem.
> 
> Thanks!
> 
> 

--
History
Date User Action Args
2008-02-13 15:11:36frankjpsetrecipients: + frankjp
2008-02-13 15:11:36frankjplinkissue5109 messages
2008-02-13 15:11:34frankjpcreate