Message13652

Author dan
Recipients ajit, dan, dasu, frankjp, rader, radtke, wcmaier
Date 2008.02.13 11:41
Content
Frank,

Setting your job requirements to avoid problematic machines is certainly 
the first step to take.  How confident are you at this point that the 
problem is not more widespread?

Do you still have a copy of the core file somewhere?

I have run some tests on the problem machines that you mentioned over in 
the CS department.  Sometimes memory errors have been a source of 
machine-specific crashes like this, but I have not been able to 
reproduce any problems in my testing so far.

--Dan

Frank Petriello via UW-HEP Help System wrote:

>Hi,
>
>	I am submitting a question about Condor in hopes that perhaps
>Sridhara or Dan Bradley can help, please let me know if there is someone
>else in Physics I can send this to.
>
>	I am running ~100 jobs from my computer (kingfisher), and some of
>them crash after running fine for a few hours with the following error
>message.
>
>----------
>001 (046.010.000) 02/13 02:03:20 Job executing on host: 
><128.105.167.40:47354>
>
>005 (046.010.000) 02/13 02:03:23 Job terminated.
>	(0) Abnormal termination (signal 11)
>	(1) Corefile in: 
>/afs/hep.wisc.edu/user/frankjp/Work/ttZ/NLOV_qqb_minimalLarin/condor_submits/core.46.10
>		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
>		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>		Usr 0 03:53:24, Sys 0 00:00:22  -  Total Remote Usage
>		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
>	5001569  -  Run Bytes Sent By Job
>	38691508  -  Run Bytes Received By Job
>	0  -  Total Bytes Sent By Job
>	0  -  Total Bytes Received By Job
>-----------
>
>Looking at my log file, it appears that this happens when the job is moved 
>to particular computers.  The "bad" nodes I've been able to identify are:
>
>macaroni16.cs.wisc.edu
>macaroni12.cs.wisc.edu
>macaroni04.cs.wisc.edu
>manta.cs.wisc.edu (128.105.167.40, the example above)
>p51.cs.wisc.edu
>
>My question is, is this a known problem (I couldn't find it in the mailing
>list archive after a quick perusal)?  Is it a mismatch between the Condor
>version on my computer and some of the CS ones?  I know how to work around
>this given the bad nodes (Requirements = Machine != p51.cs.wisc.edu, etc.
>in my submit file), but I can only find these by crashing a program, and
>it would be helpful if a simple upgrade would solve this.
>
>Thanks for your help,
>Frank
>
>----------
>group: IT
>messages: 13645
>nosy: ajit, dan, dasu, frankjp, rader, radtke, wcmaier
>priority: triage
>status: unread
>title: condor question (Sridhara, Dan?)
>
>______________________________________
>UW-HEP Help System <help@hep.wisc.edu>
><https://help.hep.wisc.edu/issue5109>
>______________________________________
>  
>
History
Date User Action Args
2008-02-13 11:41:54dansetrecipients: + dan, wcmaier, rader, dasu, ajit, radtke, frankjp
2008-02-13 11:41:54danlinkissue5109 messages
2008-02-13 11:41:54dancreate