Message13645

Author frankjp
Recipients ajit, dan, dasu, frankjp, rader, radtke, wcmaier
Date 2008.02.13 07:56
Content
Hi,

	I am submitting a question about Condor in hopes that perhaps
Sridhara or Dan Bradley can help, please let me know if there is someone
else in Physics I can send this to.

	I am running ~100 jobs from my computer (kingfisher), and some of
them crash after running fine for a few hours with the following error
message.

----------
001 (046.010.000) 02/13 02:03:20 Job executing on host: 
<128.105.167.40:47354>

005 (046.010.000) 02/13 02:03:23 Job terminated.
	(0) Abnormal termination (signal 11)
	(1) Corefile in: 
/afs/hep.wisc.edu/user/frankjp/Work/ttZ/NLOV_qqb_minimalLarin/condor_submits/core.46.10
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 03:53:24, Sys 0 00:00:22  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	5001569  -  Run Bytes Sent By Job
	38691508  -  Run Bytes Received By Job
	0  -  Total Bytes Sent By Job
	0  -  Total Bytes Received By Job
-----------

Looking at my log file, it appears that this happens when the job is moved 
to particular computers.  The "bad" nodes I've been able to identify are:

macaroni16.cs.wisc.edu
macaroni12.cs.wisc.edu
macaroni04.cs.wisc.edu
manta.cs.wisc.edu (128.105.167.40, the example above)
p51.cs.wisc.edu

My question is, is this a known problem (I couldn't find it in the mailing
list archive after a quick perusal)?  Is it a mismatch between the Condor
version on my computer and some of the CS ones?  I know how to work around
this given the bad nodes (Requirements = Machine != p51.cs.wisc.edu, etc.
in my submit file), but I can only find these by crashing a program, and
it would be helpful if a simple upgrade would solve this.

Thanks for your help,
Frank
History
Date User Action Args
2008-02-13 07:56:10frankjpsetrecipients: + frankjp, wcmaier, rader, dan, dasu, ajit, radtke
2008-02-13 07:56:10frankjplinkissue5109 messages
2008-02-13 07:56:09frankjpcreate