Hi,
I am submitting a question about Condor in hopes that perhaps
Sridhara or Dan Bradley can help, please let me know if there is someone
else in Physics I can send this to.
I am running ~100 jobs from my computer (kingfisher), and some of
them crash after running fine for a few hours with the following error
message.
----------
001 (046.010.000) 02/13 02:03:20 Job executing on host:
<128.105.167.40:47354>
005 (046.010.000) 02/13 02:03:23 Job terminated.
(0) Abnormal termination (signal 11)
(1) Corefile in:
/afs/hep.wisc.edu/user/frankjp/Work/ttZ/NLOV_qqb_minimalLarin/condor_submits/core.46.10
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 03:53:24, Sys 0 00:00:22 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
5001569 - Run Bytes Sent By Job
38691508 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
-----------
Looking at my log file, it appears that this happens when the job is moved
to particular computers. The "bad" nodes I've been able to identify are:
macaroni16.cs.wisc.edu
macaroni12.cs.wisc.edu
macaroni04.cs.wisc.edu
manta.cs.wisc.edu (128.105.167.40, the example above)
p51.cs.wisc.edu
My question is, is this a known problem (I couldn't find it in the mailing
list archive after a quick perusal)? Is it a mismatch between the Condor
version on my computer and some of the CS ones? I know how to work around
this given the bad nodes (Requirements = Machine != p51.cs.wisc.edu, etc.
in my submit file), but I can only find these by crashing a program, and
it would be helpful if a simple upgrade would solve this.
Thanks for your help,
Frank |