Issue5109

Title condor question (Sridhara, Dan?)
Priority normal Status chatting
Superseder Nosy List ajit, dan, dasu, frankjp, rader, radtke, wcmaier
Assigned To Topic Condor
Group IT

Created 2008.02.13 07:56 by frankjp.
Last changed 2008.02.18 06:52 by wcmaier.

Messages
msg13657 (view) From: dan To: ajit, dan, dasu, frankjp, rader, radtke, wcmaier Date: 2008.02.14 13:31
I have reproduced a problem by running the program by hand and 
suspending it, which causes it to try to write a checkpoint.  So far in 
my testing, it crashes on the macaroni machines in cs.wisc.edu and it 
crashes on my desktop at hep.wisc.edu, but it successfully writes 
checkpoints on many other hosts (including p51.cs.wisc.edu, strangely).  
I haven't found a common pattern yet.

I will try to find out why the checkpoint code is crashing here:

#0  0x099572d0 in adler32 ()
#1  0x09952aae in fill_window ()
#2  0x0995276d in deflate_slow ()
#3  0x09950e30 in deflate ()
#4  0x0991918d in SegMap::Write ()
#5  0x0991894e in Image::Write ()
#6  0x09918625 in Image::Write ()
#7  0x099184a4 in Image::Write ()
#8  0x09919893 in Checkpoint ()
#9  <signal handler called>
#10 0x08053796 in akze5_ ()
#11 0x08173a27 in v4nume1_ ()
#12 0x0816c417 in nlov4x0int_ ()
#13 0x08087965 in nloe0_ ()
#14 0x08088aeb in Integrate ()
#15 0x080885b6 in Vegas ()
#16 0x3f50624d in ?? ()
#17 0xe0000000 in ?? ()
#18 0x3f50624d in ?? ()
#19 0x00000000 in ?? ()

--Dan

Frank Petriello wrote:

>Hi Will and Dan,
>
>	Thanks for your messages.  I did delete the core files previously; 
>I have generated new ones by not restricting the machines I am using.  
>They are in /afs/hep.wisc.edu/home/frankjp/core_files, along with a log 
>file.  If either of you could help me figure out what is going on I would 
>greatly appreciate the help.
>
>	I find these, as I said, by noticing a process crash.  By
>retricting the allowed machines in my submit file, I can get my processes
>to finish, so I don't think it is my code (I've run thousands of jobs with
>this code several months ago without noticing this issue).  I have no idea
>how widespread it is, or whether it is something local to the machine I am
>submitting from.
>
>	Will, I can try set up my future runs in a local scratch directory
>(is this what you recommend?) if you think that will help.  However, each
>job writes minimal data (a few hundred bytes, I changed this in response
>to similar comments from Steve a while back by removing intermediate
>output).  Given this size, is the strain on AFS large?  Is some other
>variable the appropriate one to consider, and not output size?  If so I
>will certainly change my configuration, but I thought I avoided this.
>
>Thanks for your help,
>Frank
>
>On Wed, 13 Feb 2008, Will Maier via UW-HEP Help System wrote:
>
>  
>
>>Hi, Frank-
>>
>>On Wed, Feb 13, 2008 at 01:56:10PM +0000, Frank Petriello via UW-HEP Help System wrote:
>>    
>>
>>>I am submitting a question about Condor in hopes that perhaps
>>>Sridhara or Dan Bradley can help, please let me know if there is
>>>someone else in Physics I can send this to.
>>>      
>>>
>>Dan or Sridhara may yet chime in, but my first suggestion would be
>>to direct your jobs' output somewhere besides AFS. Writing job
>>output to AFS puts a significant strain on our resources and can
>>cause difficult-to-debug failures.
>>
>>Beyond that, I don't have access to the machines you mention, but I
>>don't see anything odd in their ClassAds. Dan spot checked a few of
>>them, and wasn't able to find any memory or similar problems,
>>either.
>>
>>[...]
>>    
>>
>>>----------
>>>001 (046.010.000) 02/13 02:03:20 Job executing on host: 
>>><128.105.167.40:47354>
>>>
>>>005 (046.010.000) 02/13 02:03:23 Job terminated.
>>>	(0) Abnormal termination (signal 11)
>>>	(1) Corefile in: 
>>>/afs/hep.wisc.edu/user/frankjp/Work/ttZ/NLOV_qqb_minimalLarin/condor_submits/core.46.10
>>>      
>>>
>>This core file doesn't seem to exist any more. Did you clean it up?
>>If possible, a preserved core file it might help us debug your
>>problem.
>>
>>Thanks!
>>
>>
>>    
>>
>
>  
>
msg13653 (view) From: frankjp To: ajit, dan, dasu, frankjp, rader, radtke, wcmaier Date: 2008.02.13 15:11
Hi Will and Dan,

	Thanks for your messages.  I did delete the core files previously; 
I have generated new ones by not restricting the machines I am using.  
They are in /afs/hep.wisc.edu/home/frankjp/core_files, along with a log 
file.  If either of you could help me figure out what is going on I would 
greatly appreciate the help.

	I find these, as I said, by noticing a process crash.  By
retricting the allowed machines in my submit file, I can get my processes
to finish, so I don't think it is my code (I've run thousands of jobs with
this code several months ago without noticing this issue).  I have no idea
how widespread it is, or whether it is something local to the machine I am
submitting from.

	Will, I can try set up my future runs in a local scratch directory
(is this what you recommend?) if you think that will help.  However, each
job writes minimal data (a few hundred bytes, I changed this in response
to similar comments from Steve a while back by removing intermediate
output).  Given this size, is the strain on AFS large?  Is some other
variable the appropriate one to consider, and not output size?  If so I
will certainly change my configuration, but I thought I avoided this.

Thanks for your help,
Frank

On Wed, 13 Feb 2008, Will Maier via UW-HEP Help System wrote:

> 
> Hi, Frank-
> 
> On Wed, Feb 13, 2008 at 01:56:10PM +0000, Frank Petriello via UW-HEP Help System wrote:
> > I am submitting a question about Condor in hopes that perhaps
> > Sridhara or Dan Bradley can help, please let me know if there is
> > someone else in Physics I can send this to.
> 
> Dan or Sridhara may yet chime in, but my first suggestion would be
> to direct your jobs' output somewhere besides AFS. Writing job
> output to AFS puts a significant strain on our resources and can
> cause difficult-to-debug failures.
> 
> Beyond that, I don't have access to the machines you mention, but I
> don't see anything odd in their ClassAds. Dan spot checked a few of
> them, and wasn't able to find any memory or similar problems,
> either.
> 
> [...]
> > ----------
> > 001 (046.010.000) 02/13 02:03:20 Job executing on host: 
> > <128.105.167.40:47354>
> > 
> > 005 (046.010.000) 02/13 02:03:23 Job terminated.
> > 	(0) Abnormal termination (signal 11)
> > 	(1) Corefile in: 
> > /afs/hep.wisc.edu/user/frankjp/Work/ttZ/NLOV_qqb_minimalLarin/condor_submits/core.46.10
> 
> This core file doesn't seem to exist any more. Did you clean it up?
> If possible, a preserved core file it might help us debug your
> problem.
> 
> Thanks!
> 
> 

--
msg13652 (view) From: dan To: ajit, dan, dasu, frankjp, rader, radtke, wcmaier Date: 2008.02.13 11:41
Frank,

Setting your job requirements to avoid problematic machines is certainly 
the first step to take.  How confident are you at this point that the 
problem is not more widespread?

Do you still have a copy of the core file somewhere?

I have run some tests on the problem machines that you mentioned over in 
the CS department.  Sometimes memory errors have been a source of 
machine-specific crashes like this, but I have not been able to 
reproduce any problems in my testing so far.

--Dan

Frank Petriello via UW-HEP Help System wrote:

>Hi,
>
>	I am submitting a question about Condor in hopes that perhaps
>Sridhara or Dan Bradley can help, please let me know if there is someone
>else in Physics I can send this to.
>
>	I am running ~100 jobs from my computer (kingfisher), and some of
>them crash after running fine for a few hours with the following error
>message.
>
>----------
>001 (046.010.000) 02/13 02:03:20 Job executing on host: 
><128.105.167.40:47354>
>
>005 (046.010.000) 02/13 02:03:23 Job terminated.
>	(0) Abnormal termination (signal 11)
>	(1) Corefile in: 
>/afs/hep.wisc.edu/user/frankjp/Work/ttZ/NLOV_qqb_minimalLarin/condor_submits/core.46.10
>		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
>		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>		Usr 0 03:53:24, Sys 0 00:00:22  -  Total Remote Usage
>		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
>	5001569  -  Run Bytes Sent By Job
>	38691508  -  Run Bytes Received By Job
>	0  -  Total Bytes Sent By Job
>	0  -  Total Bytes Received By Job
>-----------
>
>Looking at my log file, it appears that this happens when the job is moved 
>to particular computers.  The "bad" nodes I've been able to identify are:
>
>macaroni16.cs.wisc.edu
>macaroni12.cs.wisc.edu
>macaroni04.cs.wisc.edu
>manta.cs.wisc.edu (128.105.167.40, the example above)
>p51.cs.wisc.edu
>
>My question is, is this a known problem (I couldn't find it in the mailing
>list archive after a quick perusal)?  Is it a mismatch between the Condor
>version on my computer and some of the CS ones?  I know how to work around
>this given the bad nodes (Requirements = Machine != p51.cs.wisc.edu, etc.
>in my submit file), but I can only find these by crashing a program, and
>it would be helpful if a simple upgrade would solve this.
>
>Thanks for your help,
>Frank
>
>----------
>group: IT
>messages: 13645
>nosy: ajit, dan, dasu, frankjp, rader, radtke, wcmaier
>priority: triage
>status: unread
>title: condor question (Sridhara, Dan?)
>
>______________________________________
>UW-HEP Help System <help@hep.wisc.edu>
><https://help.hep.wisc.edu/issue5109>
>______________________________________
>  
>
msg13651 (view) From: wcmaier To: ajit, dan, dasu, frankjp, rader, radtke, wcmaier Date: 2008.02.13 11:37
Hi, Frank-

On Wed, Feb 13, 2008 at 01:56:10PM +0000, Frank Petriello via UW-HEP Help System wrote:
> I am submitting a question about Condor in hopes that perhaps
> Sridhara or Dan Bradley can help, please let me know if there is
> someone else in Physics I can send this to.

Dan or Sridhara may yet chime in, but my first suggestion would be
to direct your jobs' output somewhere besides AFS. Writing job
output to AFS puts a significant strain on our resources and can
cause difficult-to-debug failures.

Beyond that, I don't have access to the machines you mention, but I
don't see anything odd in their ClassAds. Dan spot checked a few of
them, and wasn't able to find any memory or similar problems,
either.

[...]
> ----------
> 001 (046.010.000) 02/13 02:03:20 Job executing on host: 
> <128.105.167.40:47354>
> 
> 005 (046.010.000) 02/13 02:03:23 Job terminated.
> 	(0) Abnormal termination (signal 11)
> 	(1) Corefile in: 
> /afs/hep.wisc.edu/user/frankjp/Work/ttZ/NLOV_qqb_minimalLarin/condor_submits/core.46.10

This core file doesn't seem to exist any more. Did you clean it up?
If possible, a preserved core file it might help us debug your
problem.

Thanks!

-- 

o--------------------------{ Will Maier }--------------------------o
| jabber:...wcmaier@xmpp.lfod.us | email:..will.maier@hep.wisc.edu |
| office:...........608.263.9692 | cell:..............608.438.6162 |
*--------------------[ UW High Energy Physics ]--------------------*
msg13645 (view) From: frankjp To: ajit, dan, dasu, frankjp, rader, radtke, wcmaier Date: 2008.02.13 07:56
Hi,

	I am submitting a question about Condor in hopes that perhaps
Sridhara or Dan Bradley can help, please let me know if there is someone
else in Physics I can send this to.

	I am running ~100 jobs from my computer (kingfisher), and some of
them crash after running fine for a few hours with the following error
message.

----------
001 (046.010.000) 02/13 02:03:20 Job executing on host: 
<128.105.167.40:47354>

005 (046.010.000) 02/13 02:03:23 Job terminated.
	(0) Abnormal termination (signal 11)
	(1) Corefile in: 
/afs/hep.wisc.edu/user/frankjp/Work/ttZ/NLOV_qqb_minimalLarin/condor_submits/core.46.10
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 03:53:24, Sys 0 00:00:22  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	5001569  -  Run Bytes Sent By Job
	38691508  -  Run Bytes Received By Job
	0  -  Total Bytes Sent By Job
	0  -  Total Bytes Received By Job
-----------

Looking at my log file, it appears that this happens when the job is moved 
to particular computers.  The "bad" nodes I've been able to identify are:

macaroni16.cs.wisc.edu
macaroni12.cs.wisc.edu
macaroni04.cs.wisc.edu
manta.cs.wisc.edu (128.105.167.40, the example above)
p51.cs.wisc.edu

My question is, is this a known problem (I couldn't find it in the mailing
list archive after a quick perusal)?  Is it a mismatch between the Condor
version on my computer and some of the CS ones?  I know how to work around
this given the bad nodes (Requirements = Machine != p51.cs.wisc.edu, etc.
in my submit file), but I can only find these by crashing a program, and
it would be helpful if a simple upgrade would solve this.

Thanks for your help,
Frank
History
Date User Action Args
2008-02-18 06:52:40wcmaiersetassignedto: dan
2008-02-14 13:31:28dansetmessages: + msg13657
2008-02-13 15:11:36frankjpsetmessages: + msg13653
2008-02-13 11:41:54dansetmessages: + msg13652
2008-02-13 11:37:33wcmaiersetstatus: unread -> chatting
messages: + msg13651
2008-02-13 07:57:24wcmaiersetpriority: triage -> normal
topic: + Condor
2008-02-13 07:56:10frankjpcreate