Issue5030

Title garlic's down again
Priority critical Status chatting
Superseder Nosy List ajit, dan, dasu, rader, radtke, wcmaier
Assigned To Topic AFS
Group IT

Created 2008.01.05 15:51 by wcmaier.
Last changed 2008.01.07 10:01 by rader.

Messages
msg13413 (view) From: rader To: ajit, dan, dasu, rader, radtke, wcmaier Date: 2008.01.07 10:01
Okay, garlic is out of production now.

The plan is to make wasabi into garlic and use the new hardware
for wasabi.  (Matt: talk to me about that after you're done
with the Stringer desktoops.)

steve
--

 > ---- Original Message ----
 > From: Steve Rader via UW-HEP Help System <help@hep.wisc.edu>
 > 
 > I just finished the "c-section" of garlic's old disk into
 > a different case and motherboard.
 > 
 > AFAICT, /afs/hep/osg/data recovered nicely on all systems.
 > 
 > I'll start moving root.osg here in a sec.
 > 
 > Will will order new hardware.
 > 
 > steve
 > --
 > 
 > ______________________________________
 > UW-HEP Help System <help@hep.wisc.edu>
 > <https://help.hep.wisc.edu/issue5030>
 > ______________________________________
msg13412 (view) From: rader To: ajit, dan, dasu, rader, radtke, wcmaier Date: 2008.01.07 08:16
I just finished the "c-section" of garlic's old disk into
a different case and motherboard.

AFAICT, /afs/hep/osg/data recovered nicely on all systems.

I'll start moving root.osg here in a sec.

Will will order new hardware.

steve
--
msg13409 (view) From: rader To: ajit, dan, dasu, rader, radtke, wcmaier Date: 2008.01.06 14:41
Yes, I kicked garlic at 1135ish and it crashed AGAIN at
1205ish.  So I kicked it again just now.

Both times (four total now) I've seen console msgs about
clock skew and ntpd being disabled.  So I also put in
a new battery.

Started yet another move of root.osg.

Fingers crossed.

steve
-- 

 > ---- Original Message ----
 > From: Will Maier via UW-HEP Help System <help@hep.wisc.edu>
 > 
 > On Sun, Jan 06, 2008 at 04:29:51PM +0000, Will Maier via UW-HEP Help System wrote:
 > > On Sat, Jan 05, 2008 at 06:00:22PM -0600, rader@ginseng.hep.wisc.edu wrote:
 > > > It's up now.  
 > > 
 > > And back down again. When I arrived, there were messages about hda
 > > on the console, so now we appear to have disk problems, too.
 > 
 > And back down yet again. This is insane.
 > 
 > Steve was able to move all volumes but osg.data off of garlic before
 > it crashed, though there appears to be some lingering oddness with
 > the VLDB. He'll go in and kick the box soon.
 > 
 > -- 
 > 
 > o--------------------------{ Will Maier }--------------------------o
 > | jabber:...wcmaier@xmpp.lfod.us | email:..will.maier@hep.wisc.edu |
 > | office:...........608.263.9692 | cell:..............608.438.6162 |
 > *--------------------[ UW High Energy Physics ]--------------------*
 > 
 > ______________________________________
 > UW-HEP Help System <help@hep.wisc.edu>
 > <https://help.hep.wisc.edu/issue5030>
 > ______________________________________
msg13408 (view) From: wcmaier To: ajit, dan, dasu, rader, radtke, wcmaier Date: 2008.01.06 10:31
On Sun, Jan 06, 2008 at 04:29:51PM +0000, Will Maier via UW-HEP Help System wrote:
> On Sat, Jan 05, 2008 at 06:00:22PM -0600, rader@ginseng.hep.wisc.edu wrote:
> > It's up now.  
> 
> And back down again. When I arrived, there were messages about hda
> on the console, so now we appear to have disk problems, too.

And back down yet again. This is insane.

Steve was able to move all volumes but osg.data off of garlic before
it crashed, though there appears to be some lingering oddness with
the VLDB. He'll go in and kick the box soon.

-- 

o--------------------------{ Will Maier }--------------------------o
| jabber:...wcmaier@xmpp.lfod.us | email:..will.maier@hep.wisc.edu |
| office:...........608.263.9692 | cell:..............608.438.6162 |
*--------------------[ UW High Energy Physics ]--------------------*
msg13407 (view) From: wcmaier To: ajit, dan, dasu, rader, radtke, wcmaier Date: 2008.01.06 10:29
(Accidentally sent only to steve earlier; resending...)

On Sat, Jan 05, 2008 at 06:00:22PM -0600, rader@ginseng.hep.wisc.edu wrote:
> It's up now.  

And back down again. When I arrived, there were messages about hda
on the console, so now we appear to have disk problems, too.

[...]
> Will, Matt: if either one of you comes in to kick it again, open
> er up and email around the battery type/spec.  

Battery info:

    Toshiba Lithium Battery
    CR2032
    3V
    Japan

Rather than swap batteries, though, I think we need to move all data
off this machine ASAP and replace it with a new box (with LSI RAID)
as soon as John can ship us one.

-- 

o--------------------------{ Will Maier }--------------------------o
| jabber:...wcmaier@xmpp.lfod.us | email:..will.maier@hep.wisc.edu |
| office:...........608.263.9692 | cell:..............608.438.6162 |
*--------------------[ UW High Energy Physics ]--------------------*
msg13406 (view) From: rader To: ajit, dan, dasu, help, rader, radtke, wcmaier Date: 2008.01.06 09:17
> And back down again. When I arrived, there were messages about hda
 > on the console, so now we appear to have disk problems, too.

Garlic came, up but it's AFS was still hosed.  And I noticed dfafs's
"vos examine" was hanging rather randomly on other hosts/volumes too.

I tracked that down to VLDB weirdness:  "udebug anise 7003" said "I am
currently managing write trans 0.18" and "There are write locks held"
so I did "bos restart anise -all".

I'll start moving volumes to rosemary

steve
--
msg13405 (view) From: rader To: ajit, dan, dasu, rader, radtke, wcmaier Date: 2008.01.05 18:11
It's up now.  The oops trace was something about gettime
so I wonder if it's got a bad system clock battery.  I recall
tracking down very mysterious problems with rosemary that
we resolved with a new battery.

Will, Matt: if either one of you comes in to kick it again,
open er up and email around the battery type/spec.  

steve
--


 > Steve went in and gave it a kick. Via IM, we agreed to copy the
 > volumes off of garlic and onto rosemary (which has the most space at
 > the moment). garlic's not giving us any extra reliability or
 > load balancing now, so moving the volumes onto something more stable
 > is a Good Thing. Steve will do this in a little bit.
msg13404 (view) From: wcmaier To: ajit, dan, dasu, rader, radtke, wcmaier Date: 2008.01.05 17:37
On Sat, Jan 05, 2008 at 11:25:12PM +0000, Will Maier via UW-HEP Help System wrote:
> And it's down again.

Steve went in and gave it a kick. Via IM, we agreed to copy the
volumes off of garlic and onto rosemary (which has the most space at
the moment). garlic's not giving us any extra reliability or
load balancing now, so moving the volumes onto something more stable
is a Good Thing. Steve will do this in a little bit.

-- 

o--------------------------{ Will Maier }--------------------------o
| jabber:...wcmaier@xmpp.lfod.us | email:..will.maier@hep.wisc.edu |
| office:...........608.263.9692 | cell:..............608.438.6162 |
*--------------------[ UW High Energy Physics ]--------------------*
msg13403 (view) From: wcmaier To: ajit, dan, dasu, rader, radtke, wcmaier Date: 2008.01.05 17:25
On Sat, Jan 05, 2008 at 10:10:53PM +0000, Will Maier via UW-HEP Help System wrote:
> garlic's been kicked.

And it's down again.

Matt, Steve: can either of you make it in? It seems like it might be
a good idea to move the volumes off of garlic, too.

-- 

o--------------------------{ Will Maier }--------------------------o
| jabber:...wcmaier@xmpp.lfod.us | email:..will.maier@hep.wisc.edu |
| office:...........608.263.9692 | cell:..............608.438.6162 |
*--------------------[ UW High Energy Physics ]--------------------*
msg13402 (view) From: wcmaier To: ajit, dan, dasu, rader, radtke, wcmaier Date: 2008.01.05 16:10
On Sat, Jan 05, 2008 at 09:51:49PM +0000, Will Maier via UW-HEP Help System wrote:
> garlic crashed a few minutes ago. I'm going to go in and restart
> it.

garlic's been kicked.

> Steve, Matt: can you try to figure out what's bringing it down?

When I got here, there was a trace on the console. The system
appears to have crashed while swapping; the stack included cpu_idle,
too. Either way, garlic almost certainly has hardware issues, and
I'd guess memory/CPU.

-- 

o--------------------------{ Will Maier }--------------------------o
| jabber:...wcmaier@xmpp.lfod.us | email:..will.maier@hep.wisc.edu |
| office:...........608.263.9692 | cell:..............608.438.6162 |
*--------------------[ UW High Energy Physics ]--------------------*
msg13401 (view) From: wcmaier To: ajit, dan, dasu, rader, radtke, wcmaier Date: 2008.01.05 15:51
garlic crashed a few minutes ago. I'm going to go in and restart it.

Steve, Matt: can you try to figure out what's bringing it down?

-- 

o--------------------------{ Will Maier }--------------------------o
| jabber:...wcmaier@xmpp.lfod.us | email:..will.maier@hep.wisc.edu |
| office:...........608.263.9692 | cell:..............608.438.6162 |
*--------------------[ UW High Energy Physics ]--------------------*
History
Date User Action Args
2008-01-07 10:01:01radersetmessages: + msg13413
2008-01-07 08:16:28radersetmessages: + msg13412
2008-01-07 06:34:42wcmaierlinkissue5027 superseder
2008-01-07 06:34:09wcmaiersetassignedto: rader
2008-01-06 14:41:02radersetmessages: + msg13409
2008-01-06 10:31:33wcmaiersetmessages: + msg13408
2008-01-06 10:29:47wcmaiersetmessages: + msg13407
2008-01-06 09:17:44radersetmessages: + msg13406
2008-01-05 18:11:01radersetmessages: + msg13405
2008-01-05 17:37:41wcmaiersetmessages: + msg13404
2008-01-05 17:25:12wcmaiersetmessages: + msg13403
2008-01-05 16:10:53wcmaiersetstatus: unread -> chatting
messages: + msg13402
2008-01-05 15:51:49wcmaiercreate