Issue4674

Title Upgrade AFS
Priority critical Status resolved
Superseder AFS server for user home directories
View: 4945
Nosy List ajit, dan, dasu, rader, radtke, wcmaier
Assigned To Topic AFS
Group IT

Created 2007.07.26 18:28 by walker.
Last changed 2007.11.19 13:23 by rader.

Messages
msg13085 (view) From: wcmaier To: ajit, dan, dasu, rader, radtke, wcmaier Date: 2007.11.05 13:36
The two new machines Steve ordered from Kingstar arrived late on
Friday. On one of the machines, the RAID card was poorly seated and
so didn't show up on POST. I checked the other, but its card was
seated correctly.

-- 

o--------------------------{ Will Maier }--------------------------o
| jabber:...wcmaier@xmpp.lfod.us | email:..will.maier@hep.wisc.edu |
| office:...........608.263.9692 | cell:..............608.438.6162 |
*--------------------[ UW High Energy Physics ]--------------------*
msg12580 (view) From: rader To: ajit, dan, dasu, rader, radtke, wcmaier Date: 2007.09.19 09:07
Over the weekend I moved all volumes off anise.

We now have three DB servers.  I just added rosemary and
fennel this morning.  This will ease the pain when anise
get's upgraded.

I've also updated one client (login02) to use the new
DB servers.  The rest (of the HEP) machines will get
updated at 10am via Cfengine.

Full details at ginseng:/u/l/tn/afs/add-db-server.  Will 
and Matt: please read that document!

steve
--

 > ---- Original Message ----
 > From: Steve Rader via UW-HEP Help System <help@hep.wisc.edu>
 > 
 > After building and testing what is probably the smallest, lighest
 > AFS-KRB5 cell in the world (on my 2.5lb sub-notebook), I'm pretty
 > comfortable with AFS-KRB5.  The executive summary is
 > 
 >  - afs2k5db migrates the AFS-KRB4 database (kaserver.DBo) to the KRB5 KDC
 > 
 >  - asetkey dumps the KRB5 AFS key (afs@HEP.WISC.EDU) into the AFS KeyFile file
 > 
 >  - fakeka replaces kaserver for backward (AFS-KRB4) compatibility 
 > 
 >  - pam_krb5afs.so does KRB5 auth and generates tokens
 > 
 >  - kinit and aklog replace klog when AFS-KRB4 is turned off
 > 
 > AFS-KRB4 and AFS-KRB5 can be ran in parallel so the migration should
 > be fairly painless.  Here's when Will and I decided on:
 > 
 >  - move anise's volumes (roughly users a through m) to garlic
 > 
 >  - bring up slave pts and vl database services on garlic
 >    and rosemary
 > 
 >  - replace anise's hardware and os
 > 
 >  - bring up KRB5 on anise
 > 
 >  - convert the AFS-KRB4 database to KRB5, stop kaserver and run fakeka
 > 
 >  - move all user volumes (including those on thyme) to anise
 > 
 >  - replace thyme's hardware and os
 > 
 >  - add RO user volumes to thyme (vos -convertROtoRW DR goodness!)
 > 
 >  - migrate from pam_afs.so to pam_krb5afs.so
 > 
 >  - announce a sunset date for klog (KRB4)
 > 
 >  - sniff those still running KRB4 and re-notify them of the sunset
 > 
 >  - turn of fakeka
msg12566 (view) From: rader To: ajit, dan, dasu, help, rader, radtke, wcmaier, wsmith Date: 2007.09.13 11:51
After building and testing what is probably the smallest, lighest
AFS-KRB5 cell in the world (on my 2.5lb sub-notebook), I'm pretty
comfortable with AFS-KRB5.  The executive summary is

 - afs2k5db migrates the AFS-KRB4 database (kaserver.DBo) to the KRB5 KDC

 - asetkey dumps the KRB5 AFS key (afs@HEP.WISC.EDU) into the AFS KeyFile file

 - fakeka replaces kaserver for backward (AFS-KRB4) compatibility 

 - pam_krb5afs.so does KRB5 auth and generates tokens

 - kinit and aklog replace klog when AFS-KRB4 is turned off

AFS-KRB4 and AFS-KRB5 can be ran in parallel so the migration should
be fairly painless.  Here's when Will and I decided on:

 - move anise's volumes (roughly users a through m) to garlic

 - bring up slave pts and vl database services on garlic
   and rosemary

 - replace anise's hardware and os

 - bring up KRB5 on anise

 - convert the AFS-KRB4 database to KRB5, stop kaserver and run fakeka

 - move all user volumes (including those on thyme) to anise

 - replace thyme's hardware and os

 - add RO user volumes to thyme (vos -convertROtoRW DR goodness!)

 - migrate from pam_afs.so to pam_krb5afs.so

 - announce a sunset date for klog (KRB4)

 - sniff those still running KRB4 and re-notify them of the sunset

 - turn of fakeka

steve
--
msg12128 (view) From: radtke To: ajit, dan, rader, radtke, wcmaier Date: 2007.07.31 10:27
It seems to have happened again this morning, with anise being the 
problem.  Load is ~4 and afs was largely unusable.

-- 
Matthew Radtke
UW Physics
radtke@physics.wisc.edu
http://physics.wisc.edu/
msg12119 (view) From: wcmaier To: ajit, dan, rader, radtke, wcmaier Date: 2007.07.30 17:41
On Mon, Jul 30, 2007 at 05:33:08PM -0500, Will Maier via UW-HEP Help System wrote:
> I'm adding some information to this ticket and renaming it...

Hm; rename didn't work...

-- 

o--------------------------{ Will Maier }--------------------------o
| jabber:...wcmaier@xmpp.lfod.us | email:..will.maier@hep.wisc.edu |
| office:...........608.263.9692 | cell:..............608.438.6162 |
*--------------------[ UW High Energy Physics ]--------------------*
msg12117 (view) From: wcmaier To: rader, radtke, wcmaier Date: 2007.07.30 17:33
I'm adding some information to this ticket and renaming it, as I
don't think we already have a ticket for this specific issue. This
is a very high priority from both the CMS production and local
support perspectives.

The 'normal' state for our AFS fileservers in the last few weeks has
been:

    * load average > 3
    * > 40 blocked connections

These conditions worsen, typically in the late afternoon[0]. We've
migrated most of the OSG data off of rosemary and upgraded its AFS
software, but the high load and intermittent unavailability have
persisted. anise and thyme have also experienced serious but brief
outages in the last two weeks.

These outages are disruptive to users and cause numerous production
jobs to fail unrecoverably. Since they persist on rosemary and have
only occurred recently on our other fileservers, I've concluded that
the instability is caused by increased (but still below expected)
local use of our resources.

We should have hardware on hand to replace anise and thyme,
especially since they shouldn't require significant storage. While
it would be nice to upgrade to krb5[1] at the same time, I don't
think we can afford much more delay before upgrading the fileserver
hardware.

[0] I have no idea why.
[1] https://help.hep.wisc.edu/issue4100

-- 

o--------------------------{ Will Maier }--------------------------o
| jabber:...wcmaier@xmpp.lfod.us | email:..will.maier@hep.wisc.edu |
| office:...........608.263.9692 | cell:..............608.438.6162 |
*--------------------[ UW High Energy Physics ]--------------------*
msg12097 (view) From: wcmaier To: rader, radtke, wcmaier Date: 2007.07.26 19:07
On Thu, Jul 26, 2007 at 06:40:55PM -0500, Will Maier via UW-HEP Help System wrote:
> I wonder if the 17:37 Nagios alert re: jklukas' home volume
> filling up is related. Perhaps the grad student volumes get too
> much traffic for poor old anise?

Holy damn. Lots of hosts want to talk talk anise at the moment:

    $ for IP in $(rxdebug -long anise | sed -e '/^Connection/!d; s/.*from host \(.*\), port.*/\1/'| sort -u); do 
        dig +short -x $IP
      done | wc -l
    313

Looks like lots and lots of GLOW nodes, as well as some HEP stuff.

-- 

o--------------------------{ Will Maier }--------------------------o
| jabber:...wcmaier@xmpp.lfod.us | email:..will.maier@hep.wisc.edu |
| office:...........608.263.9692 | cell:..............608.438.6162 |
*--------------------[ UW High Energy Physics ]--------------------*
msg12096 (view) From: walker To: rader, radtke, walker, wcmaier Date: 2007.07.26 19:00
Thanks Will!

>
> On Thu, Jul 26, 2007 at 06:28:18PM -0500, walker via UW-HEP Help System
> wrote:
>> It seems AFS is frozen.
>
> I'm not seeing this now, though our monitoring shows that AFS was
> having problems ~6:00PM CST. Due to the heavy load here, AFS
> sometimes slows down or freezes momentarily; these periods almost
> always resolve themselves within 10 or 15 minutes. In all cases, we
> have automated monitoring that alerts us to the conditions. It's
> safe to assume that we're already looking into the problem --
> there's usually no reason to send us a heads up (though we
> appreciate the thought). ;)
>
> We're in the process of upgrading critical parts of our AFS
> infrastructure, and we hope that this will help lighten the load on
> our servers. For now, it's best to ride out the heavy usage periods.
> We hope that these will become fewer and more far between as we
> complete our upgrades.
>
> --
>
> o--------------------------{ Will Maier }--------------------------o
> | jabber:...wcmaier@xmpp.lfod.us | email:..will.maier@hep.wisc.edu |
> | office:...........608.263.9692 | cell:..............608.438.6162 |
> *--------------------[ UW High Energy Physics ]--------------------*
>
> ----------
> priority: triage -> urgent
> status: unread -> resolved
> topic: +AFS
>
> ______________________________________
> UW-HEP Help System <help@hep.wisc.edu>
> <https://help.hep.wisc.edu/issue4674>
> ______________________________________
>
msg12095 (view) From: wcmaier To: rader, radtke, wcmaier Date: 2007.07.26 18:40
On Thu, Jul 26, 2007 at 06:38:52PM -0500, Will Maier via UW-HEP Help System wrote:
> On Thu, Jul 26, 2007 at 06:28:18PM -0500, walker via UW-HEP Help System wrote:
> > It seems AFS is frozen.

I wonder if the 17:37 Nagios alert re: jklukas' home volume filling
up is related. Perhaps the grad student volumes get too much traffic
for poor old anise?

-- 

o--------------------------{ Will Maier }--------------------------o
| jabber:...wcmaier@xmpp.lfod.us | email:..will.maier@hep.wisc.edu |
| office:...........608.263.9692 | cell:..............608.438.6162 |
*--------------------[ UW High Energy Physics ]--------------------*
msg12094 (view) From: wcmaier To: rader, radtke, walker, wcmaier Date: 2007.07.26 18:38
On Thu, Jul 26, 2007 at 06:28:18PM -0500, walker via UW-HEP Help System wrote:
> It seems AFS is frozen.

I'm not seeing this now, though our monitoring shows that AFS was
having problems ~6:00PM CST. Due to the heavy load here, AFS
sometimes slows down or freezes momentarily; these periods almost
always resolve themselves within 10 or 15 minutes. In all cases, we
have automated monitoring that alerts us to the conditions. It's
safe to assume that we're already looking into the problem --
there's usually no reason to send us a heads up (though we
appreciate the thought). ;)

We're in the process of upgrading critical parts of our AFS
infrastructure, and we hope that this will help lighten the load on
our servers. For now, it's best to ride out the heavy usage periods.
We hope that these will become fewer and more far between as we
complete our upgrades.

-- 

o--------------------------{ Will Maier }--------------------------o
| jabber:...wcmaier@xmpp.lfod.us | email:..will.maier@hep.wisc.edu |
| office:...........608.263.9692 | cell:..............608.438.6162 |
*--------------------[ UW High Energy Physics ]--------------------*
msg12093 (view) From: walker To: rader, radtke, walker, wcmaier Date: 2007.07.26 18:28
Dear HEP help,

It seems AFS is frozen.

Devin
History
Date User Action Args
2007-11-19 13:23:06radersetstatus: chatting -> resolved
superseder: + AFS server for user home directories
2007-11-05 13:36:20wcmaiersetmessages: + msg13085
2007-09-19 09:07:51radersetmessages: + msg12580
2007-09-13 11:51:12radersetmessages: + msg12566
2007-09-13 11:47:22radersetnosy: + dasu
2007-07-31 10:27:20radtkesetmessages: + msg12128
2007-07-30 17:41:11wcmaiersetmessages: + msg12119
2007-07-30 17:41:04wcmaiersetnosy: + dan, ajit
title: Slow computers -> Upgrade AFS
2007-07-30 17:33:08wcmaiersetpriority: urgent -> critical
assignedto: rader
messages: + msg12117
2007-07-26 19:07:32wcmaiersetmessages: + msg12097
2007-07-26 19:00:40walkersetmessages: + msg12096
2007-07-26 18:40:55wcmaiersetstatus: resolved -> chatting
nosy: - walker
messages: + msg12095
2007-07-26 18:38:52wcmaiersettopic: + AFS
priority: triage -> urgent
status: unread -> resolved
messages: + msg12094
2007-07-26 18:28:18walkercreate