I'm adding some information to this ticket and renaming it, as I
don't think we already have a ticket for this specific issue. This
is a very high priority from both the CMS production and local
support perspectives.
The 'normal' state for our AFS fileservers in the last few weeks has
been:
* load average > 3
* > 40 blocked connections
These conditions worsen, typically in the late afternoon[0]. We've
migrated most of the OSG data off of rosemary and upgraded its AFS
software, but the high load and intermittent unavailability have
persisted. anise and thyme have also experienced serious but brief
outages in the last two weeks.
These outages are disruptive to users and cause numerous production
jobs to fail unrecoverably. Since they persist on rosemary and have
only occurred recently on our other fileservers, I've concluded that
the instability is caused by increased (but still below expected)
local use of our resources.
We should have hardware on hand to replace anise and thyme,
especially since they shouldn't require significant storage. While
it would be nice to upgrade to krb5[1] at the same time, I don't
think we can afford much more delay before upgrading the fileserver
hardware.
[0] I have no idea why.
[1] https://help.hep.wisc.edu/issue4100
--
o--------------------------{ Will Maier }--------------------------o
| jabber:...wcmaier@xmpp.lfod.us | email:..will.maier@hep.wisc.edu |
| office:...........608.263.9692 | cell:..............608.438.6162 |
*--------------------[ UW High Energy Physics ]--------------------* |