Issue4936

Title Connection timeout problem
Priority urgent Status resolved
Superseder AFS server for user home directories
View: 4945
Nosy List ajit, dan, dasu, rader, radtke, wcmaier
Assigned To Topic AFS
Group IT

Created 2007.11.06 04:11 by bdahmes.
Last changed 2007.11.08 08:04 by wcmaier.

Messages
msg13115 (view) From: rader To: ajit, dan, dasu, rader, radtke, wcmaier Date: 2007.11.07 09:31
> The sequence of events seems simple enough, but I don't think it
 > suggests a cause for the failure. What could cause a volume to
 > (seemingly) spontaneously require a salvage?

I don't know.  FWIW, vos backupsys hanging and leaving
volumes needing salvaging has happened a few times in
the more distant past.

steve
--
msg13112 (view) From: wcmaier To: ajit, dan, dasu, rader, radtke, wcmaier Date: 2007.11.06 12:05
On Tue, Nov 06, 2007 at 06:32:02AM -0600, Will Maier via UW-HEP Help System wrote:
> I'm not familiar enough with afs2nsr to quickly understand what
> might have gone wrong.

I don't see anything in the nrs-afs log (/var/tmp/afsasm.log),
probably because I killed the backup process before it could log.

The clone of osg.app.cmssoft started just fine:

    Mon Nov  5 23:00:13 2007 1 Volser: Clone: Recloning volume 536874400 to volume 536874402

~90 minutes later, the volserver noticed that the transfer wasn't
going anywhere:

    Tue Nov  6 00:19:51 2007 trans 2407674 on volume 536874402 is older than 300 seconds

This continued until ~00:30. At ~0400, the volserver started to
complain that osg.app.cmssoft needed salvaging:

    Tue Nov  6 03:53:38 2007 VAttachVolume: volume salvage flag is ON for /vicepa/V0536874400.vol; volume needs salvage
    Tue Nov  6 03:53:38 2007 1 Volser: ListVolumes: Could not attach volume 536874400 (V0536874400.vol) error=101

This continued until I started the salvage at ~0630.

The sequence of events seems simple enough, but I don't think it
suggests a cause for the failure. What could cause a volume to
(seemingly) spontaneously require a salvage?

-- 

o--------------------------{ Will Maier }--------------------------o
| jabber:...wcmaier@xmpp.lfod.us | email:..will.maier@hep.wisc.edu |
| office:...........608.263.9692 | cell:..............608.438.6162 |
*--------------------[ UW High Energy Physics ]--------------------*
msg13097 (view) From: wcmaier To: ajit, bdahmes, dan, dasu, rader, radtke, wcmaier Date: 2007.11.06 06:32
On Tue, Nov 06, 2007 at 11:40:35AM +0100, Sridhara Dasu wrote:
> Indeed AFS cmssoft area is out of commission.  Unfortunately, we
> have to wait till one of our sysadmins wakes up and fixes this.

For some reason, one of the backup processes had grabbed onto the
volume and didn't let go:

    /usr/bin/nsrfile -s -  afsasm -  /usr/bin/afs2nsr -s -p \
        '/v/o/osg.app.cmssoft.backup/' % -p \
        /v/o/osg.app.cmssoft.backup/ full

I killed this process and (on Steve's suggestion) started a salvage
of the cmssoft volume. The salvage completed a minute ago, and
access to the volume's data appears to be normal.

I'm not familiar enough with afs2nsr to quickly understand what
might have gone wrong. I imagine Steve will take a look at it when
he gets in.

-- 

o--------------------------{ Will Maier }--------------------------o
| jabber:...wcmaier@xmpp.lfod.us | email:..will.maier@hep.wisc.edu |
| office:...........608.263.9692 | cell:..............608.438.6162 |
*--------------------[ UW High Energy Physics ]--------------------*
msg13096 (view) From: dasu To: ajit, bdahmes, dan, dasu, rader, radtke, wcmaier Date: 2007.11.06 04:40
Indeed AFS cmssoft area is out of commission.  Unfortunately, we have
to wait till one of our sysadmins wakes up and fixes this.

Sridhara

---------------------------------------------------------------------
Prof. Sridhara Rao Dasu                         Department of Physics
dasu@hep.wisc.edu                             University of Wisconsin
http://www.hep.wisc.edu/~dasu                    4289 Chamberlin Hall
608-262-3678 ( Office )                        1150 University Avenue
408-829-6625 (Wireless)                        Madison, WI 53706, USA


On Nov 6, 2007, at 11:11 AM, Bryan Michael DAHMES via UW-HEP Help  
System wrote:

>
> Hello,
>
> When logging into wisconsin, I see the following:
>
> Welcome to login05.hep.wisc.edu!
>
> Scientific Linux 4.4 UW-HEP 31Oct07.01 on a 2.4 GHz Pentium4
>
> ##################################################################
>
>        NOTICE: Do not use this system as a compute server.
>        Any CPU-intensive processes running on this machine
>        will be killed without warning or notification!
>        Please use the Condor batch system.  For more details
>        see http://www.hep.wisc.edu/computing/condor.html or
>        contact <condor-help@hep.wisc.edu>
>
> ##################################################################
> -bash: /afs/hep.wisc.edu/osg/app/cmssoft/cms/cmsset_default.sh:
> Connection timed out
>
>  From there I can't seem to do any CMSSW related work.
>
> Is there a problem?
>
> Thanks,
> bryan
>
> ----------
> group: IT
> messages: 13095
> nosy: ajit, bdahmes, dan, dasu, rader, radtke, wcmaier
> priority: triage
> status: unread
> title: Connection timeout problem
>
> ______________________________________
> UW-HEP Help System <help@hep.wisc.edu>
> <https://help.hep.wisc.edu/issue4936>
> ______________________________________
msg13095 (view) From: bdahmes To: ajit, bdahmes, dan, dasu, rader, radtke, wcmaier Date: 2007.11.06 04:11
Hello,

When logging into wisconsin, I see the following:

Welcome to login05.hep.wisc.edu!

Scientific Linux 4.4 UW-HEP 31Oct07.01 on a 2.4 GHz Pentium4

##################################################################

       NOTICE: Do not use this system as a compute server.
       Any CPU-intensive processes running on this machine
       will be killed without warning or notification!
       Please use the Condor batch system.  For more details
       see http://www.hep.wisc.edu/computing/condor.html or
       contact <condor-help@hep.wisc.edu>

##################################################################
-bash: /afs/hep.wisc.edu/osg/app/cmssoft/cms/cmsset_default.sh:  
Connection timed out

 From there I can't seem to do any CMSSW related work.

Is there a problem?

Thanks,
bryan
History
Date User Action Args
2007-11-08 08:04:30wcmaiersetstatus: chatting -> resolved
superseder: + AFS server for user home directories
2007-11-07 09:31:01radersetmessages: + msg13115
2007-11-06 12:05:13wcmaiersetmessages: + msg13112
2007-11-06 11:06:33wcmaiersetnosy: - bdahmes
2007-11-06 10:21:44wcmaiersetpriority: triage -> urgent
topic: + AFS
assignedto: wcmaier -> rader
2007-11-06 06:32:02wcmaiersetassignedto: wcmaier
messages: + msg13097
2007-11-06 04:40:55dasusetstatus: unread -> chatting
messages: + msg13096
2007-11-06 04:11:02bdahmescreate