Message14552

Author ajit
Recipients ajit, dan, dasu, devdatta, rader, radtke, wcmaier
Date 2008.07.22 11:51
Content
Hi Devdatta,

I looked at one of your jobs :

 >>>>
[ajit@login01 ~]$ condor_q -l 36814.0
 >>>>

which points to this log file :

 >>>>>
[ajit@login01 ~]$ less 
/scratch/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0002/rawHLT209BaurWp_14TeV_VLoCuts-0002.log
 >>>>>


where I see messages like this :
 >>>>>
...
006 (36814.000.000) 07/18 05:48:21 Image size of job updated: 1271060
...
006 (36814.000.000) 07/18 06:08:21 Image size of job updated: 1319916
...
.....
.....
006 (36814.000.000) 07/20 01:48:23 Image size of job updated: 1889372
...
004 (36814.000.000) 07/20 05:58:00 Job was evicted.
 >>>>>

It looks like your jobs memory consumption started to grow from 1.27GB to almost 
1.9GB and was later evicted (the message above). Apparently this job was evicted 
twice from 2 different worker nodes after running for about 48 hours each. I am 
suspecting that the eviction by condor is most likely due to the memory limit on 
the WNs. Apparently, your application has quite a bit of *memory leak* AND/OR 
the 5K events/job needs more memory (for whatever reason !) than the system can 
provide. In that case, I would suggest to use *2.5K evts/job* and see how the 
jobs do.

It seems all 20 jobs are suffering the same way (see the respective condor logs).

Thanks,
- Ajit




> Hello Ajit/Dan,
> 
> I have submitted jobs (input: ascii files) as per your instructions. I 
> submitted 20 jobs each for 5K events from the machine login01 on 18july, 
> 2008 and they are all *still running*!!
> Is this normal?
> The job ids are 36814.0...36814.19
> I checked the .err file, and there is nothing there, as the jobs have 
> not terminated yot. But for job 36814.07, i do not see it on the list of 
> running jobs upon doing "condor_q" and there is nothing in the .err file 
> as well.
> 
> Could you please check?
> 
> With regards,
> 
> Devdatta.
> 
> 
> 2008/7/18 Ajit Mohapatra <ajit@hep.wisc.edu <mailto:ajit@hep.wisc.edu>>:
> 
>     Hi Devdatta,
> 
> 
>         I should clarify.
> 
>         I want the full gen+sim+digi+reco+hlt chain for the production
>         of events (wgamma, to be precise). For this, I generate an ascii
>         file elsewhere and feed it into the gen module of the
>         above-mentioned chain. I should get an output root file
>         containing HLT and RECO information.
>         But sadly, in CMSSW, one cannot proceed linearly from gen to
>         reco and hlt. One needs a stream gen->sim->digi->raw->hlt and
>         then again raw->digi->reco.
>         The second of the job gets the root file containing raw+hlt as
>         the input and gives another root file with reco+hlt as the output.
> 
> 
> 
>         I first did the gen->sim->digi->raw->hlt chain where the *ascii*
>         file was my input and I stored the corresponding raw root file
>         in dCache.
> 
> 
>     For this you used farmoutRandomSeedJobs script with the ascii files
>     as the "--extra-inputs".
> 
> 
>         Later I used that root file as input to the raw->digi->reco chain.
> 
> 
>     For this I believe you are using "farmoutAnalysisJobs" script. In
>     that case yyou don't need to pass to the root files to the
>     "--extra-inputs". They should be read by cmsRun directly from the
>     storage with correct configuration in your cfg files.
> 
>     The "--extra-inputs" option (with this script) is provided only to
>     xfer ascii/txt files if someone ones to use PDF parameterizations
>     etc. in their analysis. If you are not using any, they there is no
>     need to use this option at all for your raw->digi->reco chain.
> 
> 
>         So can I now specify my input root file for "raw to reco" by the
>         --extra-nputs=</path/to/filename> option and include the line
>         untracked vstring fileNames = {"file:filename"}
>         in the .cfg file?
> 
> 
>     You don't have to use this option at all in this case.
> 
>     Thanks,
>     - Ajit
> 
> 
>         2008/7/18 Dan Bradley <dan@hep.wisc.edu
>         <mailto:dan@hep.wisc.edu> <mailto:dan@hep.wisc.edu
>         <mailto:dan@hep.wisc.edu>>>:
> 
> 
>            How did root files get into the list of condor input files?  Was
>            that a mistake?  Or was that intentional?  I thought there
>         were some
>            extra ascii parameter files needed.  The handling of input data
>            files is not handled by condor.  Those files are read
>         directly from
>            dCache by cmsRun, using root's plugin for accessing dcache.
> 
>            --Dan
> 
> 
>            Ajit Mohapatra wrote:
> 
>                Dear Devdatta,
> 
>                    The jobs are now running with the modified scripts. Many
>                    thanks for that. And thanks to Dan too for updating
>         the script.
> 
>                    But the jobs are taking an unusual amount of time. I
>         have a
>                    job, the details of which I paste below, of only 100
>         events
>                    which is running for about 10 hours now.
> 
>                    -- Submitter: login01.hep.wisc.edu
>         <http://login01.hep.wisc.edu>
>                    <http://login01.hep.wisc.edu>
>         <http://login01.hep.wisc.edu>
> 
>                    : <144.92.180.4:60004 <http://144.92.180.4:60004>
>         <http://144.92.180.4:60004>
>                    <http://144.92.180.4:60004>> : login01.hep.wisc.edu
>         <http://login01.hep.wisc.edu>
>                    <http://login01.hep.wisc.edu>
>         <http://login01.hep.wisc.edu>
> 
>                     ID      OWNER            SUBMITTED     RUN_TIME ST
>         PRI SIZE CMD
>                    36803.0   devdatta        7/17 03:41   0+04:56:51 R
>          0              976.6 cmsRun.sh recoHLT2
> 
> 
>                Looking at your job description details, I see the
>         following :
>                 >>>>
>                TransferInput =
>              
>          "recoHLT209BaurWp_14TeV_VLoCuts-0000.cfg,/pnfs/hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0001.root
>         <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0001.root>
>              
>          <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0001.root>"
> 
> 
>                 >>>>
> 
>                which indicates that you are asking condor to xfer a
>         200MB root
>                file from /pnfs i.e. the dCache storage system directly. The
>                storage system uses different protocols for fily copy and
>         condor
>                simply can't do that operation. That's why the job is failing
>                again and again while trying to copy the root file from the
>                storage to the worker node.
> 
>                Here is what you need to do :
> 
>                1) Kill the current job(s).
> 
>                2) Copy the relevant root files to the /scratch directory
>         (in a
>                login machine) from where you want to submit the job. The
>                instruction to copy files from dCache to a local disk is
>                documented here :
>                http://www.hep.wisc.edu/cms/comp/faq.html#copy_files
> 
>                3) Once you have the root files in /scratch, pass the
>                "/scratch/.../rawHLT209BaurWp_14TeV_VLoCuts-0001.root"
>         alongwith
>                the cfg as you are doing now (i.e. argument to the
>         TransferInput
>                option)
> 
>                4) When you have the new job submiited from whichever login
>                machines, please send us an email with the job ID. We
>         will see
>                how it is doing.
> 
> 
>                Thanks,
>                - Ajit
> 
> 
> 
>
History
Date User Action Args
2008-07-22 11:51:04ajitsetrecipients: + ajit, wcmaier, rader, dan, dasu, radtke
2008-07-22 11:51:04ajitlinkissue5355 messages
2008-07-22 11:51:02ajitcreate