Message14569

Author ajit
Recipients ajit, dan, dasu, devdatta, rader, radtke, wcmaier
Date 2008.07.24 15:11
Content
Hi Devdatta,

I already clarified in an earlier email (did you miss it ?) that you won't be 
able to use the "farmoutRandomSeedJobs" script for analysis purpose i.e. where 
you need to run your jobs on input data that is stored in the dCache storage 
here i.e. /pnfs/. Instead you have to use the "farmoutAnalysisJobs" srcipt and 
the usage of the script is described with examples here : 
http://www.hep.wisc.edu/cms/comp/AnalyzingManyEvents.html

Please let me know if you have trouble using that script.

Thanks,
- Ajit



> Hello Ajit,
> 
> I submitted jobs with 1K events each (gen+sim+digi+raw) and some of them 
> are complete.
> I need to to go on to the second part, ie raw->digi->reco. Please 
> clarify one confusion that I have about this:
> I have the raw files from the previous generation step 
> (gen+sim+digi+raw) in 
> /pnfs/hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/ 
> <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/>
> 
> How do I read these root file sin my raw2reco cfg file? I tried
> 
> source = PoolSource {
>       untracked vstring fileNames = {
>       
> "/pnfs/hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0001.root 
> <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0001.root>",
>       
> "/pnfs/hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0024.root 
> <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0024.root>",
>       
> "/pnfs/hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0025.root 
> <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0025.root>",
>       
> "/pnfs/hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0022.root 
> <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0022.root>",
>       
> "/pnfs/hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0023.root 
> <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0023.root>",
>       
> "/pnfs/hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0030.root 
> <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0030.root>"
>     }
>   }
> 
> But that did not work. Should I have prefied file: to the names?
> 
> I am running this job using farmoutRandomSeedJobs script.
> 
> Best regards,
> 
> Devdatta.
> 
> 2008/7/22 Ajit Mohapatra <ajit@hep.wisc.edu <mailto:ajit@hep.wisc.edu>>:
> 
>     Hi Devdatta,
> 
>     I looked at one of your jobs :
> 
>      >>>>
>     [ajit@login01 ~]$ condor_q -l 36814.0
>      >>>>
> 
>     which points to this log file :
> 
>      >>>>>
>     [ajit@login01 ~]$ less
>     /scratch/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0002/rawHLT209BaurWp_14TeV_VLoCuts-0002.log
>      >>>>>
> 
> 
>     where I see messages like this :
>      >>>>>
>     ...
>     006 (36814.000.000) 07/18 05:48:21 Image size of job updated: 1271060
>     ...
>     006 (36814.000.000) 07/18 06:08:21 Image size of job updated: 1319916
>     ...
>     .....
>     .....
>     006 (36814.000.000) 07/20 01:48:23 Image size of job updated: 1889372
>     ...
>     004 (36814.000.000) 07/20 05:58:00 Job was evicted.
>      >>>>>
> 
>     It looks like your jobs memory consumption started to grow from
>     1.27GB to almost 1.9GB and was later evicted (the message above).
>     Apparently this job was evicted twice from 2 different worker nodes
>     after running for about 48 hours each. I am suspecting that the
>     eviction by condor is most likely due to the memory limit on the
>     WNs. Apparently, your application has quite a bit of *memory leak*
>     AND/OR the 5K events/job needs more memory (for whatever reason !)
>     than the system can provide. In that case, I would suggest to use
>     *2.5K evts/job* and see how the jobs do.
> 
>     It seems all 20 jobs are suffering the same way (see the respective
>     condor logs).
> 
>     Thanks,
>     - Ajit
> 
> 
> 
> 
>         Hello Ajit/Dan,
> 
>         I have submitted jobs (input: ascii files) as per your
>         instructions. I submitted 20 jobs each for 5K events from the
>         machine login01 on 18july, 2008 and they are all *still running*!!
>         Is this normal?
>         The job ids are 36814.0...36814.19
>         I checked the .err file, and there is nothing there, as the jobs
>         have not terminated yot. But for job 36814.07, i do not see it
>         on the list of running jobs upon doing "condor_q" and there is
>         nothing in the .err file as well.
> 
>         Could you please check?
> 
>         With regards,
> 
>         Devdatta.
> 
> 
>         2008/7/18 Ajit Mohapatra <ajit@hep.wisc.edu
>         <mailto:ajit@hep.wisc.edu> <mailto:ajit@hep.wisc.edu
>         <mailto:ajit@hep.wisc.edu>>>:
> 
> 
>            Hi Devdatta,
> 
> 
>                I should clarify.
> 
>                I want the full gen+sim+digi+reco+hlt chain for the
>         production
>                of events (wgamma, to be precise). For this, I generate
>         an ascii
>                file elsewhere and feed it into the gen module of the
>                above-mentioned chain. I should get an output root file
>                containing HLT and RECO information.
>                But sadly, in CMSSW, one cannot proceed linearly from gen to
>                reco and hlt. One needs a stream gen->sim->digi->raw->hlt and
>                then again raw->digi->reco.
>                The second of the job gets the root file containing
>         raw+hlt as
>                the input and gives another root file with reco+hlt as
>         the output.
> 
> 
> 
>                I first did the gen->sim->digi->raw->hlt chain where the
>         *ascii*
>                file was my input and I stored the corresponding raw root
>         file
>                in dCache.
> 
> 
>            For this you used farmoutRandomSeedJobs script with the ascii
>         files
>            as the "--extra-inputs".
> 
> 
>                Later I used that root file as input to the
>         raw->digi->reco chain.
> 
> 
>            For this I believe you are using "farmoutAnalysisJobs" script. In
>            that case yyou don't need to pass to the root files to the
>            "--extra-inputs". They should be read by cmsRun directly from the
>            storage with correct configuration in your cfg files.
> 
>            The "--extra-inputs" option (with this script) is provided
>         only to
>            xfer ascii/txt files if someone ones to use PDF parameterizations
>            etc. in their analysis. If you are not using any, they there
>         is no
>            need to use this option at all for your raw->digi->reco chain.
> 
> 
>                So can I now specify my input root file for "raw to reco"
>         by the
>                --extra-nputs=</path/to/filename> option and include the line
>                untracked vstring fileNames = {"file:filename"}
>                in the .cfg file?
> 
> 
>            You don't have to use this option at all in this case.
> 
>            Thanks,
>            - Ajit
> 
> 
>                2008/7/18 Dan Bradley <dan@hep.wisc.edu
>         <mailto:dan@hep.wisc.edu>
>                <mailto:dan@hep.wisc.edu <mailto:dan@hep.wisc.edu>>
>         <mailto:dan@hep.wisc.edu <mailto:dan@hep.wisc.edu>
> 
>                <mailto:dan@hep.wisc.edu <mailto:dan@hep.wisc.edu>>>>:
> 
> 
>                   How did root files get into the list of condor input
>         files?  Was
>                   that a mistake?  Or was that intentional?  I thought there
>                were some
>                   extra ascii parameter files needed.  The handling of
>         input data
>                   files is not handled by condor.  Those files are read
>                directly from
>                   dCache by cmsRun, using root's plugin for accessing
>         dcache.
> 
>                   --Dan
> 
> 
>                   Ajit Mohapatra wrote:
> 
>                       Dear Devdatta,
> 
>                           The jobs are now running with the modified
>         scripts. Many
>                           thanks for that. And thanks to Dan too for
>         updating
>                the script.
> 
>                           But the jobs are taking an unusual amount of
>         time. I
>                have a
>                           job, the details of which I paste below, of
>         only 100
>                events
>                           which is running for about 10 hours now.
> 
>                           -- Submitter: login01.hep.wisc.edu
>         <http://login01.hep.wisc.edu>
>                <http://login01.hep.wisc.edu>
>                           <http://login01.hep.wisc.edu>
>                <http://login01.hep.wisc.edu>
> 
>                           : <144.92.180.4:60004
>         <http://144.92.180.4:60004> <http://144.92.180.4:60004>
>                <http://144.92.180.4:60004>
>                           <http://144.92.180.4:60004>> :
>         login01.hep.wisc.edu <http://login01.hep.wisc.edu>
>                <http://login01.hep.wisc.edu>
>                           <http://login01.hep.wisc.edu>
>                <http://login01.hep.wisc.edu>
> 
>                            ID      OWNER            SUBMITTED    
>         RUN_TIME ST
>                PRI SIZE CMD
>                           36803.0   devdatta        7/17 03:41  
>         0+04:56:51 R
>                 0              976.6 cmsRun.sh recoHLT2
> 
> 
>                       Looking at your job description details, I see the
>                following :
>                        >>>>
>                       TransferInput =
>                            
>         "recoHLT209BaurWp_14TeV_VLoCuts-0000.cfg,/pnfs/hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0001.root
>         <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0001.root>
>              
>          <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0001.root>
>                            
>         <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0001.root>"
> 
> 
>                        >>>>
> 
>                       which indicates that you are asking condor to xfer a
>                200MB root
>                       file from /pnfs i.e. the dCache storage system
>         directly. The
>                       storage system uses different protocols for fily
>         copy and
>                condor
>                       simply can't do that operation. That's why the job
>         is failing
>                       again and again while trying to copy the root file
>         from the
>                       storage to the worker node.
> 
>                       Here is what you need to do :
> 
>                       1) Kill the current job(s).
> 
>                       2) Copy the relevant root files to the /scratch
>         directory
>                (in a
>                       login machine) from where you want to submit the
>         job. The
>                       instruction to copy files from dCache to a local
>         disk is
>                       documented here :
>                       http://www.hep.wisc.edu/cms/comp/faq.html#copy_files
> 
>                       3) Once you have the root files in /scratch, pass the
>                       "/scratch/.../rawHLT209BaurWp_14TeV_VLoCuts-0001.root"
>                alongwith
>                       the cfg as you are doing now (i.e. argument to the
>                TransferInput
>                       option)
> 
>                       4) When you have the new job submiited from
>         whichever login
>                       machines, please send us an email with the job ID. We
>                will see
>                       how it is doing.
> 
> 
>                       Thanks,
>                       - Ajit
> 
> 
> 
> 
> 
>
History
Date User Action Args
2008-07-24 15:11:08ajitsetrecipients: + ajit, wcmaier, rader, dan, dasu, radtke
2008-07-24 15:11:08ajitlinkissue5355 messages
2008-07-24 15:11:02ajitcreate