Hi Devdatta,
I looked at one of your jobs :
>>>>
[ajit@login01 ~]$ condor_q -l 36814.0
>>>>
which points to this log file :
>>>>>
[ajit@login01 ~]$ less
/scratch/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0002/rawHLT209BaurWp_14TeV_VLoCuts-0002.log
>>>>>
where I see messages like this :
>>>>>
...
006 (36814.000.000) 07/18 05:48:21 Image size of job updated: 1271060
...
006 (36814.000.000) 07/18 06:08:21 Image size of job updated: 1319916
...
.....
.....
006 (36814.000.000) 07/20 01:48:23 Image size of job updated: 1889372
...
004 (36814.000.000) 07/20 05:58:00 Job was evicted.
>>>>>
It looks like your jobs memory consumption started to grow from 1.27GB to almost
1.9GB and was later evicted (the message above). Apparently this job was evicted
twice from 2 different worker nodes after running for about 48 hours each. I am
suspecting that the eviction by condor is most likely due to the memory limit on
the WNs. Apparently, your application has quite a bit of *memory leak* AND/OR
the 5K events/job needs more memory (for whatever reason !) than the system can
provide. In that case, I would suggest to use *2.5K evts/job* and see how the
jobs do.
It seems all 20 jobs are suffering the same way (see the respective condor logs).
Thanks,
- Ajit
> Hello Ajit/Dan,
>
> I have submitted jobs (input: ascii files) as per your instructions. I
> submitted 20 jobs each for 5K events from the machine login01 on 18july,
> 2008 and they are all *still running*!!
> Is this normal?
> The job ids are 36814.0...36814.19
> I checked the .err file, and there is nothing there, as the jobs have
> not terminated yot. But for job 36814.07, i do not see it on the list of
> running jobs upon doing "condor_q" and there is nothing in the .err file
> as well.
>
> Could you please check?
>
> With regards,
>
> Devdatta.
>
>
> 2008/7/18 Ajit Mohapatra <ajit@hep.wisc.edu <mailto:ajit@hep.wisc.edu>>:
>
> Hi Devdatta,
>
>
> I should clarify.
>
> I want the full gen+sim+digi+reco+hlt chain for the production
> of events (wgamma, to be precise). For this, I generate an ascii
> file elsewhere and feed it into the gen module of the
> above-mentioned chain. I should get an output root file
> containing HLT and RECO information.
> But sadly, in CMSSW, one cannot proceed linearly from gen to
> reco and hlt. One needs a stream gen->sim->digi->raw->hlt and
> then again raw->digi->reco.
> The second of the job gets the root file containing raw+hlt as
> the input and gives another root file with reco+hlt as the output.
>
>
>
> I first did the gen->sim->digi->raw->hlt chain where the *ascii*
> file was my input and I stored the corresponding raw root file
> in dCache.
>
>
> For this you used farmoutRandomSeedJobs script with the ascii files
> as the "--extra-inputs".
>
>
> Later I used that root file as input to the raw->digi->reco chain.
>
>
> For this I believe you are using "farmoutAnalysisJobs" script. In
> that case yyou don't need to pass to the root files to the
> "--extra-inputs". They should be read by cmsRun directly from the
> storage with correct configuration in your cfg files.
>
> The "--extra-inputs" option (with this script) is provided only to
> xfer ascii/txt files if someone ones to use PDF parameterizations
> etc. in their analysis. If you are not using any, they there is no
> need to use this option at all for your raw->digi->reco chain.
>
>
> So can I now specify my input root file for "raw to reco" by the
> --extra-nputs=</path/to/filename> option and include the line
> untracked vstring fileNames = {"file:filename"}
> in the .cfg file?
>
>
> You don't have to use this option at all in this case.
>
> Thanks,
> - Ajit
>
>
> 2008/7/18 Dan Bradley <dan@hep.wisc.edu
> <mailto:dan@hep.wisc.edu> <mailto:dan@hep.wisc.edu
> <mailto:dan@hep.wisc.edu>>>:
>
>
> How did root files get into the list of condor input files? Was
> that a mistake? Or was that intentional? I thought there
> were some
> extra ascii parameter files needed. The handling of input data
> files is not handled by condor. Those files are read
> directly from
> dCache by cmsRun, using root's plugin for accessing dcache.
>
> --Dan
>
>
> Ajit Mohapatra wrote:
>
> Dear Devdatta,
>
> The jobs are now running with the modified scripts. Many
> thanks for that. And thanks to Dan too for updating
> the script.
>
> But the jobs are taking an unusual amount of time. I
> have a
> job, the details of which I paste below, of only 100
> events
> which is running for about 10 hours now.
>
> -- Submitter: login01.hep.wisc.edu
> <http://login01.hep.wisc.edu>
> <http://login01.hep.wisc.edu>
> <http://login01.hep.wisc.edu>
>
> : <144.92.180.4:60004 <http://144.92.180.4:60004>
> <http://144.92.180.4:60004>
> <http://144.92.180.4:60004>> : login01.hep.wisc.edu
> <http://login01.hep.wisc.edu>
> <http://login01.hep.wisc.edu>
> <http://login01.hep.wisc.edu>
>
> ID OWNER SUBMITTED RUN_TIME ST
> PRI SIZE CMD
> 36803.0 devdatta 7/17 03:41 0+04:56:51 R
> 0 976.6 cmsRun.sh recoHLT2
>
>
> Looking at your job description details, I see the
> following :
> >>>>
> TransferInput =
>
> "recoHLT209BaurWp_14TeV_VLoCuts-0000.cfg,/pnfs/hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0001.root
> <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0001.root>
>
> <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0001.root>"
>
>
> >>>>
>
> which indicates that you are asking condor to xfer a
> 200MB root
> file from /pnfs i.e. the dCache storage system directly. The
> storage system uses different protocols for fily copy and
> condor
> simply can't do that operation. That's why the job is failing
> again and again while trying to copy the root file from the
> storage to the worker node.
>
> Here is what you need to do :
>
> 1) Kill the current job(s).
>
> 2) Copy the relevant root files to the /scratch directory
> (in a
> login machine) from where you want to submit the job. The
> instruction to copy files from dCache to a local disk is
> documented here :
> http://www.hep.wisc.edu/cms/comp/faq.html#copy_files
>
> 3) Once you have the root files in /scratch, pass the
> "/scratch/.../rawHLT209BaurWp_14TeV_VLoCuts-0001.root"
> alongwith
> the cfg as you are doing now (i.e. argument to the
> TransferInput
> option)
>
> 4) When you have the new job submiited from whichever login
> machines, please send us an email with the job ID. We
> will see
> how it is doing.
>
>
> Thanks,
> - Ajit
>
>
>
> |