Created 2008.07.14 09:42 by ajit. Last changed 2008.08.04 08:32 by wcmaier.
| msg14569 (view) |
From: ajit |
To: ajit, dan, dasu, devdatta, rader, radtke, wcmaier |
Date: 2008.07.24 15:11 |
|
Hi Devdatta,
I already clarified in an earlier email (did you miss it ?) that you won't be
able to use the "farmoutRandomSeedJobs" script for analysis purpose i.e. where
you need to run your jobs on input data that is stored in the dCache storage
here i.e. /pnfs/. Instead you have to use the "farmoutAnalysisJobs" srcipt and
the usage of the script is described with examples here :
http://www.hep.wisc.edu/cms/comp/AnalyzingManyEvents.html
Please let me know if you have trouble using that script.
Thanks,
- Ajit
> Hello Ajit,
>
> I submitted jobs with 1K events each (gen+sim+digi+raw) and some of them
> are complete.
> I need to to go on to the second part, ie raw->digi->reco. Please
> clarify one confusion that I have about this:
> I have the raw files from the previous generation step
> (gen+sim+digi+raw) in
> /pnfs/hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/
> <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/>
>
> How do I read these root file sin my raw2reco cfg file? I tried
>
> source = PoolSource {
> untracked vstring fileNames = {
>
> "/pnfs/hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0001.root
> <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0001.root>",
>
> "/pnfs/hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0024.root
> <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0024.root>",
>
> "/pnfs/hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0025.root
> <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0025.root>",
>
> "/pnfs/hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0022.root
> <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0022.root>",
>
> "/pnfs/hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0023.root
> <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0023.root>",
>
> "/pnfs/hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0030.root
> <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0030.root>"
> }
> }
>
> But that did not work. Should I have prefied file: to the names?
>
> I am running this job using farmoutRandomSeedJobs script.
>
> Best regards,
>
> Devdatta.
>
> 2008/7/22 Ajit Mohapatra <ajit@hep.wisc.edu <mailto:ajit@hep.wisc.edu>>:
>
> Hi Devdatta,
>
> I looked at one of your jobs :
>
> >>>>
> [ajit@login01 ~]$ condor_q -l 36814.0
> >>>>
>
> which points to this log file :
>
> >>>>>
> [ajit@login01 ~]$ less
> /scratch/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0002/rawHLT209BaurWp_14TeV_VLoCuts-0002.log
> >>>>>
>
>
> where I see messages like this :
> >>>>>
> ...
> 006 (36814.000.000) 07/18 05:48:21 Image size of job updated: 1271060
> ...
> 006 (36814.000.000) 07/18 06:08:21 Image size of job updated: 1319916
> ...
> .....
> .....
> 006 (36814.000.000) 07/20 01:48:23 Image size of job updated: 1889372
> ...
> 004 (36814.000.000) 07/20 05:58:00 Job was evicted.
> >>>>>
>
> It looks like your jobs memory consumption started to grow from
> 1.27GB to almost 1.9GB and was later evicted (the message above).
> Apparently this job was evicted twice from 2 different worker nodes
> after running for about 48 hours each. I am suspecting that the
> eviction by condor is most likely due to the memory limit on the
> WNs. Apparently, your application has quite a bit of *memory leak*
> AND/OR the 5K events/job needs more memory (for whatever reason !)
> than the system can provide. In that case, I would suggest to use
> *2.5K evts/job* and see how the jobs do.
>
> It seems all 20 jobs are suffering the same way (see the respective
> condor logs).
>
> Thanks,
> - Ajit
>
>
>
>
> Hello Ajit/Dan,
>
> I have submitted jobs (input: ascii files) as per your
> instructions. I submitted 20 jobs each for 5K events from the
> machine login01 on 18july, 2008 and they are all *still running*!!
> Is this normal?
> The job ids are 36814.0...36814.19
> I checked the .err file, and there is nothing there, as the jobs
> have not terminated yot. But for job 36814.07, i do not see it
> on the list of running jobs upon doing "condor_q" and there is
> nothing in the .err file as well.
>
> Could you please check?
>
> With regards,
>
> Devdatta.
>
>
> 2008/7/18 Ajit Mohapatra <ajit@hep.wisc.edu
> <mailto:ajit@hep.wisc.edu> <mailto:ajit@hep.wisc.edu
> <mailto:ajit@hep.wisc.edu>>>:
>
>
> Hi Devdatta,
>
>
> I should clarify.
>
> I want the full gen+sim+digi+reco+hlt chain for the
> production
> of events (wgamma, to be precise). For this, I generate
> an ascii
> file elsewhere and feed it into the gen module of the
> above-mentioned chain. I should get an output root file
> containing HLT and RECO information.
> But sadly, in CMSSW, one cannot proceed linearly from gen to
> reco and hlt. One needs a stream gen->sim->digi->raw->hlt and
> then again raw->digi->reco.
> The second of the job gets the root file containing
> raw+hlt as
> the input and gives another root file with reco+hlt as
> the output.
>
>
>
> I first did the gen->sim->digi->raw->hlt chain where the
> *ascii*
> file was my input and I stored the corresponding raw root
> file
> in dCache.
>
>
> For this you used farmoutRandomSeedJobs script with the ascii
> files
> as the "--extra-inputs".
>
>
> Later I used that root file as input to the
> raw->digi->reco chain.
>
>
> For this I believe you are using "farmoutAnalysisJobs" script. In
> that case yyou don't need to pass to the root files to the
> "--extra-inputs". They should be read by cmsRun directly from the
> storage with correct configuration in your cfg files.
>
> The "--extra-inputs" option (with this script) is provided
> only to
> xfer ascii/txt files if someone ones to use PDF parameterizations
> etc. in their analysis. If you are not using any, they there
> is no
> need to use this option at all for your raw->digi->reco chain.
>
>
> So can I now specify my input root file for "raw to reco"
> by the
> --extra-nputs=</path/to/filename> option and include the line
> untracked vstring fileNames = {"file:filename"}
> in the .cfg file?
>
>
> You don't have to use this option at all in this case.
>
> Thanks,
> - Ajit
>
>
> 2008/7/18 Dan Bradley <dan@hep.wisc.edu
> <mailto:dan@hep.wisc.edu>
> <mailto:dan@hep.wisc.edu <mailto:dan@hep.wisc.edu>>
> <mailto:dan@hep.wisc.edu <mailto:dan@hep.wisc.edu>
>
> <mailto:dan@hep.wisc.edu <mailto:dan@hep.wisc.edu>>>>:
>
>
> How did root files get into the list of condor input
> files? Was
> that a mistake? Or was that intentional? I thought there
> were some
> extra ascii parameter files needed. The handling of
> input data
> files is not handled by condor. Those files are read
> directly from
> dCache by cmsRun, using root's plugin for accessing
> dcache.
>
> --Dan
>
>
> Ajit Mohapatra wrote:
>
> Dear Devdatta,
>
> The jobs are now running with the modified
> scripts. Many
> thanks for that. And thanks to Dan too for
> updating
> the script.
>
> But the jobs are taking an unusual amount of
> time. I
> have a
> job, the details of which I paste below, of
> only 100
> events
> which is running for about 10 hours now.
>
> -- Submitter: login01.hep.wisc.edu
> <http://login01.hep.wisc.edu>
> <http://login01.hep.wisc.edu>
> <http://login01.hep.wisc.edu>
> <http://login01.hep.wisc.edu>
>
> : <144.92.180.4:60004
> <http://144.92.180.4:60004> <http://144.92.180.4:60004>
> <http://144.92.180.4:60004>
> <http://144.92.180.4:60004>> :
> login01.hep.wisc.edu <http://login01.hep.wisc.edu>
> <http://login01.hep.wisc.edu>
> <http://login01.hep.wisc.edu>
> <http://login01.hep.wisc.edu>
>
> ID OWNER SUBMITTED
> RUN_TIME ST
> PRI SIZE CMD
> 36803.0 devdatta 7/17 03:41
> 0+04:56:51 R
> 0 976.6 cmsRun.sh recoHLT2
>
>
> Looking at your job description details, I see the
> following :
> >>>>
> TransferInput =
>
> "recoHLT209BaurWp_14TeV_VLoCuts-0000.cfg,/pnfs/hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0001.root
> <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0001.root>
>
> <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0001.root>
>
> <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0001.root>"
>
>
> >>>>
>
> which indicates that you are asking condor to xfer a
> 200MB root
> file from /pnfs i.e. the dCache storage system
> directly. The
> storage system uses different protocols for fily
> copy and
> condor
> simply can't do that operation. That's why the job
> is failing
> again and again while trying to copy the root file
> from the
> storage to the worker node.
>
> Here is what you need to do :
>
> 1) Kill the current job(s).
>
> 2) Copy the relevant root files to the /scratch
> directory
> (in a
> login machine) from where you want to submit the
> job. The
> instruction to copy files from dCache to a local
> disk is
> documented here :
> http://www.hep.wisc.edu/cms/comp/faq.html#copy_files
>
> 3) Once you have the root files in /scratch, pass the
> "/scratch/.../rawHLT209BaurWp_14TeV_VLoCuts-0001.root"
> alongwith
> the cfg as you are doing now (i.e. argument to the
> TransferInput
> option)
>
> 4) When you have the new job submiited from
> whichever login
> machines, please send us an email with the job ID. We
> will see
> how it is doing.
>
>
> Thanks,
> - Ajit
>
>
>
>
>
>
|
| msg14552 (view) |
From: ajit |
To: ajit, dan, dasu, devdatta, rader, radtke, wcmaier |
Date: 2008.07.22 11:51 |
|
Hi Devdatta,
I looked at one of your jobs :
>>>>
[ajit@login01 ~]$ condor_q -l 36814.0
>>>>
which points to this log file :
>>>>>
[ajit@login01 ~]$ less
/scratch/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0002/rawHLT209BaurWp_14TeV_VLoCuts-0002.log
>>>>>
where I see messages like this :
>>>>>
...
006 (36814.000.000) 07/18 05:48:21 Image size of job updated: 1271060
...
006 (36814.000.000) 07/18 06:08:21 Image size of job updated: 1319916
...
.....
.....
006 (36814.000.000) 07/20 01:48:23 Image size of job updated: 1889372
...
004 (36814.000.000) 07/20 05:58:00 Job was evicted.
>>>>>
It looks like your jobs memory consumption started to grow from 1.27GB to almost
1.9GB and was later evicted (the message above). Apparently this job was evicted
twice from 2 different worker nodes after running for about 48 hours each. I am
suspecting that the eviction by condor is most likely due to the memory limit on
the WNs. Apparently, your application has quite a bit of *memory leak* AND/OR
the 5K events/job needs more memory (for whatever reason !) than the system can
provide. In that case, I would suggest to use *2.5K evts/job* and see how the
jobs do.
It seems all 20 jobs are suffering the same way (see the respective condor logs).
Thanks,
- Ajit
> Hello Ajit/Dan,
>
> I have submitted jobs (input: ascii files) as per your instructions. I
> submitted 20 jobs each for 5K events from the machine login01 on 18july,
> 2008 and they are all *still running*!!
> Is this normal?
> The job ids are 36814.0...36814.19
> I checked the .err file, and there is nothing there, as the jobs have
> not terminated yot. But for job 36814.07, i do not see it on the list of
> running jobs upon doing "condor_q" and there is nothing in the .err file
> as well.
>
> Could you please check?
>
> With regards,
>
> Devdatta.
>
>
> 2008/7/18 Ajit Mohapatra <ajit@hep.wisc.edu <mailto:ajit@hep.wisc.edu>>:
>
> Hi Devdatta,
>
>
> I should clarify.
>
> I want the full gen+sim+digi+reco+hlt chain for the production
> of events (wgamma, to be precise). For this, I generate an ascii
> file elsewhere and feed it into the gen module of the
> above-mentioned chain. I should get an output root file
> containing HLT and RECO information.
> But sadly, in CMSSW, one cannot proceed linearly from gen to
> reco and hlt. One needs a stream gen->sim->digi->raw->hlt and
> then again raw->digi->reco.
> The second of the job gets the root file containing raw+hlt as
> the input and gives another root file with reco+hlt as the output.
>
>
>
> I first did the gen->sim->digi->raw->hlt chain where the *ascii*
> file was my input and I stored the corresponding raw root file
> in dCache.
>
>
> For this you used farmoutRandomSeedJobs script with the ascii files
> as the "--extra-inputs".
>
>
> Later I used that root file as input to the raw->digi->reco chain.
>
>
> For this I believe you are using "farmoutAnalysisJobs" script. In
> that case yyou don't need to pass to the root files to the
> "--extra-inputs". They should be read by cmsRun directly from the
> storage with correct configuration in your cfg files.
>
> The "--extra-inputs" option (with this script) is provided only to
> xfer ascii/txt files if someone ones to use PDF parameterizations
> etc. in their analysis. If you are not using any, they there is no
> need to use this option at all for your raw->digi->reco chain.
>
>
> So can I now specify my input root file for "raw to reco" by the
> --extra-nputs=</path/to/filename> option and include the line
> untracked vstring fileNames = {"file:filename"}
> in the .cfg file?
>
>
> You don't have to use this option at all in this case.
>
> Thanks,
> - Ajit
>
>
> 2008/7/18 Dan Bradley <dan@hep.wisc.edu
> <mailto:dan@hep.wisc.edu> <mailto:dan@hep.wisc.edu
> <mailto:dan@hep.wisc.edu>>>:
>
>
> How did root files get into the list of condor input files? Was
> that a mistake? Or was that intentional? I thought there
> were some
> extra ascii parameter files needed. The handling of input data
> files is not handled by condor. Those files are read
> directly from
> dCache by cmsRun, using root's plugin for accessing dcache.
>
> --Dan
>
>
> Ajit Mohapatra wrote:
>
> Dear Devdatta,
>
> The jobs are now running with the modified scripts. Many
> thanks for that. And thanks to Dan too for updating
> the script.
>
> But the jobs are taking an unusual amount of time. I
> have a
> job, the details of which I paste below, of only 100
> events
> which is running for about 10 hours now.
>
> -- Submitter: login01.hep.wisc.edu
> <http://login01.hep.wisc.edu>
> <http://login01.hep.wisc.edu>
> <http://login01.hep.wisc.edu>
>
> : <144.92.180.4:60004 <http://144.92.180.4:60004>
> <http://144.92.180.4:60004>
> <http://144.92.180.4:60004>> : login01.hep.wisc.edu
> <http://login01.hep.wisc.edu>
> <http://login01.hep.wisc.edu>
> <http://login01.hep.wisc.edu>
>
> ID OWNER SUBMITTED RUN_TIME ST
> PRI SIZE CMD
> 36803.0 devdatta 7/17 03:41 0+04:56:51 R
> 0 976.6 cmsRun.sh recoHLT2
>
>
> Looking at your job description details, I see the
> following :
> >>>>
> TransferInput =
>
> "recoHLT209BaurWp_14TeV_VLoCuts-0000.cfg,/pnfs/hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0001.root
> <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0001.root>
>
> <http://hep.wisc.edu/data5/uscms01/devdatta/rawHLT209BaurWp_14TeV_VLoCuts/rawHLT209BaurWp_14TeV_VLoCuts-0001.root>"
>
>
> >>>>
>
> which indicates that you are asking condor to xfer a
> 200MB root
> file from /pnfs i.e. the dCache storage system directly. The
> storage system uses different protocols for fily copy and
> condor
> simply can't do that operation. That's why the job is failing
> again and again while trying to copy the root file from the
> storage to the worker node.
>
> Here is what you need to do :
>
> 1) Kill the current job(s).
>
> 2) Copy the relevant root files to the /scratch directory
> (in a
> login machine) from where you want to submit the job. The
> instruction to copy files from dCache to a local disk is
> documented here :
> http://www.hep.wisc.edu/cms/comp/faq.html#copy_files
>
> 3) Once you have the root files in /scratch, pass the
> "/scratch/.../rawHLT209BaurWp_14TeV_VLoCuts-0001.root"
> alongwith
> the cfg as you are doing now (i.e. argument to the
> TransferInput
> option)
>
> 4) When you have the new job submiited from whichever login
> machines, please send us an email with the job ID. We
> will see
> how it is doing.
>
>
> Thanks,
> - Ajit
>
>
>
>
|
| msg14520 (view) |
From: ajit |
To: ajit, dan, dasu, devdatta, rader, radtke, wcmaier |
Date: 2008.07.16 13:01 |
|
Dear Devdatta,
> Dear Ajit,
>
> I did not hear from you and I am sorry to press this matter so much.
> Getting the jobs done is bit urgent to us as we do not have the required
> samples to start working on.
The farmoutRandomSeedJobs script was updated with a new option which you can use
to list your input ascii/txt files (on the command line) and Condor will xfers
those files to the worker node so that the job will be able to access the file
while running. But you shouldn't use the full path i.e. /scratch/..../ to those
files in your cfg file (as you are doing now). Instead just use the file name,
since on the worker node, the file should be in the directory from where the job
gets executed by condor.
I just forwarded you an email (from Dan) which describes the syntax that you can
follow. Somehow that mail didn't get to you yesterday.
If there are any confusion, please let me know. Also, please send the email to
"help@hep.wisc.edu" so that our problem tracking system can do it's job right
and ensure that you get a quicker response from our help team. (please follow
this suggestion for all future problems that you encounter at Wisconsin).
Thanks,
- Ajit
>
> 2008/7/15 Devdatta Majumder <devdatta@tifr.res.in
> <mailto:devdatta@tifr.res.in>>:
>
>
> When you say that "Previously this problem never occurred", I am
> assuming that these jobs were running fine a couple of days ago,
> but now they don't i.e. all of a sudden you are seeing that the
> jobs can't find/open the input files ? And you have been running
> these jobs via the condor batch system all this time, but not
> interactively, right ?
>
>
> Previously I could run jobs using *condor*... this problem started
> since last week. Of course it is true that the datafiles then
> resided in my afs directory, but as I said, the file sizes are large
> (some 200Mb) I was forced to keep those in my /scratch area.
> Interactive running on a few events are fine, the problem is with
> the condor jobs.
>
> Interactive running on the login machine (i.e. cmsRun < input
> file) is different than condor jobs. Are you using the FarmOut
> analysis scripts here to submit your jobs to condor ?
>
>
> Yes, I am using the scripts from this webpage :
> http://www.hep.wisc.edu/cms/comp/index.html They are not the
> analysis scripts but
>
> farmoutRandomSeedJobs jobs.
>
>
> Assuming that you are familiar with condor submission mechanism
> (including the input file xfers option in condor), I see that
> your ascii file
> (/scratch/devdatta/BaurAsciiFiles/lhc_Wp_VeryLowCuts/baurWp_lhcVeryLowCuts.ascii)
> resides in the /scratch in login01 and that directory is not
> accessible from the worker nodes where your jobs end up when you
> submit via condor. So the file needs to the xfered to the Worker
> nodes for the jobs to find it and that can be configured in your
> condor command/JDL.
>
>
> I am not really familiar with this mechanism of xfering files to
> worker nodes. I am using the following for submitting my jobs from
> afs directory:
>
> farmoutRandomSeedJobs <outputFileName> 100000 500 ~/CMSSW_2_0_9/ ~/CMSSW_2_0_9/src/GeneratorInterface/BaurWgamInterface/test/<cfgFileName>
>
>
>
>
> If you can provide some tips, it would be helpful. Or, could you
> manage to find some spare time to try one of my cfgs and certify
> that I am not doing something stupid?
>
>
> Just to make sure i.e. these are 200MB (each) ascii files ?
>
>
> There are smaller fielsa s well, but there are some with >100MB
> size. I am going to produce some more of these, but I require
> storage for log-term.
>
>
> Both your AFS home area and the /scratch of the login machines
> are safe. We don't remove anything from /scratch that the users
> want to keep for a bit longer than the usual time. In that case
> it would be useful for us to know.
>
>
> I want to keep my datafile around for quite some time, at least
> until I have the whole gen+sim+digi+reco chain done.
>
>
> Condor jobs have the options to xfer input files directly to the
> worker nodes where the job can use them, eliminating the need to
> put big files in AFS for sharing. In that case you can have your
> big files in /scracth in any of the login machines.
>
> Let me know how I can be of further help in resolving this.
>
>
> Well, for now, I would like to resolve the problem of submitting
> jobs to condor where the input is in my scratch area and get the
> condor jobs running. Second, I would like to have access to some
> tape archive (I assume it would be /pnfs) where I can keep my
> datafiles for long-term storage.
>
> With regards,
>
> Devdatta.
>
>
|
| msg14519 (view) |
From: dan |
To: ajit, dan, dasu, rader, radtke, wcmaier |
Date: 2008.07.15 17:18 |
|
Hello Devdatta,
I added the option --extra-inputs=file1,file2,... to the farmout
scripts. Condor will transfer these files from whatever path you
specify (could be /scratch or /afs) to the working directory where the
job runs on the worker node. Since the files will appear in the working
directory, rather than at the original path, you would probably not want
to specify any path to these files in the CMSSW config file. In other
words, the job should assume that these files will be present in the
current working directory when it runs.
Let me know if the above description is confusing or if it doesn't solve
your problem.
--Dan
|
| msg14518 (view) |
From: ajit |
To: ajit, dan, dasu, rader, radtke, wcmaier |
Date: 2008.07.15 17:14 |
|
He missed to send the email again to HELP..
Devdatta Majumder wrote:
>
> When you say that "Previously this problem never occurred", I am
> assuming that these jobs were running fine a couple of days ago, but
> now they don't i.e. all of a sudden you are seeing that the jobs
> can't find/open the input files ? And you have been running these
> jobs via the condor batch system all this time, but not
> interactively, right ?
>
>
> Previously I could run jobs using *condor*... this problem started since
> last week. Of course it is true that the datafiles then resided in my
> afs directory, but as I said, the file sizes are large (some 200Mb) I
> was forced to keep those in my /scratch area.
> Interactive running on a few events are fine, the problem is with the
> condor jobs.
>
> Interactive running on the login machine (i.e. cmsRun < input file)
> is different than condor jobs. Are you using the FarmOut analysis
> scripts here to submit your jobs to condor ?
>
>
> Yes, I am using the scripts from this webpage :
> http://www.hep.wisc.edu/cms/comp/index.html They are not the analysis
> scripts but
>
> farmoutRandomSeedJobs jobs.
>
>
> Assuming that you are familiar with condor submission mechanism
> (including the input file xfers option in condor), I see that your
> ascii file
> (/scratch/devdatta/BaurAsciiFiles/lhc_Wp_VeryLowCuts/baurWp_lhcVeryLowCuts.ascii)
> resides in the /scratch in login01 and that directory is not
> accessible from the worker nodes where your jobs end up when you
> submit via condor. So the file needs to the xfered to the Worker
> nodes for the jobs to find it and that can be configured in your
> condor command/JDL.
>
>
> I am not really familiar with this mechanism of xfering files to worker
> nodes. I am using the following for submitting my jobs from afs directory:
>
> farmoutRandomSeedJobs <outputFileName> 100000 500 ~/CMSSW_2_0_9/ ~/CMSSW_2_0_9/src/GeneratorInterface/BaurWgamInterface/test/<cfgFileName>
>
>
> If you can provide some tips, it would be helpful. Or, could you manage
> to find some spare time to try one of my cfgs and certify that I am not
> doing something stupid?
>
>
> Just to make sure i.e. these are 200MB (each) ascii files ?
>
>
> There are smaller fielsa s well, but there are some with >100MB size. I
> am going to produce some more of these, but I require storage for log-term.
>
>
> Both your AFS home area and the /scratch of the login machines are
> safe. We don't remove anything from /scratch that the users want to
> keep for a bit longer than the usual time. In that case it would be
> useful for us to know.
>
>
> I want to keep my datafile around for quite some time, at least until I
> have the whole gen+sim+digi+reco chain done.
>
>
> Condor jobs have the options to xfer input files directly to the
> worker nodes where the job can use them, eliminating the need to put
> big files in AFS for sharing. In that case you can have your big
> files in /scracth in any of the login machines.
>
> Let me know how I can be of further help in resolving this.
>
>
> Well, for now, I would like to resolve the problem of submitting jobs to
> condor where the input is in my scratch area and get the condor jobs
> running. Second, I would like to have access to some tape archive (I
> assume it would be /pnfs) where I can keep my datafiles for long-term
> storage.
>
> With regards,
>
> Devdatta.
>
|
| msg14501 (view) |
From: ajit |
To: ajit, dan, dasu, devdatta, rader, radtke, wcmaier |
Date: 2008.07.14 11:11 |
|
Dear Devdatta,
> The job submission looks ok. I mean the jobs are submitted alright,
> although they are taking a bit too long to start running.
> The job ends and I get the .err, .log, .out file in my job drectory,
> which is under /scratch. But I do not get any root output file in my
> /pnfs area.
> I am pasting below the different output files for a particular job:
>
> ==============================
> gen209BaurWp_lhcVeryLowCuts-0000.err
> ==============================
> %MSG-s CMSException: BaurWgamInterface:source{*ctor*} 13-Jul-2008
> 15:19:09 CDT pre-events
> cms::Exception caught in cmsRun
> ---- Configuration BEGIN
> Error occured while creating source BaurWgamInterface
> ---- Configuration BEGIN
> OpenBaurWgamFileError Cannot open BaurWgam input file, check file name
> and path.
> ---- Configuration END
> ---- Configuration END
The above entry in the error log says that "it couldn't open the BaurWgam input
file". So I would guess that the job is simply dying there and not moving any
further. Is that giving you any hint of what's happening OR the print out is
normal ?
> %MSG
>
> =====================
> report.log
> =====================
> SENDING with Task:devdatta-login01.hep.wisc.edu-36763 Job:TaskMeta
> params : {'exe': 'cmsRun', 'taskId':
> 'devdatta-login01.hep.wisc.edu-36763', 'tool': 'farmout', 'jobId':
> 'TaskMeta', 'application': 'CMSSW_2_0_9', 'user': 'devdatta',
> 'scheduler': 'local-condor', 'taskType': 'simulation', 'GridName':
> '/DC=ch/DC=cern/OU=Organic
> Units/OU=Users/CN=devdatta/CN=670738/CN=Devdatta Majumder/CN=proxy',
> 'vo': 'cms'}
> SENDING with Task:devdatta-login01.hep.wisc.edu-36763 Job:0
> params : {'exe': 'cmsRun', 'taskId':
> 'devdatta-login01.hep.wisc.edu-36763', 'tool': 'farmout', 'sid':
> 'https://login01.hep.wisc.edu#1215974091#36763.0', 'jobId': '0',
> 'application': 'CMSSW_2_0_9', 'user': 'devdatta', 'scheduler':
> 'local-condor', 'taskType': 'simulation', 'GridName':
> '/DC=ch/DC=cern/OU=Organic
> Units/OU=Users/CN=devdatta/CN=670738/CN=Devdatta Majumder/CN=proxy',
> 'vo': 'cms'}
> SENDING with Task:devdatta-login01.hep.wisc.edu-36763 Job:0
> params : {'SyncCE': 'cmsgrid02.hep.wisc.edu
> <http://cmsgrid02.hep.wisc.edu>', 'SyncGridJobId':
> 'https://login01.hep.wisc.edu#1215974091#36763.0'}
> SENDING with Task:devdatta-login01.hep.wisc.edu-36763 Job:0
> params : {'ExeTime': '12', 'ExeExitCode': '65'}
> SENDING with Task:devdatta-login01.hep.wisc.edu-36763 Job:0
> params : {'JobExitCode': '65'}
>
> ===============================
> gen209BaurWp_lhcVeryLowCuts-0000.out
> ===============================
> Parameters sent to Dashboard.
> Parameters sent to Dashboard.
> Parameters sent to Dashboard.
> Error: Cannot open BaurWgam input file
The last line (above) is also indicating that the job can find/open the BaurWgam
input file. Are you sure that you have the file getting xfered to the Condor's
job directory (with the correct path etc.) in the Worker node and available to
the job ?
> ls -ltr
> total 28
> -rw------- 1 devdatta devdatta 7419 Jul 13 15:15 x509up_u1129
> -rw-rw-r-- 1 devdatta devdatta 1809 Jul 13 15:15
> gen209BaurWp_lhcVeryLowCuts-0000.cfg
> -rwxr-xr-x 1 devdatta devdatta 3724 Jul 13 15:15 condor_exec.exe
> -rw-r--r-- 1 devdatta devdatta 1036 Jul 13 15:18 report.log
> -rw-r--r-- 1 devdatta devdatta 137 Jul 13 15:19
> gen209BaurWp_lhcVeryLowCuts-0000.out
> -rw-r--r-- 1 devdatta devdatta 363 Jul 13 15:19
> gen209BaurWp_lhcVeryLowCuts-0000.err
> End of ls output
> cmsRun exited with status 65
The above line indicates that the job (i.e. cmsRun is exiting with code 65) as a
result of the BaurWgam input file open error.
> I would like to know something further. Is there a quota in the scratch
> areas? I had to delete the run areas of some earlier jobs to accommodate
> my ascii files, which are rather huge.
The /scratch is a 500GB disk and there is no quota and it is a *temporary space*
for users to store the input/output files while running jobs. But the
responsibility lies on the users to clean up regularly, otherwise jobs will
start to fail when the space is filled up completely.
> Besides, can I move my datafiles from /scratch to /pnfs? This is
> important as I do not have any backup copy of those and I believe the
> /scratch area is automatically cleaned periodically. Please let me know
> how I can directly copy any kind of files to mu /pnfs area.
Are you talking about the output root files that your job is producing OR the
input ascii files ?
If it's the output root files, then assuming that you are using the FarmOut
scripts to submit your jobs to condor thee root files should be ending up in
/pnfs directly, but not in the /scratch. Isn't that the case ? If not, we need
to figure what's going wrong.
In case of the input ascii files, you can put them in your AFS home directory
here. How big are the files and what's the total size ?
Thanks,
- Ajit
> 2008/7/14 Ajit Mohapatra <ajit@hep.wisc.edu <mailto:ajit@hep.wisc.edu>>:
>
> Dear Devdatta,
>
> It looks like you are submitting jobs from login01, right ? Can you
> please provide some details about the nature of problems you are
> experiencing i.e. the issue is with job submission/running/crashing
> etc. and the log files that contains relevant info ? That would help
> us debug/pinpoint the issue and help resolve.
>
> Thanks,
> - Ajit
>
> DearAjit,
>
> I am facing some problems in submitting jobs to the Wisconsin
> grid. I am
> trying to run gen+sim+digi+raw+hlt chain and then the rawToReco
> chain on
> some ascii datafile.
>
> Can I have some help in getting the jobs done? I have run the
> jobs in
> interactive modes, and the outputs are fine, but whenever I try to
> submit the jobs to the Grid using condor scripts, I run into
> trouble.
> This has been happening only in the past few days,. Before that
> it was fine.
>
>
> My datafile is in :
> machine: login01:
> /scratch/devdatta/BaurAsciiFiles/lhc_Wp_VeryLowCuts/baurWp_lhcVeryLowCuts.asci
>
> gen+sim+digi+raw+hlt cfg file :
> /home/devdatta/CMSSW_2_0_9/src/GeneratorInterface/BaurWgamInterface/test/genSimDigiRawHLTBaurWmunuGamma.cfg
>
> rawToreco file :
> /home/devdatta/CMSSW_2_0_9/src/GeneratorInterface/BaurWgamInterface/test/raw2recoBaurWmunuGam.cfg
>
> So the gen+sim+digi+raw+hlt cfg file runs on
> baurWp_lhcVeryLowCuts.ascii
> and the output is the input to raw2recoBaurWmunuGam.cfg whose
> output is
> the final processed data.
>
> Sridhara asked me to contact you quite some back if I needed
> help, but I
> was managing fine with small datasets. If you could please help, it
> would be very nice.
>
> Best wishes,
>
> Devdatta.
>
> ----------
> group: IT
> messages: 14499
> nosy: ajit, dan, dasu, rader, radtke, wcmaier
> priority: triage
> status: unread
> title: Problem with condor job submission]
>
> ______________________________________
> UW-HEP Help System <help@hep.wisc.edu <mailto:help@hep.wisc.edu>>
> <https://help.hep.wisc.edu/issue5355>
> ______________________________________
>
>
>
|
| msg14500 (view) |
From: ajit |
To: ajit, dan, dasu, devdatta, rader, radtke, wcmaier |
Date: 2008.07.14 10:01 |
|
Dear Devdatta,
It looks like you are submitting jobs from login01, right ? Can you please
provide some details about the nature of problems you are experiencing i.e. the
issue is with job submission/running/crashing etc. and the log files that
contains relevant info ? That would help us debug/pinpoint the issue and help
resolve.
Thanks,
- Ajit
> DearAjit,
>
> I am facing some problems in submitting jobs to the Wisconsin grid. I am
> trying to run gen+sim+digi+raw+hlt chain and then the rawToReco chain on
> some ascii datafile.
>
> Can I have some help in getting the jobs done? I have run the jobs in
> interactive modes, and the outputs are fine, but whenever I try to
> submit the jobs to the Grid using condor scripts, I run into trouble.
> This has been happening only in the past few days,. Before that it was fine.
>
>
> My datafile is in :
> machine: login01:
> /scratch/devdatta/BaurAsciiFiles/lhc_Wp_VeryLowCuts/baurWp_lhcVeryLowCuts.asci
>
> gen+sim+digi+raw+hlt cfg file :
> /home/devdatta/CMSSW_2_0_9/src/GeneratorInterface/BaurWgamInterface/test/genSimDigiRawHLTBaurWmunuGamma.cfg
>
> rawToreco file :
> /home/devdatta/CMSSW_2_0_9/src/GeneratorInterface/BaurWgamInterface/test/raw2recoBaurWmunuGam.cfg
>
> So the gen+sim+digi+raw+hlt cfg file runs on baurWp_lhcVeryLowCuts.ascii
> and the output is the input to raw2recoBaurWmunuGam.cfg whose output is
> the final processed data.
>
> Sridhara asked me to contact you quite some back if I needed help, but I
> was managing fine with small datasets. If you could please help, it
> would be very nice.
>
> Best wishes,
>
> Devdatta.
>
> ----------
> group: IT
> messages: 14499
> nosy: ajit, dan, dasu, rader, radtke, wcmaier
> priority: triage
> status: unread
> title: Problem with condor job submission]
>
> ______________________________________
> UW-HEP Help System <help@hep.wisc.edu>
> <https://help.hep.wisc.edu/issue5355>
> ______________________________________
|
| msg14499 (view) |
From: ajit |
To: ajit, dan, dasu, rader, radtke, wcmaier |
Date: 2008.07.14 09:42 |
|
-------- Original Message --------
Subject: Problem with condor job submission
Date: Mon, 14 Jul 2008 11:51:06 +0530
From: Devdatta Majumder <devdatta@tifr.res.in>
To: Ajit Mohapatra <ajit@hep.wisc.edu>
DearAjit,
I am facing some problems in submitting jobs to the Wisconsin grid. I am
trying to run gen+sim+digi+raw+hlt chain and then the rawToReco chain on
some ascii datafile.
Can I have some help in getting the jobs done? I have run the jobs in
interactive modes, and the outputs are fine, but whenever I try to
submit the jobs to the Grid using condor scripts, I run into trouble.
This has been happening only in the past few days,. Before that it was fine.
My datafile is in :
machine: login01:
/scratch/devdatta/BaurAsciiFiles/lhc_Wp_VeryLowCuts/baurWp_lhcVeryLowCuts.asci
gen+sim+digi+raw+hlt cfg file :
/home/devdatta/CMSSW_2_0_9/src/GeneratorInterface/BaurWgamInterface/test/genSimDigiRawHLTBaurWmunuGamma.cfg
rawToreco file :
/home/devdatta/CMSSW_2_0_9/src/GeneratorInterface/BaurWgamInterface/test/raw2recoBaurWmunuGam.cfg
So the gen+sim+digi+raw+hlt cfg file runs on baurWp_lhcVeryLowCuts.ascii
and the output is the input to raw2recoBaurWmunuGam.cfg whose output is
the final processed data.
Sridhara asked me to contact you quite some back if I needed help, but I
was managing fine with small datasets. If you could please help, it
would be very nice.
Best wishes,
Devdatta.
|
|
| Date |
User |
Action |
Args |
| 2008-08-04 08:32:21 | wcmaier | set | status: chatting -> resolved |
| 2008-07-24 15:11:08 | ajit | set | messages:
+ msg14569 |
| 2008-07-22 11:51:04 | ajit | set | messages:
+ msg14552 |
| 2008-07-16 16:03:04 | dan | set | nosy:
+ devdatta |
| 2008-07-16 13:01:05 | ajit | set | messages:
+ msg14520 |
| 2008-07-15 17:18:32 | dan | set | messages:
+ msg14519 |
| 2008-07-15 17:14:20 | ajit | set | messages:
+ msg14518 |
| 2008-07-14 12:12:02 | wcmaier | set | assignedto: ajit |
| 2008-07-14 11:11:03 | ajit | set | messages:
+ msg14501 |
| 2008-07-14 10:05:33 | wcmaier | set | priority: triage -> normal topic:
+ Condor, CMS title: Problem with condor job submission] -> Problem with condor job submission |
| 2008-07-14 10:01:01 | ajit | set | status: unread -> chatting messages:
+ msg14500 |
| 2008-07-14 09:42:55 | ajit | create | |
|