It is difficult to set up many computer with powerful CPU and GPU and large memory. Then we must share the resource of one powerful computer.

We can use a famous job queuing system “condor“. We have to use the condor to run programs which occupy CPU and/or GPU more than 10 minutes, which is enough for R&D.

condor_submit can submit your long jobs for each CPU and GPU. The online manual is located here. Read stable version.

A simple example: job1/submit.conder

universe                = vanilla
executable              = a.out
arguments               = infile$(Process).txt
should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT
initialdir              = workdir/run$(Process)
transfer_input_files    = infile$(Process).txt
log                     = job1.log
request_GPUs            = 1
queue 100

Submit your jobs by “condor_submit job1/submit.conder”
See the status of compute nodes by “condor_status”
See status of your jobs by “condor_q”

$(Process) will be replaced to process number 0-99 for “queue 100”. In the most case, you will need to prepare “initialdir” to store outputs and logs. The /home is shared via NFS, file transfer is not needed.
If you need to different input information for each jobs respectively, it will be useful.

If you want to use GPU, “request GPUs = 1” must be set.

The Tensorflow may allocate all available memory for one job by default, independently of how much memory jobs really need.
I know it is for the best memory bandwidth performance, but the gain is very small in most case.
Please consider to set “gpu_options.allow_growth = True”.
And your usage can be check by command “nvidia-smi”.

Remove “request GPUs” unless you need GPU. Your jobs run in parallel on 12 cores CPU.
Why all 24 logical cores will not be used is for single core performance and reserve for local users.

When you want to run multithread jobs, set “universe = parallel” may help us, but not work fine now…

universe                = parallel
executable              = a.out
arguments               = infile$(Process).txt
should_transfer_files   = IF_NEEDED
when_to_transfer_output = ON_EXIT
initialdir              = workdir/run$(Process)
transfer_input_files    = infile$(Process).txt
log                     = job1.log
machine_count           = 1
request_cpus            = 12
queue 100
Posted in HPC | Tagged ,
Share this post, let the world know

コメントを残す

メールアドレスが公開されることはありません。 が付いている欄は必須項目です


reCaptcha の認証期間が終了しました。ページを再読み込みしてください。