It is difficult to set up many computer with powerful CPU and GPU and large memory. Then we must share the resource of one powerful computer.
We can use a famous job queuing system “condor“. We have to use the condor to run programs which occupy CPU and/or GPU more than 10 minutes, which is enough for R&D.
condor_submit can submit your long jobs for each CPU and GPU. The online manual is located here. Read stable version.
A simple example: job1/submit.conder
universe = vanilla executable = a.out arguments = infile$(Process).txt should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT initialdir = workdir/run$(Process) transfer_input_files = infile$(Process).txt log = job1.log request_GPUs = 1 queue 100
Submit your jobs by “condor_submit job1/submit.conder”
See the status of compute nodes by “condor_status”
See status of your jobs by “condor_q”
$(Process) will be replaced to process number 0-99 for “queue 100”. In the most case, you will need to prepare “initialdir” to store outputs and logs. The /home is shared via NFS, file transfer is not needed.
If you need to different input information for each jobs respectively, it will be useful.
If you want to use GPU, “request GPUs = 1” must be set.
The Tensorflow may allocate all available memory for one job by default, independently of how much memory jobs really need.
I know it is for the best memory bandwidth performance, but the gain is very small in most case.
Please consider to set “gpu_options.allow_growth = True”.
And your usage can be check by command “nvidia-smi”.
Remove “request GPUs” unless you need GPU. Your jobs run in parallel on 12 cores CPU.
Why all 24 logical cores will not be used is for single core performance and reserve for local users.
When you want to run multithread jobs, set “universe = parallel” may help us, but not work fine now…
universe = parallel executable = a.out arguments = infile$(Process).txt should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT initialdir = workdir/run$(Process) transfer_input_files = infile$(Process).txt log = job1.log machine_count = 1 request_cpus = 12 queue 100