We include configuration files for known supercomputers. Hopefully these help both other users that use those machines and new users who want to see examples for similar clusters.
Additional examples from other cluster welcome here.
distributed: scheduler: bandwidth: 1000000000 # GB MB/s estimated worker-worker bandwidth worker: memory: target: 0.90 # Avoid spilling to disk spill: False # Avoid spilling to disk pause: 0.80 # fraction at which we pause worker threads terminate: 0.95 # fraction at which we terminate the worker comm: compression: null jobqueue: pbs: cores: 36 memory: 108GB processes: 4 interface: ib0 local-directory: $TMPDIR queue: regular project: null # TODO, change me walltime: '00:30:00' resource-spec: select=1:ncpus=36:mem=109G
It should be noted that the the following config file assumes you are running the scheduler on a worker node. Currently the login node appears unable to talk to the worker nodes bidirectionally. As such you need to request an interactive node with the following:
$ salloc -N 1 -C haswell --qos=interactive -t 04:00:00
Then you will run dask jobqueue directly on that interactive node. Note the distributed section that is set up to avoid having dask write to disk. This was due to some weird behavior with the local filesystem.
distributed: worker: memory: target: False # Avoid spilling to disk spill: False # Avoid spilling to disk pause: 0.80 # fraction at which we pause worker threads terminate: 0.95 # fraction at which we terminate the worker jobqueue: slurm: cores: 64 memory: 128GB processes: 4 queue: debug walltime: '00:10:00' job-extra: ['-C haswell', '-L project, SCRATCH, cscratch1']
jobqueue: pbs: name: dask-worker cores: 36 memory: 270GB processes: 6 interface: ib0 local-directory: $localscratch queue: high_mem # Can also select batch or gpu_ssd project: arm walltime: 00:30:00 #Adjust this to job size job-extra: ['-W group_list=cades-arm']