5.590 -L NUMA Resource Request

The -L option is available in the qsub and msub commands to allow administrators the ability to place jobs at the "task" or "OS process" level to get maximum efficiency out of the available hardware.

Using the -L option requires a basic knowledge of the topologies of the available hardware where jobs will run. You will need to know how many cores, numanodes, sockets, etc. are available on the hosts within the cluster. The -L syntax is designed to allow for a wide variety of requests. However, if requests do not match the available hardware, you may have unexpected results.

In addition, multiple, non-symmetric resource requests can be made for the same job using the -L job submission syntax.

For example, the following command:

qsub -L tasks=4:lprocs=2:usecores:memory=500mb -L tasks=8:lprocs=4:memory=2gb

Creates two requests. The first request creates 4 tasks with two logical processors and 500 mb of memory per task. The logical processors are placed on cores. The second request calls for 8 tasks with 4 logical processors and 2 gb of memory per task. Logical processors may be placed on cores or threads since the default placement is allowthreads.

This topic provides the -L option syntax and a description of the valid value and allocation options.

5.590.1 Syntax

-L tasks=#[:lprocs=#|all]
[:{usecores|usethreads|allowthreads}]
[:place={socket|numanode|core|thread}[=#]{node}][:memory=#][:swap=#][:maxtpn=#][:gpus=#[:<mode>]][:mics=#][:gres=<gres>][:feature=<feature>]
[[:{cpt|cgroup_per_task}]|[:{cph|cgroup_per_host}]]

5.590.2 Valid Value

Value Description
tasks Specifies the quantity of job tasks for which the resource request describes the resources needed by a single task.
  • Distributed memory systems - A single task must run within a single compute node/server; i.e., the task's resources must all come from the same compute node/server.
  • Shared memory systems - A single task may run on multiple compute nodes; i.e., the task's resources may come from multiple compute nodes.
This option is required for task-based resource allocation and placement.
qsub -L tasks=4

Creates four tasks, each with one logical process. The tasks can be run on a core or thread (default allowthreads).

5.590.3 Available Options

The following table identifiers the various allocation options you can specify per task.

Value Description
lprocs

Specifies the quantity of "logical processors" required by a single task to which it will be pinned by its control-group (cgroup).

The "place" value specifies the total number of physical cores/threads to which a single task has exclusive access. The lprocs= keyword indicates the actual number of cores/threads to which the task has exclusive access for the task's cgroup to pin to the task.

  • When :lprocs is specified, and nothing is specified for #, the default is 1.
  • When :lprocs=all is specified, all cores or threads in any compute node/server's available resource locality placement specified by the "place" option is eligible for task placement (the user has not specified a quantity, other than "give me all logical processors within the resource locality or localities"), which allows a user application to take whatever it can get and adapt to whatever it receives, which cannot exceed one node.

qsub -L tasks=1:lprocs=4

One task is created which allocates four logical processors to the task. When the job is executed, the pbs_mom where the job is running will create a cpuset with four processors in the set. Torque will make a best effort to allocate the four processors next to each other but the placement is not guaranteed.

qsub -L tasks=1:lprocs=all:place=node

Places one task on a single node, and places all processing units in the cpuset of the task. The "lprocs=all" parameter specifies that the task will use all cores and/or threads available on the resource level requested.

usecores, usethreads, allow threads

The usecores, usethreads, and allowthreads parameters are used to indicate whether the cgroup pins cores, threads, or either to a task, respectively. If no logical processor definition is given, the default is allowthreads for backward-compatible Moab scheduler and Torque resource manager behavior.

In this context, "cores" means an AMD Opteron core, a hyperthread-disabled Intel Xeon core, or thread 0 and only thread 0 of a hyperthread-enabled Intel Xeon core. The term "threads" refers to a hyperthread-enabled Intel Xeon thread. Likewise, "either" refers to an AMD Opteron core, a hyperthread-disabled Intel Xeon core, or any thread of a hyperthread-enabled Intel Xeon.

  • :usecores – Denotes that the logical processor definition for a task resource request is a physical core. This means if a core has hyper-threading enabled, the task will use only thread 0 of the core.
    qsub -L tasks=2:lprocs=2:usecores

    Two tasks are allocated with two logical processors per task. The usecores parameter indicates the processor types must be a core or thread 0 of a hyper-threaded core.

  • :usethreads – Specifies the logical processor definition for a task resource request is a hardware-based thread or virtual core.
    qsub -L tasks=2:lprocs=2:usethreads

    Two tasks are allocated with two logical processors per task. The usethreads parameter indicates that any type of hardware-based thread or virtual core can be used.

  • :allowthreads – Specifies that the logical processor definition for a task resource request can be either a physical core (e.g. AMD Opteron), or hardware-based thread of a core (hyperthread-enabled Intel Xeon).
    qsub -L tasks=2:lprocs=2:allowthreads

    Two tasks are allocated with two logical processors per task. The allowthreads parameter indicates hardware threads or cores can be used..

place

Specifies placement of a single task on the hardware. Specifically, this designates what hardware resource locality level and identifies the quantity of locality-level resources. Placement at a specific locality level is always exclusive, meaning a job task has exclusive use of all logical processor and physical memory resources at the specified level of resource locality, even if it does not use them.

 

Valid Options:

If a valid option is not specified, the usecusecores, usethreads, and allowthreads parameters are used.

  • socket[=#] – Refers to a socket within a compute node/server and specifies that each task is placed at the socket level with exclusive use of all logical processors and memory resources of the socket(s) allocated to a task. If a count is not specified, the default setting is 1.
    qsub -L tasks=2:lprocs=4:place=socket

    Two tasks are allocated with four logical processors each. Each task is placed on a socket where it will have exclusive access to all of the cores and memory of the socket. Although the socket may have more cores/threads than four, only four cores/threads will be bound in a cpuset per task per socket as indicated by "lprocs=4".

  • numanode[=#] – Refers to the numanode within a socket and specifies that each task is placed at the NUMA node level within a socket with exclusive use of all logical processor and memory resources of the NUMA node(s) allocated to the task. If a count is not given, the default value is 1. If a socket does not contain multiple numanodes, by default the socket contains one numanode.

    To illustrate the locality level to which this option refers, the following examples are provided:

    First, a Haswell-based Intel Xeon v3 processor with 10 or more cores is divided internally into two separate "nodes", each with an equal quantity of cores and its own local memory (referred to as a "numanode" in this topic).

    Second, an AMD Opteron 6xxx processor is a "multi-chip module" that contains two separate physical silicon chips each with its own local memory (referred to as a "numanode" in this topic).

    In both of the previous examples, a core in one "node" of the processor can access its own local memory faster than it can access the remote memory of the other "node" in the processor, which results in NUMA behavior.

    qsub -L tasks=2:lprocs=4:place=numanode

    Places a single task on a single numanode and the task has exclusive use of all the logical processors and memory of the numanode.

    qsub -L tasks=2:lprocs=all:place=numanode=2

    Allocates two tasks with each task getting two numanodes each. The "lprocs=all" specification indicates all of the cores/threads of each numanode will be bound in the cpuset for the task.

  • core[=#] – Refers to a core within a numanode or socket and specifies each task is placed at the core level within the numanode or socket and has exclusive use of all logical processor and memory resources of the core(s) allocated to the task. Whether a core has SMT/hyper-threading enabled or not is irrelevant to this locality level. If a number of cores is not specified, it will default to the number of lprocs specified.

    The amount of cores specified must be greater than or equal to the number of lprocs available, otherwise the job submission will be rejected.

    qsub -L tasks=2:place=core=2

    Two tasks with one logical processor each will be placed on two cores per task.

    qsub -L tasks=2:lprocs=2:place=core

    Two tasks are allocated with two logical processors per task. Each logical process will be assigned to one core each (two cores total, the same as the number of lprocs). Torque will attempt to place the logical processors on non-adjacent cores.

  • thread[=#] – Specifies each task is placed at the thread level within a core and has exclusive use of all logical processor and memory resources of the thread(s) allocated to a task.

    This affinity level refers to threads within a core and is applicable only to nodes with SMT or hyper-threading enabled. If a node does not have SMT or hyper-threading enabled, Moab will consider the node ineligible when allocating resources for a task. If a specific number of threads is not specified, it will default the the number of lprocs specified.

    qsub -L tasks=2:lprocs=4:place=thread

    Allocates two tasks, each with four logical processors, which can be bound to any thread. Torque will make a best effort to bind the threads on the same numanode but placement is not guaranteed. Because the amount of threads is not specified, Torque will place the number of lprocs requested.

  • node – Specifies that each task is placed at the node level and has exclusive use of all the resources of the node(s) allocated to a task. This locality level usually refers to a physical compute node, blade, or server within a cluster.
    qsub -L tasks=2:lprocs=all:place=node

    Two tasks are allocated with one task per node, where the task has exclusive access to all the resources on the node. The "lprocs=all" specification directs Torque to create a cpuset with all of the processing units on the node. The "place=node" speficiation also claims all of the memory for the node/server.

memory

"memory" is roughly equivalent to the mem request for the qsub/msub -l resource request. However, with the -L qsub syntax, cgroups monitors the job memory usage and puts a ceiling on resident memory for each task of the job.

Specifies the maximum resident memory allocated per task. Allowable suffixes are kb (kilobytes), mb (megabytes), gb (gigabytes), tb (terabyte), pb (petabytes), and eb (exabyte). If a suffix is not provided by the user, mb (megabytes) is default.

If a task uses more resident memory than specified the excess memory is moved to swap.

qsub -L tasks=4:lprocs=2:usecores:memory=1gb

Allocates four tasks with two logical processors each. Each task is given a limit of 1 gb of resident memory.

qsub -L tasks=2:memory=3500

Allocates two tasks with 3500 mb (the suffix was not specified so megabytes is assumed).

swap

Specifies the maximum allocated resident memory and swap space allowed per task.

Allowable suffixes are kb (kilobytes), mb (megabytes), gb (gigabytes), tb (terabyte), pb (petabytes), and eb (exabyte). If a suffix is not given, mb (megabytes) is assumed.

 

If a task exceeds the specified limit, the task will be killed; the associated job will be terminated.

If the swap limit is unable to be set, the job will still be allowed to run. All other cgroup-related failures will cause the job to be rejected.

 

When requesting swap, it is not required that you give a value for the :memory option.

  • If using :swap without a specified :memory value, Torque will supply a memory value up to the value of :swap; but not larger than available physical memory.

    qsub -L tasks=4:lprocs=2:swap=4gb

    Allocates four tasks with two logical processors each. Each task is given a combined limit of 4 gb of resident memory and swap space. If a task exceeds the limit, the task is terminated.

  • If using :swap with a specified :memory value, Torque will only supply resident memory up to the :memory value. The rest of the swap can only be supplied from the swap space.

    The :memory value must be smaller than or equal to the :swap value.

    qsub -L tasks=2:memory=3.5gb:swap=5gb

    Allocates two tasks and each task has up to 3.5 gb of resident memory and a maximum of 5 gb of swap. If a task exceed 3.5 gb of resident memory, the excess will be moved to the swap space. However, if the task exceed 5 gb of total swap, the task and job will be terminated.

maxtpn Specifies the maximum tasks per node; where "#" is the maximum tasks allocated per physical compute node. This restricts a task type to no more than "#" tasks per compute node and allows it to share a node with other task types or jobs. For example, a communication-intensive task may share a compute node with computation-intensive tasks.

The number of nodes and tasks per node will not be known until the job is run.

qsub -L tasks=7:maxtpn=4

Allocates seven tasks but a maximum of four tasks can run on a single node.

gpus Specifies the quantity of GPU accelerators to allocate to a task, which requires placement at the locality-level to which an accelerator is connected or higher. <MODE> can be exclusive_process, exclusive_thread, or reseterr.

The task resource request must specify placement at the numanode- (AMD only), socket-, or node-level. place=core and place=thread are invalid placement options when a task requests a PCIe-based accelerator device, since allowing other tasks to use cores and threads on the same NUMA chip or socket as the task with the PCIe device(s) would violate the consistent job execution time principle since these other tasks would likely interfere with the data transfers between the task's logical processors and its allocated accelerator(s).

:gpus=1

Allocates one GPU per task.

:gpus=2:exclusive_process:reseterr

Allocates two GPUs per task with exclusive access by process and resets error counters.

mics Specifies the quantity of Intel MIC accelerators to allocate to a task, which requires placement at the locality-level to which a MIC is connected or higher.

The task resource request must specify placement at the NUMA chip- (makes sense for AMD only), socket-, or node-level. place=core and place=thread are invalid placement options when a task requests a PCIe-based accelerator device since allowing other tasks to use cores and threads on the same NUMA chip or socket as the task with the PCIe device(s) would violate the consistent job execution time principle since these other tasks would likely interfere with the data transfers between the task's logical processors and its allocated accelerator(s).

Allocating resources for MICs operates in the exact same manner as for GPUs. See gpus.

:mics=1

Allocates on MIC per task.

:mics=2

Allocates two MICs per task.

gres Specifies the quantity of a specific generic resource <gres> to allocate to a task. If a quantity is not given, it defaults to one.

Specify multiple GRES by separating them with commas and enclosing all the GRES names, their quantities, and the commas within single quotation marks.

:gres=matlab=1

Allocates one Matlab license per task.

:gres='dvd,blu=2'

Allocates one DVD drive and two Blu-ray drives per task, represented by the "dvd" and "blu" generic resource names, respectively.

When scheduling, if a generic resource is node-locked, only compute nodes with the generic resource are eligible for allocation to a job task. If a generic resource is floating, it does not qualify or disqualify compute node resources from allocation to a job task.

feature Specifies one or more node feature names used to qualify compute nodes for task resources; i.e., a compute node must have all ("&") or and ("|") of the specified feature name(s) assigned or the compute node's resources are ineligible for allocation to a job task.
:feature=bigmem
:feature='bmem&fio'
:feature='bmem|fio'
cpt, cgroup_per_task, cph, cgroup_per_host

Specifies whether cgroups are created per-task or per-host. If submitting using msub, this information is passed through to Torque; there is no affect to Moab operations.

This option lets you specify how cgroups are created during job submission. This option can be used to override the Torque cgroup_per_task sever parameter. If this option is not specified, the server parameter value is used. See Server Parameters for more information.

  • :cpt, :cgroup_per_task – Job request will have one cgroup created per task; all the processes on that host will be placed in the first task's cgroup.
  • :cph, :cgroup_per_host – Job request will have one cgroup created per host; this is similar to pre-6.0 cpuset implementations.

Some MPI implementations only launch one process through the TM API, and then fork each subsequent process that should be launched on that host. If the job is set to have one cgroup per task, this means that all of the processes on that host will be placed in the first task's cgroup. Confirm that the cgroup_per_task Torque server parameter is set to FALSE (default) or specify :cph or :cgroup_per_host at job submission.

If you know that your MPI will communicate each process launch to the mom individually, then set the cgroup_per_task Torque server parameter is set to TRUE or specify :cpt or :cgroup_per_task at job submission.

 

Related Topics 

© 2016 Adaptive Computing