Advanced TORQUE Administration is a video tutorial of a session offered at Moab Con that offers further details on advanced TORQUE administration. |
This collection of documentation for TORQUE resource manager is intended as a reference for both users and system administrators.
The 1.0 Overview section provides the details for installation and initialization, advanced configuration options, and (optional) qmgr options necessary to get the system up and running. System Testing is also covered.
The 2.0 Submitting and Managing Jobs section covers different actions applicable to jobs. The first section, 2.1 Job Submission, details how to submit a job and request resources (nodes, software licenses, and so forth) and provides several examples. Other actions include monitoring, canceling, preemption, and keeping completed jobs.
The 3.0 Managing Nodes section covers administrator tasks relating to nodes, which includes the following: adding nodes, changing node properties, and identifying state. Also an explanation of how to configure restricted user access to nodes is covered in section 3.4 Host Security.
The 4.0 Setting Server Policies section details server side configurations of queue and high availability.
The 5.0 Interfacing with a Scheduler section offers information about using the native scheduler versus an advanced scheduler.
The 6.0 Configuring Data Management section deals with issues of data management. For non-network file systems, the SCP/RCP Setup section details setting up SSH keys and nodes to automate transferring data. The NFS and Other Networked File Systems section covers configuration for these file systems. This chapter also addresses the use of File Stage-In/Stage-Out using the stagein and stageout directives of the qsub command.
The 7.0 Interfacing with Message Passing section offers details supporting MPI (Message Passing Interface).
The 8.0 Managing Resources section covers configuration, utilization, and states of resources.
The 9.0 Accounting section explains how jobs are tracked by TORQUE for accounting purposes.
The 10.0 Troubleshooting section is a troubleshooting guide that offers help with general problems; it includes an FAQ (Frequently Asked Questions) list and instructions for how to set up and use compute node checks and how to debug TORQUE.
The numerous appendices provide tables of commands, parameters, configuration options, error codes, the Quick Start Guide, and so forth.