See these topics for details:
There are several steps to ensure that the server and the nodes are completely aware of each other and able to communicate directly. Some of this configuration takes place within TORQUE directly using the qmgr command. Other configuration settings are managed using the pbs_server nodes file, DNS files such as /etc/hosts and the /etc/hosts.equiv file.
Each node, as well as the server, must be able to resolve the name of every node with which it will interact. This can be accomplished using /etc/hosts, DNS, NIS, or other mechanisms. In the case of /etc/hosts, the file can be shared across systems in most cases.
A simple method of checking proper name service configuration is to verify that the server and the nodes can "ping" each other.
Configuring job submission hosts
When jobs can be submitted from several different hosts, these hosts should be trusted via the R* commands (such as rsh and rcp). This can be enabled by adding the hosts to the /etc/hosts.equiv file of the machine executing the pbs_server daemon or using other R* command authorization methods. The exact specification can vary from OS to OS (see the man page for ruserok to find out how your OS validates remote users). In most cases, configuring this file is as simple as adding a line to your /etc/hosts.equiv file, as in the following:
/etc/hosts.equiv:
#[+ | -] [hostname] [username] mynode.myorganization.com ..... |
Either of the hostname or username fields may be replaced with a wildcard symbol (+). The (+) may be used as a stand-alone wildcard but not connected to a username or hostname, e.g., +node01 or +user01. However, a (-) may be used in that manner to specifically exclude a user.
Following the Linux man page instructions for hosts.equiv may result in a failure. You cannot precede the user or hostname with a (+). To clarify, node1 +user1 will not work and user1 will not be able to submit jobs.
For example, the following lines will not work or will not have the desired effect:
+node02 user1 node02 +user1 |
These lines will work:
node03 + + jsmith node04 -tjones |
The most restrictive rules must precede more permissive rules. For example, to restrict user tsmith but allow all others, follow this format:
node01 -tsmith node01 + |
Please note that when a hostname is specified, it must be the fully qualified domain name (FQDN) of the host. Job submission can be further secured using the server or queue acl_hosts and acl_host_enabled parameters (for details, see Queue attributes).
Using the "submit_hosts" service parameter
Trusted submit host access may be directly specified without using RCmd authentication by setting the server submit_hosts parameter via qmgr as in the following example:
> qmgr -c 'set server submit_hosts = host1' > qmgr -c 'set server submit_hosts += host2' > qmgr -c 'set server submit_hosts += host3' |
Use of submit_hosts is potentially subject to DNS spoofing and should not be used outside of controlled and trusted environments.
Allowing job submission from compute hosts
If preferred, all compute nodes can be enabled as job submit hosts without setting .rhosts or hosts.equiv by setting the allow_node_submit parameter to true.
Configuring TORQUE on a multi-homed server
If the pbs_server daemon is to be run on a multi-homed host (a host possessing multiple network interfaces), the interface to be used can be explicitly set using the SERVERHOST parameter.
With some versions of Mac OS/X, it is required to add the line $restricted *.<DOMAIN> to the pbs_mom configuration file. This is required to work around some socket bind bugs in the OS.
Specifying non-root administrators
By default, only root is allowed to start, configure and manage the pbs_server daemon. Additional trusted users can be authorized using the parameters managers and operators. To configure these parameters use the qmgr command, as in the following example:
> qmgr Qmgr: set server managers += josh@*.fsc.com Qmgr: set server operators += josh@*.fsc.com |
All manager and operator specifications must include a user name and either a fully qualified domain name or a host expression.
To enable all users to be trusted as both operators and administrators, place the + (plus) character on its own line in the server_priv/acl_svr/operators and server_priv/acl_svr/managers files.
Moab relies on emails from TORQUE about job events. To set up email, do the following:
To set up email
> ./configure --with-sendmail=<path_to_executable> |
> qmgr -c 'set server mail_domain=clusterresources.com' |
> qmgr -c 'set server mail_body_fmt=Job: %i \n Name: %j \n On host: %h \n \n %m \n \n %d' > qmgr -c 'set server mail_subject_fmt=Job %i - %r' |
By default, users receive e-mails on job aborts. Each user can select which kind of e-mails to receive by using the qsub -m option when submitting the job. If you want to dictate when each user should receive e-mails, use a submit filter (for details, see Job submission filter ("qsub wrapper")).
MUNGE is an authentication service that creates and validates user credentials. It was developed by Lawrence Livermore National Laboratoy (LLNL) to be highly scalable so it can be used in large environments such as HPC clusters. To learn more about MUNGE and how to install it, see http://code.google.com/p/munge/.
Configuring TORQUE to use MUNGE is a compile time operation. When you are building TORQUE, use –enable-munge-auth as a command line option with ./configure.
> ./configure –enable-munge-auth |
You can use only one authorization method at a time. If –enable-munge-auth is configured, the privileged port ruserok method is disabled.
TORQUE does not link any part of the MUNGE library into its executables. It calls the MUNGE and UNMUNGE utilities which are part of the MUNGE daemon. The MUNGE daemon must be running on the server and all submission hosts. The TORQUE client utilities call MUNGE and then deliver the encrypted credential to pbs_server where the credential is then unmunged and the server verifies the user and host against the authorized users configured in serverdb.
Authorized users are added to serverdb using qmgr and the authorized_users parameter. The syntax for authorized_users is authorized_users=<user>@<host>. To add an authorized user to the server you can use the following qmgr command:
> qmgr -c 'set server authorized_users=user1@hosta > qmgr -c 'set server authorized_users+=user2@hosta |
The previous example adds user1 and user2 from hosta to the list of authorized users on the server. Users can be removed from the list of authorized users by using the -= syntax as follows:
> qmgr -c 'set server authorized_users-=user1@hosta |
Users must be added with the <user>@<host> syntax. The user and the host portion can use the '*' wildcard to allow multiple names to be accepted with a single entry. A range of user or host names can be specified using a [a-b] syntax where a is the beginning of the range and b is the end.
> qmgr -c 'set server authorized_users=user[1-10]@hosta |
This allows user1 through user10 on hosta to run client commands on the server.
The MOM hierarchy allows you to override the compute nodes' default behavior of reporting status updates directly to the pbs_server. Instead, you configure compute nodes so that each node sends its status update information to another compute node. The compute nodes pass the information up a tree or hierarchy until eventually the information reaches a node that will pass the information directly to pbs_server. This can significantly reduce traffic and time required to keep the cluster status up to date.
The name of the file that contains the configuration information is named mom_hierarchy. By default, it is located in the /var/spool/torque/server_priv directory. The file uses syntax similar to XML:
<path> <level> comma-separated node list </level> <level> comma-separated node list </level> ... </path> ...
The <path></path> tag pair identifies a group of compute nodes. The <level></level> tag pair contains a comma-separated list of compute node names. Multiple paths can be defined with multiple levels within each path.
Within a <path> tag pair the levels define the hierarchy. All nodes in the top level communicate directly with the server. All nodes in lower levels communicate to the first available node in the level directly above it. If the first node in the upper level goes down, the nodes in the subordinate level will then communicate to the next node in the upper level. If no nodes are available in an upper level then the node will communicate directly to the server.
If an upper level node has fallen out and then becomes available, the lower level nodes will eventually find that the node is available and start sending their updates to that node.
Related topics
© 2012 Adaptive Computing