Introduction
This test determines if the proper environment has been established.
Test steps
Submit a test job and the issue a hold on the job.
> qsub -c enabled test.sh 999.xxx.yyy > qhold 999 |
Possible failures
Normally the result of qhold is nothing. If an error message is produced saying that qhold is not a supported feature then one of the following configuration errors might be present.
Successful results
If no configuration was done to specify a specific directory location for the checkpoint file, the default location is off of the TORQUE directory, which in my case is /var/spool/torque/checkpoint.
Otherwise, go to the specified directory for the checkpoint image files. This was done by either specifying an option on job submission, i.e. -c dir=/home/test or by setting an attribute on the execution quere. This is done with the command qmgr -c 'set queue batch checkpoint_dir=/home/test'.
Doing a directory listing shows the following.
# find /var/spool/torque/checkpoint /var/spool/torque/checkpoint /var/spool/torque/checkpoint/999.xxx.yyy.CK /var/spool/torque/checkpoint/999.xxx.yyy.CK/ckpt.999.xxx.yyy.1205266630 # find /var/spool/torque/checkpoint |xargs ls -l -r-------- 1 root root 543779 2008-03-11 14:17 /var/spool/torque/checkpoint/999.xxx.yyy.CK/ckpt.999.xxx.yyy.1205266630
/var/spool/torque/checkpoint: total 4 drwxr-xr-x 2 root root 4096 2008-03-11 14:17 999.xxx.yyy.CK
/var/spool/torque/checkpoint/999.xxx.yyy.CK: total 536 -r-------- 1 root root 543779 2008-03-11 14:17 ckpt.999.xxx.yyy.1205266630 |
Doing a qstat -f command should show the job in a held state, job_state = H. Note that the attribute checkpoint_name is set to the name of the file seen above.
If a checkpoint directory has been specified, there will also be an attribute checkpoint_dir in the output of qstat -f.
$ qstat -f Job Id: 999.xxx.yyy Job_Name = test.sh Job_Owner = [email protected] resources_used.cput = 00:00:00 resources_used.mem = 0kb resources_used.vmem = 0kb resources_used.walltime = 00:00:06 job_state = H queue = batch server = xxx.yyy Checkpoint = u ctime = Tue Mar 11 14:17:04 2008 Error_Path = xxx.yyy:/home/test/test.sh.e999 exec_host = test/0 Hold_Types = u Join_Path = n Keep_Files = n Mail_Points = a mtime = Tue Mar 11 14:17:10 2008 Output_Path = xxx.yyy:/home/test/test.sh.o999 Priority = 0 qtime = Tue Mar 11 14:17:04 2008 Rerunable = True Resource_List.neednodes = 1 Resource_List.nodect = 1 Resource_List.nodes = 1 Resource_List.walltime = 01:00:00 session_id = 9402 substate = 20 Variable_List = PBS_O_HOME=/home/test,PBS_O_LANG=en_US.UTF-8, PBS_O_LOGNAME=test, PBS_O_PATH=/usr/local/perltests/bin:/home/test/bin:/usr/local/s bin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games, PBS_O_SHELL=/bin/bash,PBS_SERVER=xxx.yyy, PBS_O_HOST=xxx.yyy,PBS_O_WORKDIR=/home/test, PBS_O_QUEUE=batch euser = test egroup = test hashname = 999.xxx.yyy queue_rank = 3 queue_type = E comment = Job started on Tue Mar 11 at 14:17 exit_status = 271 submit_args = test.sh start_time = Tue Mar 11 14:17:04 2008 start_count = 1 checkpoint_dir = /var/spool/torque/checkpoint/999.xxx.yyy.CK checkpoint_name = ckpt.999.xxx.yyy.1205266630 |
Related topics
© 2012 Adaptive Computing