You are here: 24 Data Staging > How-to > Data staging to or from a shared file system in a grid

24.4 Staging Data to or from a Shared File System in a Grid

To stage data to or from a shared file system in a grid

  1. If you have not already done so, configure your SSH keys and moab.cfg to support data staging. See Configuring the SSH keys for the Data Staging Transfer Script and Configuring Data Staging for more information.
  2. Create your job templates for data staging jobs in moab.cfg. The templates in the example below create a compute job that stages data in before it starts and stages data out when it completes. For more information about creating job templates, see About Job Templates.

    1. Create a selectable master template, called ds in the example below, that creates a stage in and stage out system job. This name should match the DEFAULT_TEMPLATE value in ds_config.py. For more information, see Configuring Data Staging with Advanced Options.
    2. For the data staging in job template, called dsin in the example below, specify that it will create a data staging job by setting DATASTAGINGJOB to TRUE. Note that the name of this job template must match the name of the data stage in job template referenced in the master template.
    3. Set the staging job template bandwidth GRES to the amount of bandwidth a single stage in job should use. This indicates how many of the bandwidth units specified with NODECFG[GLOBAL] in Configuring Data Staging a data staging job with this template should consume.
    4. Set JOBMIGRATEPOLICY to JUSTINTIME.
    5. Add FLAGS=GRESONLY to indicate that this data staging job does not require any compute resources.

    6. Create a trigger that executes the ds_move_scp, ds_move_rsync, or ds_move_multiplex script, depending on which file transfer utility you use. Set the attacherror, objectxmlstdin, and user FLAGs to attach any trigger stderr as a message to the job, pass the job XML to the script, and indicate that the script should run as the job's user, respectively.

      If you use the rsync protocol, you can configure your data staging jobs to report the actual number of bytes transferred and the total data size to be transferred. To do so, use the Sets attribute to ^BYTES_IN.^DATA_SIZE_IN for stage in jobs and ^BYTES_OUT.^DATA_SIZE_OUT for stage out jobs. For example, a stage in trigger would look like the following:

      JOBCFG[dsin]   TRIGGER=EType=start,AType=exec,Action="/opt/moab/tools/data-staging/ds_move_rsync --stagein",Flags=objectxmlstdin:user:attacherror,Sets=^BYTES_IN.^DATA_SIZE_IN

      A stage out trigger would look like the following:

      JOBCFG[dsout]   TRIGGER=EType=start,AType=exec,Action="/opt/moab/tools/data-staging/ds_move_rsync --stageout",Flags=objectxmlstdin:user:attacherror,Sets=^BYTES_OUT.^DATA_SIZE_OUT

      These variables show up as events if you set your WIKIEVENTS parameter to TRUE.

    7. Create the stage out job, called dsout in the example below, by repeating steps 2b - 2f in a new template. In the example below, this template is called dsout. Note that the name of this job template must match the name of the data stage out job template referenced in the master template.
      JOBCFG[ds]     TEMPLATEDEPEND=AFTEROK:dsin TEMPLATEDEPEND=BEFORE:dsout SELECT=TRUE
       
      JOBCFG[dsin]   DATASTAGINGSYSJOB=TRUE
      JOBCFG[dsin]   GRES=bandwidth:2
      JOBCFG[dsin]   FLAGS=GRESONLY
      JOBCFG[dsin]   TRIGGER=EType=start,AType=exec,Action="/opt/moab/tools/data-staging/ds_move_rsync --stagein",Flags=attacherror:objectxmlstdin:user
       
      JOBCFG[dsout]  DATASTAGINGSYSJOB=TRUE
      JOBCFG[dsout]  GRES=bandwidth:2
      JOBCFG[dsout]  FLAGS=GRESONLY
      JOBCFG[dsout]  TRIGGER=EType=start,AType=exec,Action="/opt/moab/tools/data-staging/ds_move_rsync --stageout",Flags=attacherror:objectxmlstdin:user

  3. Create the job using msub, adding resources and specifying a script as you normally would. Then configure Moab to stage the data for it. To do so:
    1. At the end of the command, use the --stagein/--stageout option and/or --stageinfile/--stageoutfile option.
      • The --stagein/--stageout option lets you specify a single file or directory to stage in or out. You must set the option equal to <source>%<destination>, where <source> and <destination> are both [<user>@]<host>:/<path>/[<fileName>]. See Staging a file or directory for format and details.

        Note that if you do not know the cluster where the job will run but want the data staged to the same location, you can use the $CLUSTERHOST variable in place of a host. If you choose to use the $CLUSTERHOST variable, you must first customize the ds_config.py file. For more information, see Configuring the $CLUSTERHOST variable.

        If the destination partition is down or does not have configured resources, the data staging workflow submission will fail.

        > msub ... --stagein=annasmith@labs:/patient-022678/%\$CLUSTERHOST:/davidharris/research/patientrecords <jobScript>

        Moab copies the /patient-022678 directory from the hospital's labs server to the cluster where the job will run prior to job start.

      • The --stageinfile/--stageoutfile option lets you specify a file that contains the file(s) and directory(-ies) to stage in or out. You must set the option equal to <path>/<fileName> of the file. The file must contain at least one line with this format: [<user>@]<host>:/<path>[<fileName>]. See Staging multiple files or directories for more information.

        If the destination partition is down or does not have configured resources, the data staging workflow submission will fail.

        > msub ... --stageinfile=/davidharris/research/recordlist <jobScript>

        Moab copies all files specified in the /davidharris/research/recordlist file to the cluster where the job will run prior to job start.

        /davidharris/research/recordlist:

        annasmith@labs:/patient-022678/tests/blood02282014%$CLUSTERHOST:/davidharris/research/patientrecords/blood02282014
        annasmith@labs:/patient-022678/visits/stats02032014%$CLUSTERHOST:/davidharris/research/patientrecords/stats02032014
        annasmith@labs:/patient-022678/visits/stats02142014%$CLUSTERHOST:/davidharris/research/patientrecords/stats02142014
        annasmith@labs:/patient-022678/visits/stats02282014%$CLUSTERHOST:/davidharris/research/patientrecords/stats02282014
        annasmith@labs:/patient-022678/visits/stats03032014%$CLUSTERHOST:/davidharris/research/patientrecords/stats03032014
        annasmith@labs:/patient-022678/visits/stats03142014%$CLUSTERHOST:/davidharris/research/patientrecords/stats03142014
        annasmith@labs:/patient-022678/visits/stats03282014%$CLUSTERHOST:/davidharris/research/patientrecords/stats03282014

        Moab copies the seven patient record files from the hospital's labs server to the cluster where the job will run prior to job start.

    2. The --stageinsize/--stageoutsize option lets you specify the estimated size of the files and/or directories to help Moab more quickly and accurately calculate the amount of time it will take to stage the data and therefore schedule your job correctly. If you used the $CLUSTERHOST variable to stage in, then setting --stageinsize is required. --stageoutsize is always required for staging data out. If you provide an integer, Moab will assume the number is in megabytes. To change the unit, add another suffix. See Stage in or out file size for more information.

      > msub ... --stageinfile=/davidharris/research/recordlist --stageinsize=100  <jobScript>

      Moab copies the /davidharris/research/recordlist file, which is approximately 100 megabytes, from the biology node to the host where the job will run prior to job start.

  4. To see the status, errors, and other details associated with your data staging job, run checkjob -v. See "checkjob" for details.

Related Topics 

© 2016 Adaptive Computing