Skip to main content

Office of Information Technology (OIT)

UT Arlington
OIT: Office of Information Technology

helpdesk@uta.edu ·  Work Order · 817-272-2208 · System Status

We are your IT partner!

Users Guide

System Overview

Why have root nodes and compute nodes?

Instead of investing in one very large (read very expensive) multi-processor supercomputer, we've chosen to keep our architecture open and more capable of growth as technology changes. HPC combines inexpensive systems that have been grouped together for processing data in batch.

Our users log in to one of two interactive systems (called root nodes) and submit compute jobs to run in batch. The machine you log in to doesn't actually perform these jobs, but rather dispatches them to run on a "compute" node. Compute nodes complete the task and report the results back to the root nodes. Users do not need to concern themselves with which node processes their data -- this is all done automatically by our batch scheduling software.

As compute nodes are modular, we can easily expand the resources of HPC by adding new systems to meet changes in technology. Further, because users do not log in to the compute nodes we can take them offline as needed for maintenance or repairs without impacting the entire system.

The root node/compute node model is essentially identical to the client/server model in software engineering; in this case the client is the root node and the server is the compute node. You, the client, request that some operation, your job, be performed; the server, our compute nodes, completes your task and reports the results.

[Return to top]

How do I connect to HPC?

Before you can connect to HPC, you must first obtain an account. Our online Account Request Form is the fastest way to submit your application.

Once your account is active, you'll need a Secure Shell client on your workstation to access the system; HPC accepts only secure connections. You can use your SSH client to connect and enter commands interactively, or you may use an SFTP client to transfer files to and from the system.

Please read the Users Guide sections on SSH and SFTP for a more detailed explanation of the two protocols and how to obtain them for your workstation.

Decide which root node you should connect to. HPC has an Intel Itanium 2 and an Inter EM64T Xeon root node. If your code is built to run on a particular hardware architecture, it's important that you choose the right host. For example, code compiled for the Itanium architecture will not run on the EM64T Xeon root node.

The host you are logged on will only display the queues available and the jobs running on that platform. 

The hostnames for our root nodes are:
rir2001.uta.edu - To log into the Itanium root node.
rir8002.uta.edu - To log into the Intel EM64T Xeon root node.

From a Linux or UNIX workstation, enter:

% ssh username@hostname

Where username represents your username and hostname represents the name of the root node.

From a Windows (or other OS) workstation you will need an appropriate SSH client.

[Return to top]

Why do I log into a root node?

There are two general types of machines in the HPC server farm: root nodes and compute nodes. Think of the root node as an interactive server that you log in to and the compute nodes as batch processing machines. All jobs are submitted to run on the compute nodes from the root nodes using our batch scheduling software, LSF. 

From the root node you should do any testing and tuning of your script before you submit it to run in batch mode. You should never log on to the compute nodes directly; they have been tuned to process data in batch mode.

[Return to top]

Where should I save my files?

Your home directory is stored on a Network Attached Storage (NAS) device. Each compute node also contains Direct Attached Storage (DAS) -- disks contained in that node and only readable on that node. Accessing the NAS requires traffic to flow across the network, and while this is fast, it's not as fast or reliable as accessing direct attach storage.

Each compute node contains a directory called /scratch, which is used for temporary storage while your job is executing. Users are to utilize this space during job execution in lieu of NAS. The NAS is to be used only for collecting any required information at job startup and compiling results at job finish.

Before your job terminates, move any files created in /scratch to the NAS. Space in /scratch is shared between all users of the system.

  • The Intel Itanium 2 (IA64-based) compute nodes each have 357GB of scratch space.
  • The Intel Xeon EM64T (IA32-64-based) compute nodes each have 9GB of scratch space.

User home directories are mounted on /home. System files, including applications and compilers, are mounted on /usr/nas.

[Return to top]

Where should I submit my job? How do I determine the right queue?

There are no "hard and fast" rules for where your jobs must be submitted. Queues are created to fit specific types of jobs (those that are meant to run in either parallel or sequential fashion). In general, jobs requiring more than one processor are considered "parallel" and those meant to run on only one processor are considered "sequential" or "serial." You can enter the command "bqueues" to get a list of all the queues on the system. Below is an example of the output.

QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H
itanium_long 20 Open:Active 22 9 1 3
itanium_debug 25 Open:Active 5 2 1 -
itanium_p4 20 Open:Active 16 6 - -

Each queue has several properties associated with it: priority, maximum number of jobs, job limit per user, job limit per host,  and others that can be configured. A higher priority queue will run before a lower priority queue. The maximum jobs per queue specifies how many separate jobs can be running (as apposed to pending) in a given queue. The job limit per user specifies how many jobs a particular user can run in any given queue at any given time. The job limit per host limits the number of jobs of a given queue that can execute on a given compute node at any given time (for example: if there is a 1 JL/H on a queue and 1 job of that queue is already running on a compute node, no more jobs from that queue will be accepted on that particular compute node until the job already running completes).Below is a list of all the queues and their properties:

Queues available on the "rir2001" root node.

itanium_debug - Itanium 2 processor, 2 hours max cpu, 500MB max memory, user limit 5, queue limit 10, serial jobs only.

itanium_long - Itanium 2 processor, 200 hours max cpu, 2GB max memory, user limit 22, queue limit 52, serial jobs only.

itanium_p4 - Itanium 2 processor, 100 hours max run time, 16GB max memory, user limit 2, queue limit 4, 4 processor parallel jobs only.

itanium_preemptable - Itanium 2 processor, 100 hours max cpu, 2GB max memory, user limit 14, queue limit 20, serial jobs only. This queue can be used when you have reached your user limit on the itanium_long queue, but there are still idle cpus available. Jobs in this queue can be preempted (suspended) by the system when jobs are submitted to the itanium_long and/or itanium_debug queues by other users.

itanium_low_priority - Itanium 2 processor, 100 hours max cpu, 2GB max memory, user limit 6, queue limit 10, serial jobs only. All jobs in this queue are niced to the maximum value and jobs in this queue can be preempted. This queue has the lowest priority and was created for a group of "restricted users" who do not have access to the other queues.

Queues available on the "rir8002" root node.

normal - user limit 40, queue limit 80, serial jobs only.

parallel - user limit is 320 processors, queue limit 456 (please submit jobs with processors in multiples of four).

[Return to top]

What is SSH?

SSH (Secure Shell) is a program to log into another computer over a network, to execute commands in a remote machine, and to move files from one machine to another. It provides strong authentication and secure communications over insecure network channels.

Many users of telnet, rlogin, ftp, and other such programs might not realize that their password is transmitted across the Internet unencrypted, but it is. SSH encrypts all traffic (including passwords) to effectively eliminate eavesdropping, connection hijacking, and other network-level attacks.

SSH is intended as a replacement for telnet, rlogin, rcp, ftp and rsh.

[Return to top]

Where do I get an SSH client?

For Microsoft Windows users, The University of Texas at Arlington has established an academic site license with SSH Communications Security which allows us to distribute a non-commercial version of SSH Secure Shell for Workstations to our faculty, students and staff. To obtain your copy, visit:

http://download.uta.edu/software/pc/SSHSecureShellClient-3.2.9.exe

(see below for special instructions)

The above download requires authentication. Please enter your UTA Network username and password (this should be the same as your MS Exchange or Windows account username and password). The file size should be 5,912 KB. If your browser asks for a domain name, enter "UTA". If your password doesn't work, try putting "UTA\" before your username (e.g. "UTA\username" instead of "username").

There are numerous SSH clients available both commercially and for free for almost all modern operating systems. Virtually all distributions of Linux and BSD Unix now ship with an SSH client and server by default. MacSSH, a modified version of BetterTelnet, implements the SSH2 protocol for MacOS.

The OpenSSH program was established to include SSH with all operating systems. Their code is currently portable to most operating systems and can be obtained free of charge on their web site.

[Return to top]

Using the Load Sharing Facility (LSF)

What is LSF?

The fair and efficient use of CPU time is of great concern to the HPC committee. A batch queue system is meant to address these issues. The queuing system used on HPC is the Load Sharing Facility (LSF) version 7.0, written by Platform Computing Corporation. LSF is a program that can manage resources, schedule, monitor, and analyze the workload for a network of computers. With LSF, you can run a job remotely and it behaves as if it were run locally. 

Reference manuals are available from Platform Computing for LSF. Licensing restrictions prevent us from providing the manuals over the public Internet. If you require access to these manuals, please contact the HPC support staff directly.

[Return to top]

Submitting batch jobs.

To run a job using LSF you need to make a batch job which contains the LSF batch instructions. After the batch job file is created, you submit the file to the LSF system using the command bsub. You can get more information about the bsub command from the bsub man page. 

Below is an example of a batch job that compiles a fortran job called test.f. It then runs the program, and puts the output from the execution of the job in a file called run.output.

#BSUB -q normal      # job queue name


#BSUB -J test # Job name


#BSUB -c 30 # time in minutes


#BSUB -o test.out # output is sent to file test.out


#----------- end of options to batch ------------------





# Compile the program


f77 -o run test.f





# Run the program and redirect the output to run.output


./run > run.output


To run the batch job above called test.one we would use the following command:

bsub < test.one

[Return to top]

Will I have to wait for all the jobs to run in the queue before my job will run?

LSF uses fairshare scheduling to divide the processing power of the LSF cluster among users and groups to provide fair access to resources. Without fairshare scheduling a user could submit many long jobs at once and monopolize the cluster's resources for a long time, while other users submit urgent jobs that must wait in queues until all of the first user's jobs are all done.

Fairshare scheduling works by assigning a fixed number of shares to each user or group. These shares represent a fraction of the resources that are available in the cluster. The most important users or groups are the ones with the most shares.  A user's dynamic priority depends on their share assignment, the dynamic priority formula, and the resources their jobs have already consumed. LSF tries to place the first job in the queue that belongs to the user with the highest dynamic priority. The order of jobs in the queue is secondary. LSF calculates the dynamic priority based on the following information about each user:

- Number of shares assigned to the user or group.All faculty members, regardless of the number of accounts they own, have the same number of shares and therefore the same access to the facilities.

- Resources used by jobs belonging to the user or group.
1. Number of job slots reserved and in use.
2. Run time of running jobs.
3. Cumulative actual CPU time, adjusted so that recently used CPU time is weighted more heavily than CPU time used in the distant past.

Please note that the more resources a user consumes, as defined by items 1-3, the lower their priority becomes.  This insures that heavy users cannot dominate the system. Also note that a user's dynamic priority is calculated on a queue by queue basis, and is different in every queue.

[Return to top]

LSF commands.

Here is a list of some of the most useful commands in LSF:

bsub This is the command for submitting a job for execution.

bjobs This command gives status information for one or more jobs. The -l option gives resource usage information. The -u username option displays jobs submitted by username. bjobs -u all produces information on all users. If you desire detailed information on a job (particularly, if you want answers to questions like "why hasn't my job started?" or "what resources have been requested or used by a job?"), then use bjobs -l jobid.

bqueues This will list for each queue: its priority; its status; the maximum number of processors that the queue can use; the maximum number of processors that a job from the queue can use; and the total number of jobs, the number of pending jobs, the number of running jobs, and the number of suspended jobs in the queue. The -l option to this command will provide a plethora of information on each queue.

bkill jobid This command cancels pending jobs and sends a kill signal (by default) to running jobs. Note that this command can also be used to send other signals to a running job (see the man page). LSF is sometimes a procrastinator and may respond to this command with "Operation will be retried later". Never fear, it will get around to killing the job.

bhist This command gives a history of one or more batch jobs. The default is to display a summary of current jobs in the queue only. The -a option will display both finished and unfinished jobs. The -l option will give more information than the standard summary. The -u username option will cover all jobs belonging to username.

[Return to top]

How can I get the job usage report at the end of my jobs standard output file rather than in email?

You can add the usage report to the end of your job output file simply by adding the following as the last executable line of your batch script:

busage $LSB_JOBID

The environment variable $LSB_JOBID is the ID of the currently running job. It is defined in the man page for bsub.
There is no way to disable the e-mail message of HPC.

[Return to top]

LSF exit codes explained:

LSF uses regular Unix signals/error codes. The exit code that LSF reports is received from whichever process is running on the compute node during your batch execution (for example the system lsfbatch processes themselves or if a batch job spawns more than one process, it could receive an exit code from any one of them).

For exit codes > 128, subtract 128 to get the actual exit code number.

[Return to top]

LSF Job States:

A job goes through a series of state transitions until it eventually completes its task, fails, or is terminated. The possible states of a job during its life cycle are shown in the diagram.

Job States

Many jobs enter only three states:

  • PEND - Waiting in a queue for scheduling and dispatch
  • RUN - Dispatched to a host and running
  • DONE - Finished normally with a zero exit value
Pending Jobs

A job remains pending until all conditions for its execution are met. Some of the conditions are:

  • Start time specified by the user when the job is submitted
  • Load conditions on qualified hosts
  • Dispatch windows during which the queue can dispatch and qualified hosts can accept jobs
  • Run windows during which jobs from the queue can run
  • Limits on the number of job slots configured for a queue, a host, or a user
  • Relative priority to other users and jobs
  • Availability of the specified resources
  • Job dependency and pre-execution conditions
Suspended Jobs:

A job can be suspended at any time. A job can be suspended by its owner, by the LSF administrator, by the root user (super user), or by LSF. There are three different states for suspended jobs:

  • PSUP - Suspended by its owner or the LSF administrator while in PEND state
  • USUP - Suspended by its owner or the LSF administrator after being dispatched
  • SSUSP - Suspended by LSF after being dispatched

After a job has been dispatched and started on a host, it can be suspended by LSF. When a job is running, LSF periodically checks the load level on the execution host. If any load index is beyond either its per-host or its per-queue suspending conditions, the lowest priority batch job on that host is suspended.

If the load on the execution host or hosts becomes too high, batch jobs could be interfering among themselves or could be interfering with interactive jobs. In either case, some jobs should be suspended to maximize host performance or to guarantee interactive response time.

LSF suspends jobs according to the priority of the job's queue. When a host is busy, LSF suspends lower priority jobs first unless the scheduling policy associated with the job dictates otherwise. Jobs are also suspended by the system if the job queue has a run window and the current time goes outside the run window.

A system-suspended job can later be resumed by LSF Batch if the load condition on the execution hosts falls low enough or when the closed run window of the queue opens again.

View Suspension Reasons:

The bjobs command displays the reason why a job was suspended.

[Return to top]

Abnormal termination / job fails to start:

A job might terminate abnormally for various reasons. Job termination can happen from any state. An abnormally terminated job goes into EXIT state. The situations where a job terminates abnormally include:

  • The job is canceled by its owner or the LSF administrator while pending, or after being dispatched to a host.
  • The job is not able to be dispatched before it reaches its termination deadline, and thus is aborted by LSF.
  • The job fails to start successfully. For example, the wrong executable is specified by the user when the job is submitted.
  • The job exits with a non-zero exit status.

A common mistake on HPC is to submit executables compiled on Itanium to run on EM64T (or vice versa). Be sure to note that these architectures are not compatible. A good practice would be to compile for different architectures in different directories or to have your script compile your executable on the compute node.

[Return to top]

LSF Scripts

Make sure that your scripts have execute permissions and submit the job using the bsub command:

General Purpose Example

#!/bin/csh
#
#BSUB -o myexec.out
#
#BSUB -q queue_name
#BSUB -J job_name
#BSUB -t time_in_minutes


set scratchname = /scratch/$LOGNAME'.'$LSB_JOBID

if ( ! -d $scratchname )then
mkdir $scratchname
endif

cd $scratchname

$HOME/myexe

cp -pr $scratchname $HOME/.
cd /scratch
rm -rf $scratchname

This is a general purpose example which can be adopted by most users. The script will exexute a file named myexecin the $HOME path (usually defined by your login shell as your home directory.) The script will send output to the file myexec.out located in your #HOME directory.

The job will be submitted to the batch queue named queue_name and be given a job name of job_name. The job will terminate after time_in_minutes of execution.

[Return to top]

An example using Gaussian98:

#!/bin/csh
#
#BSUB -q alpha_debug
#BSUB -J g98-test

set filename = "test001.com"

setenv g98root /usr/local/

set scratchname = /scratch/$LOGNAME'.'$LSB_JOBID

if ( ! -d $scratchname )then
mkdir $scratchname
endif

setenv GAUSS_SCRDIR $scratchname
source /usr/local/g98/bsd/g98.login

$g98root/g98/bsd/clearipc

cat $filename | /usr/local/g98/g98

$g98root/g98/bsd/clearipc

cp -pr $scratchname $HOME/.
cd /scratch
rm -rf $scratchname

The following script will run the Gaussian98 job test001.com on the queue named alpha_debug with a job name of g98-test. Intermediate files are stored on the compute node and the results are moved into the users' home directory after completion.

[Return to top]

Compile and execute a Fortran job on a compute node.

#!/bin/csh
#BSUB -J fortran_example
#BSUB -q int

set scratchname = /scratch/$LOGNAME'.'$LSB_JOBID

if ( ! -d $scratchname )then
mkdir $scratchname
endif

cd $scratchname
cat > primes.f <<eof></eof>

The following script will create a new Fortran file, compile and execute it. We'll use local storage for the intermediate files and copy the results back to the users' home directory.

[Return to top]

An example that compiles and executes a C program.

#!/bin/csh
#
#BSUB -J C_example
#BSUB -q int

set scratchname = /scratch/$LOGNAME'.'$LSB_JOBID

if ( ! -d $scratchname )then
mkdir $scratchname
endif

cd $scratchname
cat > hello.c <<eof><stdlib.h>

int
main(argc, argv)
int argc;
char **argv;
{
printf("Hello world!\n");
return 0;
}
EOF

gcc hello.c -o hello

./hello

cp -pr $scratchname $HOME/.
cd /scratch
rm -rf $scratchname
</stdlib.h></eof>

The following script will create a new C file, compile and execute it. We'll use local storage for the intermediate files and copy the results back to the users' home directory.

[Return to top]

An example that compiles and executes and MPI program.

#!/bin/tcsh
#BSUB -o mpi_script.out
#BSUB -J mpi_ex
#BSUB -q int_p2

set scratchname = /scratch/$LOGNAME'.'$LSB_JOBID

if ( ! -d $scratchname ) then
mkdir $scratchname
endif

cd $scratchname

if ( -f machinefile ) then
rm -f machinefile
endif

foreach host ( $LSB_HOSTS )
echo $host >> machinefile
end

cp -p $HOME/hello_mpi.c .
mpicc hello_mpi.c -o hello_mpi
mpirun -np 2 -machinefile machinefile hello_mpi

cp -pr $scratchname $HOME/.
cd /scratch
rm -rf $scratchname

The following script will compile the MPI program hello_mpi.c located in the path referred to by $HOME/, which is usally defined as the users' home directory. The script will execute the compiled file, utilizing local storage for any temporary files, and move the results back to the users' home directory upon completion.

Also note that the script creates a machine file for MPI called machinefile. This file contains a list of hosts that will execute the MPI job. LSF specifies the available hosts in the $LSB_HOSTS environment variable.

[Return to top]

File transfer questions

What is scp?

scp - Secure copy

scp copies files between hosts on a network.  It uses ssh (version 2) for data transfer, and uses the same authentication and provides the same security as ssh. Unlike rcp, scp will ask for passwords or passphrases if they are needed for authentication.

Any file name may contain a host and user specification to indicate that the file is to be copied to/from that host.  Copies between two remote hosts are permitted.

Examples:

% scp my_file.cpp remote_user@remote_host:/tmp/my_file.cpp

Copies the local file my_file.cpp from the local host to a remote host, as a remote user, and places the file as /tmp/my_file.cpp on the remote machine.

% scp remote_user@remote_host:/home/username/my_file.cpp

Copies the remote file /home/username/my_file.cpp from a remote host and places it in the current directory on the local host.

[Return to top]

What is sftp?

sftp - Secure file transfer program

sftp is an interactive file transfer program, similar to ftp(1), which performs all operations over an encrypted ssh (version 2) transport. It may also use many features of ssh, such as public key authentication and compression. sftp connects and logs into the specified host, then enters an interactive command mode.

[Return to top]

General questions

Can I arrange a tour of HPC?

UTA Faculty may arrange tours of the HPC servers with the HPC staff. Students interested in visiting the area should have your faculty sponsor contact OIT as described below.

HPC is located in a secured section of Arlington Regional Data Center, in Fort Worth. An escort from the OIT systems group must accompany all tour groups. Your OIT escort may inform you of any items which are not allowed in the server room.

To arrange a tour, please call the OIT Helpdesk at (817) 272-2208 or send an E-mail to helpdesk@uta.edu. An HPC administrator will contact you to arrange a mutually convenient time for your tour.

[Return to top]

Why are the node names so cryptic

All nodes names within HPC are based on either hardware platform and number of processors or a specific function.

[Return to top]