Gadi is Australia’s most powerful supercomputer, a highly parallel cluster comprising more than 180,000 processor cores on ten different types of compute nodes. Gadi accommodates a wide range of tasks, from running climate models to genome sequencing, from designing molecules to astrophysical modelling. 

Introduction to Gadi is designed for new users, or users that want a refresher on the basics of Gadi. It covers NCI's HPC systems, how user accounts and projects work, how to use login nodes, compute and storage resource accounting, data storage essentials, and how to do compute job scripting, submission and monitoring at NCI.

Course Information

Prerequisites

The only prerequisite for this course is that you have an active NCI user account ready for login. If you do not have an NCI user account you may register for this course, however you will not be able to take full advantage of any hands-on exercises.

Attendees are strongly encouraged to review the following pages, which contain essential background information, before the course.


Objectives

This course is aiming to empower attendees to work with confidence on Gadi with the basic understanding of 

  • resource accounting
  • the difference among login, compute and data-mover nodes
  • job submission and management
  • module environment, for using software applications
  • basic skills to plan, track and manage their jobs on Gadi.


Learning Outcomes

At the completion of this course you will be able to 

  • login to Gadi
  • transfer data on and off Gadi
  • run module commands to customise user environment and configure software applications
  • submit jobs
  • check and maintain compute, storage, and job status
  • estimate job cost
  • request resource adequate for your jobs
  • monitor job status/progress and its resource utilisation 
  • understand common reasons why jobs finish with errors
  • ask questions about jobs like a pro 


Topics Covered

  • Login nodes and login environment
  • Shared filesystems and jobfs
  • Home, lustre, and tape filesystems
  • Home and project folders 
  • Data transfer and data mover nodes
  • Compute grant, resource hours and PBS queues
  • Job submission and output/error logs
  • Applications, modules and software groups
  • Login, copyq, and different compute nodes
  • Job cost and resource hours
  • PBS directives
  • Tools for job monitoring before, during and after the run
  • Common reasons of why jobs are not running
  • Common reasons of why jobs fail