Python is one of the most used programming languages worldwide with applications in almost every data-oriented application domain. The Python data science ecosystem is a rich platform for scaling up workflows, enhancing scientific research and improving insight. However, Python can be performance limited when large datasets or challenging computations are required. Parallel computing and efficient data handling can overcome this barrier, enhancing research throughput.



Course Information

Prerequistes

  • Basic experience with Python is required.
  • Some grasp of array processing with NumPy would be helpful but is not required as we will do a brief refresher during the course.
  • The training session is driven on NCI Open OnDemand (OOD) service. Attendees are encouraged to review the following page for background information: Open OnDemand (OOD) Service


Objectives

This course is designed to be the first parallel programming one for scientists. As such, it aims to help attendees

  • Understand array programming with NumPy
  • Work with large and possibly heterogenous data using xarray.
  • Perform parallel computation using Dask


Learning Outcomes

At the completion of this course you will be able to

  • How use vectorized computation using NumPy
  • How to load, annotate and work with data using xarray
  • Serialise large datasets to file using xarray
  • Load data from cloud using OpenDap and xarray
  • Parallelise common workflows and arbitrary code using Dask
  • Combine Dask and xarray for big data processing
  • Combine Dask and GPUs for maximum data throughput
  • Feel confident in your data science skills to tackle your own problems


Topics Covered

  • Array programming in NumPy
  • Array datastructures and hierarchies
  • Loading and saving data efficiently to disk
  • Cloud-native computing
  • Parallel computing with Dask
  • Combining python packages for enhanced functionality