This note helps you set up Singularity environment on NYU HPC (Greene)

Most up-to-date doc: link

Connect to NYU Greene:

ssh [netid]@greene.hpc.nyu.edu
[Type in you password]

# Last login: Tue Mar  1 15:53:16 2022 from xx.xx.xx.xx
# [hl3797@log-2 hl3797]$

Get a compute node with GPU:

srun --nodes=1 --cpus-per-task=4 --mem=32GB --time=2:00:00 --gres=gpu:1 --pty /bin/bash
# wait until you are directed to the node
# [hl3797@gv001 ~]$

Prepare required files:

cd /scratch/$USER
# [hl3797@gv001 hl3797]$

cp /scratch/work/public/singularity/cuda11.4.2-cudnn8.2.4-devel-ubuntu20.04.3.sif .
cp /scratch/work/public/overlay-fs-ext3/overlay-25GB-500K.ext3.gz .

gunzip -vvv overlay-25GB-500K.ext3.gz
# Note: this takes a long time

ls
# cuda11.4.2-cudnn8.2.4-devel-ubuntu20.04.3.sif  overlay-25GB-500K.ext3

Launch singularity container (with GPU access):

singularity exec --nv --bind /scratch/$USER --overlay /scratch/$USER/overlay-25GB-500K.ext3:rw /scratch/$USER/cuda11.4.2-cudnn8.2.4-devel-ubuntu20.04.3.sif /bin/bash
# Singularity> 

cd /ext3
# Singularity> 

wget <https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh> --no-check-certificate
# 2022-03-01 21:54:02 (151 MB/s) - 'Miniconda3-latest-Linux-x86_64.sh' saved [75660608/75660608]

bash ./Miniconda3-latest-Linux-x86_64.sh -b -p /ext3/miniconda3
# PREFIX=/ext3/miniconda3
# Unpacking payload ...
# [...]
# installation finished.

wget <https://raw.githubusercontent.com/hmdliu/MLLU-SP22-tmp/main/env.sh> --no-check-certificate
# 2022-03-01 21:55:54 (4.76 MB/s) - 'env.sh' saved [143/143]

source /ext3/env.sh
# Singularity>

unset -f which
which python
# /ext3/miniconda3/bin/python

Install packages:

which pip
# /ext3/miniconda3/bin/pip

pip install torch torchvision torchaudio
pip install sklearn numpy scipy pandas matplotlib h5py addict tensorboard
# [normal installation info]
# Note: You may install more pkgs as needed.

Create .bashrc and .bash_profile:

Singularity> python
# Python 3.9.7 (default, Sep 16 2021, 13:09:58)
# [...]

>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.current_device()
0
>>> torch.cuda.device_count()
1
>>> torch.cuda.get_device_name(0)
'Tesla T4'
>>> exit()

Singularity> exit
# [hl3797@gv001 hl3797]$

Fix path:

SCRATCH=/scratch/$USER
echo $SCRATCH
# /scratch/[netid]

Test job:

mkdir $SCRATCH/test
cd $SCRATCH/test
# [hl3797@b-8-72 test]$

wget <https://raw.githubusercontent.com/TeamOfProfGuo/Codebase-Files/main/test_gpu.py>
# 2022-04-03 10:51:05 (86.9 MB/s) - ‘test_gpu.py’ saved [678/678]

wget <https://raw.githubusercontent.com/TeamOfProfGuo/Codebase-Files/main/submit_job.slurm>
# 2022-04-03 10:51:19 (60.5 MB/s) - ‘submit_job.slurm’ saved [439/439]

sbatch submit_job.slurm
# Submitted batch job 53617
# Note: The job can be pending for a while.

squeue -u $USER
#    JOBID  PARTITION       NAME    USER    ST   TIME  NODES  NODELIST(REASON)
# 17135904    rtx8000   test_gpu  hl3797     R   0:10      1  gr004
# Note: Wait until the 'test_gpu' job ends.

cat test.out
# Torch cuda available: True
# GPU name: Quadro RTX 8000
# 
# 
# CPU matmul elapsed: 1.7312142848968506 sec.
# GPU matmul elapsed: 0.15191888809204102 sec.

cat test.err
# /scratch/hl3797/test/test_gpu.py:34: UserWarning: Sample warning message.
#   warnings.warn("Sample warning message.")