Connecting to the Discovery Cluster
Intro to Discovery (how to set up and run SLURM jobs ) tutorial video: https://rc.northeastern.edu/support/training/
Discovery account. Apply by opening a Service-Now ticket with an access request at https://rc-docs.northeastern.edu/en/latest/welcome/gettinghelp.html#submit-a-ticket
Login by typing e.g.
ssh lotterhos@login.discovery.neu.edu
, but use your own username- also check out instructions for advanced ssh-ing to
ssh
without a password and to use an alias (egssh discovery
instead of using the full command:ssh username@login.discovery.neu.edu
). These steps will help you save time in the long run.
- also check out instructions for advanced ssh-ing to
Check out NU Discovery research computing docs, particularly the sections on Getting Started, Software, and Slurm
Need help?
GET OFF THE LOGIN NODE!
When you sign into the cluster, you are on a login node! Don’t run your code on the login node or else you will get an angry email from research computing.
To develop code and troubleshoot, you need to transfer over to another partition after you login. Use the srun
command with the --pty
option will output the results of your code on the screen
To run on the lotterhos partition:
srun -p lotterhos -N 1 --pty /bin/bash
Or if you are uploading and want a faster network connection: srun --constraint=ib -p lotterhos --pty /bin/bash
OR For 20 minutes on the debug partition:
srun -p debug -N 1 --pty /bin/bash
OR for a few 24? hours on the short partition:
srun -p short -N 1 --pty /bin/bash
See this page for information about default times on different partitions
Overview presentation
Here is a presentation from Research Computing on terminology and optimizing workflows in the lab: Optimization_Lotterhos_labmeeting_Sekar-V2.pdf
Brandon’s recordings on how to set up RSA keys and run Jupyter Notebooks
link to recording Passcode: 0u?9rW&%
Using Lotterhos nodes on Discovery
There is a lotterhos unix group, which you need to request Dr. L to add you to. It takes a day for the change to propagate so you should be able to see you are part of the ‘lotterhos’ group when typing groups $USER
in the terminal. Once you confirm you are part of the unix group and have access to the nodes, you can submit regular jobs using the --partition lotterhos
There are 2 nodes each with 36 cores, so we can run parallel programs with up to 36 cores, and serial on up to 72 cores. Our nodes are d3037 and d3038. EAch node has ~180Gb of memory (about 5Gb per CPU). Please communicate with Dr. L about your project and the best way to use resources, so that everyone in the lab has access if they need it.
To get info on the nodes type sinfo -p lotterhos
, which gives the node IDs, and lscpu #####
to get info on the CPUs for that node.
scontrol show partition lotterhos
shows what we can do. There are no limits on job submissions, the default time is 7 days with a max of 30 days. When you type the code, you can see MaxTime=30-00:00:00
squeue -p lotterhos
shows what is current running. You definitely want to check this before you try to srun
(see below) or submit jobs!
There is 5GB of memory on each core, with 36 cores/node and 72 cores total across both. So it should be possible to submit an array with 140 jobs at a time, each with 2.5 GB memory, but I haven’t tried it yet.
If you are on campus, or offcampus using NU’s VPN, you can use this webpage to view resources: https://xdmod.rc.northeastern.edu/