Connecting to the Discovery Cluster

Edit this page


Intro to Discovery (how to set up and run SLURM jobs ) tutorial video: https://rc.northeastern.edu/support/training/

  • Discovery account. Apply by opening a Service-Now ticket with an access request at https://rc-docs.northeastern.edu/en/latest/welcome/gettinghelp.html#submit-a-ticket

  • Login by typing e.g. ssh lotterhos@login.discovery.neu.edu, but use your own username

    • also check out instructions for advanced ssh-ing to ssh without a password and to use an alias (eg ssh discovery instead of using the full command: ssh username@login.discovery.neu.edu). These steps will help you save time in the long run.
  • Check out NU Discovery research computing docs, particularly the sections on Getting Started, Software, and Slurm

  • Need help?

GET OFF THE LOGIN NODE!

When you sign into the cluster, you are on a login node! Don’t run your code on the login node or else you will get an angry email from research computing.

To develop code and troubleshoot, you need to transfer over to another partition after you login. Use the srun command with the --pty option will output the results of your code on the screen

To run on the lotterhos partition:

srun -p lotterhos -N 1 --pty /bin/bash

Or if you are uploading and want a faster network connection: srun --constraint=ib -p lotterhos --pty /bin/bash

OR For 20 minutes on the debug partition:

srun -p debug -N 1 --pty /bin/bash

OR for a few 24? hours on the short partition:

srun -p short -N 1 --pty /bin/bash

See this page for information about default times on different partitions

Overview presentation

Here is a presentation from Research Computing on terminology and optimizing workflows in the lab: Optimization_Lotterhos_labmeeting_Sekar-V2.pdf

Brandon’s recordings on how to set up RSA keys and run Jupyter Notebooks

link to recording Passcode: 0u?9rW&%

Using Lotterhos nodes on Discovery

There is a lotterhos unix group, which you need to request Dr. L to add you to. It takes a day for the change to propagate so you should be able to see you are part of the ‘lotterhos’ group when typing groups $USER in the terminal. Once you confirm you are part of the unix group and have access to the nodes, you can submit regular jobs using the --partition lotterhos

There are 2 nodes each with 36 cores, so we can run parallel programs with up to 36 cores, and serial on up to 72 cores. Our nodes are d3037 and d3038. EAch node has ~180Gb of memory (about 5Gb per CPU). Please communicate with Dr. L about your project and the best way to use resources, so that everyone in the lab has access if they need it.

To get info on the nodes type sinfo -p lotterhos, which gives the node IDs, and lscpu ##### to get info on the CPUs for that node.

scontrol show partition lotterhos shows what we can do. There are no limits on job submissions, the default time is 7 days with a max of 30 days. When you type the code, you can see MaxTime=30-00:00:00

squeue -p lotterhos shows what is current running. You definitely want to check this before you try to srun (see below) or submit jobs!

There is 5GB of memory on each core, with 36 cores/node and 72 cores total across both. So it should be possible to submit an array with 140 jobs at a time, each with 2.5 GB memory, but I haven’t tried it yet.

If you are on campus, or offcampus using NU’s VPN, you can use this webpage to view resources: https://xdmod.rc.northeastern.edu/