Skip to main content
U.S. flag

An official website of the United States government

Dot gov

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Https

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Jupyter Lab/Notebook

Table of Contents


Introduction

Background

Project Jupyter is a open source software stack that supports interactive data science and scientific computing across a wide array of programming languages (>130 supported kernels). The primary applications within Jupyter are:

  1. JupyterHub: Jupyter’s multi-user server. This application spawns, manages, and proxies multiple instances of the single-user JupyterLab server.

  2. JupyterLab: Jupyter’s next-generation notebook interface, which includes: Jupyter notebooks, text editor, terminal, file browser (with upload/download capacity), data viewer, markdown, context help, and external extensions.

screenshot of jupyterlab software

Why Jupyter

Jupyter is popular among data scientists and researchers (Perkel, 2018) because it offers:

  • Interactive data exploration features
  • A browser based user interface, making it easy to work on remote systems such as HPC and Cloud
  • Language agnostic (supports >130 kernels)
  • Upload/Download files between remote and local systems
  • Easy ways to convert analyses and results into shareable formats (nbconvert) such as slides, html, pdf, latex, etc.
  • Ease of sharing, collaborating, and archiving analyses and results
  • A broad software stack that works with other Open Source projects, such as:
  • Customizability and Extensibility
  • An open source code base

For more details about Jupyter and why you may want to use it for computational research see: Why Jupyter


Launching JupyterLab

There are multiple approaches for accessing the Jupyter stack on Ceres.

The simplest and most succinct method to launch JupyterLab is thru the JupyterHub interface. To access, you will need functioning SCINet credentials. To setup a SCINet account, see the quickstart guide. Below are the instructions for JupyterHub.

  1. Go To: https://jupyterhub.scinet.usda.gov/
  2. Log into JupyterHub (SCINet credentials)
    • Username: SCINet username
    • Verification Code: 6 digit time-sensitive code
    • Password: SCINet password
  3. Spawning a JupyterLab Instance

    The Spawning page includes a comprehensive set of options for customizing JupyterLab and the compute environment. There are two ways to spawn JupyterLab, with the standard environment (default) or with a user defined container (optional).

    Standard Options

    • Node Type (Required): Which partition (Ceres partitions) to spawn JupyterLab.
    • Number of Cores (Required): How many cores to allocate (must be an even number).
    • Job Duration (Required): How long should the Slurm (Ceres resource allocation software) allocate to this task.
    • Slurm Sbatch Args (Optional): Additional options for Slurm (see sbatch options). An example may be –mem-per-cpu=6GB.
    • Working Directory (Optional): The directory to launch JupyterLab. An example may be /lustre/project/name_of_project, defaults to your $HOME directory.

    Container Options

    • Full Path to the Container (Optional): If you wish to launch JupyterLab with a container, specify the Ceres path or Hub URL to the container.
    • Container Exec Args (Optional): [Additional options] for executing the container (see the singularity exec options. An example may be –bind /lustre/project/name_of_project.
  4. Terminating JupyterLab

    To end the JupyterLab instance go to: File –> Hub Control Panel –> Stop Server

Below is a video (COMING SOON) showing the above process.


Environments and Software

Default Environment

The default environment includes:

  • Python, IDL, R kernels.
  • JupyterLab (and/or Jupyter Notebook)
  • Rstudio (launch as an external process from within JupyterLab) Note: RStudio has been disabled in JupyterHub due to security issues. If you need to use RStudio on Ceres, see RStudio Server Guide
  • User conda environments (see below for details)
  • Ability to load Ceres maintained software (see below)
  • Slurm queue manager

Bring Your Own Environment

If you have an environment (e.g. a conda environment) in your $HOME directory (e.g. ~/.conda/envs/my_env) with a Jupyter Kernel installed (i.e. IPyKernel, IRKernel, IJulia, idl_kernel, etc…), JupyterLab will detect this environmnet as a seperate kernel (assuming it is not the base environment). For instance, a conda environment named my_env with the IPyKernel will appear as Python [conda env:myenv] in the list of optional kernels in JupyterLab. The one exception to this is the base environment, which already exists in the defualt Jupyter environment, and will not be loaded from your home directory.

Use Ceres Maintained Software

The default environment includes an extension (located on the left vertical section of JupyterLab) to load Ceres software into the current environment. This is the software visible with the module avail command.

Containerized Environment

JupyterHub will spawn an instance of JupyterLab using a singularity container (see the container options above). The container selected needs to have JupyterLab installed. Users can specify a container in the Container Path section on the Spawner Options page. There are several ways to access containers on Ceres:

  • Pointing to a prebuilt container either maintained by the yourself or by the VRSC (located at /references/containers/).
  • Pointing to a prebuilt container on an external hub, such as Docker Hub or Singularity Hub. When launching JupyterLab from a container located on a Hub for the first time, it will take 1-10 minutes to start JupyterLab because the container needs to be downloaded, built, and cached. However, on subsequent tries it should be quite fast ~10-20 seconds (the image is now cached in your $HOME directory). If the container is modified on the Hub, it will be re-downloaded, built, and cached.
    1. An example input into Container Path: docker://jupyter/datascience-notebook
    2. Project Jupyter maintains a set of containers (Jupyter Stacks) which include:

Best Practices

Resource Conservation

  • For short sessions, select partitions that are have been designated for shorter duration jobs (such as brief-low/2hr limit or short/48hrs) in the “Node Type” drop down.
  • For serial computing (non-parallel code) enter 2 or 4 for “Number of Cores” in the spawner options. If a computation is not parallelized, having more cores will not improve the computation power. If you need more memory, use the –mem-per-cpu=XXGB in the Slurm Sbatch Args on the spawner page.
  • For parallel computing choose a reasonable number of cores to meet your needs.
  • Choose a reasonable job duration.
  • Remember to stop the jupyter server when you are done working (File --> Hub Control Panel --> Stop Server).

Reproducible Research

  • Version Control: The gold standard are version control systems like Github or Gitlab.
  • Legible and Interperable Code: Coding documents should include information about the mechanics of the code (commenting within code blocks) as well as the underlying scientific narrative (adding markdown cells surronding anlaysis and results).
  • Archiving Computational Environment: Containerized environments, such as Docker and Singularity, provide the best approach for archiving computatoinal environments. Services such as Docker Hub and Singularity Hub can store images associated with specific research or publications. Another approach is to capture computational environments as text output, such as a conda requirements.txt file.
  • Data Providence/Archiving: If utilizing a public dataset, the source and version should be documented. If using non-public data, the data should be published to a public repository, such as NAL or AgCross.

A detailed tutorial about conducting reproducible research can be found at: Coming Soon!

Tutorials and Packages for Parallel Computing

Developing code/scripts that utilize resources of a cluster can be challenging. Below are some software packages that may assist in parallelizing computations as well as links to some Ceres specific examples.

  1. Python - Dask, Ipyparallel, Ray, Joblib
  2. R - rslurm, Parallel, doParallel, Snow

Known Issues

  • Users launching RStudio from JupyterHub for the first time may encounter timeout error. Refreshing the page should fix this.