Geospatial Analysis with Python on Ceres

Rowan Gaffney (rowan.gaffney@usda.gov)

1. Data on Ceres

Transferring Data

  1. I2 Connection (at select locations)
    • For very large data transfers
    • Secure Copy (SCP) to transfer data
    • Additional documentation on Scinet Basecamp
  2. Globus
    • Specifically designed for HPC data transfer. Intuitive web based GUI. Globus Link
    • Additional documentation on Scinet Basecamp
  3. JupyterLab
    • Upload files (from local storage to Ceres) by using the upload icon within the files tab
    • Download files (from Ceres to local storage) by right clicking file/folder and choosing "Download" or "Download Current Folder as Archive"

Reducing Data Size

Data Type

Scaling and changing the data type is an effective way to reduce the overall size of your data. A common datatype is floating point 64 bit, which has a level of precision that is often far greater than the precision in the data. Consider reflectance data ranging from 0.0 to 1.0 in floating point representation. Scaling by 10,000, and converting to int16 (16 bits) perserves the precision of the data and can reduce the size by a factor of 4.

Consider the Below Example:

In [1]:
import numpy as np

array1 = np.random.random((1000,10000)).astype(np.float64)
print('This array, in '+str(array1.dtype)+', is '+str(array1.nbytes*1e-6)+' MB')

array2 = np.round(array1*10000.,0).astype(np.int16)
print('Scaling and converting the array to '+str(array2.dtype)+' results in a size of '+str(array2.nbytes*1e-6)+' MB')
This array, in float64, is 80.0 MB
Scaling and converting the array to int16 results in a size of 20.0 MB

Data Compression

Most raster formats have internal lossless compression options. Depending on the nature of your data, this can reduce the size substantially. For detailed specifications of raster formats see the GDAL specifications. Other common file formats, such as Zarr (Zarr Compression) and NetCDF (NetCDF Compression) have internal compression options as well.

Virtual Rasters

A common issue with geospatial analysis is working with data in different projections. A typical workflow may be to reproject data into a common projection, which results in duplication. An alternative is to use Virtual Raster Files (.VRT), which a simple .xml files that describe how the data should be transformed when opened.

In addition to re-projecting, Virtual Raster Files can be used to mosaic, alter resolutions, resample, etc... An efficient tool for building VRT files is gdalbuildvrt

2. Python Setup

Overall Setup - Background

Interface

  • JupyterLab: Web-based user interface (IDE for Python, R, IDL, etc...)
  • JupyterHub A multi-user Hub that spawns, manages, and proxies multiple instances of the single-user Jupyter notebook server

Packages / Resouces

  • Dask: Parallel Computing Library
  • Xarray: Labelled multi-dimensional array package
  • Numpy: Fundamental package for scientific computing with Python
  • Rasterio: Raster IO and processing.
  • Pangeo : NSF funded project for Big Data Geoscience. Implements a system very similiar to the container used on Ceres (titled: data_science_im_rs)
  • EarthML: Examples of machine learning and visualization for Earth science. Supported and maintained by Anaconda, as a collaboration with the NASA Goddard Space Flight Center.

Cluster

  • Dask Distributed (a python library for parallel and distributed computing) uses Direct Acyclic Graphs to distribute data and processing across the cluster (very different from a MPI style cluster). Components of a dask cluster include:
    • Client
    • Scheduler
    • Workers

Container

Step by Step Instructions

Below are the steps/commands to setup the Python/JupyterLab environment followed by a soundless video of the process on Ceres. Note that you will need to already have a SCINet account. Please visit the SCINet website for detailed instructions to setup an account.

  1. Access JupyterHub

    Currently, to access JupyterHub, you need to port forward the application to your local system. However, in the future there will be a public URL, and you will not longer need to do this step. To port forward JupyterHub, run the following command in the PowerShell (windows) or terminal (linux). Note that you will need to replace your USER.NAME with your SciNET user name.

    ssh -N -L 8000:jupyterhub.scinet.local:80 USER.NAME@login.scinet.science
    
  2. Open in Browser

    Open a web browser (firefox, chrome, edge, etc...) and go to localhost:8000

  3. Spawn JupyterLab

    Once logged into JupyterHub, you are given a set of options when launching JupyterLab. Below are brief descriptions of each option, followed by the value to use for this tutorial. If this is the first time spawning a notebook from a container on Docker or Singularity Hub (as in this example), it will take 4-10 minutes to donwload and build the container. The container is then cached in your home directory, so on subsequent tries, JupyterLab should spawn in 10 - 30 seconds.

    Node Type: Ceres partition to use when running JupyterLab
          short OR brief-low (ethier works)
    Number of Cores: Number of Cores to allocate
          4
    Job Duration: Length of Job (HH:MM:SS)
          00:30:00
    Additional Slurm Options: Sbatch Options
          (leave blank)
    Notebook/Lab Options: Additional JupyterLab or Jupyter Notebook options
          --notebook-dir=/project/geospatial_tutorials/
    Enter the full path to the container image: Location of the container to use
          docker://rowangaffney/data_science_im_rs:latest
    Container Exec Options: Additional options for the singularity exec command
          --bind /etc/munge --bind /var/log/munge --bind /var/run/munge --bind /usr/bin/squeue --bind /usr/bin/sinfo --bind /usr/bin/scancel --bind /usr/bin/sbatch --bind /usr/bin/scontrol --bind /scinet01/gov/usda/ars/scinet/system/slurm:/etc/slurm --bind /run/munge --bind /usr/lib64 --bind /scinet01 --bind $HOME --bind /software/7/apps/envi -H $HOME:/home/jovyan

    Please be cognizant of the compute resources you are requesting (see best practices below).

    Best Practices

    • For short sessions (2hrs or less) please choose the brief-low partition in the "Node Type" drop down, if available.
    • For serial computing (non-parallel code) enter 2 or 4 for number of cores.
    • For parallel computing choose a reasonable number of cores to meet your needs.
    • Choose a reasonable job duration (e.g. Do not choose 48hr job duration so you can leave your session open overnight).
    • Remember to stop the jupyter server when you are done working (file --> Hub Control Panel --> Stop Server).

    Note that this data_science_im_rs container has two main environments which include (as of March 18, 2020):

    ► geo (python geospatial - click to see all packages)
    Name Version Build Channel
    _libgcc_mutex 0.1 conda_forge conda-forge
    _openmp_mutex 4.5 1_llvm conda-forge
    _py-xgboost-mutex 2.0 cpu_0 conda-forge
    _r-mutex 1.0.1 anacondar_1 conda-forge
    affine 2.3.0 py_0 conda-forge
    aiohttp 3.6.2 py36h516909a_0 conda-forge
    appdirs 1.4.3 py_1 conda-forge
    arrow-cpp 0.15.1 py36had5782a_4 conda-forge
    asciitree 0.3.3 py_2 conda-forge
    async-timeout 3.0.1 py_1000 conda-forge
    attrs 19.3.0 py_0 conda-forge
    backcall 0.1.0 py_0 conda-forge
    beautifulsoup4 4.8.2 py36_0 conda-forge
    binutils_impl_linux-64 2.33.1 h53a641e_8 conda-forge
    binutils_linux-64 2.33.1 h9595d00_16 conda-forge
    bleach 3.1.1 py_0 conda-forge
    blinker 1.4 py_1 conda-forge
    bokeh 1.4.0 py36_0 conda-forge
    boost-cpp 1.70.0 h8e57a91_2 conda-forge
    boto3 1.12.12 py_0 conda-forge
    botocore 1.15.12 py_0 conda-forge
    bottleneck 1.3.2 py36hc1659b7_0 conda-forge
    brotli 1.0.7 he1b5a44_1000 conda-forge
    bwidget 1.9.14 0 conda-forge
    bzip2 1.0.8 h516909a_2 conda-forge
    c-ares 1.15.0 h516909a_1001 conda-forge
    ca-certificates 2019.11.28 hecc5488_0 conda-forge
    cachetools 3.1.1 py_0 conda-forge
    cairo 1.16.0 hfb77d84_1002 conda-forge
    cartopy 0.17.0 py36h39d8c00_1011 conda-forge
    certifi 2019.11.28 py36_0 conda-forge
    cffi 1.13.2 py36h8022711_0 conda-forge
    cfitsio 3.470 hb60a0a2_2 conda-forge
    cftime 1.0.4.2 py36hc1659b7_0 conda-forge
    chardet 3.0.4 py36_1003 conda-forge
    click 7.0 py_0 conda-forge
    click-plugins 1.1.1 py_0 conda-forge
    cligj 0.5.0 py_0 conda-forge
    cloudpickle 1.3.0 py_0 conda-forge
    colorcet 2.0.1 py_0 conda-forge
    cryptography 2.8 py36h72c5cf5_1 conda-forge
    curl 7.68.0 hf8cf82a_0 conda-forge
    cycler 0.10.0 py_2 conda-forge
    cytoolz 0.10.1 py36h516909a_0 conda-forge
    dask 2.11.0 py_0 conda-forge
    dask-core 2.11.0 py_0 conda-forge
    dask-glm 0.2.0 py_1 conda-forge
    dask-jobqueue 0.7.0 py_0 conda-forge
    dask-labextension 1.1.0 py_0 conda-forge
    dask-ml 1.2.0 py_0 conda-forge
    dask-xgboost 0.1.10 py_0 conda-forge
    datashader 0.10.0 py_0 conda-forge
    datashape 0.5.4 py_1 conda-forge
    dbus 1.13.6 he372182_0 conda-forge
    decorator 4.4.2 py_0 conda-forge
    defusedxml 0.6.0 py_0 conda-forge
    distributed 2.11.0 py36_0 conda-forge
    docopt 0.6.2 py_1 conda-forge
    docutils 0.15.2 py36_0 conda-forge
    double-conversion 3.1.5 he1b5a44_2 conda-forge
    earthengine-api 0.1.213 py_0 conda-forge
    entrypoints 0.3 py36_1000 conda-forge
    et_xmlfile 1.0.1 py_1001 conda-forge
    expat 2.2.9 he1b5a44_2 conda-forge
    fasteners 0.14.1 py_3 conda-forge
    fastparquet 0.3.3 py36hc1659b7_0 conda-forge
    fiona 1.8.13 py36h900e953_0 conda-forge
    fontconfig 2.13.1 h86ecdb6_1001 conda-forge
    freetype 2.10.0 he983fc9_1 conda-forge
    freexl 1.0.5 h14c3975_1002 conda-forge
    fribidi 1.0.5 h516909a_1002 conda-forge
    fsspec 0.6.2 py_0 conda-forge
    future 0.18.2 py36_0 conda-forge
    gcc_impl_linux-64 7.3.0 hd420e75_5 conda-forge
    gcc_linux-64 7.3.0 h553295d_16 conda-forge
    gcsfs 0.6.0 py_0 conda-forge
    gdal 3.0.4 py36hbb6b9fb_1 conda-forge
    geopandas 0.7.0 py_1 conda-forge
    geos 3.8.0 he1b5a44_0 conda-forge
    geotiff 1.5.1 hcbe54f9_9 conda-forge
    geoviews 1.6.6 py_1 conda-forge
    geoviews-core 1.6.6 py_1 conda-forge
    gettext 0.19.8.1 hc5be6a0_1002 conda-forge
    gflags 2.2.2 he1b5a44_1002 conda-forge
    gfortran_impl_linux-64 7.3.0 hdf63c60_5 conda-forge
    gfortran_linux-64 7.3.0 h553295d_16 conda-forge
    giflib 5.2.1 h516909a_2 conda-forge
    git 2.25.0 pl526hce37bd2_0 conda-forge
    glib 2.58.3 py36h6f030ca_1002 conda-forge
    glog 0.4.0 he1b5a44_1 conda-forge
    google-api-core 1.16.0 py36_1 conda-forge
    google-api-python-client 1.7.11 py_0 conda-forge
    google-auth 1.11.2 py_0 conda-forge
    google-auth-httplib2 0.0.3 py_3 conda-forge
    google-auth-oauthlib 0.4.1 py_2 conda-forge
    google-cloud-core 1.3.0 py_0 conda-forge
    google-cloud-storage 1.26.0 py_0 conda-forge
    google-resumable-media 0.5.0 py_1 conda-forge
    googleapis-common-protos 1.51.0 py36_1 conda-forge
    graphite2 1.3.13 hf484d3e_1000 conda-forge
    graphviz 2.42.3 h0511662_0 conda-forge
    grpc-cpp 1.25.0 h5321d42_1 conda-forge
    gsl 2.6 h294904e_0 conda-forge
    gst-plugins-base 1.14.5 h0935bb2_2 conda-forge
    gstreamer 1.14.5 h36ae1b5_2 conda-forge
    gxx_impl_linux-64 7.3.0 hdf63c60_5 conda-forge
    gxx_linux-64 7.3.0 h553295d_16 conda-forge
    h5netcdf 0.8.0 py_0 conda-forge
    h5py 2.10.0 nompi_py36h513d04c_102 conda-forge
    harfbuzz 2.4.0 h9f30f68_3 conda-forge
    hdf4 4.2.13 hf30be14_1003 conda-forge
    hdf5 1.10.5 nompi_h3c11f04_1104 conda-forge
    heapdict 1.0.1 py_0 conda-forge
    holoviews 1.12.7 py_0 conda-forge
    httplib2 0.17.0 py36_0 conda-forge
    hvplot 0.5.2 py_0 conda-forge
    icu 64.2 he1b5a44_1 conda-forge
    idna 2.9 py_1 conda-forge
    idna_ssl 1.1.0 py36_1000 conda-forge
    imageio 2.8.0 py_0 conda-forge
    importlib_metadata 1.5.0 py36_0 conda-forge
    intake 0.5.4 py_0 conda-forge
    intake-esm 2019.12.13 py_0 conda-forge
    intake-parquet 0.2.3 py_0 conda-forge
    intake-sql 0.2.0 py_0 conda-forge
    intake-stac 0.2.3 py_0 conda-forge
    intake-xarray 0.3.1 py_0 conda-forge
    ipykernel 5.1.4 py36h5ca1d4c_0 conda-forge
    ipython 7.13.0 py36h5ca1d4c_0 conda-forge
    ipython_genutils 0.2.0 py_1 conda-forge
    ipywidgets 7.5.1 py_0 conda-forge
    jdcal 1.4.1 py_0 conda-forge
    jedi 0.16.0 py36_0 conda-forge
    jinja2 2.11.1 py_0 conda-forge
    jmespath 0.9.4 py_0 conda-forge
    joblib 0.14.1 py_0 conda-forge
    jpeg 9c h14c3975_1001 conda-forge
    json-c 0.13.1 h14c3975_1001 conda-forge
    jsonschema 3.2.0 py36_0 conda-forge
    jupyter 1.0.0 py_2 conda-forge
    jupyter-server-proxy 1.2.0 py_0 conda-forge
    jupyter_client 6.0.0 py_0 conda-forge
    jupyter_console 6.0.0 py_0 conda-forge
    jupyter_core 4.6.3 py36_0 conda-forge
    kealib 1.4.10 h58c409b_1005 conda-forge
    kiwisolver 1.1.0 py36hc9558a2_0 conda-forge
    krb5 1.16.4 h2fd8d38_0 conda-forge
    ld_impl_linux-64 2.33.1 h53a641e_8 conda-forge
    libblas 3.8.0 15_openblas conda-forge
    libcblas 3.8.0 15_openblas conda-forge
    libclang 9.0.1 default_hde54327_0 conda-forge
    libcurl 7.68.0 hda55be3_0 conda-forge
    libdap4 3.20.4 hd3bb157_0 conda-forge
    libedit 3.1.20170329 hf8c457e_1001 conda-forge
    libevent 2.1.10 h72c5cf5_0 conda-forge
    libffi 3.2.1 he1b5a44_1006 conda-forge
    libgcc-ng 9.2.0 h24d8f2e_2 conda-forge
    libgdal 3.0.4 h022d3c0_1 conda-forge
    libgfortran-ng 7.3.0 hdf63c60_5 conda-forge
    libgomp 9.2.0 h24d8f2e_2 conda-forge
    libiconv 1.15 h516909a_1005 conda-forge
    libkml 1.3.0 h4fcabce_1010 conda-forge
    liblapack 3.8.0 15_openblas conda-forge
    libllvm8 8.0.1 hc9558a2_0 conda-forge
    libllvm9 9.0.1 hc9558a2_0 conda-forge
    libnetcdf 4.7.3 nompi_h9f9fd6a_101 conda-forge
    libopenblas 0.3.8 h5ec1e0e_0 conda-forge
    libpng 1.6.37 hed695b0_0 conda-forge
    libpq 12.2 hae5116b_0 conda-forge
    libprotobuf 3.10.1 h8b12597_0 conda-forge
    libsodium 1.0.17 h516909a_0 conda-forge
    libspatialindex 1.9.3 he1b5a44_3 conda-forge
    libspatialite 4.3.0a hd318ce7_1035 conda-forge
    libssh2 1.8.2 h22169c7_2 conda-forge
    libstdcxx-ng 9.2.0 hdf63c60_2 conda-forge
    libtiff 4.1.0 hc3755c2_3 conda-forge
    libtool 2.4.6 h14c3975_1002 conda-forge
    libuuid 2.32.1 h14c3975_1000 conda-forge
    libwebp 1.0.2 h56121f0_5 conda-forge
    libxcb 1.13 h14c3975_1002 conda-forge
    libxgboost 0.90 he1b5a44_4 conda-forge
    libxkbcommon 0.10.0 he1b5a44_0 conda-forge
    libxml2 2.9.10 hee79883_0 conda-forge
    libxslt 1.1.33 h31b3aaa_0 conda-forge
    llvm-openmp 9.0.1 hc9558a2_2 conda-forge
    llvmlite 0.31.0 py36h8b12597_0 conda-forge
    locket 0.2.0 py_2 conda-forge
    lxml 4.5.0 py36h7ec2d77_0 conda-forge
    lz4-c 1.8.3 he1b5a44_1001 conda-forge
    make 4.3 h516909a_0 conda-forge
    markdown 3.2.1 py_0 conda-forge
    markupsafe 1.1.1 py36h516909a_0 conda-forge
    matplotlib-base 3.1.3 py36h250f245_0 conda-forge
    mechanicalsoup 0.12.0 py_0 conda-forge
    metpy 0.12.0 py_0 conda-forge
    mistune 0.8.4 py36h516909a_1000 conda-forge
    monotonic 1.5 py_0 conda-forge
    mpi 1.0 openmpi conda-forge
    mpi4py 3.0.3 py36h0299808_0 conda-forge
    msgpack-numpy 0.4.4.3 py_0 conda-forge
    msgpack-python 1.0.0 py36hc9558a2_0 conda-forge
    multidict 4.7.5 py36h516909a_0 conda-forge
    multipledispatch 0.6.0 py_0 conda-forge
    munch 2.5.0 py_0 conda-forge
    nbconvert 5.6.1 py36_0 conda-forge
    nbformat 5.0.4 py_0 conda-forge
    ncurses 6.1 hf484d3e_1002 conda-forge
    netcdf4 1.5.3 nompi_py36hd35fb8e_102 conda-forge
    networkx 2.4 py_0 conda-forge
    notebook 6.0.3 py36_0 conda-forge
    nspr 4.25 he1b5a44_0 conda-forge
    nss 3.47 he751ad9_0 conda-forge
    numba 0.48.0 py36hb3f55d8_0 conda-forge
    numcodecs 0.6.4 py36he1b5a44_0 conda-forge
    numpy 1.18.1 py36h95a1406_0 conda-forge
    oauth2client 4.1.3 py_0 conda-forge
    oauthlib 3.0.1 py_0 conda-forge
    olefile 0.46 py_0 conda-forge
    openjpeg 2.3.1 h981e76c_3 conda-forge
    openmpi 4.0.2 hdf1f1ad_3 conda-forge
    openpyxl 3.0.3 py_0 conda-forge
    openssl 1.1.1d h516909a_0 conda-forge
    owslib 0.19.1 py_0 conda-forge
    packaging 20.1 py_0 conda-forge
    pandas 1.0.1 py36hb3f55d8_0 conda-forge
    pandoc 2.9.2 0 conda-forge
    pandocfilters 1.4.2 py_1 conda-forge
    panel 0.8.0 0 conda-forge
    pango 1.42.4 ha030887_1 conda-forge
    param 1.9.3 py_0 conda-forge
    parquet-cpp 1.5.1 2 conda-forge
    parso 0.6.2 py_0 conda-forge
    partd 1.1.0 py_0 conda-forge
    pcre 8.44 he1b5a44_0 conda-forge
    perl 5.26.2 h516909a_1006 conda-forge
    pexpect 4.8.0 py36_0 conda-forge
    phantomjs 2.1.1 1 conda-forge
    pickleshare 0.7.5 py36_1000 conda-forge
    pillow 7.0.0 py36hefe7db6_0 conda-forge
    pint 0.11 py_1 conda-forge
    pip 20.0.2 py_2 conda-forge
    pixman 0.38.0 h516909a_1003 conda-forge
    pooch 1.0.0 py_0 conda-forge
    poppler 0.67.0 h14e79db_8 conda-forge
    poppler-data 0.4.9 1 conda-forge
    postgresql 12.2 hf1211e9_0 conda-forge
    proj 6.3.1 hc80f0dc_1 conda-forge
    prometheus_client 0.7.1 py_0 conda-forge
    prompt_toolkit 2.0.10 py_0 conda-forge
    protobuf 3.4.1 py36_0 conda-forge
    psutil 5.7.0 py36h516909a_0 conda-forge
    psycopg2 2.8.4 py36h72c5cf5_1 conda-forge
    pthread-stubs 0.4 h14c3975_1001 conda-forge
    ptyprocess 0.6.0 py_1001 conda-forge
    py-xgboost 0.90 py36_4 conda-forge
    pyarrow 0.15.1 py36h8b68381_1 conda-forge
    pyasn1 0.4.8 py_0 conda-forge
    pyasn1-modules 0.2.7 py_0 conda-forge
    pycparser 2.19 py_2 conda-forge
    pyct 0.4.6 py_0 conda-forge
    pyct-core 0.4.6 py_0 conda-forge
    pydap 3.2.2 py36_1000 conda-forge
    pydrive 1.3.1 py_1 conda-forge
    pyepsg 0.4.0 py_0 conda-forge
    pygments 2.5.2 py_0 conda-forge
    pyhdf 0.10.2 py36h3a4e923_0 conda-forge
    pyjwt 1.7.1 py_0 conda-forge
    pykdtree 1.3.1 py36hc1659b7_1002 conda-forge
    pyopenssl 19.1.0 py_1 conda-forge
    pyparsing 2.4.6 py_0 conda-forge
    pyproj 2.5.0 py36he3cd046_1 conda-forge
    pyqt 5.12.3 py36hcca6a23_1 conda-forge
    pyqt5-sip 4.19.18 pypi_0 pypi
    pyqtwebengine 5.12.1 pypi_0 pypi
    pyrsistent 0.15.7 py36h516909a_0 conda-forge
    pysal 1.14.4 py36_0 conda-forge
    pyshp 2.1.0 py_0 conda-forge
    pysocks 1.7.1 py36_0 conda-forge
    python 3.6.7 h357f687_1006 conda-forge
    python-dateutil 2.7.5 py_0 conda-forge
    python-graphviz 0.13.2 py_0 conda-forge
    python-snappy 0.5.4 py36hee44bf9_1 conda-forge
    pytz 2019.3 py_0 conda-forge
    pyviz_comms 0.7.3 py_0 conda-forge
    pywavelets 1.1.1 py36hc1659b7_0 conda-forge
    pyyaml 5.3 py36h516909a_0 conda-forge
    pyzmq 19.0.0 py36h1768529_0 conda-forge
    qt 5.12.5 hd8c4c69_1 conda-forge
    qtconsole 4.7.1 py_0 conda-forge
    qtpy 1.9.0 py_0 conda-forge
    r 3.6 r36_1003 conda-forge
    r-base 3.6.2 h7ed4ef7_1 conda-forge
    r-boot 1.3_24 r36h6115d3f_0 conda-forge
    r-class 7.3_15 r36hcdcec82_1001 conda-forge
    r-cluster 2.1.0 r36h9bbef5b_2 conda-forge
    r-codetools 0.2_16 r36h6115d3f_1001 conda-forge
    r-foreign 0.8_76 r36hcdcec82_0 conda-forge
    r-kernsmooth 2.23_16 r36hfa343cc_1 conda-forge
    r-lattice 0.20_40 r36hcdcec82_0 conda-forge
    r-mass 7.3_51.5 r36hcdcec82_0 conda-forge
    r-matrix 1.2_18 r36h7fa42b6_2 conda-forge
    r-mgcv 1.8_31 r36h7fa42b6_0 conda-forge
    r-nlme 3.1_144 r36h9bbef5b_0 conda-forge
    r-nnet 7.3_13 r36hcdcec82_0 conda-forge
    r-recommended 3.6 r36_1003 conda-forge
    r-rpart 4.1_15 r36hcdcec82_1 conda-forge
    r-spatial 7.3_11 r36hcdcec82_1003 conda-forge
    r-survival 3.1_8 r36hcdcec82_0 conda-forge
    rasterio 1.1.3 py36h900e953_0 conda-forge
    rasterstats 0.14.0 py_0 conda-forge
    re2 2020.03.03 he1b5a44_0 conda-forge
    readline 8.0 hf8c457e_0 conda-forge
    requests 2.23.0 py36_0 conda-forge
    requests-oauthlib 1.2.0 py_0 conda-forge
    rioxarray 0.0.21 py_0 conda-forge
    rpy2 3.1.0 py36r36hc1659b7_3 conda-forge
    rsa 4.0 py_0 conda-forge
    rtree 0.9.4 py36h7b0cdae_0 conda-forge
    ruamel.yaml 0.16.6 py36h516909a_0 conda-forge
    ruamel.yaml.clib 0.2.0 py36h516909a_0 conda-forge
    s3fs 0.4.0 py_0 conda-forge
    s3transfer 0.3.3 py36_0 conda-forge
    sat-stac 0.3.3 py_0 conda-forge
    scikit-image 0.16.2 py36hb3f55d8_0 conda-forge
    scikit-learn 0.22.1 py36hcdab131_1 conda-forge
    scipy 1.4.1 py36h921218d_0 conda-forge
    sed 4.7 h1bed415_1000 conda-forge
    selenium 3.141.0 py36h516909a_1000 conda-forge
    send2trash 1.5.0 py_0 conda-forge
    setuptools 45.2.0 py36_0 conda-forge
    shapely 1.7.0 py36h5d51c17_0 conda-forge
    simpervisor 0.3 py_1 conda-forge
    simplegeneric 0.8.1 py_1 conda-forge
    simplejson 3.17.0 py36h516909a_0 conda-forge
    six 1.14.0 py36_0 conda-forge
    snappy 1.1.8 he1b5a44_1 conda-forge
    snuggs 1.4.7 py_0 conda-forge
    sortedcontainers 2.1.0 py_0 conda-forge
    soupsieve 1.9.4 py36_0 conda-forge
    spectral 0.20 py_0 conda-forge
    sqlalchemy 1.3.13 py36h516909a_0 conda-forge
    sqlite 3.30.1 hcee41ef_0 conda-forge
    streamz 0.5.2 py_0 conda-forge
    tbb 2018.0.5 h2d50403_0 conda-forge
    tblib 1.6.0 py_0 conda-forge
    terminado 0.8.3 py36_0 conda-forge
    testpath 0.4.4 py_0 conda-forge
    threddsclient 0.4.2 py_0 conda-forge
    thrift 0.11.0 py36he1b5a44_1001 conda-forge
    thrift-cpp 0.12.0 hf3afdfd_1004 conda-forge
    tiledb 1.7.0 hcde45ca_2 conda-forge
    tk 8.6.10 hed695b0_0 conda-forge
    tktable 2.10 h555a92e_3 conda-forge
    toolz 0.10.0 py_0 conda-forge
    tornado 6.0.3 py36h516909a_4 conda-forge
    tqdm 4.43.0 py_0 conda-forge
    traitlets 4.3.3 py36_0 conda-forge
    typing_extensions 3.7.4.1 py36_0 conda-forge
    tzcode 2019a h516909a_1002 conda-forge
    tzlocal 2.0.0 py_0 conda-forge
    uriparser 0.9.3 he1b5a44_1 conda-forge
    uritemplate 3.0.1 py_0 conda-forge
    urllib3 1.25.7 py36_0 conda-forge
    wcwidth 0.1.8 py_0 conda-forge
    webencodings 0.5.1 py_1 conda-forge
    webob 1.8.6 py_0 conda-forge
    wheel 0.34.2 py_1 conda-forge
    widgetsnbextension 3.5.1 py36_0 conda-forge
    xarray 0.15.0 py_0 conda-forge
    xerces-c 3.2.2 h8412b87_1004 conda-forge
    xgboost 0.90 py36he1b5a44_4 conda-forge
    xgeo 1.0 py_0 conda-forge
    xorg-kbproto 1.0.7 h14c3975_1002 conda-forge
    xorg-libice 1.0.10 h516909a_0 conda-forge
    xorg-libsm 1.2.3 h84519dc_1000 conda-forge
    xorg-libx11 1.6.9 h516909a_0 conda-forge
    xorg-libxau 1.0.9 h14c3975_0 conda-forge
    xorg-libxdmcp 1.1.3 h516909a_0 conda-forge
    xorg-libxext 1.3.4 h516909a_0 conda-forge
    xorg-libxpm 3.5.13 h516909a_0 conda-forge
    xorg-libxrender 0.9.10 h516909a_1002 conda-forge
    xorg-libxt 1.1.5 h516909a_1003 conda-forge
    xorg-renderproto 0.11.1 h14c3975_1002 conda-forge
    xorg-xextproto 7.3.0 h14c3975_1002 conda-forge
    xorg-xproto 7.0.31 h14c3975_1007 conda-forge
    xrviz 0.1.4 py_1 conda-forge
    xz 5.2.4 h14c3975_1001 conda-forge
    yaml 0.2.2 h516909a_1 conda-forge
    yarl 1.3.0 py36h516909a_1000 conda-forge
    zarr 2.4.0 py_0 conda-forge
    zeromq 4.3.2 he1b5a44_2 conda-forge
    zict 2.0.0 py_0 conda-forge
    zipp 3.1.0 py_0 conda-forge
    zlib 1.2.11 h516909a_1006 conda-forge
    zstd 1.4.4 h3b9ef0a_1 conda-forge

    ► r_geo (R geospatial - click to see all packages)

    Name Version Build Channel
    _libgcc_mutex 0.1 conda_forge conda-forge
    _openmp_mutex 4.5 1_llvm conda-forge
    _r-mutex 1.0.1 anacondar_1 conda-forge
    binutils_impl_linux-64 2.33.1 h53a641e_8 conda-forge
    binutils_linux-64 2.33.1 h9595d00_16 conda-forge
    boost-cpp 1.70.0 h8e57a91_2 conda-forge
    bwidget 1.9.14 0 conda-forge
    bzip2 1.0.8 h516909a_2 conda-forge
    ca-certificates 2019.11.28 hecc5488_0 conda-forge
    cairo 1.16.0 hfb77d84_1002 conda-forge
    cfitsio 3.470 hb60a0a2_2 conda-forge
    curl 7.68.0 hf8cf82a_0 conda-forge
    expat 2.2.9 he1b5a44_2 conda-forge
    fontconfig 2.13.1 h86ecdb6_1001 conda-forge
    freetype 2.10.0 he983fc9_1 conda-forge
    freexl 1.0.5 h14c3975_1002 conda-forge
    fribidi 1.0.5 h516909a_1002 conda-forge
    gcc_impl_linux-64 7.3.0 hd420e75_5 conda-forge
    gcc_linux-64 7.3.0 h553295d_16 conda-forge
    geos 3.7.2 he1b5a44_2 conda-forge
    geotiff 1.5.1 hcd53e25_3 conda-forge
    gettext 0.19.8.1 hc5be6a0_1002 conda-forge
    gfortran_impl_linux-64 7.3.0 hdf63c60_5 conda-forge
    gfortran_linux-64 7.3.0 h553295d_16 conda-forge
    giflib 5.1.7 h516909a_1 conda-forge
    glib 2.58.3 h6f030ca_1002 conda-forge
    graphite2 1.3.13 hf484d3e_1000 conda-forge
    gsl 2.6 h294904e_0 conda-forge
    gxx_impl_linux-64 7.3.0 hdf63c60_5 conda-forge
    gxx_linux-64 7.3.0 h553295d_16 conda-forge
    harfbuzz 2.4.0 h9f30f68_3 conda-forge
    hdf4 4.2.13 hf30be14_1003 conda-forge
    hdf5 1.10.5 nompi_h3c11f04_1104 conda-forge
    icu 64.2 he1b5a44_1 conda-forge
    jpeg 9c h14c3975_1001 conda-forge
    json-c 0.13.1 h14c3975_1001 conda-forge
    kealib 1.4.10 h58c409b_1005 conda-forge
    krb5 1.16.4 h2fd8d38_0 conda-forge
    ld_impl_linux-64 2.33.1 h53a641e_8 conda-forge
    libblas 3.8.0 15_openblas conda-forge
    libcblas 3.8.0 15_openblas conda-forge
    libcurl 7.68.0 hda55be3_0 conda-forge
    libdap4 3.20.4 hd3bb157_0 conda-forge
    libedit 3.1.20170329 hf8c457e_1001 conda-forge
    libffi 3.2.1 he1b5a44_1006 conda-forge
    libgcc-ng 9.2.0 h24d8f2e_2 conda-forge
    libgdal 3.0.1 hf47eb90_8 conda-forge
    libgfortran-ng 7.3.0 hdf63c60_5 conda-forge
    libgomp 9.2.0 h24d8f2e_2 conda-forge
    libiconv 1.15 h516909a_1005 conda-forge
    libkml 1.3.0 h4fcabce_1010 conda-forge
    liblapack 3.8.0 15_openblas conda-forge
    libnetcdf 4.6.2 h303dfb8_1003 conda-forge
    libopenblas 0.3.8 h5ec1e0e_0 conda-forge
    libpng 1.6.37 hed695b0_0 conda-forge
    libpq 11.5 hd9ab2ff_2 conda-forge
    libsodium 1.0.17 h516909a_0 conda-forge
    libspatialite 4.3.0a h57ae47a_1030 conda-forge
    libssh2 1.8.2 h22169c7_2 conda-forge
    libstdcxx-ng 9.2.0 hdf63c60_2 conda-forge
    libtiff 4.1.0 hc3755c2_3 conda-forge
    libuuid 2.32.1 h14c3975_1000 conda-forge
    libxcb 1.13 h14c3975_1002 conda-forge
    libxml2 2.9.10 hee79883_0 conda-forge
    llvm-openmp 9.0.1 hc9558a2_2 conda-forge
    lz4-c 1.8.3 he1b5a44_1001 conda-forge
    make 4.3 h516909a_0 conda-forge
    ncurses 6.1 hf484d3e_1002 conda-forge
    openjpeg 2.3.1 h981e76c_3 conda-forge
    openssl 1.1.1d h516909a_0 conda-forge
    pango 1.42.4 ha030887_1 conda-forge
    pcre 8.44 he1b5a44_0 conda-forge
    pixman 0.38.0 h516909a_1003 conda-forge
    poppler 0.67.0 h14e79db_8 conda-forge
    poppler-data 0.4.9 1 conda-forge
    postgresql 11.5 hc63931a_2 conda-forge
    proj4 6.1.1 hc80f0dc_1 conda-forge
    pthread-stubs 0.4 h14c3975_1001 conda-forge
    r-assertthat 0.2.1 r36h6115d3f_1 conda-forge
    r-backports 1.1.5 r36hcdcec82_0 conda-forge
    r-base 3.6.2 h7ed4ef7_1 conda-forge
    r-base64enc 0.1_3 r36hcdcec82_1003 conda-forge
    r-class 7.3_15 r36hcdcec82_1001 conda-forge
    r-classint 0.4_2 r36h9bbef5b_0 conda-forge
    r-cli 2.0.2 r36h6115d3f_0 conda-forge
    r-codetools 0.2_16 r36h6115d3f_1001 conda-forge
    r-crayon 1.3.4 r36h6115d3f_1002 conda-forge
    r-dbi 1.1.0 r36h6115d3f_0 conda-forge
    r-digest 0.6.25 r36h0357c0b_0 conda-forge
    r-e1071 1.7_3 r36h0357c0b_0 conda-forge
    r-ellipsis 0.3.0 r36hcdcec82_0 conda-forge
    r-evaluate 0.14 r36h6115d3f_1 conda-forge
    r-fansi 0.4.1 r36hcdcec82_0 conda-forge
    r-fastmap 1.0.1 r36h0357c0b_0 conda-forge
    r-fnn 1.1.3 r36h0357c0b_1 conda-forge
    r-foreach 1.4.8 r36h6115d3f_0 conda-forge
    r-foreign 0.8_76 r36hcdcec82_0 conda-forge
    r-gdalutils 2.0.3.2 r36h6115d3f_0 conda-forge
    r-glue 1.3.1 r36hcdcec82_1 conda-forge
    r-gstat 2.0_4 r36hcdcec82_0 conda-forge
    r-htmltools 0.4.0 r36h0357c0b_0 conda-forge
    r-httpuv 1.5.2 r36h0357c0b_1 conda-forge
    r-intervals 0.15.1 r36h0357c0b_1003 conda-forge
    r-irdisplay 0.7 r36_1001 conda-forge
    r-irkernel 1.1 r36h6115d3f_0 conda-forge
    r-iterators 1.0.12 r36h6115d3f_0 conda-forge
    r-jsonlite 1.6.1 r36hcdcec82_0 conda-forge
    r-kernsmooth 2.23_16 r36hfa343cc_1 conda-forge
    r-later 1.0.0 r36h0357c0b_0 conda-forge
    r-lattice 0.20_40 r36hcdcec82_0 conda-forge
    r-magrittr 1.5 r36h6115d3f_1002 conda-forge
    r-maptools 0.9_9 r36hcdcec82_0 conda-forge
    r-mass 7.3_51.5 r36hcdcec82_0 conda-forge
    r-matrix 1.2_18 r36h7fa42b6_2 conda-forge
    r-mime 0.9 r36hcdcec82_0 conda-forge
    r-pbdzmq 0.3_3 r36h559a7a4_1002 conda-forge
    r-pillar 1.4.3 r36h6115d3f_0 conda-forge
    r-promises 1.1.0 r36h0357c0b_0 conda-forge
    r-r.methodss3 1.8.0 r36h6115d3f_0 conda-forge
    r-r.oo 1.23.0 r36h6115d3f_0 conda-forge
    r-r.utils 2.9.2 r36h6115d3f_0 conda-forge
    r-r6 2.4.1 r36h6115d3f_0 conda-forge
    r-rappdirs 0.3.1 r36hcdcec82_1003 conda-forge
    r-raster 3.0_12 r36h0357c0b_0 conda-forge
    r-rcpp 1.0.3 r36h0357c0b_0 conda-forge
    r-repr 1.1.0 r36h6115d3f_0 conda-forge
    r-reticulate 1.14 r36h0357c0b_0 conda-forge
    r-rgdal 1.4_7 r36h33584d0_0 conda-forge
    r-rgeos 0.5_2 r36h05224b2_0 conda-forge
    r-rlang 0.4.5 r36hcdcec82_0 conda-forge
    r-sf 0.8_0 r36h33584d0_0 conda-forge
    r-shiny 1.4.0 r36h6115d3f_0 conda-forge
    r-snow 0.4_3 r36h6115d3f_1001 conda-forge
    r-sourcetools 0.1.7 r36he1b5a44_1001 conda-forge
    r-sp 1.4_1 r36hcdcec82_0 conda-forge
    r-spacetime 1.2_3 r36h6115d3f_0 conda-forge
    r-units 0.6_5 r36h0357c0b_0 conda-forge
    r-utf8 1.1.4 r36hcdcec82_1001 conda-forge
    r-uuid 0.1_4 r36hcdcec82_0 conda-forge
    r-vctrs 0.2.3 r36hcdcec82_0 conda-forge
    r-xtable 1.8_4 r36h6115d3f_2 conda-forge
    r-xts 0.12_0 r36hcdcec82_0 conda-forge
    r-zeallot 0.1.0 r36h6115d3f_1001 conda-forge
    r-zoo 1.8_7 r36hcdcec82_0 conda-forge
    readline 8.0 hf8c457e_0 conda-forge
    sed 4.7 h1bed415_1000 conda-forge
    sqlite 3.30.1 hcee41ef_0 conda-forge
    tbb 2018.0.5 h2d50403_0 conda-forge
    tiledb 1.6.2 hcde45ca_3 conda-forge
    tk 8.6.10 hed695b0_0 conda-forge
    tktable 2.10 h555a92e_3 conda-forge
    tzcode 2019a h516909a_1002 conda-forge
    udunits2 2.2.27.6 h4e0c4b3_1001 conda-forge
    xerces-c 3.2.2 h8412b87_1004 conda-forge
    xorg-kbproto 1.0.7 h14c3975_1002 conda-forge
    xorg-libice 1.0.10 h516909a_0 conda-forge
    xorg-libsm 1.2.3 h84519dc_1000 conda-forge
    xorg-libx11 1.6.9 h516909a_0 conda-forge
    xorg-libxau 1.0.9 h14c3975_0 conda-forge
    xorg-libxdmcp 1.1.3 h516909a_0 conda-forge
    xorg-libxext 1.3.4 h516909a_0 conda-forge
    xorg-libxrender 0.9.10 h516909a_1002 conda-forge
    xorg-renderproto 0.11.1 h14c3975_1002 conda-forge
    xorg-xextproto 7.3.0 h14c3975_1002 conda-forge
    xorg-xproto 7.0.31 h14c3975_1007 conda-forge
    xz 5.2.4 h14c3975_1001 conda-forge
    zeromq 4.3.2 he1b5a44_2 conda-forge
    zlib 1.2.11 h516909a_1006 conda-forge
    zstd 1.4.4 h3b9ef0a_1 conda-forge

    Furthermore, you can launch the RStudio (which uses the r_geo environment), a terminal, a help window, markdown, text file, and IDL kernel. To access the IDL library, you need to check-out the license from SCINet license server and bind-mounted properly (this is not shown in this example).

The following silent video is a media alternative for the text in steps 1-3 in the "Python Setup" Section above.
Link To Video

3. Cluster Setup

Overall Setup - Background

  • Uses the Dask Jobqueue Library to submit jobs to SLURM. Each "Slurm job" has X number of "Python workers".

  • Scales across nodes and partitions.

  • Number of workers can be scaled up or down dynamically.

  • Subject to SLURM resource allocation.

  • JupyterLab has a Dask add-on to monitor the cluster.

  • Dask includes a Dataframe (ie: Pandas) and Array (ie: Numpy) equivalent features.

  • Dask is used by Xarray - a geospatial/multidimensionial data package.

Step by Step Instructions

Below are the steps/commands to setup the cluster. Below these steps is a gif of the process on Ceres.

  1. Load Relevant Libraries
In [2]:
import os
import time
import dask_jobqueue as jq
from dask.distributed import Client,wait
import dask.array as da
  1. Setup the Client

Need to specify:

  • Partition: You may want to change the partition (short, mem, brief-low, etc...) to whatever is available.
  • Location of Singularity Image/Container
  • SLURM job and python worker structure. In this example, for each SLURM JOB there are:
    • 2 Python workers (i.e. processes)
    • 6 cores per Python worker
    • 3.2 GB per core
    • The SLURM job will last 2 hours (wall time)
    • The SLURM job will be run on the short and brief-low partitions
    • Dask will launch using the docker://rowangaffney/data_science_im_rs:latest image and the geo environment
In [3]:
partition='short,brief-low'
container_url = 'docker://rowangaffney/data_science_im_rs:latest'
conda_env = 'geo'
num_processes = 2
num_threads_per_processes = 6
mem = 3.2*num_processes*num_threads_per_processes
n_cores_per_job = num_processes*num_threads_per_processes

clust = jq.SLURMCluster(queue=partition,
                        processes=num_processes,
                        cores=n_cores_per_job,
                        memory=str(mem)+'GB',
                        interface='ib0',
                        local_directory='$TMPDIR',
                        tmpdir_ssh='/project/cper_neon_aop/neon_2017/analysis/prepocessing/',
                        death_timeout=30,
                        python="singularity -vv exec --bind /usr/lib64 --bind /scinet01 --bind /software/7/apps/envi/bin/ {} /opt/conda/envs/{}/bin/python".format(container_url,conda_env),
                        walltime='02:00:00',
                        job_extra=["--output=/dev/null","--error=/dev/null"])
cl=Client(clust)
dash_addr = '''/user/{}/proxy/{}/status'''.format(os.environ['USER'],cl.scheduler_info()['services']['dashboard'])
print('Dask Lab Extention Address (paste into the dask search box): '+dash_addr)
cl
Dask Lab Extention Address (paste into the dask search box): /user/rowan.gaffney/proxy/8787/status
Out[3]:

Client

Cluster

  • Workers: 0
  • Cores: 0
  • Memory: 0 B
In [4]:
num_jobs=12
clust.scale(n=num_jobs*num_processes)
while (((cl.status == "running") and (len(cl.scheduler_info()["workers"]) < num_jobs*num_processes))):
    time.sleep(.1)
cl
Out[4]:

Client

Cluster

  • Workers: 24
  • Cores: 144
  • Memory: 460.80 GB

A few quick example.

  1. 60 GB data: Calculate the mean without holding the data in memory
  2. 600 GB data: Calculate the mean without holding the data in memory
  3. 60 GB data: Persist the data to memory and calculate the mean
In [5]:
t = da.random.random((10000,7500,100),chunks=(400,400,-1))
t
Out[5]:
Array Chunk
Bytes 60.00 GB 128.00 MB
Shape (10000, 7500, 100) (400, 400, 100)
Count 475 Tasks 475 Chunks
Type float64 numpy.ndarray
100 7500 10000
In [6]:
t2 = t.mean()
t2
Out[6]:
Array Chunk
Bytes 8 B 8 B
Shape () ()
Count 1132 Tasks 1 Chunks
Type float64 numpy.ndarray

Now we will dynamically load the data, compute the results, and drop the data.

In [7]:
t2.compute()
Out[7]:
0.49999816307535666

Lets try working with data larger than memory

In [8]:
t = da.random.random((100000,7500,100),chunks=(400,400,-1))
t
Out[8]:
Array Chunk
Bytes 600.00 GB 128.00 MB
Shape (100000, 7500, 100) (400, 400, 100)
Count 4750 Tasks 4750 Chunks
Type float64 numpy.ndarray
100 7500 100000
In [9]:
t2 = t.mean()
t2
Out[9]:
Array Chunk
Bytes 8 B 8 B
Shape () ()
Count 11208 Tasks 1 Chunks
Type float64 numpy.ndarray
In [10]:
t2.compute()
Out[10]:
0.5000005197202411

Alternatively, we can load the data to the cluster with the "persist" option

In [11]:
t = da.random.random((10000,7500,100),chunks=(400,400,-1)).persist()
wait(t)
t
Out[11]:
Array Chunk
Bytes 60.00 GB 128.00 MB
Shape (10000, 7500, 100) (400, 400, 100)
Count 475 Tasks 475 Chunks
Type float64 numpy.ndarray
100 7500 10000
In [12]:
t.mean().compute()
Out[12]:
0.4999999739855532

The following silent video is a media alternative for the text in the "Cluster Setup" Section above.
Link To Video