Software Package/Environment Management Workshop: Python and conda

- Thursday, October 3, 2024
  - Session Recording

Managing packages and environments in Python

We will begin by focusing on package and environment management with the standard Python toolset: the venv and pip modules that are usually included with Python. Later, we will learn package and environment management with conda.

Choosing which version of Python to use

First, use the cluster’s environment module system to find and load the version of Python you want to use for your project: module spider python or ml spider python.
Load the version of Python you’d like to use. E.g., module load python/3.12.5 or ml load python/3.12.5. (Note that you can use tab completion for module names!)
It is a good idea to run python or python3 and verify that you get the version you want!

Note: After you create your virtual environment, you no longer need to load the associated Python environment module. You can simply activate the virtual environment!

Creating and managing virtual environments with `venv`

If you are not already in your workshop directory, change to it by running cd /90daydata/shared/$USER/.
Use the venv module and command to create a new virtual environment: python -m venv demo_venv.
You should now see a new folder called demo_venv. Let’s take a look inside and explore a bit!
You now have a virtual environment that is ready to use. The easiest way to use the environment is to activate it. Run source demo_venv/bin/activate. After running this command, notice that the command prompt has changed to indicate that you are working in the virtual environment named “demo_venv”.
To deactivate the environment and return to the “normal” command environment you had before, run deactivate.
To remove the environment, delete its directory: rm -I -r demo_venv.

That is pretty much everything you need to know about how to use and manage Python virtual environments! The venv command does have more options, but you most likely won’t need them.

Installing and managing Python packages in a virtual environment

The standard software tool for managing Python packages is pip, which is included with Python. If a Python virtual environment is activated, pip commands will automatically be applied to the active virtual environment.

If you have not already done so, create and activate a Python virtual environment using venv.
Use python -m pip install PACKAGE to install one or more Python packages. E.g., python -m pip install termcolor.
Try importing the newly installed package in Python to confirm the installation succeeded.
To upgrade a package that is already installed, use python -m pip install --upgrade PACKAGE.
To remove a package, use python -m pip uninstall PACKAGE.

pip is a fairly complex piece of software with many commands and options. (Fortunately, it also has good, detailed documentation!) However, the basic commands above, plus a few more we’ll learn in the next section, will likely be all that you need!

Exercise 1: Use nano (or the text editor of your choice) to save the following Python program to a file called print_data.py (e.g., nano print_data.py). Try running the program: python print_data.py. What happens? Create a virtual environment for the program and use it to run the program. What does the program do?

from termcolor import cprint
from faker import Faker

fake = Faker()
n = 4

def print_fake_data():
    cprint(f'Name: {fake.name()}', 'cyan')
    cprint(f'Company: {fake.company()}', 'white')
    cprint(f'Job: {fake.job()}\n', 'green')

for i in range(n):
    print_fake_data()

Using `requirements.txt` to automate package management

In order to make virtual environments and package management truly useful, we need a mechanism to easily and precisely record all of the packages an environment requires. pip can use a special “requirements file”, usually named requirements.txt, to do this. In its simplest form, requirements.txt simply lists the names of packages that are needed for an environment, with one package on each line. For example:

package_one
package_two

The requirements file can also provide information about the versions of a package that are acceptable.

package_one == 2.4.1
package_two > 1.3

The example above specifies that package_one must be version 2.4.1, and package_two can be any version as long as it is more recent than 1.3. There are many more possibilities; see the official documentation to learn more.

Not only are requirements files handy for documenting the packages in an environment, they also make it very easy to install the packages an environment requires! To install all packages specified in requirements.txt, run python -m pip install -r requirements.txt.

You can of course write requirements.txt by hand, but is often quite helpful to let pip make it for you! Once you have your virtual environment set up the way you want, you can use python -m pip freeze to generate the contents of requirements.txt that will exactly recreate the environment. To write the command output directly to a file, run python -m pip freeze > requirements.txt. (Be careful, though, because that command will overwrite requirements.txt if it already exists.)

It is a good practice to include a requirements.txt file along with the code and documentation for a project. That way, you or anyone else who uses your code can easily create a suitable virtual environment for the project. Using a requirements.txt file therefore helps ensure that your code and analyses are fully reproducible.

Exercise 2: Create a requirements.txt file for the virtual environment you created for the previous exercise. How many packages are included in the file? Why?

Exercise 3: Create a new virtual environment and install all packages from requirements.txt from Exercise 2 into the virtual environment. Confirm that the program from Exercise 1 runs in your new virtual environment.

Using virtual environments with Jupyter notebooks

What we’ve learned so far is all you need for using Python from the command line. How, though, do you access your virtual environment from a Jupyter notebook? For this, we need to create a “Jupyter kernel” to make our environment available in notebooks.

From the command line, make sure the target virtual environment for the kernel is activated.
Install the ipykernel package: python -m pip install ipykernel.
Create a “kernel specification” that will make the virtual environment available to Jupyter notebooks: python -m ipykernel install --user --name "KERNEL NAME".

You should now see your new kernel available for use with Jupyter notebooks. (It might take a minute or two for Jupyter to detect the new kernel.)

What if, after creating the Jupyter kernel, you need to change your virtual environment by adding or removing packages? That is no problem, and you do not need to create the kernel specification again. The kernel specification merely provides access to the virtual environment. If you modify the virtual environment, the changes will automatically be available to notebooks that use the environment via a Jupyter kernel.

If you want to remove a Jupyter kernel, you can run jupyter kernelspec uninstall KERNELNAME from the command line. To see the names of all installed kernels, run jupyter kernelspec list. (Installing the ipykernel package should also give you the jupyter command, but you will need to be sure your Python virtual environment is active in order to use it.)

For the next exercise, we will use Open OnDemand on Atlas.

Log on to Open OnDemand on Atlas.
From the Open OnDemand landing page, select “Interactive Apps” > “Jupyter”. You will be taken to a page with multiple input fields to configure your Jupyter session.
Select the following inputs:
- Account: scinet_workshop1
- Partition: atlas
- QOS: normal 14-00:00:00
- Number of hours: 2
- Number of nodes: 1
- Number of tasks: 1
- Additional Slurm Parameters: --reservation=workshop --mem=8G
- Working Directory: /90daydata/shared/${USER}

Exercise 4: Using Open OnDemand, launch a new JupyterLab session and open a new Jupyter notebook using the default “Python 3” kernel. Paste the following into a code cell and run the code. What happens?

import plotnine as pn
import numpy as np
import pandas as pd

df = pd.DataFrame({
    'x': np.arange(100),
    'y': np.cumsum(np.random.randn(100))
})
(pn.ggplot(df, pn.aes(x='x', y='y'))
    + pn.geom_line()
    + pn.xlab('Time')
    + pn.ylab('An important measurement!')
)

Create a suitable virtual environment for this code, then create a Jupyter kernel for your notebook from the environment. Verify that the code runs.

Managing packages and environments with Anaconda

If all of the software components you need to manage in your virtual environment are Python packages, we strongly recommend using the venv and pip workflow described in detail above. Why? Because pip and venv:

Are included with Python and therefore available pretty much anywhere Python is available. This helps ensure that your workflow is portable and easy to share.
Install packages from PyPI, the Python Package Index, which ensures you have access to the most recent versions of Python packages.
Are officially supported by the Python project. This helps ensure future reproducibility.

However, if you need to manage other kinds of software, too, conda can provide a useful alternative. Conceptually, the process of managing software using conda is the same: you create a virtual environment and then manage software packages within that environment. We will go over the basics in this workshop; please see the official documentation for more information!

Load and initialize miniconda

First, load the environment module for miniconda so that you have access to the conda command:

On Ceres: module load miniconda or ml load miniconda.
On Atlas: module load miniconda3 or ml load miniconda3.

If you’ve not used conda before on the system, you will need to run conda init. By default, this will cause the conda “base” environment to automatically be activated every time you log in. This can be annoying! If you want to disable this, run conda config --set auto_activate_base false. (You can also undo all changes made by conda init by running conda init --reverse.) After running conda init for the first time, you will need to either exit your shell session and start a new one or run source ~/.basrc.

Note: mamba is a drop-in replacement for conda that is generally faster and more robust. However, recent versions of conda have adopted code from the mamba project so that there is now less of a performance gap between the two. If mamba is available (mamba is currently available on Ceres but not on Atlas), you can simply replace conda with mamba in the commands below.

An important note about conda channels: conda installs software from a “channel”, which is essentially a remote software repository. By default, conda attempts to install software from a channel called “defaults”. However, use of the “defaults” channel requires a paid license, so USDA employees should not install software from “defaults”! Instead, we recommend you use the “conda-forge” channel, which is free to use. To see which channel(s) you are using, run conda config --show channels. If you see “defaults” listed, complete the following steps:

Run conda config --add channels conda-forge to add the “conda-forge” channel.
Run conda config --remove channels defaults to remove the “defaults” channel.

Creating and managing environments with `conda`

First, let’s cover what not to do! Most online documentation will tell you to create a new conda environment by running conda create -n ENVNAME, where “ENVNAME” is the name of the new environment. E.g., conda create -n conda_env. The problem with this is that all packages will be installed into a hidden directory inside your home directory (typically ~/.conda/envs) and you will quickly run out of space!

Instead, we need to tell conda to create the environment in a location that we choose. Proceed as follows:

Run conda create --prefix ENVNAME, where “ENVNAME” is the name of the new environment. E.g., conda create --prefix conda_env. This will create a directory called “ENVNAME” for the new environment.
Activate the environment by running conda activate /path/to/ENVNAME, where “path/to/ENVNAME” is the file system path to the environment location. E.g., conda activate ./conda_env.
To deactivate the environment and return to the “normal” command environment you had before, run conda deactivate.
To remove the environment, delete its directory: e.g., rm -I -r conda_venv. Alternatively, you can run the command conda remove --prefix /path/to/ENVNAME --all.

Installing and managing software in a conda environment

The conda command is also used to install and remove software from a conda environment.

Make sure the target conda environment is activated.
Run conda install PACKAGE to install software into a conda environment.
Run conda remove PACKAGE to uninstall software.

Note that a conda environment does not automatically include Python, so you will need to tell conda to install it!

Exercise 5: If you have not already done so, create a new conda environment called conda_env. Activate the environment and launch Python (python). The most recent version of Python available from the conda-forge channel is 3.12.6. What do you notice? Use conda to install the latest version of Python from conda-forge into your conda environment and verify that it works.

Conda provides an alternative way to manage Python packages. Although you can still use pip from within a conda environment, doing so can introduce a variety of complications, and the official recommendation is to use conda to manage Python packages within a conda environment whenever possible.

Exercise 6: Modify your conda environment so that you can run the Python script you created for Exercise 1.

Using `environment.yml` to automate package management

Just as we can use requirements.txt to specify the packages to include in a Python virtual environment, we can use a file typically called environment.yml to specify the packages to include in a conda environment.

To automatically generate the contents of environment.yml for an activated conda environment, run conda export --from-history. To save the output directly to a file, run conda export --from-history > environment.yml.

To create a new conda environment that matches the contents of an environment file, run conda env create --prefix ENVNAME --file environment.yml, where “ENVNAME” is the name of the new environment. E.g., conda env create --prefix conda_env --file environment.yml.

Exercise 7: Save the configuration of the conda environment you created for Exercise 6 and use it to create a new conda environment. Verify that you have the correct version of Python in the new environment and are able to run the Python script you created for Exercise 1.

Using conda environments with Jupyter notebooks

The process to make a conda environment available to Jupyter notebooks is nearly the same as for Python virtual environments.

Make sure the target conda virtual environment for the kernel is activated.
Install the ipykernel package: conda install ipykernel.
Create a “kernel specification” that will make the virtual environment available to Jupyter notebooks: python -m ipykernel install --user --name "KERNEL NAME".