Lecture 12: Organization and Packaging of Python Projects¶
A complex research project often relies and many different programs and software packages to accomplish the research goals. An important part of scientific computing is deciding how to organize and structure the code you use for research. A well-structured project can make you a more efficient and effective researcher.
Just putting all of your code into git repositories won't magically turn a mess of scripts into a beautiful, well-organized project. More deliberate effort is required.
Types of Projects¶
Not all projects are created equal. Based on my experience, I categorize three different types of "research code" scenarios commonly encountered in geosciences.
- Exploratory analyses: When exploring a new idea, a single notebook or script is often all we need.
- A Single Paper: The "paper" is a standard unit of scientific output. The code related to a single paper usually belongs together.
- Reusable software elements: In the course of our research computing, we often identify specialized routines that we want to package for reuse in other projects, or by other scientists. This is where "scripts" become "software."
This lecture outlines some suggested practices for each category.
Exploratory Analysis¶
When starting something new, we are often motivated to just start coding and get some results quick. This is fine! Jupyter notebooks are an ideal format for open-ended exploratory analysis, since they are totally self-contained: they encapsulate text, code, and figures. If we find someting cool or useful, it is important to preserve these exploratory notebooks.
A dedicated github repository can be overkill for a single file. Instead, I recommend github's "gist" mechanism for saving and sharing such "one-off" notebooks and code snippets. Gists are like mini repos you can easily share and embed. (You can create one right now by going to https://gist.github.com/.)
You can upload any file (including an .ipynb
notebook file) by dragging and dropping it into the gist website.
You have the choice of making you gist public or secret. (There is no private option, but a secret gist can only be seen by others if you give them the URL.)
GitHub's rendering of Gists is a bit buggy. For a more consistent rendering experience, you can share your gist via http://nbviewer.ipython.org/.
A Single Paper¶
Scientific Reproducibility¶
Reproducibility is a cornerstone of the scientific process. However, today one often reads that science is in the midst of a reproducibility crisis. This crisis may be due to increasing complexity and cost of scientific analysis, together with mounting pressure to publish as much and as quickly as possible.
Today almost all earth science relies on some form of computation, from simple statistical analysis and curve fitting to advanced numerical simulation. In principle, computational science should be highly reproducible. Keep in mind that the audience for a reproducibile project is not just other scientists...it's you, a year from now, or whenever you need to repeat and / or build on earlier work. Most scientists build on their Ph.D. work for a decade following graduation. Extra time spent on reproducibility now will make you more productive in the long run.
We begin with an important observation.
An article about computational science … is not the scholarship itself, it’s merely scholarship advertisement. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.
Donoho, D. et al. (2009), Reproducible research in computational harmonic analysis, Comp. Sci. Eng. 11(1):8–18, doi: 10.1109/MCSE.2009.15
Sandve et al. (2013) give some specific recommmendations for computational reproducibility.
- For every result, keep track of how it was produced
- Avoid manual data-manipulation steps
- Archive the exact versions of all external programs used
- Version-control all custom scripts
- Record all intermediate results, when possible in standard formats
- For analyses that include randomness, note underlying random seeds
- Always store raw data behind plots
- Generate hierarchical analysis output, allowing layers of increasing detail to be inspected
- Connect textual statements to underlying results
- Provide public access to scripts, runs, and results
These recommendations suggest a certain structure for a project.
Project Layout¶
A reproducible single-paper project directory structure might look something like this
README.md
LICENSE
environment.yml
data/intermediate_results.csv
notebooks/process_raw_data.ipynb
notebooks/figure1.ipynb
notebooks/figure2.ipynb
notebooks/helper.py
manuscript/manuscript.tex
Reuseable Software Elements¶
Scientific software can perhaps be grouped into two categories: single-use "scripts" that are used in a very specific context to do a very specific thing (e.g.~to generate a specific figure for a paper), and reuseable components which encapsulate a more generic workflow. Once you find yourself repeating the same chunks of code in many different scripts or projects, it's time to start composing reusable software elements.
In Assginment 2, we wrote several functions for unit conversion. Now let's write a module for these functions. Open a file called temperature_unit_convert.py
in a text editor. The file should be in the same directory as the notebook you are working in now.) Populate it with the functions you defined. Mine is like this:
"""
A python module for unit conversion for temperature.
"""
def k_to_c(temp):
"""Convert temperature from kelvin to celsius.
PARAMETERS
----------
temp : float
Temperature in Kelvin.
RETURNS
-------
temp_c : float
Temperature in Celsius.
"""
temp_c = temp - 273.15
return temp_c
def c_to_k(temp):
"""Convert temperature from kelvin to celsius.
PARAMETERS
----------
temp : float
Temperature in Celsius.
RETURNS
-------
temp_F : float
Temperature in Kelvin.
"""
temp_k = temp+273.15
return temp_k
def temp_to_F(temp, C = True):
"""Convert temperature to Fahrenheit.
PARAMETERS
----------
temp : float
Temperature in Celsius or Kelvin.
C : bool, default: True
If True, input temperature is in Celsius. If False, input temperature is in Kelvin.
RETURNS
-------
temp_F : float
Temperature in Fahrenheit.
"""
if C:
temp_F = (temp * 9/5) + 32
else:
temp_c = k_to_c(temp)
temp_F = (temp_c * 9/5) + 32
return temp_F
def temp_from_F(temp, C = True):
"""Convert temperature from Fahrenheit to Celsius or Kelvin.
PARAMETERS
----------
temp : float
Temperature in Fahrenheit.
C : bool, default: True
If True, out temperature is in Celsius. If False, out temperature is in Kelvin.
RETURNS
-------
temp_F : float
Temperature in Celsius or Kelvin.
"""
temp_c = (temp - 32) * 5/9
if C:
return(temp_c)
else:
temp_k = c_to_k(temp_c)
return temp_k
The module begins with a docstring explaining what it does. Then it contains some data (just a constant R
) and a single function.
Now let's import our module
import temperature_unit_convert
help(temperature_unit_convert)
Help on module temperature_unit_convert: NAME temperature_unit_convert - A python module for unit conversion for temperature. FUNCTIONS c_to_k(temp) Convert temperature from kelvin to celsius. PARAMETERS ---------- temp : float Temperature in Celsius. RETURNS ------- temp_F : float Temperature in Kelvin. k_to_c(temp) Convert temperature from kelvin to celsius. PARAMETERS ---------- temp : float Temperature in Kelvin. RETURNS ------- temp_c : float Temperature in Celsius. temp_from_F(temp, C=True) Convert temperature from Fahrenheit to Celsius or Kelvin. PARAMETERS ---------- temp : float Temperature in Fahrenheit. C : bool, default: True If True, out temperature is in Celsius. If False, out temperature is in Kelvin. RETURNS ------- temp_F : float Temperature in Celsius or Kelvin. temp_to_F(temp, C=True) Convert temperature to Fahrenheit. PARAMETERS ---------- temp : float Temperature in Celsius or Kelvin. C : bool, default: True If True, input temperature is in Celsius. If False, input temperature is in Kelvin. RETURNS ------- temp_F : float Temperature in Fahrenheit. FILE /Users/xiaomengjin/Dropbox/0_Rutgers/3_Teaching/Research_Computing/Lectures/temperature_unit_convert.py
And let's try using it to make a calculation
temperature_unit_convert.c_to_k(0)
273.15
temperature_unit_convert.temp_to_F(300, C = False)
80.33000000000004
We could just import the function we need
from temperature_unit_convert import c_to_k
c_to_k(0)
273.15
If we change the module, we need to either restart our kernel or else reload the module.
from importlib import reload
reload(temperature_unit_convert)
<module 'temperature_unit_convert' from '/Users/xiaomengjin/Dropbox/0_Rutgers/3_Teaching/Research_Computing/Lectures/temperature_unit_convert.py'>
Modules are a simple way to share code between different scripts or notebooks in the same project. Module files must reside in the same directory as any script which imports them! This is a big limitation; it means you can't share modules between different projects.
Once you have a piece of code that is general-purpose enough to share between projects, you need to create a package.
Packages¶
Packages are python's way of encapsulating reusable code elements for sharing with others. Packaging is a huge and complicated topic. We will just scratch the surface.
We have already interacted with many packages already. Browse some of their github repositories to explore the structure of a large python package:
- NumPy: https://github.com/numpy/numpy
- Pandas: https://github.com/pandas-dev/pandas
- Xarray: https://github.com/pydata/xarray
These packages all have a common basic structure. Imagine we wanted to turn our temperature unit conversion module into a package. It would look like this.
README.md
LICENSE
environment.yml
requirements.txt
setup.py
temperature_unit_convert/__init__.py
temperature_unit_convert/temperature_unit_convert.py
temperature_unit_convert/tests/__init__.py
temperature_unit_convert/tests/test_unit_convert.py
The actual package is contained in the temperature_unit_convert
subdirectory. The other files are auxilliary files which help others understand and install your package. Here is an overview of what they do
File Name | Purpose |
---|---|
README.md |
Explain what the package is for |
LICENSE |
Defines the legal terms under which other can use the package. Open source is encouraged! |
environment.yml |
A conda environment which describes the package's dependencies (more info) |
requirements.txt |
A file which describes the package's dependences for pip. (more info) |
setup.py |
A special python script which installs your package. (more info) |
The actual package¶
The directory temperature_unit_convert
is the actual package. Any directory that contains an __init__.py
file is recognized by python as a package. This file can be blank, bu it needs to be present. From the root directory, we can import a module from the package as follows
from temperature_unit_convert import temperature_unit_convert
Yes, this is a bit redundant. That's because the temperature_unit_convert.py
module has the same name as the temperature_unit_convert
package directory.
However, this import will only work from the parent directory. It is not globally accessible from your python environment.
setup.py
is the magic file that makes your package installable and accessible anywhere. Here is an extremely basic setup.py
from setuptools import setup
setup(
name = "temperature_unit_convert",
version = "0.1.0",
author = "Xiaomeng Jin",
packages=['temperature_unit_convert'],
install_requires=['numpy'],
)
There is a dizzying range of options for setup.py
. More fields are required if you want to upload your package to pypi (so it is installable via pip
).
To run the setup script, we call the following from the command line
python setup.py install
The package files are copied to our python library directory. If we plan to keep developing the package, we can install it in "developer mode" as
python setup.py develop
In this case, the files are symlinked rather than copied.
Testing¶
A software package requires tests to ensure that it works properly.
Tests don't have to be complicated. They are simply a check to verify that your code does what it is supposed to do.
To add tests to our project, we create create the file temperature_unit_convert/tests/test_unit_convert.py
. (We also need an __init__.py
file in the tests
directory.) The example below shows an example of a test function for our package.
import pytest
from temperature_unit_convert.temperature_unit_convert import k_to_c, c_to_k, temp_from_F, temp_to_F
def test_unit_convert():
# some known results
# Verify that the "round trip" conversion from and back to C.
for orig in [10, 20, 30]:
new = k_to_c(c_to_k(orig))
assert new == orig
# Verify that the "round trip" conversion from and back to F.
for orig_F in [100, 90, 95]:
new = temp_from_F(temp_to_F(orig_F))
assert new == orig_F
# now check that we can't pass the wrong number of arguments
with pytest.raises(TypeError):
k_to_c(1, 2, 3)
We will use pytest to run our tests. If you don't have pytest installed in your active python environment, take a minute to run pip install pytest
from the command line. Now run
py.test -v
from the root directory of your project. You should see a notification that the tests passed. Try playing around with the tests to cause something to fail.
Publishing python package to github¶
Go to your GitHub, open a new repository named temperature_unit_convert
. Make it Public.
Under terminal, go back to your package directory. First clean the directory:
python setup.py clean --all
Initiate a git repository:
git init
Add and commit your package:
git add *
git commit -m 'initial commit'
Follow the command on GitHub, push this repository to your new GitHub repository
git remote add origin https://github.com/xjin49/temperature_unit_convert.git
git branch -M main
git push -u origin main
Now you should be able to see your python package in your Github.