Lecture 9 Dask for Parellel Computing and Big Data¶
In past lectures, we learned how to use numpy, pandas, and xarray to analyze various types of geoscience data. In this lecture, we address an incresingly common problem: what happens if the data we wish to analyze is "big data"
What is Dask?¶
Dask is a tool that helps us easily extend our familiar python data analysis tools to medium and big data, i.e. dataset that can't fit in our computer's RAM. In many cases, dask also allows us to speed up our analysis by using mutiple CPU cores. Dask can help us work more efficiently on our laptop, and it can also help us scale up our analysis on HPC and cloud platforms. Most importantly, dask is almost invisible to the user, meaning that you can focus on your science, rather than the details of parallel computing.
Dask was created by the brilliant Matt Rocklin. You can learn more about it on
Dask provides collections for big data and a scheduler for parallel computing.
Dask Arrays¶
A dask array looks and feels a lot like a numpy array. However, a dask array doesn't directly hold any data. Instead, it symbolically represents the computations needed to generate the data. Nothing is actually computed until the actual numerical values are needed. This mode of operation is called "lazy"; it allows one to build up complex, large calculations symbolically before turning them over the scheduler for execution.
If we want to create a numpy array of all ones, we do it like this:
import numpy as np
shape = (1000, 4000)
ones_np = np.ones(shape)
ones_np
array([[1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.], ..., [1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.]])
This array contains exactly 32 MB of data:
ones_np.nbytes / 1e6
32.0
Now let's create the same array using dask's array interface.
import dask
import dask.array as da
ones = da.ones(shape)
ones
|
ones = da.ones((100,1000,1000))
# Dask will determine an optimal chunk size.
ones
|
By default, dask reads the entire array as a single chuck. If the data size is too big, dask will split the dataset to different chunks. "Chunks" describes how the array is split up over many sub-arrays.
There are several ways to specify chunks. In this lecture, we will use a block shape.
chunk_shape = (1000, 1000)
shape = (1000, 4000)
ones = da.ones(shape, chunks=chunk_shape)
ones
|
ones.chunksize
(1000, 1000)
Notice that we just see a symbolic represetnation of the array, including its shape, dtype, and chunksize.
No data has been generated yet.
When we call .compute()
on a dask array, the computation is trigger and the dask array becomes a numpy array.
ones.compute()
array([[1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.], ..., [1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.], [1., 1., 1., ..., 1., 1., 1.]])
In order to understand what happened when we called .compute()
, we can visualize the dask graph, the symbolic operations that make up the array
# Install graphviz: mamba install -c conda-forge python-graphviz graphviz
ones.visualize()
Our array has four chunks. To generate it, dask calls np.ones
four times and then concatenates this together into one array.
Rather than immediately loading a dask array (which puts all the data into RAM), it is more common to want to reduce the data somehow. For example
sum_of_ones = ones.sum()
sum_of_ones.visualize()
Here we see dask's strategy for finding the sum. This simple example illustrates the beauty of dask: it automatically designs an algorithm appropriate for custom operations with big data.
If we make our operation more complex, the graph gets more complex.
fancy_calculation = (ones * ones[::-1, ::-1]).mean()
fancy_calculation.visualize()
A Bigger Calculation¶
The examples above were toy examples; the data (32 MB) is nowhere nearly big enough to warrant the use of dask.
We can make it a lot bigger!
bigshape = (200000, 4000)
big_ones = da.ones(bigshape, chunks=chunk_shape)
big_ones
|
big_ones.nbytes / 1e6
6400.0
This dataset is 6.4 GB, rather MB! This is probably close to or greater than the amount of available RAM than you have in your computer. Nevertheless, dask has no problem working on it.
Do not try to .visualize()
this array!
When doing a big calculation, dask also has some tools to help us understand what is happening under the hood
from dask.diagnostics import ProgressBar
big_calc = (big_ones * big_ones[::-1, ::-1]).mean()
with ProgressBar():
result = big_calc.compute()
result
[########################################] | 100% Completed | 2.38 sms
1.0
Reduction¶
All the usual numpy methods work on dask arrays. You can also apply numpy function directly to a dask array, and it will stay lazy.
big_ones_reduce = (np.cos(big_ones)**2).mean(axis=0)
big_ones_reduce
|
Plotting also triggers computation, since we need the actual values
from matplotlib import pyplot as plt
plt.plot(big_ones_reduce)
[<matplotlib.lines.Line2D at 0x7f50b7bfa820>]
Dask + Xarray¶
import numpy as np
import xarray as xr
ds = xr.tutorial.open_dataset("air_temperature")
ds = xr.tutorial.open_dataset(
"air_temperature",
chunks={ # this tells xarray to open the dataset as a dask array
"lat": "auto",
"lon": "auto",
"time": 1000,
},
)
ds
<xarray.Dataset> Dimensions: (lat: 25, time: 2920, lon: 53) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Data variables: air (time, lat, lon) float32 dask.array<chunksize=(1000, 25, 53), meta=np.ndarray> Attributes: Conventions: COARDS title: 4x daily NMC reanalysis (1948) description: Data is from NMC initialized reanalysis\n(4x/day). These a... platform: Model references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
Extracting underlying data¶
There are two ways to pull out the underlying array object in an xarray object.
.to_numpy
or.values
will always return a NumPy array. For dask-backed xarray objects, this means that compute will always be called.data
will return a Dask array
{tip}
Use `to_numpy` or `as_numpy` instead of `.values` so that your code generalizes to other array types (like CuPy arrays, sparse arrays)
ds.air.data # dask array, not numpy
|
ds.air.as_numpy().data ## numpy array
array([[[241.2 , 242.5 , 243.5 , ..., 232.79999, 235.5 , 238.59999], [243.79999, 244.5 , 244.7 , ..., 232.79999, 235.29999, 239.29999], [250. , 249.79999, 248.89 , ..., 233.2 , 236.39 , 241.7 ], ..., [296.6 , 296.19998, 296.4 , ..., 295.4 , 295.1 , 294.69998], [295.9 , 296.19998, 296.79 , ..., 295.9 , 295.9 , 295.19998], [296.29 , 296.79 , 297.1 , ..., 296.9 , 296.79 , 296.6 ]], [[242.09999, 242.7 , 243.09999, ..., 232. , 233.59999, 235.79999], [243.59999, 244.09999, 244.2 , ..., 231. , 232.5 , 235.7 ], [253.2 , 252.89 , 252.09999, ..., 230.79999, 233.39 , 238.5 ], ..., [296.4 , 295.9 , 296.19998, ..., 295.4 , 295.1 , 294.79 ], [296.19998, 296.69998, 296.79 , ..., 295.6 , 295.5 , 295.1 ], [296.29 , 297.19998, 297.4 , ..., 296.4 , 296.4 , 296.6 ]], [[242.29999, 242.2 , 242.29999, ..., 234.29999, 236.09999, 238.7 ], [244.59999, 244.39 , 244. , ..., 230.29999, 232. , 235.7 ], [256.19998, 255.5 , 254.2 , ..., 231.2 , 233.2 , 238.2 ], ..., [295.6 , 295.4 , 295.4 , ..., 296.29 , 295.29 , 295. ], [296.19998, 296.5 , 296.29 , ..., 296.4 , 296. , 295.6 ], [296.4 , 296.29 , 296.4 , ..., 297. , 297. , 296.79 ]], ..., [[243.48999, 242.98999, 242.09 , ..., 244.18999, 244.48999, 244.89 ], [249.09 , 248.98999, 248.59 , ..., 240.59 , 241.29 , 242.68999], [262.69 , 262.19 , 261.69 , ..., 239.39 , 241.68999, 245.18999], ..., [294.79 , 295.29 , 297.49 , ..., 295.49 , 295.38998, 294.69 ], [296.79 , 297.88998, 298.29 , ..., 295.49 , 295.49 , 294.79 ], [298.19 , 299.19 , 298.79 , ..., 296.09 , 295.79 , 295.79 ]], [[245.79 , 244.79 , 243.48999, ..., 243.29 , 243.98999, 244.79 ], [249.89 , 249.29 , 248.48999, ..., 241.29 , 242.48999, 244.29 ], [262.38998, 261.79 , 261.29 , ..., 240.48999, 243.09 , 246.89 ], ..., [293.69 , 293.88998, 295.38998, ..., 295.09 , 294.69 , 294.29 ], [296.29 , 297.19 , 297.59 , ..., 295.29 , 295.09 , 294.38998], [297.79 , 298.38998, 298.49 , ..., 295.69 , 295.49 , 295.19 ]], [[245.09 , 244.29 , 243.29 , ..., 241.68999, 241.48999, 241.79 ], [249.89 , 249.29 , 248.39 , ..., 239.59 , 240.29 , 241.68999], [262.99 , 262.19 , 261.38998, ..., 239.89 , 242.59 , 246.29 ], ..., [293.79 , 293.69 , 295.09 , ..., 295.29 , 295.09 , 294.69 ], [296.09 , 296.88998, 297.19 , ..., 295.69 , 295.69 , 295.19 ], [297.69 , 298.09 , 298.09 , ..., 296.49 , 296.19 , 295.69 ]]], dtype=float32)
{exercise}
:label: data-values
Try calling `ds.air.values` and `ds.air.data`. Do you understand the difference?
ds.air.to_numpy()
array([[[241.2 , 242.5 , 243.5 , ..., 232.79999, 235.5 , 238.59999], [243.79999, 244.5 , 244.7 , ..., 232.79999, 235.29999, 239.29999], [250. , 249.79999, 248.89 , ..., 233.2 , 236.39 , 241.7 ], ..., [296.6 , 296.19998, 296.4 , ..., 295.4 , 295.1 , 294.69998], [295.9 , 296.19998, 296.79 , ..., 295.9 , 295.9 , 295.19998], [296.29 , 296.79 , 297.1 , ..., 296.9 , 296.79 , 296.6 ]], [[242.09999, 242.7 , 243.09999, ..., 232. , 233.59999, 235.79999], [243.59999, 244.09999, 244.2 , ..., 231. , 232.5 , 235.7 ], [253.2 , 252.89 , 252.09999, ..., 230.79999, 233.39 , 238.5 ], ..., [296.4 , 295.9 , 296.19998, ..., 295.4 , 295.1 , 294.79 ], [296.19998, 296.69998, 296.79 , ..., 295.6 , 295.5 , 295.1 ], [296.29 , 297.19998, 297.4 , ..., 296.4 , 296.4 , 296.6 ]], [[242.29999, 242.2 , 242.29999, ..., 234.29999, 236.09999, 238.7 ], [244.59999, 244.39 , 244. , ..., 230.29999, 232. , 235.7 ], [256.19998, 255.5 , 254.2 , ..., 231.2 , 233.2 , 238.2 ], ..., [295.6 , 295.4 , 295.4 , ..., 296.29 , 295.29 , 295. ], [296.19998, 296.5 , 296.29 , ..., 296.4 , 296. , 295.6 ], [296.4 , 296.29 , 296.4 , ..., 297. , 297. , 296.79 ]], ..., [[243.48999, 242.98999, 242.09 , ..., 244.18999, 244.48999, 244.89 ], [249.09 , 248.98999, 248.59 , ..., 240.59 , 241.29 , 242.68999], [262.69 , 262.19 , 261.69 , ..., 239.39 , 241.68999, 245.18999], ..., [294.79 , 295.29 , 297.49 , ..., 295.49 , 295.38998, 294.69 ], [296.79 , 297.88998, 298.29 , ..., 295.49 , 295.49 , 294.79 ], [298.19 , 299.19 , 298.79 , ..., 296.09 , 295.79 , 295.79 ]], [[245.79 , 244.79 , 243.48999, ..., 243.29 , 243.98999, 244.79 ], [249.89 , 249.29 , 248.48999, ..., 241.29 , 242.48999, 244.29 ], [262.38998, 261.79 , 261.29 , ..., 240.48999, 243.09 , 246.89 ], ..., [293.69 , 293.88998, 295.38998, ..., 295.09 , 294.69 , 294.29 ], [296.29 , 297.19 , 297.59 , ..., 295.29 , 295.09 , 294.38998], [297.79 , 298.38998, 298.49 , ..., 295.69 , 295.49 , 295.19 ]], [[245.09 , 244.29 , 243.29 , ..., 241.68999, 241.48999, 241.79 ], [249.89 , 249.29 , 248.39 , ..., 239.59 , 240.29 , 241.68999], [262.99 , 262.19 , 261.38998, ..., 239.89 , 242.59 , 246.29 ], ..., [293.79 , 293.69 , 295.09 , ..., 295.29 , 295.09 , 294.69 ], [296.09 , 296.88998, 297.19 , ..., 295.69 , 295.69 , 295.19 ], [297.69 , 298.09 , 298.09 , ..., 296.49 , 296.19 , 295.69 ]]], dtype=float32)
Lazy computation¶
Xarray seamlessly wraps dask so all computation is deferred until explicitly requested.
mean = ds.air.mean("time")
mean
<xarray.DataArray 'air' (lat: 25, lon: 53)> dask.array<mean_agg-aggregate, shape=(25, 53), dtype=float32, chunksize=(25, 53), chunktype=numpy.ndarray> Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
Dask actually constructs a graph of the required computation. Here it's pretty simple: The full array is subdivided into 3 arrays. Dask will load each of these subarrays in a separate thread using the default single-machine scheduling. You can visualize dask 'task graphs' which represent the requested computation:
mean.data # dask array
|
# visualize the graph for the underlying dask array
# we ask it to visualize the graph from left to right because it looks nicer
dask.visualize(mean.data, rankdir="LR", optimize_graph =True)
Getting concrete values¶
At some point, you will want to actually get concrete values (usually a numpy array) from dask.
There are two ways to compute values on dask arrays.
.compute()
returns an xarray object just like a dask array.load()
replaces the dask array in the xarray object with a numpy array. This is equivalent tods = ds.compute()
ds.load()
<xarray.Dataset> Dimensions: (lat: 25, time: 2920, lon: 53) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Data variables: air (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7 Attributes: Conventions: COARDS title: 4x daily NMC reanalysis (1948) description: Data is from NMC initialized reanalysis\n(4x/day). These a... platform: Model references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
Reading multi-file dataset with dask¶
ds_multiple = xr.open_mfdataset('/scratch/xj103/rcaes/GEOS_Chem_Sims/GEOSChem.SpeciesConcHourlyNO2.2020*_2100z.nc4')
ds_multiple
<xarray.Dataset> Dimensions: (time: 30, lev: 47, lat: 81, lon: 65) Coordinates: * time (time) datetime64[ns] 2020-09-01T21:30:00 ... 2020-09-30... * lev (lev) float64 0.9925 0.9775 0.9625 ... 0.0001387 3.8e-05 * lat (lat) float64 27.0 27.25 27.5 27.75 ... 46.5 46.75 47.0 * lon (lon) float64 -130.0 -129.7 -129.4 ... -110.6 -110.3 -110.0 Data variables: SpeciesConc_NO2 (time, lev, lat, lon) float32 dask.array<chunksize=(1, 47, 81, 65), meta=np.ndarray>
%time ds_multiple.SpeciesConc_NO2.mean(dim = 'time').compute()
CPU times: user 462 ms, sys: 89.5 ms, total: 552 ms Wall time: 2.52 s
<xarray.DataArray 'SpeciesConc_NO2' (lev: 47, lat: 81, lon: 65)> array([[[4.9062303e-11, 4.9062081e-11, 4.9061859e-11, ..., 4.9071434e-11, 4.9071955e-11, 4.9072468e-11], [1.6923058e-11, 1.6950108e-11, 1.6952899e-11, ..., 9.9948189e-11, 9.9900643e-11, 9.9872791e-11], [1.6929434e-11, 1.6926667e-11, 1.6934249e-11, ..., 9.9928288e-11, 9.9878307e-11, 9.9870709e-11], ..., [8.5456746e-11, 8.5386170e-11, 8.5320598e-11, ..., 4.0959905e-10, 4.0894488e-10, 4.1077211e-10], [8.5364313e-11, 8.5232314e-11, 8.5183277e-11, ..., 4.0959866e-10, 4.0940437e-10, 4.1138837e-10], [8.0792273e-10, 8.0792262e-10, 8.0792262e-10, ..., 8.0796042e-10, 8.0796181e-10, 8.0796331e-10]], [[4.8373788e-11, 4.8373711e-11, 4.8373645e-11, ..., 4.8377212e-11, 4.8377392e-11, 4.8377573e-11], [1.6216743e-11, 1.6300355e-11, 1.6300242e-11, ..., 9.9929100e-11, 9.9903898e-11, 9.9897722e-11], [1.6222215e-11, 1.6223784e-11, 1.6226818e-11, ..., 9.9919122e-11, 9.9897909e-11, 9.9892851e-11], ... [1.2932498e-09, 1.2932503e-09, 1.2932567e-09, ..., 1.2929462e-09, 1.2929313e-09, 1.2928540e-09], [1.2932498e-09, 1.2936645e-09, 1.2936429e-09, ..., 1.2944885e-09, 1.2945151e-09, 1.2928540e-09], [1.2952455e-09, 1.2952455e-09, 1.2952455e-09, ..., 1.2952455e-09, 1.2952455e-09, 1.2952455e-09]], [[1.3088649e-09, 1.3088649e-09, 1.3088649e-09, ..., 1.3088649e-09, 1.3088649e-09, 1.3088649e-09], [1.3110013e-09, 1.3099452e-09, 1.3099236e-09, ..., 1.3121275e-09, 1.3121271e-09, 1.3131806e-09], [1.3110013e-09, 1.3109217e-09, 1.3109236e-09, ..., 1.3131171e-09, 1.3131758e-09, 1.3131806e-09], ..., [1.3325602e-09, 1.3325523e-09, 1.3325528e-09, ..., 1.3367689e-09, 1.3354362e-09, 1.3367826e-09], [1.3325604e-09, 1.3321834e-09, 1.3321892e-09, ..., 1.3367548e-09, 1.3367537e-09, 1.3367827e-09], [1.3349080e-09, 1.3349080e-09, 1.3349080e-09, ..., 1.3349080e-09, 1.3349080e-09, 1.3349080e-09]]], dtype=float32) Coordinates: * lev (lev) float64 0.9925 0.9775 0.9625 ... 0.0004141 0.0001387 3.8e-05 * lat (lat) float64 27.0 27.25 27.5 27.75 28.0 ... 46.25 46.5 46.75 47.0 * lon (lon) float64 -130.0 -129.7 -129.4 -129.1 ... -110.6 -110.3 -110.0
rolling_mean = ds_multiple.SpeciesConc_NO2.rolling(time=5).mean()
rolling_mean # contains dask array
<xarray.DataArray 'SpeciesConc_NO2' (time: 30, lev: 47, lat: 81, lon: 65)> dask.array<truediv, shape=(30, 47, 81, 65), dtype=float32, chunksize=(5, 47, 81, 65), chunktype=numpy.ndarray> Coordinates: * time (time) datetime64[ns] 2020-09-01T21:30:00 ... 2020-09-30T21:30:00 * lev (lev) float64 0.9925 0.9775 0.9625 ... 0.0004141 0.0001387 3.8e-05 * lat (lat) float64 27.0 27.25 27.5 27.75 28.0 ... 46.25 46.5 46.75 47.0 * lon (lon) float64 -130.0 -129.7 -129.4 -129.1 ... -110.6 -110.3 -110.0 Attributes: long_name: Dry mixing ratio of species NO2 units: mol mol-1 dry averaging_method: time-averaged
computed = rolling_mean.compute()
computed # has real numpy values
<xarray.DataArray 'SpeciesConc_NO2' (time: 30, lev: 47, lat: 81, lon: 65)> array([[[[ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], ..., [ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan]], [[ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], ... 1.27299549e-09, 1.27348032e-09, 1.27298894e-09], [1.27392474e-09, 1.27276323e-09, 1.27275179e-09, ..., 1.27361832e-09, 1.27358102e-09, 1.27298905e-09], [1.27646427e-09, 1.27646427e-09, 1.27646427e-09, ..., 1.27646427e-09, 1.27646427e-09, 1.27646427e-09]], [[1.25867883e-09, 1.25867883e-09, 1.25867883e-09, ..., 1.25867883e-09, 1.25867883e-09, 1.25867883e-09], [1.25231525e-09, 1.25114186e-09, 1.25108790e-09, ..., 1.25833632e-09, 1.25830835e-09, 1.25872757e-09], [1.25231525e-09, 1.25225497e-09, 1.25224831e-09, ..., 1.25869781e-09, 1.25869914e-09, 1.25872757e-09], ..., [1.28924105e-09, 1.28924049e-09, 1.28924071e-09, ..., 1.29703770e-09, 1.29703892e-09, 1.29704647e-09], [1.28924105e-09, 1.28915789e-09, 1.28917210e-09, ..., 1.29697664e-09, 1.29696454e-09, 1.29704658e-09], [1.29171085e-09, 1.29171085e-09, 1.29171085e-09, ..., 1.29171085e-09, 1.29171085e-09, 1.29171085e-09]]]], dtype=float32) Coordinates: * time (time) datetime64[ns] 2020-09-01T21:30:00 ... 2020-09-30T21:30:00 * lev (lev) float64 0.9925 0.9775 0.9625 ... 0.0004141 0.0001387 3.8e-05 * lat (lat) float64 27.0 27.25 27.5 27.75 28.0 ... 46.25 46.5 46.75 47.0 * lon (lon) float64 -130.0 -129.7 -129.4 -129.1 ... -110.6 -110.3 -110.0 Attributes: long_name: Dry mixing ratio of species NO2 units: mol mol-1 dry averaging_method: time-averaged