Lecture 9 Dask for Parellel Computing and Big Data¶

In past lectures, we learned how to use numpy, pandas, and xarray to analyze various types of geoscience data. In this lecture, we address an incresingly common problem: what happens if the data we wish to analyze is "big data"

What is Dask?¶

Dask is a tool that helps us easily extend our familiar python data analysis tools to medium and big data, i.e. dataset that can't fit in our computer's RAM. In many cases, dask also allows us to speed up our analysis by using mutiple CPU cores. Dask can help us work more efficiently on our laptop, and it can also help us scale up our analysis on HPC and cloud platforms. Most importantly, dask is almost invisible to the user, meaning that you can focus on your science, rather than the details of parallel computing.

Dask was created by the brilliant Matt Rocklin. You can learn more about it on

Dask provides collections for big data and a scheduler for parallel computing.

Dask Arrays¶

A dask array looks and feels a lot like a numpy array. However, a dask array doesn't directly hold any data. Instead, it symbolically represents the computations needed to generate the data. Nothing is actually computed until the actual numerical values are needed. This mode of operation is called "lazy"; it allows one to build up complex, large calculations symbolically before turning them over the scheduler for execution.

If we want to create a numpy array of all ones, we do it like this:

In [14]:

Copied!





import numpy as np
shape = (1000, 4000)
ones_np = np.ones(shape)
ones_np
import numpy as np
shape = (1000, 4000)
ones_np = np.ones(shape)
ones_np

Out[14]:

array([[1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       ...,
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.]])

This array contains exactly 32 MB of data:

In [2]:

Copied!

ones_np.nbytes / 1e6
ones_np.nbytes / 1e6

Out[2]:

32.0

Now let's create the same array using dask's array interface.

In [3]:

Copied!

import dask
import dask.array as da
ones = da.ones(shape)
import dask
import dask.array as da
ones = da.ones(shape)

In [4]:

Copied!

ones
ones

Out[4]:

	Array	Chunk
Bytes	30.52 MiB	30.52 MiB
Shape	(1000, 4000)	(1000, 4000)
Dask graph	1 chunks in 1 graph layer
Data type	float64 numpy.ndarray

In [5]:

Copied!

ones = da.ones((100,1000,1000))
ones = da.ones((100,1000,1000))

In [6]:

Copied!

# Dask will determine an optimal chunk size. 
ones
# Dask will determine an optimal chunk size. 
ones

Out[6]:

	Array	Chunk
Bytes	762.94 MiB	127.63 MiB
Shape	(100, 1000, 1000)	(100, 409, 409)
Dask graph	9 chunks in 1 graph layer
Data type	float64 numpy.ndarray

By default, dask reads the entire array as a single chuck. If the data size is too big, dask will split the dataset to different chunks. "Chunks" describes how the array is split up over many sub-arrays.

There are several ways to specify chunks. In this lecture, we will use a block shape.

In [7]:

Copied!





chunk_shape = (1000, 1000)
shape = (1000, 4000)
ones = da.ones(shape, chunks=chunk_shape)
ones
chunk_shape = (1000, 1000)
shape = (1000, 4000)
ones = da.ones(shape, chunks=chunk_shape)
ones

Out[7]:

	Array	Chunk
Bytes	30.52 MiB	7.63 MiB
Shape	(1000, 4000)	(1000, 1000)
Dask graph	4 chunks in 1 graph layer
Data type	float64 numpy.ndarray

In [8]:

Copied!

ones.chunksize
ones.chunksize

Out[8]:

(1000, 1000)

Notice that we just see a symbolic represetnation of the array, including its shape, dtype, and chunksize. No data has been generated yet. When we call .compute() on a dask array, the computation is trigger and the dask array becomes a numpy array.

In [9]:

Copied!

ones.compute()
ones.compute()

Out[9]:

array([[1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       ...,
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.]])

In order to understand what happened when we called .compute(), we can visualize the dask graph, the symbolic operations that make up the array

In [19]:

Copied!

# Install graphviz: mamba install -c conda-forge python-graphviz graphviz
# Install graphviz: mamba install -c conda-forge python-graphviz graphviz

In [18]:

Copied!

ones.visualize()
ones.visualize()

Out[18]:

No description has been provided for this image

Our array has four chunks. To generate it, dask calls np.ones four times and then concatenates this together into one array.

Rather than immediately loading a dask array (which puts all the data into RAM), it is more common to want to reduce the data somehow. For example

In [20]:

Copied!

sum_of_ones = ones.sum()
sum_of_ones.visualize()
sum_of_ones = ones.sum()
sum_of_ones.visualize()

Out[20]:

Here we see dask's strategy for finding the sum. This simple example illustrates the beauty of dask: it automatically designs an algorithm appropriate for custom operations with big data.

If we make our operation more complex, the graph gets more complex.

In [21]:

Copied!

fancy_calculation = (ones * ones[::-1, ::-1]).mean()
fancy_calculation.visualize()
fancy_calculation = (ones * ones[::-1, ::-1]).mean()
fancy_calculation.visualize()

Out[21]:

A Bigger Calculation¶

The examples above were toy examples; the data (32 MB) is nowhere nearly big enough to warrant the use of dask.

We can make it a lot bigger!

In [22]:

Copied!

bigshape = (200000, 4000)
big_ones = da.ones(bigshape, chunks=chunk_shape)
big_ones
bigshape = (200000, 4000)
big_ones = da.ones(bigshape, chunks=chunk_shape)
big_ones

Out[22]:

	Array	Chunk
Bytes	5.96 GiB	7.63 MiB
Shape	(200000, 4000)	(1000, 1000)
Dask graph	800 chunks in 1 graph layer
Data type	float64 numpy.ndarray

In [23]:

Copied!

big_ones.nbytes / 1e6
big_ones.nbytes / 1e6

Out[23]:

6400.0

This dataset is 6.4 GB, rather MB! This is probably close to or greater than the amount of available RAM than you have in your computer. Nevertheless, dask has no problem working on it.

Do not try to .visualize() this array!

When doing a big calculation, dask also has some tools to help us understand what is happening under the hood

In [24]:

Copied!

from dask.diagnostics import ProgressBar

big_calc = (big_ones * big_ones[::-1, ::-1]).mean()

with ProgressBar():
    result = big_calc.compute()
result
from dask.diagnostics import ProgressBar

big_calc = (big_ones * big_ones[::-1, ::-1]).mean()

with ProgressBar():
    result = big_calc.compute()
result

[########################################] | 100% Completed | 2.38 sms

Out[24]:

1.0

Reduction¶

All the usual numpy methods work on dask arrays. You can also apply numpy function directly to a dask array, and it will stay lazy.

In [25]:

Copied!

big_ones_reduce = (np.cos(big_ones)**2).mean(axis=0)
big_ones_reduce
big_ones_reduce = (np.cos(big_ones)**2).mean(axis=0)
big_ones_reduce

Out[25]:

	Array	Chunk
Bytes	31.25 kiB	7.81 kiB
Shape	(4000,)	(1000,)
Dask graph	4 chunks in 8 graph layers
Data type	float64 numpy.ndarray

Plotting also triggers computation, since we need the actual values

In [26]:

Copied!

from matplotlib import pyplot as plt
from matplotlib import pyplot as plt

In [27]:

Copied!

plt.plot(big_ones_reduce)
plt.plot(big_ones_reduce)

Out[27]:

[<matplotlib.lines.Line2D at 0x7f50b7bfa820>]

Dask + Xarray¶

In [1]:

Copied!

import numpy as np
import xarray as xr
import numpy as np
import xarray as xr

In [30]:

Copied!

ds = xr.tutorial.open_dataset("air_temperature")
ds = xr.tutorial.open_dataset("air_temperature")

In [31]:

Copied!





ds = xr.tutorial.open_dataset(
    "air_temperature",
    chunks={  # this tells xarray to open the dataset as a dask array
        "lat": "auto",
        "lon": "auto",
        "time": 1000,
    },
)
ds
ds = xr.tutorial.open_dataset(
    "air_temperature",
    chunks={  # this tells xarray to open the dataset as a dask array
        "lat": "auto",
        "lon": "auto",
        "time": 1000,
    },
)
ds

Out[31]:

<xarray.Dataset>
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float32 dask.array<chunksize=(1000, 25, 53), meta=np.ndarray>
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

Extracting underlying data¶

There are two ways to pull out the underlying array object in an xarray object.

.to_numpy or .values will always return a NumPy array. For dask-backed xarray objects, this means that compute will always be called
.data will return a Dask array

{tip}
Use `to_numpy` or `as_numpy` instead of `.values` so that your code generalizes to other array types (like CuPy arrays, sparse arrays)

In [32]:

Copied!

ds.air.data  # dask array, not numpy
ds.air.data  # dask array, not numpy

Out[32]:

	Array	Chunk
Bytes	14.76 MiB	5.05 MiB
Shape	(2920, 25, 53)	(1000, 25, 53)
Dask graph	3 chunks in 2 graph layers
Data type	float32 numpy.ndarray

In [33]:

Copied!

ds.air.as_numpy().data  ## numpy array
ds.air.as_numpy().data  ## numpy array

Out[33]:

array([[[241.2    , 242.5    , 243.5    , ..., 232.79999, 235.5    ,
         238.59999],
        [243.79999, 244.5    , 244.7    , ..., 232.79999, 235.29999,
         239.29999],
        [250.     , 249.79999, 248.89   , ..., 233.2    , 236.39   ,
         241.7    ],
        ...,
        [296.6    , 296.19998, 296.4    , ..., 295.4    , 295.1    ,
         294.69998],
        [295.9    , 296.19998, 296.79   , ..., 295.9    , 295.9    ,
         295.19998],
        [296.29   , 296.79   , 297.1    , ..., 296.9    , 296.79   ,
         296.6    ]],

       [[242.09999, 242.7    , 243.09999, ..., 232.     , 233.59999,
         235.79999],
        [243.59999, 244.09999, 244.2    , ..., 231.     , 232.5    ,
         235.7    ],
        [253.2    , 252.89   , 252.09999, ..., 230.79999, 233.39   ,
         238.5    ],
        ...,
        [296.4    , 295.9    , 296.19998, ..., 295.4    , 295.1    ,
         294.79   ],
        [296.19998, 296.69998, 296.79   , ..., 295.6    , 295.5    ,
         295.1    ],
        [296.29   , 297.19998, 297.4    , ..., 296.4    , 296.4    ,
         296.6    ]],

       [[242.29999, 242.2    , 242.29999, ..., 234.29999, 236.09999,
         238.7    ],
        [244.59999, 244.39   , 244.     , ..., 230.29999, 232.     ,
         235.7    ],
        [256.19998, 255.5    , 254.2    , ..., 231.2    , 233.2    ,
         238.2    ],
        ...,
        [295.6    , 295.4    , 295.4    , ..., 296.29   , 295.29   ,
         295.     ],
        [296.19998, 296.5    , 296.29   , ..., 296.4    , 296.     ,
         295.6    ],
        [296.4    , 296.29   , 296.4    , ..., 297.     , 297.     ,
         296.79   ]],

       ...,

       [[243.48999, 242.98999, 242.09   , ..., 244.18999, 244.48999,
         244.89   ],
        [249.09   , 248.98999, 248.59   , ..., 240.59   , 241.29   ,
         242.68999],
        [262.69   , 262.19   , 261.69   , ..., 239.39   , 241.68999,
         245.18999],
        ...,
        [294.79   , 295.29   , 297.49   , ..., 295.49   , 295.38998,
         294.69   ],
        [296.79   , 297.88998, 298.29   , ..., 295.49   , 295.49   ,
         294.79   ],
        [298.19   , 299.19   , 298.79   , ..., 296.09   , 295.79   ,
         295.79   ]],

       [[245.79   , 244.79   , 243.48999, ..., 243.29   , 243.98999,
         244.79   ],
        [249.89   , 249.29   , 248.48999, ..., 241.29   , 242.48999,
         244.29   ],
        [262.38998, 261.79   , 261.29   , ..., 240.48999, 243.09   ,
         246.89   ],
        ...,
        [293.69   , 293.88998, 295.38998, ..., 295.09   , 294.69   ,
         294.29   ],
        [296.29   , 297.19   , 297.59   , ..., 295.29   , 295.09   ,
         294.38998],
        [297.79   , 298.38998, 298.49   , ..., 295.69   , 295.49   ,
         295.19   ]],

       [[245.09   , 244.29   , 243.29   , ..., 241.68999, 241.48999,
         241.79   ],
        [249.89   , 249.29   , 248.39   , ..., 239.59   , 240.29   ,
         241.68999],
        [262.99   , 262.19   , 261.38998, ..., 239.89   , 242.59   ,
         246.29   ],
        ...,
        [293.79   , 293.69   , 295.09   , ..., 295.29   , 295.09   ,
         294.69   ],
        [296.09   , 296.88998, 297.19   , ..., 295.69   , 295.69   ,
         295.19   ],
        [297.69   , 298.09   , 298.09   , ..., 296.49   , 296.19   ,
         295.69   ]]], dtype=float32)

{exercise}
:label: data-values
Try calling `ds.air.values` and `ds.air.data`. Do you understand the difference?

In [34]:

Copied!

ds.air.to_numpy()
ds.air.to_numpy()

Out[34]:

array([[[241.2    , 242.5    , 243.5    , ..., 232.79999, 235.5    ,
         238.59999],
        [243.79999, 244.5    , 244.7    , ..., 232.79999, 235.29999,
         239.29999],
        [250.     , 249.79999, 248.89   , ..., 233.2    , 236.39   ,
         241.7    ],
        ...,
        [296.6    , 296.19998, 296.4    , ..., 295.4    , 295.1    ,
         294.69998],
        [295.9    , 296.19998, 296.79   , ..., 295.9    , 295.9    ,
         295.19998],
        [296.29   , 296.79   , 297.1    , ..., 296.9    , 296.79   ,
         296.6    ]],

       [[242.09999, 242.7    , 243.09999, ..., 232.     , 233.59999,
         235.79999],
        [243.59999, 244.09999, 244.2    , ..., 231.     , 232.5    ,
         235.7    ],
        [253.2    , 252.89   , 252.09999, ..., 230.79999, 233.39   ,
         238.5    ],
        ...,
        [296.4    , 295.9    , 296.19998, ..., 295.4    , 295.1    ,
         294.79   ],
        [296.19998, 296.69998, 296.79   , ..., 295.6    , 295.5    ,
         295.1    ],
        [296.29   , 297.19998, 297.4    , ..., 296.4    , 296.4    ,
         296.6    ]],

       [[242.29999, 242.2    , 242.29999, ..., 234.29999, 236.09999,
         238.7    ],
        [244.59999, 244.39   , 244.     , ..., 230.29999, 232.     ,
         235.7    ],
        [256.19998, 255.5    , 254.2    , ..., 231.2    , 233.2    ,
         238.2    ],
        ...,
        [295.6    , 295.4    , 295.4    , ..., 296.29   , 295.29   ,
         295.     ],
        [296.19998, 296.5    , 296.29   , ..., 296.4    , 296.     ,
         295.6    ],
        [296.4    , 296.29   , 296.4    , ..., 297.     , 297.     ,
         296.79   ]],

       ...,

       [[243.48999, 242.98999, 242.09   , ..., 244.18999, 244.48999,
         244.89   ],
        [249.09   , 248.98999, 248.59   , ..., 240.59   , 241.29   ,
         242.68999],
        [262.69   , 262.19   , 261.69   , ..., 239.39   , 241.68999,
         245.18999],
        ...,
        [294.79   , 295.29   , 297.49   , ..., 295.49   , 295.38998,
         294.69   ],
        [296.79   , 297.88998, 298.29   , ..., 295.49   , 295.49   ,
         294.79   ],
        [298.19   , 299.19   , 298.79   , ..., 296.09   , 295.79   ,
         295.79   ]],

       [[245.79   , 244.79   , 243.48999, ..., 243.29   , 243.98999,
         244.79   ],
        [249.89   , 249.29   , 248.48999, ..., 241.29   , 242.48999,
         244.29   ],
        [262.38998, 261.79   , 261.29   , ..., 240.48999, 243.09   ,
         246.89   ],
        ...,
        [293.69   , 293.88998, 295.38998, ..., 295.09   , 294.69   ,
         294.29   ],
        [296.29   , 297.19   , 297.59   , ..., 295.29   , 295.09   ,
         294.38998],
        [297.79   , 298.38998, 298.49   , ..., 295.69   , 295.49   ,
         295.19   ]],

       [[245.09   , 244.29   , 243.29   , ..., 241.68999, 241.48999,
         241.79   ],
        [249.89   , 249.29   , 248.39   , ..., 239.59   , 240.29   ,
         241.68999],
        [262.99   , 262.19   , 261.38998, ..., 239.89   , 242.59   ,
         246.29   ],
        ...,
        [293.79   , 293.69   , 295.09   , ..., 295.29   , 295.09   ,
         294.69   ],
        [296.09   , 296.88998, 297.19   , ..., 295.69   , 295.69   ,
         295.19   ],
        [297.69   , 298.09   , 298.09   , ..., 296.49   , 296.19   ,
         295.69   ]]], dtype=float32)

Lazy computation¶

Xarray seamlessly wraps dask so all computation is deferred until explicitly requested.

In [35]:

Copied!

mean = ds.air.mean("time")
mean
mean = ds.air.mean("time")
mean

Out[35]:

<xarray.DataArray 'air' (lat: 25, lon: 53)>
dask.array<mean_agg-aggregate, shape=(25, 53), dtype=float32, chunksize=(25, 53), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0

Dask actually constructs a graph of the required computation. Here it's pretty simple: The full array is subdivided into 3 arrays. Dask will load each of these subarrays in a separate thread using the default single-machine scheduling. You can visualize dask 'task graphs' which represent the requested computation:

In [36]:

Copied!

mean.data  # dask array
mean.data  # dask array

Out[36]:

	Array	Chunk
Bytes	5.18 kiB	5.18 kiB
Shape	(25, 53)	(25, 53)
Dask graph	1 chunks in 4 graph layers
Data type	float32 numpy.ndarray

In [37]:

Copied!

# visualize the graph for the underlying dask array
# we ask it to visualize the graph from left to right because it looks nicer
dask.visualize(mean.data, rankdir="LR", optimize_graph =True)
# visualize the graph for the underlying dask array
# we ask it to visualize the graph from left to right because it looks nicer
dask.visualize(mean.data, rankdir="LR", optimize_graph =True)

Out[37]:

Getting concrete values¶

At some point, you will want to actually get concrete values (usually a numpy array) from dask.

There are two ways to compute values on dask arrays.

.compute() returns an xarray object just like a dask array
.load() replaces the dask array in the xarray object with a numpy array. This is equivalent to ds = ds.compute()

Reading multi-file dataset with dask¶

In [2]:

Copied!

ds_multiple = xr.open_mfdataset('/scratch/xj103/rcaes/GEOS_Chem_Sims/GEOSChem.SpeciesConcHourlyNO2.2020*_2100z.nc4')
ds_multiple = xr.open_mfdataset('/scratch/xj103/rcaes/GEOS_Chem_Sims/GEOSChem.SpeciesConcHourlyNO2.2020*_2100z.nc4')

In [3]:

Copied!

ds_multiple
ds_multiple

Out[3]:

<xarray.Dataset>
Dimensions:          (time: 30, lev: 47, lat: 81, lon: 65)
Coordinates:
  * time             (time) datetime64[ns] 2020-09-01T21:30:00 ... 2020-09-30...
  * lev              (lev) float64 0.9925 0.9775 0.9625 ... 0.0001387 3.8e-05
  * lat              (lat) float64 27.0 27.25 27.5 27.75 ... 46.5 46.75 47.0
  * lon              (lon) float64 -130.0 -129.7 -129.4 ... -110.6 -110.3 -110.0
Data variables:
    SpeciesConc_NO2  (time, lev, lat, lon) float32 dask.array<chunksize=(1, 47, 81, 65), meta=np.ndarray>

In [8]:

Copied!

rolling_mean = ds_multiple.SpeciesConc_NO2.rolling(time=5).mean()  
rolling_mean  # contains dask array
rolling_mean = ds_multiple.SpeciesConc_NO2.rolling(time=5).mean()  
rolling_mean  # contains dask array

Out[8]:

<xarray.DataArray 'SpeciesConc_NO2' (time: 30, lev: 47, lat: 81, lon: 65)>
dask.array<truediv, shape=(30, 47, 81, 65), dtype=float32, chunksize=(5, 47, 81, 65), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 2020-09-01T21:30:00 ... 2020-09-30T21:30:00
  * lev      (lev) float64 0.9925 0.9775 0.9625 ... 0.0004141 0.0001387 3.8e-05
  * lat      (lat) float64 27.0 27.25 27.5 27.75 28.0 ... 46.25 46.5 46.75 47.0
  * lon      (lon) float64 -130.0 -129.7 -129.4 -129.1 ... -110.6 -110.3 -110.0
Attributes:
    long_name:         Dry mixing ratio of species NO2
    units:             mol mol-1 dry
    averaging_method:  time-averaged

In [ ]: