Lecture 7: Advaned Pandas¶
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
Groupby methods¶
Download wildfire data Spatial_Database_Big_Wildfires_US_all.csv
from Canvas, and upload the data to your current working directory.
df = pd.read_csv('Spatial_Database_Big_Wildfires_US_all.csv', parse_dates=['DISCOVERY_DATE'])
df.head()
FOD_ID | FPA_ID | SOURCE_SYSTEM_TYPE | SOURCE_SYSTEM | NWCG_REPORTING_UNIT_ID | NWCG_REPORTING_UNIT_NAME | FIRE_CODE | FIRE_NAME | MTBS_FIRE_NAME | COMPLEX_NAME | ... | CONT_DOY | FIRE_SIZE | FIRE_SIZE_CLASS | LATITUDE | LONGITUDE | OWNER_DESCR | STATE | COUNTY | FIPS_CODE | FIPS_NAME | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 17 | FS-1418878 | FED | FS-FIRESTAT | USCAENF | Eldorado National Forest | NaN | POWER | POWER | NaN | ... | 295.0 | 16823.0 | G | 38.523333 | -120.211667 | USFS | CA | 5 | 6005.0 | Amador County |
1 | 18 | FS-1418881 | FED | FS-FIRESTAT | USCAENF | Eldorado National Forest | BHA3 | FREDS | FREDS | NaN | ... | 291.0 | 7700.0 | G | 38.780000 | -120.260000 | USFS | CA | 17 | 6017.0 | El Dorado County |
2 | 40 | FS-1418920 | FED | FS-FIRESTAT | USNCNCF | National Forests in North Carolina | BKC8 | AUSTIN CREEK | NaN | NaN | ... | 44.0 | 125.0 | D | 36.001667 | -81.590000 | MISSING/NOT SPECIFIED | NC | 27 | 37027.0 | Caldwell County |
3 | 119 | FS-1419153 | FED | FS-FIRESTAT | USNENBF | Nebraska National Forest | BEW8 | THOMPSON BUTTE | NaN | NaN | ... | 198.0 | 119.0 | D | 43.899167 | -102.954722 | USFS | SD | 103 | 46103.0 | Pennington County |
4 | 120 | FS-1419156 | FED | FS-FIRESTAT | USNENBF | Nebraska National Forest | BEW8 | CHARLES DRAW | NaN | NaN | ... | 197.0 | 119.0 | D | 43.892778 | -102.948056 | USFS | SD | 103 | 46103.0 | Pennington County |
5 rows × 28 columns
df.columns
Index(['FOD_ID', 'FPA_ID', 'SOURCE_SYSTEM_TYPE', 'SOURCE_SYSTEM', 'NWCG_REPORTING_UNIT_ID', 'NWCG_REPORTING_UNIT_NAME', 'FIRE_CODE', 'FIRE_NAME', 'MTBS_FIRE_NAME', 'COMPLEX_NAME', 'FIRE_YEAR', 'DISCOVERY_DATE', 'DISCOVERY_DOY', 'DISCOVERY_TIME', 'NWCG_CAUSE_CLASSIFICATION', 'NWCG_GENERAL_CAUSE', 'NWCG_CAUSE_AGE_CATEGORY', 'CONT_DATE', 'CONT_DOY', 'FIRE_SIZE', 'FIRE_SIZE_CLASS', 'LATITUDE', 'LONGITUDE', 'OWNER_DESCR', 'STATE', 'COUNTY', 'FIPS_CODE', 'FIPS_NAME'], dtype='object')
df = df.set_index('FOD_ID')
df[['DISCOVERY_DATE','DISCOVERY_DOY','DISCOVERY_TIME' ]]
DISCOVERY_DATE | DISCOVERY_DOY | DISCOVERY_TIME | |
---|---|---|---|
FOD_ID | |||
17 | 2004-10-06 | 280 | 1415.0 |
18 | 2004-10-13 | 287 | 1618.0 |
40 | 2005-02-12 | 43 | 1520.0 |
119 | 2005-07-16 | 197 | 1715.0 |
120 | 2005-07-16 | 197 | 1730.0 |
... | ... | ... | ... |
400732975 | 2019-08-09 | 221 | 2134.0 |
400732976 | 2020-03-01 | 61 | 1330.0 |
400732977 | 2020-05-13 | 134 | 1300.0 |
400732982 | 2020-08-17 | 230 | 755.0 |
400732984 | 2020-11-20 | 325 | 1110.0 |
60713 rows × 3 columns
df
FPA_ID | SOURCE_SYSTEM_TYPE | SOURCE_SYSTEM | NWCG_REPORTING_UNIT_ID | NWCG_REPORTING_UNIT_NAME | FIRE_CODE | FIRE_NAME | MTBS_FIRE_NAME | COMPLEX_NAME | FIRE_YEAR | ... | CONT_DOY | FIRE_SIZE | FIRE_SIZE_CLASS | LATITUDE | LONGITUDE | OWNER_DESCR | STATE | COUNTY | FIPS_CODE | FIPS_NAME | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FOD_ID | |||||||||||||||||||||
17 | FS-1418878 | FED | FS-FIRESTAT | USCAENF | Eldorado National Forest | NaN | POWER | POWER | NaN | 2004 | ... | 295.0 | 16823.0 | G | 38.523333 | -120.211667 | USFS | CA | 5 | 6005.0 | Amador County |
18 | FS-1418881 | FED | FS-FIRESTAT | USCAENF | Eldorado National Forest | BHA3 | FREDS | FREDS | NaN | 2004 | ... | 291.0 | 7700.0 | G | 38.780000 | -120.260000 | USFS | CA | 17 | 6017.0 | El Dorado County |
40 | FS-1418920 | FED | FS-FIRESTAT | USNCNCF | National Forests in North Carolina | BKC8 | AUSTIN CREEK | NaN | NaN | 2005 | ... | 44.0 | 125.0 | D | 36.001667 | -81.590000 | MISSING/NOT SPECIFIED | NC | 27 | 37027.0 | Caldwell County |
119 | FS-1419153 | FED | FS-FIRESTAT | USNENBF | Nebraska National Forest | BEW8 | THOMPSON BUTTE | NaN | NaN | 2005 | ... | 198.0 | 119.0 | D | 43.899167 | -102.954722 | USFS | SD | 103 | 46103.0 | Pennington County |
120 | FS-1419156 | FED | FS-FIRESTAT | USNENBF | Nebraska National Forest | BEW8 | CHARLES DRAW | NaN | NaN | 2005 | ... | 197.0 | 119.0 | D | 43.892778 | -102.948056 | USFS | SD | 103 | 46103.0 | Pennington County |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
400732975 | ICS209_2019_10805427 | INTERAGCY | IA-ICS209 | USORMAF | Malheur National Forest | MMY5 | 204 COW | 204 COW | NaN | 2019 | ... | NaN | 9668.0 | G | 44.285050 | -118.459800 | USFS | OR | Baker | 41001.0 | Baker County |
400732976 | ICS209_2020_11699713 | INTERAGCY | IA-ICS209 | USOKCNA | Cherokee Nation Tribe | NaN | CAMERA | NaN | NaN | 2020 | ... | NaN | 401.0 | E | 36.303830 | -94.903820 | UNDEFINED FEDERAL | OK | Delaware | 40041.0 | Delaware County |
400732977 | ICS209_2020_11703752 | INTERAGCY | IA-ICS209 | USFLFLS | Florida Forest Service | NaN | 22ND AVE SE | NaN | NaN | 2020 | ... | NaN | 1000.0 | F | 26.191111 | -81.523889 | Private | FL | Collier | 12021.0 | Collier County |
400732982 | ICS209_2020_11831809 | INTERAGCY | IA-ICS209 | USWAMCR | Mid Columbia National Wildlife Refuge Complex | NaN | TAYLOR POND | TAYLOR POND | NaN | 2020 | ... | 233.0 | 24892.0 | G | 46.670340 | -120.114500 | UNDEFINED FEDERAL | WA | Yakima | 53077.0 | Yakima County |
400732984 | ICS209_2020_11977822 | INTERAGCY | IA-ICS209 | USVAVAF | George Washington & Jefferson National Forests | NaN | MIDDLE MOUNTAIN | NaN | NaN | 2020 | ... | 327.0 | 105.0 | D | 38.578900 | -79.148450 | UNDEFINED FEDERAL | VA | Rockingham | 51165.0 | Rockingham County |
60713 rows × 27 columns
An Example:¶
Question: Find out the top 10 states with largest number of wildfires.
This is an example of a "one-liner" that you can accomplish with groupby.
df.groupby('STATE').FPA_ID.count().nlargest(10).plot(kind='bar', figsize=(12,6))
<Axes: xlabel='STATE'>
What Happened?¶
Let's break apart this operation a bit. The workflow with groubpy
can be divided into three general steps:
Split: Partition the data into different groups based on some criterion.
Apply: Do some caclulation within each group. Different types of "apply" steps might be
- Aggregation: Get the mean or max within the group.
- Transformation: Normalize all the values within a group.
- Filtration: Eliminate some groups based on a criterion.
Combine: Put the results back together into a single object.
The groupby
method¶
Both Series
and DataFrame
objects have a groupby method. It accepts a variety of arguments, but the simplest way to think about it is that you pass another series, whose unique values are used to split the original object into different groups.
df.groupby(df.STATE)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x17f832520>
There is a shortcut for doing this with dataframes: you just pass the column name:
df.groupby('STATE')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x17f7c5340>
The GroubBy
object¶
When we call, groupby
we get back a GroupBy
object:
gb = df.groupby('STATE')
gb
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x17f832490>
The length tells us how many groups were found:
len(gb)
50
All of the groups are available as a dictionary via the .groups
attribute:
groups = gb.groups
len(groups)
50
groups['NJ']
Index([ 247140, 384407, 578890, 579164, 579296, 580135, 580956, 581165, 582330, 582680, ... 400271442, 400389794, 400482043, 400528139, 400531379, 400587134, 400594776, 400602511, 400613352, 400632920], dtype='int64', name='FOD_ID', length=111)
list(groups.keys())
['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'PR', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY']
Iterating and selecting groups¶
You can loop through the groups if you want.
for key, group in gb:
display(group.head())
print(f'The key is "{key}"')
break
FPA_ID | SOURCE_SYSTEM_TYPE | SOURCE_SYSTEM | NWCG_REPORTING_UNIT_ID | NWCG_REPORTING_UNIT_NAME | FIRE_CODE | FIRE_NAME | MTBS_FIRE_NAME | COMPLEX_NAME | FIRE_YEAR | ... | CONT_DOY | FIRE_SIZE | FIRE_SIZE_CLASS | LATITUDE | LONGITUDE | OWNER_DESCR | STATE | COUNTY | FIPS_CODE | FIPS_NAME | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FOD_ID | |||||||||||||||||||||
6689 | FS-1431539 | FED | FS-FIRESTAT | USAKTNF | Tongass National Forest | BRD1 | MUSKEG | NaN | NaN | 2005 | ... | 126.0 | 305.0 | E | 59.087222 | -135.441389 | STATE OR PRIVATE | AK | 220 | 2220.0 | Sitka City and Borough |
109456 | FS-334441 | FED | FS-FIRESTAT | USAKTNF | Tongass National Forest | NaN | MILL | NaN | NaN | 1998 | ... | 189.0 | 118.0 | D | 55.681667 | -132.615000 | STATE OR PRIVATE | AK | NaN | NaN | NaN |
147361 | FS-374211 | FED | FS-FIRESTAT | USAKCGF | Chugach National Forest | NaN | KENAI LAKE | KENAI LAKE | NaN | 2001 | ... | 188.0 | 3260.0 | F | 60.410278 | -149.473611 | USFS | AK | NaN | NaN | NaN |
174677 | W-374459 | FED | DOI-WFMI | USAKAKA | Alaska Regional Office | B391 | B391 | 532391 | NaN | 1995 | ... | 226.0 | 2850.0 | F | 66.832700 | -160.736100 | TRIBAL | AK | NaN | NaN | NaN |
213301 | W-36457 | FED | DOI-WFMI | USAKAKD | Alaska Fire Service | A029 | 203029 | NaN | NaN | 1992 | ... | 126.0 | 170.0 | D | 57.065900 | -154.085700 | BIA | AK | NaN | NaN | NaN |
5 rows × 27 columns
The key is "AK"
And you can get a specific group by key.
gb.get_group('NJ')
FPA_ID | SOURCE_SYSTEM_TYPE | SOURCE_SYSTEM | NWCG_REPORTING_UNIT_ID | NWCG_REPORTING_UNIT_NAME | FIRE_CODE | FIRE_NAME | MTBS_FIRE_NAME | COMPLEX_NAME | FIRE_YEAR | ... | CONT_DOY | FIRE_SIZE | FIRE_SIZE_CLASS | LATITUDE | LONGITUDE | OWNER_DESCR | STATE | COUNTY | FIPS_CODE | FIPS_NAME | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FOD_ID | |||||||||||||||||||||
247140 | W-234848 | FED | DOI-WFMI | USPADWP | Delaware Water Gap National Recreation Area | NaN | WORTHINGTO | WORTHINGTO | NaN | 1999 | ... | 102.0 | 623.0 | E | 40.995895 | -75.120000 | NPS | NJ | NaN | NaN | NaN |
384407 | FWS-2007NJERRDFR2 | FED | FWS-FMIS | USNJERR | Edwin B. Forsythe National Wildlife Refuge | DFR2 | NJ NJFFS WF ASSIST WARREN GROVE | WARREN GROVE | NaN | 2007 | ... | 141.0 | 17050.0 | G | 39.707500 | -74.309722 | STATE | NJ | NaN | NaN | NaN |
578890 | SFO-2006NJDEPA032704 | NONFED | ST-NASF | USNJNJS | New Jersey Forest Fire Service | NaN | NaN | NaN | NaN | 2006 | ... | NaN | 104.0 | D | 40.304400 | -74.201100 | PRIVATE | NJ | Middlesex | 34023.0 | Middlesex County |
579164 | SFO-2006NJDEPB012703 | NONFED | ST-NASF | USNJNJS | New Jersey Forest Fire Service | NaN | RARITAN CENTER | NaN | NaN | 2006 | ... | NaN | 450.0 | E | 40.296100 | -74.214000 | PRIVATE | NJ | Middlesex | 34023.0 | Middlesex County |
579296 | SFO-2006NJDEPB032108 | NONFED | ST-NASF | USNJNJS | New Jersey Forest Fire Service | NaN | SUNRISE LAKE | NaN | NaN | 2006 | ... | NaN | 136.0 | D | 39.482800 | -74.510900 | STATE | NJ | Burlington | 34005.0 | Burlington County |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
400587134 | SFO-2020NJDEPA03-200223155509 | NONFED | ST-NASF | USNJNJS | New Jersey Forest Fire Service | NaN | NaN | NaN | NaN | 2020 | ... | NaN | 103.0 | D | 40.970480 | -75.117710 | MISSING/NOT SPECIFIED | NJ | Warren | 34041.0 | Warren County |
400594776 | SFO-2020NJDEPC03-200409234633 | NONFED | ST-NASF | USNJNJS | New Jersey Forest Fire Service | NaN | NaN | SPLIT DITCH | NaN | 2020 | ... | NaN | 1518.0 | F | 39.312640 | -75.090240 | MISSING/NOT SPECIFIED | NJ | Cumberland | 34011.0 | Cumberland County |
400602511 | SFO-2020NJDEPC06-200519235337 | NONFED | ST-NASF | USNJNJS | New Jersey Forest Fire Service | NaN | NaN | BIG TIMBER | NaN | 2020 | ... | NaN | 2107.0 | F | 39.651250 | -74.892050 | MISSING/NOT SPECIFIED | NJ | Camden | 34007.0 | Camden County |
400613352 | SFO-2020NJDEPB09-200709201959 | NONFED | ST-NASF | USNJNJS | New Jersey Forest Fire Service | NaN | NaN | NaN | NaN | 2020 | ... | NaN | 204.0 | D | 40.111570 | -74.412330 | MISSING/NOT SPECIFIED | NJ | Ocean | 34029.0 | Ocean County |
400632920 | ICS209_2019_10720324 | INTERAGCY | IA-ICS209 | USNJNJS | New Jersey Forest Fire Service | NaN | SPRING HILL FIRE | SPRING HILL FIRE | NaN | 2019 | ... | NaN | 11638.0 | G | 39.770000 | -74.450000 | MISSING/NOT SPECIFIED | NJ | Burlington | 34005.0 | Burlington County |
111 rows × 27 columns
Aggregation¶
Now that we know how to create a GroupBy
object, let's learn how to do aggregation on it.
One way us to use the .aggregate
method, which accepts another function as its argument. The result is automatically combined into a new dataframe with the group key as the index.
By default, the operation is applied to every column. That's usually not what we want. We can use both .
or []
syntax to select a specific column to operate on. Then we get back a series.
# Find out the biggest fire size in each state.
gb.FIRE_SIZE.aggregate(np.max)
STATE AK 606945.0 AL 4394.0 AR 8294.0 AZ 538049.0 CA 589368.0 CO 208913.0 CT 1300.0 DE 2000.0 FL 158000.0 GA 309200.0 HI 25000.0 IA 4000.0 ID 367785.0 IL 1475.0 IN 1549.0 KS 55000.0 KY 10000.0 LA 12877.0 MA 800.0 MD 3208.0 ME 1092.0 MI 21069.0 MN 92682.0 MO 4761.0 MS 5717.0 MT 270723.3 NC 45294.0 ND 51627.0 NE 74500.0 NH 329.0 NJ 19225.0 NM 297845.0 NV 416821.2 NY 5050.0 OH 2865.0 OK 662700.0 OR 558198.3 PA 7949.2 PR 2702.0 SC 19130.0 SD 83508.0 TN 17140.0 TX 479549.0 UT 357185.0 VA 16021.0 VT 270.0 WA 255575.0 WI 7499.4 WV 5832.0 WY 176878.0 Name: FIRE_SIZE, dtype: float64
gb.max( numeric_only=True)
FIRE_YEAR | DISCOVERY_DOY | DISCOVERY_TIME | CONT_DOY | FIRE_SIZE | LATITUDE | LONGITUDE | FIPS_CODE | |
---|---|---|---|---|---|---|---|---|
STATE | ||||||||
AK | 2020 | 363 | 2359.0 | 365.0 | 606945.0 | 70.138100 | -132.333333 | 2290.0 |
AL | 2020 | 365 | 2356.0 | 362.0 | 4394.0 | 34.999182 | -85.102382 | 1133.0 |
AR | 2020 | 365 | 2330.0 | 365.0 | 8294.0 | 36.499280 | -90.531700 | 5149.0 |
AZ | 2020 | 362 | 2359.0 | 362.0 | 538049.0 | 37.050000 | -109.020278 | 4027.0 |
CA | 2020 | 365 | 2356.0 | 366.0 | 589368.0 | 41.998889 | -114.250800 | 6115.0 |
CO | 2020 | 350 | 2330.0 | 365.0 | 208913.0 | 41.000000 | -102.067461 | 8125.0 |
CT | 2017 | 316 | 2201.0 | 144.0 | 1300.0 | 41.781171 | -72.475777 | 9009.0 |
DE | 2005 | 107 | 1400.0 | 103.0 | 2000.0 | 39.250000 | -75.305800 | NaN |
FL | 2020 | 366 | 2359.0 | 365.0 | 158000.0 | 30.991900 | -80.090000 | 12133.0 |
GA | 2020 | 365 | 2326.0 | 365.0 | 309200.0 | 34.986715 | -80.973820 | 13321.0 |
HI | 2020 | 361 | 2337.0 | 366.0 | 25000.0 | 22.135260 | -154.903534 | 15009.0 |
IA | 2020 | 359 | 2328.0 | 359.0 | 4000.0 | 43.483347 | -90.425753 | 19195.0 |
ID | 2020 | 337 | 2358.0 | 349.0 | 367785.0 | 48.993611 | -111.050000 | 16087.0 |
IL | 2020 | 350 | 2248.0 | 350.0 | 1475.0 | 42.400000 | -87.804337 | 17197.0 |
IN | 2020 | 322 | 2112.0 | 322.0 | 1549.0 | 41.615600 | -85.245469 | 18137.0 |
KS | 2020 | 365 | 2345.0 | 365.0 | 55000.0 | 39.999700 | -94.612500 | 20209.0 |
KY | 2020 | 358 | 2359.0 | 360.0 | 10000.0 | 38.717621 | -81.990090 | 21237.0 |
LA | 2020 | 366 | 2350.0 | 364.0 | 12877.0 | 32.998195 | -89.708889 | 22127.0 |
MA | 2016 | 320 | 1700.0 | 320.0 | 800.0 | 42.719058 | -70.524722 | 25027.0 |
MD | 2020 | 365 | 2341.0 | 365.0 | 3208.0 | 39.705251 | -75.330750 | 24047.0 |
ME | 2020 | 330 | 2141.0 | 330.0 | 1092.0 | 47.333333 | -67.012675 | 23031.0 |
MI | 2020 | 309 | 2100.0 | 309.0 | 21069.0 | 47.420000 | -82.835379 | 26165.0 |
MN | 2020 | 354 | 2359.0 | 347.0 | 92682.0 | 48.998941 | -90.031667 | 27171.0 |
MO | 2020 | 365 | 2358.0 | 360.0 | 4761.0 | 40.593912 | -89.508361 | 29229.0 |
MS | 2020 | 365 | 2330.0 | 365.0 | 5717.0 | 35.034700 | -88.166400 | 28163.0 |
MT | 2020 | 354 | 2351.0 | 365.0 | 270723.3 | 48.999200 | -104.005700 | 30111.0 |
NC | 2020 | 361 | 2345.0 | 363.0 | 45294.0 | 36.548333 | -75.563890 | 37199.0 |
ND | 2020 | 353 | 2350.0 | 353.0 | 51627.0 | 48.995400 | -96.867199 | 38105.0 |
NE | 2020 | 365 | 2359.0 | 365.0 | 74500.0 | 43.042806 | -95.590920 | 31175.0 |
NH | 2016 | 319 | 1530.0 | 320.0 | 329.0 | 44.071389 | -71.197222 | 33005.0 |
NJ | 2020 | 353 | 2300.0 | 350.0 | 19225.0 | 41.187500 | -74.063100 | 34041.0 |
NM | 2020 | 365 | 2340.0 | 365.0 | 297845.0 | 37.030278 | -102.970000 | 35061.0 |
NV | 2020 | 364 | 2345.0 | 365.0 | 416821.2 | 41.996190 | -114.046212 | 32510.0 |
NY | 2020 | 324 | 2339.0 | 327.0 | 5050.0 | 44.876262 | -72.708313 | 36113.0 |
OH | 2020 | 320 | 2000.0 | 323.0 | 2865.0 | 41.680000 | -81.195625 | 39163.0 |
OK | 2020 | 365 | 2359.0 | 366.0 | 662700.0 | 36.999700 | -93.649444 | 40295.0 |
OR | 2020 | 352 | 2358.0 | 366.0 | 558198.3 | 46.066000 | -116.516111 | 41071.0 |
PA | 2020 | 333 | 2228.0 | 334.0 | 7949.2 | 41.774440 | -74.311800 | 42129.0 |
PR | 2018 | 365 | 2145.0 | 239.0 | 2702.0 | 18.425734 | -65.325278 | 72147.0 |
SC | 2020 | 365 | 2300.0 | 346.0 | 19130.0 | 35.150000 | -78.766700 | 45091.0 |
SD | 2020 | 365 | 2315.0 | 365.0 | 83508.0 | 45.989340 | -96.476360 | 46137.0 |
TN | 2020 | 364 | 2353.0 | 366.0 | 17140.0 | 36.614350 | -81.703333 | 47185.0 |
TX | 2020 | 365 | 2330.0 | 366.0 | 479549.0 | 36.615150 | -93.541944 | 48507.0 |
UT | 2020 | 340 | 2359.0 | 356.0 | 357185.0 | 41.994444 | -109.050700 | 49057.0 |
VA | 2020 | 362 | 2345.0 | 348.0 | 16021.0 | 39.166800 | -75.648060 | 51810.0 |
VT | 2015 | 142 | 1220.0 | 143.0 | 270.0 | 43.948333 | -72.320833 | 50027.0 |
WA | 2020 | 327 | 2355.0 | 365.0 | 255575.0 | 48.999900 | -116.943056 | 53077.0 |
WI | 2020 | 345 | 2153.0 | 335.0 | 7499.4 | 46.583333 | -87.629000 | 55141.0 |
WV | 2020 | 364 | 2359.0 | 365.0 | 5832.0 | 39.411365 | -78.643929 | 54109.0 |
WY | 2020 | 352 | 2337.0 | 352.0 | 176878.0 | 45.138658 | -104.068800 | 56045.0 |
gb.FIRE_SIZE.aggregate(np.max).nlargest(10)
STATE OK 662700.0 AK 606945.0 CA 589368.0 OR 558198.3 AZ 538049.0 TX 479549.0 NV 416821.2 ID 367785.0 UT 357185.0 GA 309200.0 Name: FIRE_SIZE, dtype: float64
There are shortcuts for common aggregation functions:
gb.FIRE_SIZE.max().nlargest(10)
STATE OK 662700.0 AK 606945.0 CA 589368.0 OR 558198.3 AZ 538049.0 TX 479549.0 NV 416821.2 ID 367785.0 UT 357185.0 GA 309200.0 Name: FIRE_SIZE, dtype: float64
gb.FIRE_SIZE.mean().nlargest(10)
STATE AK 16440.375603 NV 6138.109191 OR 5595.310524 WA 5071.806130 ID 4657.836385 CA 4086.919845 MT 3571.759463 AZ 3084.376006 UT 2874.009960 CO 2809.130881 Name: FIRE_SIZE, dtype: float64
We can also apply multiple functions at once:
gb.FIRE_SIZE.aggregate([np.max, np.mean])
max | mean | |
---|---|---|
STATE | ||
AK | 606945.0 | 16440.375603 |
AL | 4394.0 | 270.382601 |
AR | 8294.0 | 332.404468 |
AZ | 538049.0 | 3084.376006 |
CA | 589368.0 | 4086.919845 |
CO | 208913.0 | 2809.130881 |
CT | 1300.0 | 285.846154 |
DE | 2000.0 | 705.000000 |
FL | 158000.0 | 1273.882815 |
GA | 309200.0 | 1447.480901 |
HI | 25000.0 | 1499.055450 |
IA | 4000.0 | 307.217379 |
ID | 367785.0 | 4657.836385 |
IL | 1475.0 | 279.925000 |
IN | 1549.0 | 287.605128 |
KS | 55000.0 | 1432.532363 |
KY | 10000.0 | 365.679315 |
LA | 12877.0 | 616.683622 |
MA | 800.0 | 298.357143 |
MD | 3208.0 | 498.654240 |
ME | 1092.0 | 275.581250 |
MI | 21069.0 | 1021.730381 |
MN | 92682.0 | 849.779631 |
MO | 4761.0 | 290.755380 |
MS | 5717.0 | 294.757127 |
MT | 270723.3 | 3571.759463 |
NC | 45294.0 | 726.763361 |
ND | 51627.0 | 691.042681 |
NE | 74500.0 | 1785.755841 |
NH | 329.0 | 233.250000 |
NJ | 19225.0 | 1247.720721 |
NM | 297845.0 | 2738.802666 |
NV | 416821.2 | 6138.109191 |
NY | 5050.0 | 510.412069 |
OH | 2865.0 | 295.750000 |
OK | 662700.0 | 1124.858015 |
OR | 558198.3 | 5595.310524 |
PA | 7949.2 | 580.139743 |
PR | 2702.0 | 350.566406 |
SC | 19130.0 | 342.640423 |
SD | 83508.0 | 1091.602361 |
TN | 17140.0 | 394.399034 |
TX | 479549.0 | 1584.134866 |
UT | 357185.0 | 2874.009960 |
VA | 16021.0 | 616.936929 |
VT | 270.0 | 175.666667 |
WA | 255575.0 | 5071.806130 |
WI | 7499.4 | 425.588632 |
WV | 5832.0 | 372.253441 |
WY | 176878.0 | 2761.473651 |
gb.FIRE_SIZE.aggregate([np.median, np.mean]).nlargest(10, 'mean').plot(kind='bar')
<Axes: xlabel='STATE'>
df.columns
Index(['FPA_ID', 'SOURCE_SYSTEM_TYPE', 'SOURCE_SYSTEM', 'NWCG_REPORTING_UNIT_ID', 'NWCG_REPORTING_UNIT_NAME', 'FIRE_CODE', 'FIRE_NAME', 'MTBS_FIRE_NAME', 'COMPLEX_NAME', 'FIRE_YEAR', 'DISCOVERY_DATE', 'DISCOVERY_DOY', 'DISCOVERY_TIME', 'NWCG_CAUSE_CLASSIFICATION', 'NWCG_GENERAL_CAUSE', 'NWCG_CAUSE_AGE_CATEGORY', 'CONT_DATE', 'CONT_DOY', 'FIRE_SIZE', 'FIRE_SIZE_CLASS', 'LATITUDE', 'LONGITUDE', 'OWNER_DESCR', 'STATE', 'COUNTY', 'FIPS_CODE', 'FIPS_NAME'], dtype='object')
df['NWCG_CAUSE_CLASSIFICATION'].unique()
array(['Human', 'Natural', 'Missing data/not specified/undetermined'], dtype=object)
df['NWCG_GENERAL_CAUSE'].unique()
array(['Equipment and vehicle use', 'Power generation/transmission/distribution', 'Debris and open burning', 'Natural', 'Missing data/not specified/undetermined', 'Recreation and ceremony', 'Smoking', 'Railroad operations and maintenance', 'Arson/incendiarism', 'Fireworks', 'Other causes', 'Misuse of fire by a minor', 'Firearms and explosives use'], dtype=object)
df.groupby('NWCG_GENERAL_CAUSE').count()['FPA_ID'].nlargest(10).plot(kind = 'bar')
<Axes: xlabel='NWCG_GENERAL_CAUSE'>
df.groupby('NWCG_CAUSE_CLASSIFICATION').count()['FPA_ID'].nlargest().plot(kind = 'bar')
<Axes: xlabel='NWCG_CAUSE_CLASSIFICATION'>
df.groupby('FIRE_YEAR').FPA_ID.count().plot()
<Axes: xlabel='FIRE_YEAR'>
gb = df.groupby('STATE')
df_CA = gb.get_group('CA')
df_CA.groupby('FIRE_YEAR').FIRE_SIZE.sum().plot()
<Axes: xlabel='FIRE_YEAR'>
Groupby multiple index¶
gb = df.groupby(['STATE','FIRE_YEAR'])
len(gb)
1231
list(gb.groups.keys())[:100:10]
[('AK', 1992), ('AK', 2002), ('AK', 2012), ('AL', 1993), ('AL', 2003), ('AL', 2013), ('AR', 1994), ('AR', 2004), ('AR', 2014), ('AZ', 1995)]
gb.FIRE_SIZE.sum()
STATE FIRE_YEAR AK 1992 141007.000 1993 684669.800 1994 259901.600 1995 42526.000 1996 596706.400 ... WY 2016 254804.700 2017 118803.020 2018 220718.000 2019 41022.200 2020 285791.255 Name: FIRE_SIZE, Length: 1231, dtype: float64
### Select group with multiple index must use tuple!
gb.get_group(('CA', 2003))
FPA_ID | SOURCE_SYSTEM_TYPE | SOURCE_SYSTEM | NWCG_REPORTING_UNIT_ID | NWCG_REPORTING_UNIT_NAME | FIRE_CODE | FIRE_NAME | MTBS_FIRE_NAME | COMPLEX_NAME | FIRE_YEAR | ... | CONT_DOY | FIRE_SIZE | FIRE_SIZE_CLASS | LATITUDE | LONGITUDE | OWNER_DESCR | STATE | COUNTY | FIPS_CODE | FIPS_NAME | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FOD_ID | |||||||||||||||||||||
157551 | FS-385211 | FED | FS-FIRESTAT | USCAINF | Inyo National Forest | NaN | DEXTER | DEXTER WFU | NaN | 2003 | ... | 286.0 | 2515.0 | F | 37.831389 | -118.795000 | USFS | CA | NaN | NaN | NaN |
158728 | FS-386431 | FED | FS-FIRESTAT | USCAPNF | Plumas National Forest | 4300 | ROWLAND | NaN | NaN | 2003 | ... | 163.0 | 114.0 | D | 39.951389 | -120.068889 | USFS | CA | NaN | NaN | NaN |
159533 | FS-387254 | FED | FS-FIRESTAT | USCASTF | Stanislaus National Forest | 7648 | MUDD | MUD WFU | MUD COMPLEX | 2003 | ... | 300.0 | 4102.0 | F | 38.424722 | -119.961111 | USFS | CA | NaN | NaN | NaN |
159534 | FS-387255 | FED | FS-FIRESTAT | USCASTF | Stanislaus National Forest | 5555 | WHITT | WHITT | MUD COMPLEX | 2003 | ... | 300.0 | 1014.0 | F | 38.378056 | -119.999722 | USFS | CA | NaN | NaN | NaN |
160202 | FS-388122 | FED | FS-FIRESTAT | USCALPF | Los Padres National Forest | 2996 | DEL VENTURI | NaN | NaN | 2003 | ... | 225.0 | 861.0 | E | 36.071111 | -121.390000 | MISSING/NOT SPECIFIED | CA | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
15000827 | ICS209_2003_CA-KRN-37559 | INTERAGCY | IA-ICS209 | USCAKRN | Kern County Fire Department | NaN | HILLSIDE | UNNAMED | NaN | 2003 | ... | 198.0 | 835.0 | E | 35.168056 | -118.468056 | MISSING/NOT SPECIFIED | CA | KERN | 6029.0 | Kern County |
15000828 | ICS209_2003_CA-KRN-38766 | INTERAGCY | IA-ICS209 | USCAKRN | Kern County Fire Department | NaN | MERCURY | UNNAMED | NaN | 2003 | ... | 203.0 | 340.0 | E | 35.200556 | -118.518611 | MISSING/NOT SPECIFIED | CA | KERN | 6029.0 | Kern County |
15000829 | ICS209_2003_CA-LAC-03004054 | INTERAGCY | IA-ICS209 | USCALAC | Los Angeles County Fire Department | NaN | AIRPORT | NaN | NaN | 2003 | ... | 8.0 | 245.0 | D | 33.407778 | -118.402500 | MISSING/NOT SPECIFIED | CA | LOS ANGELES | 6037.0 | Los Angeles County |
201940026 | ICS209_2003-CA-KRN-0333259 | INTERAGCY | IA-ICS209 | USCAKRN | Kern County Fire Department | NaN | TEJON | TEJON | NaN | 2003 | ... | 183.0 | 1155.0 | F | 34.871389 | -118.882778 | MISSING/NOT SPECIFIED | CA | Kern | 6029.0 | Kern County |
400280041 | ICS209_2003_CA-KRN-33853 | INTERAGCY | IA-ICS209 | USCAKRN | Kern County Fire Department | NaN | GRAPEVINE | GRAPEVINE | NaN | 2003 | ... | NaN | 1830.0 | F | 34.916944 | -118.918333 | MISSING/NOT SPECIFIED | CA | Kern | 6029.0 | Kern County |
210 rows × 27 columns
gb.get_group(('CA', 2003)).describe()
FIRE_YEAR | DISCOVERY_DATE | DISCOVERY_DOY | DISCOVERY_TIME | CONT_DOY | FIRE_SIZE | LATITUDE | LONGITUDE | FIPS_CODE | |
---|---|---|---|---|---|---|---|---|---|
count | 210.0 | 210 | 210.000000 | 210.000000 | 198.000000 | 210.000000 | 210.000000 | 210.000000 | 116.000000 |
mean | 2003.0 | 2003-08-04 13:42:51.428571392 | 216.571429 | 1317.295238 | 223.934343 | 4823.423333 | 37.258685 | -120.151254 | 6052.965517 |
min | 2003.0 | 2003-01-06 00:00:00 | 6.000000 | 10.000000 | 7.000000 | 101.000000 | 32.566700 | -124.091100 | 6001.000000 |
25% | 2003.0 | 2003-07-05 00:00:00 | 186.000000 | 1129.000000 | 185.000000 | 200.000000 | 35.580192 | -121.577778 | 6029.000000 |
50% | 2003.0 | 2003-07-30 12:00:00 | 211.500000 | 1348.500000 | 216.000000 | 360.000000 | 37.188207 | -120.191944 | 6045.000000 |
75% | 2003.0 | 2003-09-03 00:00:00 | 246.000000 | 1532.250000 | 255.000000 | 1224.500000 | 39.145239 | -118.755139 | 6073.000000 |
max | 2003.0 | 2003-11-28 00:00:00 | 332.000000 | 2320.000000 | 365.000000 | 280059.000000 | 41.998333 | -114.660556 | 6113.000000 |
std | 0.0 | NaN | 50.829739 | 418.434515 | 59.655787 | 22885.276576 | 2.302258 | 1.918151 | 30.874763 |
### Find out the largest fire in CA, 2020
df.loc[gb['FIRE_SIZE'].idxmax().loc['CA',2020]]
FPA_ID IRW-2020-CAMNF-000730 SOURCE_SYSTEM_TYPE INTERAGCY SOURCE_SYSTEM IA-IRWIN NWCG_REPORTING_UNIT_ID USCAMNF NWCG_REPORTING_UNIT_NAME Mendocino National Forest FIRE_CODE NFP4 FIRE_NAME DOE MTBS_FIRE_NAME AUGUST COMPLEX COMPLEX_NAME AUGUST COMPLEX FIRE_YEAR 2020 DISCOVERY_DATE 2020-08-16 00:00:00 DISCOVERY_DOY 229 DISCOVERY_TIME NaN NWCG_CAUSE_CLASSIFICATION Natural NWCG_GENERAL_CAUSE Natural NWCG_CAUSE_AGE_CATEGORY NaN CONT_DATE 11/11/2020 CONT_DOY 316.0 FIRE_SIZE 589368.0 FIRE_SIZE_CLASS G LATITUDE 39.765255 LONGITUDE -122.672914 OWNER_DESCR USFS STATE CA COUNTY Glenn FIPS_CODE 6021.0 FIPS_NAME Glenn County Name: 400629554, dtype: object
df.groupby(['FIRE_YEAR','NWCG_CAUSE_CLASSIFICATION']).FPA_ID.count().unstack('NWCG_CAUSE_CLASSIFICATION').plot(kind = 'bar', stacked = True)
<Axes: xlabel='FIRE_YEAR'>
df.groupby(['FIRE_YEAR','NWCG_CAUSE_CLASSIFICATION']).FPA_ID.count().plot(kind = 'bar')
<Axes: xlabel='FIRE_YEAR,NWCG_CAUSE_CLASSIFICATION'>
Named aggregation¶
- The keywords are the output column names
- The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column.
df.groupby('NWCG_GENERAL_CAUSE').aggregate(total_size = ('FIRE_SIZE', 'sum'))
total_size | |
---|---|
NWCG_GENERAL_CAUSE | |
Arson/incendiarism | 8.403065e+06 |
Debris and open burning | 5.211647e+06 |
Equipment and vehicle use | 9.101695e+06 |
Firearms and explosives use | 6.280094e+05 |
Fireworks | 5.020915e+05 |
Missing data/not specified/undetermined | 3.140607e+07 |
Misuse of fire by a minor | 3.598069e+05 |
Natural | 1.044037e+08 |
Other causes | 4.257120e+05 |
Power generation/transmission/distribution | 3.368717e+06 |
Railroad operations and maintenance | 7.531811e+05 |
Recreation and ceremony | 4.408299e+06 |
Smoking | 7.751060e+05 |
df.groupby('NWCG_GENERAL_CAUSE').aggregate(total_incidents = ('FPA_ID', 'count'))
total_incidents | |
---|---|
NWCG_GENERAL_CAUSE | |
Arson/incendiarism | 10056 |
Debris and open burning | 7299 |
Equipment and vehicle use | 5165 |
Firearms and explosives use | 283 |
Fireworks | 433 |
Missing data/not specified/undetermined | 14204 |
Misuse of fire by a minor | 345 |
Natural | 18802 |
Other causes | 117 |
Power generation/transmission/distribution | 1322 |
Railroad operations and maintenance | 858 |
Recreation and ceremony | 1237 |
Smoking | 592 |
# Applying different functions to dataframe columns
df.groupby(['NWCG_GENERAL_CAUSE']).aggregate({'FIRE_SIZE':'sum', 'FPA_ID':'count'})
FIRE_SIZE | FPA_ID | |
---|---|---|
NWCG_GENERAL_CAUSE | ||
Arson/incendiarism | 8.403065e+06 | 10056 |
Debris and open burning | 5.211647e+06 | 7299 |
Equipment and vehicle use | 9.101695e+06 | 5165 |
Firearms and explosives use | 6.280094e+05 | 283 |
Fireworks | 5.020915e+05 | 433 |
Missing data/not specified/undetermined | 3.140607e+07 | 14204 |
Misuse of fire by a minor | 3.598069e+05 | 345 |
Natural | 1.044037e+08 | 18802 |
Other causes | 4.257120e+05 | 117 |
Power generation/transmission/distribution | 3.368717e+06 | 1322 |
Railroad operations and maintenance | 7.531811e+05 | 858 |
Recreation and ceremony | 4.408299e+06 | 1237 |
Smoking | 7.751060e+05 | 592 |
Filtration¶
# Take the 1st row from each group
df.groupby(['FIRE_YEAR']).nth(10)
FPA_ID | SOURCE_SYSTEM_TYPE | SOURCE_SYSTEM | NWCG_REPORTING_UNIT_ID | NWCG_REPORTING_UNIT_NAME | FIRE_CODE | FIRE_NAME | MTBS_FIRE_NAME | COMPLEX_NAME | FIRE_YEAR | ... | CONT_DOY | FIRE_SIZE | FIRE_SIZE_CLASS | LATITUDE | LONGITUDE | OWNER_DESCR | STATE | COUNTY | FIPS_CODE | FIPS_NAME | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FOD_ID | |||||||||||||||||||||
745 | FS-1420517 | FED | FS-FIRESTAT | USMTCGF | Custer Gallatin National Forest | BK2U | LAMPKIN | NaN | NaN | 2005 | ... | 89.0 | 236.0 | D | 45.744722 | -104.180000 | USFS | MT | 11 | 30011.0 | Carter County |
7175 | FS-1432509 | FED | FS-FIRESTAT | USMSMNF | National Forests in Mississippi | B76Z | OP-6 | NaN | NaN | 2006 | ... | 13.0 | 200.0 | D | 31.091944 | -89.079167 | USFS | MS | 111 | 28111.0 | Perry County |
18178 | FS-1447814 | FED | FS-FIRESTAT | USGACHF | Chattahoochee-Oconee National Forest | C67F | FORESTRY TRAIL | NaN | NaN | 2007 | ... | 83.0 | 117.0 | D | 34.573333 | -85.168889 | STATE OR PRIVATE | GA | 55 | 13055.0 | Chattooga County |
26970 | FS-1458681 | FED | FS-FIRESTAT | USMOMTF | Mark Twain National Forest | D4LW | BOWLING PIN | NaN | NaN | 2008 | ... | 84.0 | 113.0 | D | 37.748611 | -90.998889 | STATE OR PRIVATE | MO | 221 | 29221.0 | Washington County |
34327 | FS-1475293 | FED | FS-FIRESTAT | USNMCIF | Cibola National Forest | ER0H | ELENA | ELENA | NaN | 2009 | ... | 70.0 | 2400.0 | F | 35.636944 | -99.819167 | STATE OR PRIVATE | OK | 129 | 40129.0 | Roger Mills County |
43014 | FS-256761 | FED | FS-FIRESTAT | USMTKNF | Kootenai National Forest | NaN | THREE GOATS | NaN | NaN | 1992 | ... | 222.0 | 209.0 | D | 48.401667 | -116.080000 | USFS | ID | NaN | NaN | NaN |
54983 | FS-279512 | FED | FS-FIRESTAT | USNMCIF | Cibola National Forest | NaN | NaN | NaN | NaN | 1993 | ... | 175.0 | 106.0 | D | 34.133333 | -107.951667 | USFS | NM | NaN | NaN | NaN |
61790 | FS-286434 | FED | FS-FIRESTAT | USMTBRF | Bitterroot National Forest | NaN | LITTLE CLEARWATER | BITTER-NEZ COMPLEX (MAGRUDER) | BITTER-NEZ COMPLEX | 1994 | ... | 297.0 | 785.0 | E | 45.710000 | -114.838333 | USFS | ID | NaN | NaN | NaN |
77216 | FS-301978 | FED | FS-FIRESTAT | USCORGF | Rio Grande National Forest | NaN | JOHN | NaN | NaN | 1995 | ... | 211.0 | 201.0 | D | 37.066667 | -105.816667 | USFS | CO | NaN | NaN | NaN |
85678 | FS-310483 | FED | FS-FIRESTAT | USLAKIF | Kisatchie National Forest | NaN | 085 SIMMONS ROAD | NaN | NaN | 1996 | ... | 75.0 | 124.0 | D | 31.650000 | -92.533333 | USFS | LA | NaN | NaN | NaN |
97767 | FS-322642 | FED | FS-FIRESTAT | USSCFMF | Francis Marion & Sumter National Forests | NaN | BUCK HALL | NaN | NaN | 1997 | ... | 91.0 | 151.0 | D | 33.033333 | -79.550000 | USFS | SC | NaN | NaN | NaN |
105736 | FS-330703 | FED | FS-FIRESTAT | USFLFNF | National Forests in Florida | 5422 | OAK | OAK | NaN | 1998 | ... | 198.0 | 20100.0 | G | 30.340000 | -82.506667 | USFS | FL | NaN | NaN | NaN |
115178 | FS-340214 | FED | FS-FIRESTAT | USCASTF | Stanislaus National Forest | 7687 | PILOT | PILOT | NaN | 1999 | ... | 242.0 | 4028.0 | F | 37.823056 | -120.017500 | STATE OR PRIVATE | CA | NaN | NaN | NaN |
125447 | FS-350535 | FED | FS-FIRESTAT | USAZCOF | Coconino National Forest | 2300 | GULF | NaN | NaN | 2000 | ... | 203.0 | 1000.0 | F | 34.583889 | -111.467500 | USFS | AZ | 5 | 4005.0 | Coconino County |
137257 | FS-363242 | FED | FS-FIRESTAT | USFLFNF | National Forests in Florida | 5220 | HARMS ROAD/TLH SUPPO | NaN | NaN | 2001 | ... | 67.0 | 209.0 | D | 30.037222 | -84.498056 | STATE OR PRIVATE | FL | 129 | 12129.0 | Wakulla County |
147261 | FS-374101 | FED | FS-FIRESTAT | USKYDBF | Daniel Boone National Forest | 4563 | CANE CREEK | CANE CREEK | NaN | 2002 | ... | 102.0 | 990.0 | E | 36.778889 | -84.315000 | USFS | KY | 235 | 21235.0 | Whitley County |
157440 | FS-385097 | FED | FS-FIRESTAT | USMIHIF | Hiawatha National Forest | NaN | LOVEGROVE | NaN | NaN | 2003 | ... | 119.0 | 172.0 | D | 46.145833 | -84.954167 | STATE OR PRIVATE | MI | NaN | NaN | NaN |
166506 | FS-394950 | FED | FS-FIRESTAT | USAZTNF | Tonto National Forest | A02B | WEBBER | WEBBER | NaN | 2004 | ... | 115.0 | 4311.0 | F | 34.426111 | -111.367500 | USFS | AZ | 7 | 4007.0 | Gila County |
1326854 | SFO-2010-CACDFTUU000251 | NONFED | ST-CACDF | USCATUU | Tulare Unit | NaN | CURVE | NaN | NaN | 2010 | ... | NaN | 108.0 | D | 36.031944 | -118.858056 | MISSING/NOT SPECIFIED | CA | TULARE | 6107.0 | Tulare County |
20020200 | FS-1493543 | FED | FS-FIRESTAT | USMOMTF | Mark Twain National Forest | F1CR | BETHEL | NaN | NaN | 2011 | ... | 71.0 | 433.0 | E | 36.788611 | -92.763333 | USFS | MO | NaN | NaN | NaN |
201430181 | FS-1509564 | FED | FS-FIRESTAT | USMTBRF | Bitterroot National Forest | EKS4 | BURNT STRIP | BURNT STRIP | NaN | 2012 | ... | 296.0 | 3570.0 | F | 45.847778 | -114.646667 | USFS | ID | 049 | 16049.0 | Idaho County |
201760265 | FS-1519193 | FED | FS-FIRESTAT | USCATNF | Tahoe National Forest | HU11 | AMERICAN | AMERICAN | NaN | 2013 | ... | 281.0 | 27440.0 | G | 39.118889 | -120.646111 | USFS | CA | 061 | 6061.0 | Placer County |
300000132 | FS-1527353 | FED | FS-FIRESTAT | USIDNCF | Nez Perce - Clearwater National Forests | JDT1 | ELEVATOR | ELEVATOR MOUNTAIN | SELWAY COMPLEX | 2014 | ... | 297.0 | 4227.0 | F | 45.970833 | -114.818056 | USFS | ID | Idaho | 16049.0 | Idaho County |
300200527 | FS-6352433 | FED | FS-FIRESTAT | USIDIPF | Idaho Panhandle National Forest | J114 | GRASSY MOUNTAIN | NaN | GRIZZLY COMPLEX | 2015 | ... | NaN | 860.0 | E | 47.785000 | -116.205278 | USFS | ID | 079 | 16079.0 | Shoshone County |
400000402 | FS-6738266 | FED | FS-FIRESTAT | USWYMRF | Medicine Bow-Routt National Forests, Thunder B... | KL7H | BROADWAY | BROADWAY | NaN | 2016 | ... | NaN | 2121.0 | F | 41.041389 | -106.762778 | USFS | WY | 007 | 56007.0 | Carbon County |
400006154 | FS-6824721 | FED | FS-FIRESTAT | USMTFNF | Flathead National Forest | EKS8 | DOLLY VARDEN | NaN | NaN | 2017 | ... | NaN | 424.0 | E | 48.012222 | -113.178333 | USFS | MT | 029 | 30029.0 | Flathead County |
400300244 | FS-6885846 | FED | FS-FIRESTAT | USWYMRF | Medicine Bow-Routt National Forests, Thunder B... | L24P | GLENDO FIRE | NaN | NaN | 2018 | ... | 223.0 | 137.1 | D | 42.446667 | -105.241667 | PRIVATE | WY | 031 | 56031.0 | Platte County |
400500130 | FS-6954521 | FED | FS-FIRESTAT | USNMGNF | Gila National Forest | MDB3 | BLACK | BLACK | NaN | 2019 | ... | 227.0 | 1235.0 | F | 33.163611 | -107.905278 | USFS | NM | 017 | 35017.0 | Grant County |
400510712 | W-736893 | FED | DOI-WFMI | USOKANA | Anadarko Agency | MZY5 | WEST CACHE | NaN | NaN | 2020 | ... | 8.0 | 170.0 | D | 34.612610 | -98.650800 | PRIVATE | OK | NaN | NaN | NaN |
29 rows × 27 columns
Time grouping with the weather data¶
df = pd.read_csv('../Assignments/Assignment_3/Millbrook_NY_daily_weather.csv')
df.head()
LST_DATE | WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | ... | SOIL_MOISTURE_10_DAILY | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | Unnamed: 28 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-01-01 | 64756 | 2.422 | -73.74 | 41.79 | 3.4 | -0.5 | 1.5 | 1.3 | 0.0 | ... | 0.233 | 0.204 | 0.155 | 0.147 | 4.2 | 4.4 | 5.1 | 6.0 | 7.6 | NaN |
1 | 2016-01-02 | 64756 | 2.422 | -73.74 | 41.79 | 2.9 | -3.6 | -0.4 | -0.3 | 0.0 | ... | 0.227 | 0.199 | 0.152 | 0.144 | 2.8 | 3.1 | 4.2 | 5.7 | 7.4 | NaN |
2 | 2016-01-03 | 64756 | 2.422 | -73.74 | 41.79 | 5.1 | -1.8 | 1.6 | 1.1 | 0.0 | ... | 0.223 | 0.196 | 0.151 | 0.141 | 2.6 | 2.8 | 3.8 | 5.2 | 7.2 | NaN |
3 | 2016-01-04 | 64756 | 2.422 | -73.74 | 41.79 | 0.5 | -14.4 | -6.9 | -7.5 | 0.0 | ... | 0.220 | 0.194 | 0.148 | 0.139 | 1.7 | 2.1 | 3.4 | 4.9 | 6.9 | NaN |
4 | 2016-01-05 | 64756 | 2.422 | -73.74 | 41.79 | -5.2 | -15.5 | -10.3 | -11.7 | 0.0 | ... | 0.213 | 0.191 | 0.148 | 0.138 | 0.4 | 0.9 | 2.4 | 4.3 | 6.6 | NaN |
5 rows × 29 columns
df.describe()
WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | SOLARAD_DAILY | ... | SOIL_MOISTURE_10_DAILY | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | Unnamed: 28 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 2557.0 | 2557.000000 | 2557.00 | 2.557000e+03 | 2545.000000 | 2545.000000 | 2545.000000 | 2545.000000 | 2547.000000 | 2545.000000 | ... | 2173.000000 | 2281.000000 | 2545.000000 | 2427.000000 | 2545.000000 | 2545.000000 | 2542.000000 | 2545.000000 | 2544.000000 | 0.0 |
mean | 64756.0 | 2.549059 | -73.74 | 4.179000e+01 | 15.973635 | 4.037642 | 10.003733 | 10.130059 | 3.216726 | 12.908444 | ... | 0.189869 | 0.192877 | 0.150320 | 0.156753 | 12.382318 | 12.353084 | 12.155901 | 12.077367 | 11.994340 | NaN |
std | 0.0 | 0.518602 | 0.00 | 7.106817e-15 | 10.648586 | 9.568906 | 9.865217 | 9.721821 | 8.131592 | 8.015033 | ... | 0.071355 | 0.073708 | 0.029615 | 0.022373 | 9.531614 | 9.468884 | 8.960360 | 8.190971 | 7.334731 | NaN |
min | 64756.0 | -9.000000 | -73.74 | 4.179000e+01 | -12.300000 | -26.000000 | -18.400000 | -19.200000 | 0.000000 | 0.030000 | ... | 0.029000 | 0.030000 | 0.070000 | 0.023000 | -1.800000 | -1.600000 | -0.500000 | 0.500000 | 1.400000 | NaN |
25% | 64756.0 | 2.422000 | -73.74 | 4.179000e+01 | 6.900000 | -3.100000 | 2.000000 | 2.100000 | 0.000000 | 6.070000 | ... | 0.141000 | 0.144000 | 0.133000 | 0.145000 | 2.700000 | 2.800000 | 3.125000 | 3.800000 | 4.700000 | NaN |
50% | 64756.0 | 2.622000 | -73.74 | 4.179000e+01 | 16.900000 | 3.900000 | 10.200000 | 10.500000 | 0.000000 | 11.820000 | ... | 0.216000 | 0.211000 | 0.157000 | 0.160000 | 12.300000 | 12.200000 | 12.000000 | 12.000000 | 11.900000 | NaN |
75% | 64756.0 | 2.622000 | -73.74 | 4.179000e+01 | 25.200000 | 11.900000 | 18.700000 | 18.800000 | 2.150000 | 19.450000 | ... | 0.244000 | 0.246000 | 0.170000 | 0.170000 | 21.800000 | 21.700000 | 21.075000 | 20.100000 | 19.100000 | NaN |
max | 64756.0 | 2.622000 | -73.74 | 4.179000e+01 | 36.500000 | 23.400000 | 28.900000 | 28.400000 | 133.400000 | 31.250000 | ... | 0.359000 | 0.335000 | 0.223000 | 0.218000 | 28.600000 | 28.500000 | 27.100000 | 25.400000 | 23.500000 | NaN |
8 rows × 27 columns
df.columns
Index(['LST_DATE', 'WBANNO', 'CRX_VN', 'LONGITUDE', 'LATITUDE', 'T_DAILY_MAX', 'T_DAILY_MIN', 'T_DAILY_MEAN', 'T_DAILY_AVG', 'P_DAILY_CALC', 'SOLARAD_DAILY', 'SUR_TEMP_DAILY_TYPE', 'SUR_TEMP_DAILY_MAX', 'SUR_TEMP_DAILY_MIN', 'SUR_TEMP_DAILY_AVG', 'RH_DAILY_MAX', 'RH_DAILY_MIN', 'RH_DAILY_AVG', 'SOIL_MOISTURE_5_DAILY', 'SOIL_MOISTURE_10_DAILY', 'SOIL_MOISTURE_20_DAILY', 'SOIL_MOISTURE_50_DAILY', 'SOIL_MOISTURE_100_DAILY', 'SOIL_TEMP_5_DAILY', 'SOIL_TEMP_10_DAILY', 'SOIL_TEMP_20_DAILY', 'SOIL_TEMP_50_DAILY', 'SOIL_TEMP_100_DAILY', 'Unnamed: 28'], dtype='object')
df['LST_DATE'] = pd.to_datetime(df['LST_DATE'])
df = df.drop(columns = ['SUR_TEMP_DAILY_TYPE'])
df = df.set_index('LST_DATE')
df
WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | SOLARAD_DAILY | ... | SOIL_MOISTURE_10_DAILY | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | Unnamed: 28 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LST_DATE | |||||||||||||||||||||
2016-01-01 | 64756 | 2.422 | -73.74 | 41.79 | 3.4 | -0.5 | 1.5 | 1.3 | 0.0 | 1.69 | ... | 0.233 | 0.204 | 0.155 | 0.147 | 4.2 | 4.4 | 5.1 | 6.0 | 7.6 | NaN |
2016-01-02 | 64756 | 2.422 | -73.74 | 41.79 | 2.9 | -3.6 | -0.4 | -0.3 | 0.0 | 6.25 | ... | 0.227 | 0.199 | 0.152 | 0.144 | 2.8 | 3.1 | 4.2 | 5.7 | 7.4 | NaN |
2016-01-03 | 64756 | 2.422 | -73.74 | 41.79 | 5.1 | -1.8 | 1.6 | 1.1 | 0.0 | 5.69 | ... | 0.223 | 0.196 | 0.151 | 0.141 | 2.6 | 2.8 | 3.8 | 5.2 | 7.2 | NaN |
2016-01-04 | 64756 | 2.422 | -73.74 | 41.79 | 0.5 | -14.4 | -6.9 | -7.5 | 0.0 | 9.17 | ... | 0.220 | 0.194 | 0.148 | 0.139 | 1.7 | 2.1 | 3.4 | 4.9 | 6.9 | NaN |
2016-01-05 | 64756 | 2.422 | -73.74 | 41.79 | -5.2 | -15.5 | -10.3 | -11.7 | 0.0 | 9.34 | ... | 0.213 | 0.191 | 0.148 | 0.138 | 0.4 | 0.9 | 2.4 | 4.3 | 6.6 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2022-12-27 | 64756 | 2.622 | -73.74 | 41.79 | -0.8 | -8.0 | -4.4 | -3.8 | 0.0 | 4.00 | ... | NaN | NaN | 0.164 | 0.157 | -0.4 | -0.2 | 0.5 | 2.2 | 4.0 | NaN |
2022-12-28 | 64756 | 2.622 | -73.74 | 41.79 | 7.4 | -6.1 | 0.7 | 1.3 | 0.0 | 7.73 | ... | NaN | NaN | 0.162 | 0.156 | -0.4 | -0.3 | 0.4 | 2.1 | 3.8 | NaN |
2022-12-29 | 64756 | 2.622 | -73.74 | 41.79 | 10.7 | -1.8 | 4.4 | 5.0 | 0.0 | 6.66 | ... | NaN | NaN | 0.159 | 0.155 | -0.3 | -0.3 | 0.3 | 1.9 | 3.7 | NaN |
2022-12-30 | 64756 | 2.622 | -73.74 | 41.79 | 16.6 | 4.9 | 10.7 | 10.3 | 0.0 | 5.39 | ... | NaN | NaN | 0.159 | 0.154 | -0.2 | -0.2 | 0.3 | 1.8 | 3.6 | NaN |
2022-12-31 | 64756 | 2.622 | -73.74 | 41.79 | 13.2 | 2.7 | 7.9 | 10.2 | 5.0 | 1.25 | ... | NaN | NaN | 0.160 | 0.153 | -0.1 | -0.2 | 0.3 | 1.8 | 3.4 | NaN |
2557 rows × 27 columns
This timeseries has daily resolution, and the daily plots are somewhat noisy.
df.T_DAILY_MEAN.plot()
<Axes: xlabel='LST_DATE'>
A common way to analyze such data in climate science is to create a "climatology," which contains the average values in each month or day of the year. We can do this easily with groupby. Recall that df.index is a pandas DateTimeIndex object.
monthly_climatology = df.groupby(df.index.month).mean(numeric_only=True)
monthly_climatology
WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | SOLARAD_DAILY | ... | SOIL_MOISTURE_10_DAILY | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | Unnamed: 28 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LST_DATE | |||||||||||||||||||||
1 | 64756.0 | 2.564857 | -73.74 | 41.79 | 2.200000 | -7.550463 | -2.678241 | -2.442130 | 2.217130 | 5.655139 | ... | 0.256699 | 0.230433 | 0.164727 | 0.166809 | 0.393056 | 0.472685 | 0.963889 | 1.983796 | 3.371759 | NaN |
2 | 64756.0 | 2.564424 | -73.74 | 41.79 | 4.866162 | -5.901010 | -0.519192 | -0.220202 | 3.449495 | 8.359444 | ... | 0.250415 | 0.219145 | 0.163520 | 0.165425 | 0.557071 | 0.518687 | 0.707071 | 1.256566 | 2.156061 | NaN |
3 | 64756.0 | 2.564857 | -73.74 | 41.79 | 9.015668 | -2.634101 | 3.184793 | 3.370046 | 2.426728 | 12.813917 | ... | 0.240800 | 0.224130 | 0.168618 | 0.166180 | 3.385714 | 3.270507 | 3.157604 | 3.182488 | 3.337788 | NaN |
4 | 64756.0 | 2.564857 | -73.74 | 41.79 | 14.930476 | 1.948095 | 8.438095 | 8.733810 | 3.217143 | 14.977381 | ... | 0.229071 | 0.235376 | 0.165395 | 0.165433 | 9.685714 | 9.460476 | 8.950000 | 8.119524 | 7.199524 | NaN |
5 | 64756.0 | 2.564857 | -73.74 | 41.79 | 20.996774 | 8.094009 | 14.548848 | 14.772811 | 3.182949 | 17.912673 | ... | 0.208286 | 0.221042 | 0.156848 | 0.161083 | 16.721198 | 16.471889 | 15.606019 | 14.184793 | 12.509217 | NaN |
6 | 64756.0 | 2.564857 | -73.74 | 41.79 | 25.874286 | 12.033810 | 18.952857 | 19.285714 | 2.290000 | 21.610095 | ... | 0.138119 | 0.162114 | 0.136386 | 0.153190 | 22.399524 | 22.177619 | 21.112381 | 19.530476 | 17.661429 | NaN |
7 | 64756.0 | 2.564857 | -73.74 | 41.79 | 29.040092 | 15.894470 | 22.465899 | 22.322120 | 3.880184 | 20.864101 | ... | 0.106465 | 0.113350 | 0.118535 | 0.139206 | 25.527189 | 25.406912 | 24.316204 | 22.814286 | 21.129630 | NaN |
8 | 64756.0 | 2.297069 | -73.74 | 41.79 | 28.139048 | 15.543810 | 21.838571 | 21.615714 | 3.969048 | 18.131429 | ... | 0.134129 | 0.128214 | 0.122652 | 0.136627 | 24.912857 | 24.918095 | 24.244762 | 23.358571 | 22.286190 | NaN |
9 | 64756.0 | 2.564857 | -73.74 | 41.79 | 23.720096 | 11.202871 | 17.460287 | 17.344019 | 3.948804 | 14.033301 | ... | 0.144627 | 0.151890 | 0.127550 | 0.141841 | 20.640191 | 20.736364 | 20.594258 | 20.666986 | 20.554067 | NaN |
10 | 64756.0 | 2.590664 | -73.74 | 41.79 | 17.773488 | 5.738140 | 11.753023 | 11.740465 | 3.637963 | 9.193674 | ... | 0.190428 | 0.187633 | 0.141386 | 0.144750 | 14.663256 | 14.799070 | 15.074419 | 15.892558 | 16.645581 | NaN |
11 | 64756.0 | 2.593429 | -73.74 | 41.79 | 10.377512 | -1.277990 | 4.547368 | 4.720574 | 3.301905 | 6.567895 | ... | 0.234833 | 0.237167 | 0.166742 | 0.161895 | 7.128708 | 7.302392 | 7.955288 | 9.385167 | 10.990909 | NaN |
12 | 64756.0 | 2.593429 | -73.74 | 41.79 | 4.488018 | -4.917972 | -0.217512 | 0.055760 | 3.121198 | 4.639954 | ... | 0.254506 | 0.242934 | 0.171530 | 0.170230 | 2.254839 | 2.385714 | 2.946083 | 4.288479 | 5.912442 | NaN |
12 rows × 27 columns
Each row in this new dataframe respresents the average values for the months (1=January, 2=February, etc.)
We can apply more customized aggregations, as with any groupby operation. Below we keep the mean of the mean, max of the max, and min of the min for the temperature measurements.
monthly_T_climatology = df.groupby(df.index.month).aggregate({'T_DAILY_MEAN': 'mean',
'T_DAILY_MAX': 'max',
'T_DAILY_MIN': 'min'})
monthly_T_climatology
T_DAILY_MEAN | T_DAILY_MAX | T_DAILY_MIN | |
---|---|---|---|
LST_DATE | |||
1 | -2.678241 | 19.8 | -26.0 |
2 | -0.519192 | 24.9 | -24.7 |
3 | 3.184793 | 26.8 | -17.4 |
4 | 8.438095 | 30.6 | -11.3 |
5 | 14.548848 | 33.4 | -3.1 |
6 | 18.952857 | 34.5 | 1.5 |
7 | 22.465899 | 36.2 | 8.2 |
8 | 21.838571 | 36.5 | 6.0 |
9 | 17.460287 | 32.7 | -1.6 |
10 | 11.753023 | 29.9 | -5.9 |
11 | 4.547368 | 24.4 | -15.9 |
12 | -0.217512 | 17.9 | -21.8 |
monthly_T_climatology.plot(marker='o')
<Axes: xlabel='LST_DATE'>
If we want to do it on a finer scale, we can group by day of year.
daily_T_climatology = df.groupby(df.index.dayofyear).aggregate({'T_DAILY_MEAN': 'mean',
'T_DAILY_MAX': 'max',
'T_DAILY_MIN': 'min'})
daily_T_climatology.plot(marker='.')
<Axes: xlabel='LST_DATE'>
Calculating anomalies¶
A common mode of analysis in climate science is to remove the climatology from a signal to focus only on the "anomaly" values. This can be accomplished with transformation.
def standardize(x):
return (x - x.mean())/x.std()
anomaly = df.groupby(df.index.month).transform(standardize)
anomaly.plot(y='T_DAILY_MEAN')
<Axes: xlabel='LST_DATE'>
anomaly
WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | SOLARAD_DAILY | ... | SOIL_MOISTURE_10_DAILY | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | Unnamed: 28 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LST_DATE | |||||||||||||||||||||
2016-01-01 | NaN | -1.577491 | NaN | NaN | 0.208563 | 1.082782 | 0.727292 | 0.654996 | -0.398067 | -1.242257 | ... | -0.992876 | -0.857404 | -0.607020 | -1.209148 | 3.178622 | 3.416310 | 3.829867 | 3.857182 | 3.970999 | NaN |
2016-01-02 | NaN | -1.577491 | NaN | NaN | 0.121662 | 0.606696 | 0.396565 | 0.374943 | -0.398067 | 0.186367 | ... | -1.244252 | -1.019588 | -0.794239 | -1.392266 | 2.009687 | 2.285460 | 2.996505 | 3.569060 | 3.783166 | NaN |
2016-01-03 | NaN | -1.577491 | NaN | NaN | 0.504027 | 0.883133 | 0.744698 | 0.619989 | -0.398067 | 0.010922 | ... | -1.411835 | -1.116899 | -0.856646 | -1.575385 | 1.842696 | 2.024495 | 2.626121 | 3.088858 | 3.595334 | NaN |
2016-01-04 | NaN | -1.577491 | NaN | NaN | -0.295464 | -1.051924 | -0.734867 | -0.885294 | -0.398067 | 1.101187 | ... | -1.537523 | -1.181773 | -1.043866 | -1.697463 | 1.091238 | 1.415576 | 2.255738 | 2.800736 | 3.313586 | NaN |
2016-01-05 | NaN | -1.577491 | NaN | NaN | -1.286137 | -1.220858 | -1.326693 | -1.620432 | -0.398067 | 1.154447 | ... | -1.830794 | -1.279083 | -1.043866 | -1.758503 | 0.005798 | 0.371715 | 1.329779 | 2.224493 | 3.031838 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2022-12-27 | NaN | 0.407307 | NaN | NaN | -0.963856 | -0.556494 | -0.802459 | -0.738821 | -0.464967 | -0.249287 | ... | NaN | NaN | -0.529551 | -1.086642 | -1.375941 | -1.382812 | -1.435337 | -1.298599 | -1.289971 | NaN |
2022-12-28 | NaN | 0.407307 | NaN | NaN | 0.530772 | -0.213428 | 0.176035 | 0.238415 | -0.464967 | 1.203695 | ... | NaN | NaN | -0.670202 | -1.168774 | -1.375941 | -1.436291 | -1.494016 | -1.360778 | -1.424874 | NaN |
2022-12-29 | NaN | 0.407307 | NaN | NaN | 1.132268 | 0.562984 | 0.885923 | 0.947390 | -0.464967 | 0.786888 | ... | NaN | NaN | -0.881180 | -1.250906 | -1.324114 | -1.436291 | -1.552695 | -1.485136 | -1.492325 | NaN |
2022-12-30 | NaN | 0.407307 | NaN | NaN | 2.207671 | 1.772744 | 2.094651 | 1.962948 | -0.464967 | 0.292173 | ... | NaN | NaN | -0.881180 | -1.333038 | -1.272286 | -1.382812 | -1.552695 | -1.547315 | -1.559777 | NaN |
2022-12-31 | NaN | 0.407307 | NaN | NaN | 1.587947 | 1.375509 | 1.557438 | 1.943787 | 0.279886 | -1.320521 | ... | NaN | NaN | -0.810854 | -1.415170 | -1.220458 | -1.382812 | -1.552695 | -1.547315 | -1.694679 | NaN |
2557 rows × 27 columns
anomaly
WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | SOLARAD_DAILY | ... | SOIL_MOISTURE_10_DAILY | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | Unnamed: 28 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LST_DATE | |||||||||||||||||||||
2016-01-01 | NaN | -1.577491 | NaN | NaN | 0.208563 | 1.082782 | 0.727292 | 0.654996 | -0.398067 | -1.242257 | ... | -0.992876 | -0.857404 | -0.607020 | -1.209148 | 3.178622 | 3.416310 | 3.829867 | 3.857182 | 3.970999 | NaN |
2016-01-02 | NaN | -1.577491 | NaN | NaN | 0.121662 | 0.606696 | 0.396565 | 0.374943 | -0.398067 | 0.186367 | ... | -1.244252 | -1.019588 | -0.794239 | -1.392266 | 2.009687 | 2.285460 | 2.996505 | 3.569060 | 3.783166 | NaN |
2016-01-03 | NaN | -1.577491 | NaN | NaN | 0.504027 | 0.883133 | 0.744698 | 0.619989 | -0.398067 | 0.010922 | ... | -1.411835 | -1.116899 | -0.856646 | -1.575385 | 1.842696 | 2.024495 | 2.626121 | 3.088858 | 3.595334 | NaN |
2016-01-04 | NaN | -1.577491 | NaN | NaN | -0.295464 | -1.051924 | -0.734867 | -0.885294 | -0.398067 | 1.101187 | ... | -1.537523 | -1.181773 | -1.043866 | -1.697463 | 1.091238 | 1.415576 | 2.255738 | 2.800736 | 3.313586 | NaN |
2016-01-05 | NaN | -1.577491 | NaN | NaN | -1.286137 | -1.220858 | -1.326693 | -1.620432 | -0.398067 | 1.154447 | ... | -1.830794 | -1.279083 | -1.043866 | -1.758503 | 0.005798 | 0.371715 | 1.329779 | 2.224493 | 3.031838 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2022-12-27 | NaN | 0.407307 | NaN | NaN | -0.963856 | -0.556494 | -0.802459 | -0.738821 | -0.464967 | -0.249287 | ... | NaN | NaN | -0.529551 | -1.086642 | -1.375941 | -1.382812 | -1.435337 | -1.298599 | -1.289971 | NaN |
2022-12-28 | NaN | 0.407307 | NaN | NaN | 0.530772 | -0.213428 | 0.176035 | 0.238415 | -0.464967 | 1.203695 | ... | NaN | NaN | -0.670202 | -1.168774 | -1.375941 | -1.436291 | -1.494016 | -1.360778 | -1.424874 | NaN |
2022-12-29 | NaN | 0.407307 | NaN | NaN | 1.132268 | 0.562984 | 0.885923 | 0.947390 | -0.464967 | 0.786888 | ... | NaN | NaN | -0.881180 | -1.250906 | -1.324114 | -1.436291 | -1.552695 | -1.485136 | -1.492325 | NaN |
2022-12-30 | NaN | 0.407307 | NaN | NaN | 2.207671 | 1.772744 | 2.094651 | 1.962948 | -0.464967 | 0.292173 | ... | NaN | NaN | -0.881180 | -1.333038 | -1.272286 | -1.382812 | -1.552695 | -1.547315 | -1.559777 | NaN |
2022-12-31 | NaN | 0.407307 | NaN | NaN | 1.587947 | 1.375509 | 1.557438 | 1.943787 | 0.279886 | -1.320521 | ... | NaN | NaN | -0.810854 | -1.415170 | -1.220458 | -1.382812 | -1.552695 | -1.547315 | -1.694679 | NaN |
2557 rows × 27 columns
Note that the transform is different from apply.
df.groupby(df.index.month).transform('mean')
WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | SOLARAD_DAILY | ... | SOIL_MOISTURE_10_DAILY | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | Unnamed: 28 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LST_DATE | |||||||||||||||||||||
2016-01-01 | 64756.0 | 2.564857 | -73.74 | 41.79 | 2.200000 | -7.550463 | -2.678241 | -2.44213 | 2.217130 | 5.655139 | ... | 0.256699 | 0.230433 | 0.164727 | 0.166809 | 0.393056 | 0.472685 | 0.963889 | 1.983796 | 3.371759 | NaN |
2016-01-02 | 64756.0 | 2.564857 | -73.74 | 41.79 | 2.200000 | -7.550463 | -2.678241 | -2.44213 | 2.217130 | 5.655139 | ... | 0.256699 | 0.230433 | 0.164727 | 0.166809 | 0.393056 | 0.472685 | 0.963889 | 1.983796 | 3.371759 | NaN |
2016-01-03 | 64756.0 | 2.564857 | -73.74 | 41.79 | 2.200000 | -7.550463 | -2.678241 | -2.44213 | 2.217130 | 5.655139 | ... | 0.256699 | 0.230433 | 0.164727 | 0.166809 | 0.393056 | 0.472685 | 0.963889 | 1.983796 | 3.371759 | NaN |
2016-01-04 | 64756.0 | 2.564857 | -73.74 | 41.79 | 2.200000 | -7.550463 | -2.678241 | -2.44213 | 2.217130 | 5.655139 | ... | 0.256699 | 0.230433 | 0.164727 | 0.166809 | 0.393056 | 0.472685 | 0.963889 | 1.983796 | 3.371759 | NaN |
2016-01-05 | 64756.0 | 2.564857 | -73.74 | 41.79 | 2.200000 | -7.550463 | -2.678241 | -2.44213 | 2.217130 | 5.655139 | ... | 0.256699 | 0.230433 | 0.164727 | 0.166809 | 0.393056 | 0.472685 | 0.963889 | 1.983796 | 3.371759 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2022-12-27 | 64756.0 | 2.593429 | -73.74 | 41.79 | 4.488018 | -4.917972 | -0.217512 | 0.05576 | 3.121198 | 4.639954 | ... | 0.254506 | 0.242934 | 0.171530 | 0.170230 | 2.254839 | 2.385714 | 2.946083 | 4.288479 | 5.912442 | NaN |
2022-12-28 | 64756.0 | 2.593429 | -73.74 | 41.79 | 4.488018 | -4.917972 | -0.217512 | 0.05576 | 3.121198 | 4.639954 | ... | 0.254506 | 0.242934 | 0.171530 | 0.170230 | 2.254839 | 2.385714 | 2.946083 | 4.288479 | 5.912442 | NaN |
2022-12-29 | 64756.0 | 2.593429 | -73.74 | 41.79 | 4.488018 | -4.917972 | -0.217512 | 0.05576 | 3.121198 | 4.639954 | ... | 0.254506 | 0.242934 | 0.171530 | 0.170230 | 2.254839 | 2.385714 | 2.946083 | 4.288479 | 5.912442 | NaN |
2022-12-30 | 64756.0 | 2.593429 | -73.74 | 41.79 | 4.488018 | -4.917972 | -0.217512 | 0.05576 | 3.121198 | 4.639954 | ... | 0.254506 | 0.242934 | 0.171530 | 0.170230 | 2.254839 | 2.385714 | 2.946083 | 4.288479 | 5.912442 | NaN |
2022-12-31 | 64756.0 | 2.593429 | -73.74 | 41.79 | 4.488018 | -4.917972 | -0.217512 | 0.05576 | 3.121198 | 4.639954 | ... | 0.254506 | 0.242934 | 0.171530 | 0.170230 | 2.254839 | 2.385714 | 2.946083 | 4.288479 | 5.912442 | NaN |
2557 rows × 27 columns
df.groupby(df.index.month).apply('mean')
WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | SOLARAD_DAILY | ... | SOIL_MOISTURE_10_DAILY | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | Unnamed: 28 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LST_DATE | |||||||||||||||||||||
1 | 64756.0 | 2.564857 | -73.74 | 41.79 | 2.200000 | -7.550463 | -2.678241 | -2.442130 | 2.217130 | 5.655139 | ... | 0.256699 | 0.230433 | 0.164727 | 0.166809 | 0.393056 | 0.472685 | 0.963889 | 1.983796 | 3.371759 | NaN |
2 | 64756.0 | 2.564424 | -73.74 | 41.79 | 4.866162 | -5.901010 | -0.519192 | -0.220202 | 3.449495 | 8.359444 | ... | 0.250415 | 0.219145 | 0.163520 | 0.165425 | 0.557071 | 0.518687 | 0.707071 | 1.256566 | 2.156061 | NaN |
3 | 64756.0 | 2.564857 | -73.74 | 41.79 | 9.015668 | -2.634101 | 3.184793 | 3.370046 | 2.426728 | 12.813917 | ... | 0.240800 | 0.224130 | 0.168618 | 0.166180 | 3.385714 | 3.270507 | 3.157604 | 3.182488 | 3.337788 | NaN |
4 | 64756.0 | 2.564857 | -73.74 | 41.79 | 14.930476 | 1.948095 | 8.438095 | 8.733810 | 3.217143 | 14.977381 | ... | 0.229071 | 0.235376 | 0.165395 | 0.165433 | 9.685714 | 9.460476 | 8.950000 | 8.119524 | 7.199524 | NaN |
5 | 64756.0 | 2.564857 | -73.74 | 41.79 | 20.996774 | 8.094009 | 14.548848 | 14.772811 | 3.182949 | 17.912673 | ... | 0.208286 | 0.221042 | 0.156848 | 0.161083 | 16.721198 | 16.471889 | 15.606019 | 14.184793 | 12.509217 | NaN |
6 | 64756.0 | 2.564857 | -73.74 | 41.79 | 25.874286 | 12.033810 | 18.952857 | 19.285714 | 2.290000 | 21.610095 | ... | 0.138119 | 0.162114 | 0.136386 | 0.153190 | 22.399524 | 22.177619 | 21.112381 | 19.530476 | 17.661429 | NaN |
7 | 64756.0 | 2.564857 | -73.74 | 41.79 | 29.040092 | 15.894470 | 22.465899 | 22.322120 | 3.880184 | 20.864101 | ... | 0.106465 | 0.113350 | 0.118535 | 0.139206 | 25.527189 | 25.406912 | 24.316204 | 22.814286 | 21.129630 | NaN |
8 | 64756.0 | 2.297069 | -73.74 | 41.79 | 28.139048 | 15.543810 | 21.838571 | 21.615714 | 3.969048 | 18.131429 | ... | 0.134129 | 0.128214 | 0.122652 | 0.136627 | 24.912857 | 24.918095 | 24.244762 | 23.358571 | 22.286190 | NaN |
9 | 64756.0 | 2.564857 | -73.74 | 41.79 | 23.720096 | 11.202871 | 17.460287 | 17.344019 | 3.948804 | 14.033301 | ... | 0.144627 | 0.151890 | 0.127550 | 0.141841 | 20.640191 | 20.736364 | 20.594258 | 20.666986 | 20.554067 | NaN |
10 | 64756.0 | 2.590664 | -73.74 | 41.79 | 17.773488 | 5.738140 | 11.753023 | 11.740465 | 3.637963 | 9.193674 | ... | 0.190428 | 0.187633 | 0.141386 | 0.144750 | 14.663256 | 14.799070 | 15.074419 | 15.892558 | 16.645581 | NaN |
11 | 64756.0 | 2.593429 | -73.74 | 41.79 | 10.377512 | -1.277990 | 4.547368 | 4.720574 | 3.301905 | 6.567895 | ... | 0.234833 | 0.237167 | 0.166742 | 0.161895 | 7.128708 | 7.302392 | 7.955288 | 9.385167 | 10.990909 | NaN |
12 | 64756.0 | 2.593429 | -73.74 | 41.79 | 4.488018 | -4.917972 | -0.217512 | 0.055760 | 3.121198 | 4.639954 | ... | 0.254506 | 0.242934 | 0.171530 | 0.170230 | 2.254839 | 2.385714 | 2.946083 | 4.288479 | 5.912442 | NaN |
12 rows × 27 columns
Resampling¶
Another common operation is to change the resolution of a dataset by resampling in time. Pandas exposes this through the resample function. The resample periods are specified using pandas offset index syntax.
Below we resample the dataset by taking the mean over each month.
df.resample('M').mean().plot(y='T_DAILY_MEAN', marker='o')
<Axes: xlabel='LST_DATE'>
Just like with groupby, we can apply any aggregation function to our resample operation.
df.resample('M').max().plot(y='T_DAILY_MAX', marker='o')
<Axes: xlabel='LST_DATE'>
Rolling Operations¶
df.rolling(30, center=True).T_DAILY_MEAN.mean().plot()
df.rolling(30, center=True, win_type='triang').T_DAILY_MEAN.mean().plot()
<Axes: xlabel='LST_DATE'>
df.loc[(df.T_DAILY_MEAN.isnull())]
WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | SOLARAD_DAILY | ... | SOIL_MOISTURE_10_DAILY | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | Unnamed: 28 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LST_DATE | |||||||||||||||||||||
2017-10-04 | 64756 | 2.622 | -73.74 | 41.79 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2018-10-13 | 64756 | 2.622 | -73.74 | 41.79 | NaN | NaN | NaN | NaN | 4.6 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2019-08-10 | 64756 | 2.622 | -73.74 | 41.79 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2019-08-11 | 64756 | -9.000 | -73.74 | 41.79 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2019-08-12 | 64756 | -9.000 | -73.74 | 41.79 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2019-08-13 | 64756 | -9.000 | -73.74 | 41.79 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2019-08-14 | 64756 | -9.000 | -73.74 | 41.79 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2019-08-15 | 64756 | -9.000 | -73.74 | 41.79 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2019-08-16 | 64756 | 2.622 | -73.74 | 41.79 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2019-09-25 | 64756 | 2.622 | -73.74 | 41.79 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2021-01-04 | 64756 | 2.622 | -73.74 | 41.79 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2021-11-16 | 64756 | 2.622 | -73.74 | 41.79 | NaN | NaN | NaN | NaN | 0.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
12 rows × 27 columns
df_roll = df.rolling(30, center=True).T_DAILY_MEAN.mean()
list(df_roll.loc[df_roll.isnull()].index)
[Timestamp('2016-01-01 00:00:00'), Timestamp('2016-01-02 00:00:00'), Timestamp('2016-01-03 00:00:00'), Timestamp('2016-01-04 00:00:00'), Timestamp('2016-01-05 00:00:00'), Timestamp('2016-01-06 00:00:00'), Timestamp('2016-01-07 00:00:00'), Timestamp('2016-01-08 00:00:00'), Timestamp('2016-01-09 00:00:00'), Timestamp('2016-01-10 00:00:00'), Timestamp('2016-01-11 00:00:00'), Timestamp('2016-01-12 00:00:00'), Timestamp('2016-01-13 00:00:00'), Timestamp('2016-01-14 00:00:00'), Timestamp('2016-01-15 00:00:00'), Timestamp('2017-09-20 00:00:00'), Timestamp('2017-09-21 00:00:00'), Timestamp('2017-09-22 00:00:00'), Timestamp('2017-09-23 00:00:00'), Timestamp('2017-09-24 00:00:00'), Timestamp('2017-09-25 00:00:00'), Timestamp('2017-09-26 00:00:00'), Timestamp('2017-09-27 00:00:00'), Timestamp('2017-09-28 00:00:00'), Timestamp('2017-09-29 00:00:00'), Timestamp('2017-09-30 00:00:00'), Timestamp('2017-10-01 00:00:00'), Timestamp('2017-10-02 00:00:00'), Timestamp('2017-10-03 00:00:00'), Timestamp('2017-10-04 00:00:00'), Timestamp('2017-10-05 00:00:00'), Timestamp('2017-10-06 00:00:00'), Timestamp('2017-10-07 00:00:00'), Timestamp('2017-10-08 00:00:00'), Timestamp('2017-10-09 00:00:00'), Timestamp('2017-10-10 00:00:00'), Timestamp('2017-10-11 00:00:00'), Timestamp('2017-10-12 00:00:00'), Timestamp('2017-10-13 00:00:00'), Timestamp('2017-10-14 00:00:00'), Timestamp('2017-10-15 00:00:00'), Timestamp('2017-10-16 00:00:00'), Timestamp('2017-10-17 00:00:00'), Timestamp('2017-10-18 00:00:00'), Timestamp('2017-10-19 00:00:00'), Timestamp('2018-09-29 00:00:00'), Timestamp('2018-09-30 00:00:00'), Timestamp('2018-10-01 00:00:00'), Timestamp('2018-10-02 00:00:00'), Timestamp('2018-10-03 00:00:00'), Timestamp('2018-10-04 00:00:00'), Timestamp('2018-10-05 00:00:00'), Timestamp('2018-10-06 00:00:00'), Timestamp('2018-10-07 00:00:00'), Timestamp('2018-10-08 00:00:00'), Timestamp('2018-10-09 00:00:00'), Timestamp('2018-10-10 00:00:00'), Timestamp('2018-10-11 00:00:00'), Timestamp('2018-10-12 00:00:00'), Timestamp('2018-10-13 00:00:00'), Timestamp('2018-10-14 00:00:00'), Timestamp('2018-10-15 00:00:00'), Timestamp('2018-10-16 00:00:00'), Timestamp('2018-10-17 00:00:00'), Timestamp('2018-10-18 00:00:00'), Timestamp('2018-10-19 00:00:00'), Timestamp('2018-10-20 00:00:00'), Timestamp('2018-10-21 00:00:00'), Timestamp('2018-10-22 00:00:00'), Timestamp('2018-10-23 00:00:00'), Timestamp('2018-10-24 00:00:00'), Timestamp('2018-10-25 00:00:00'), Timestamp('2018-10-26 00:00:00'), Timestamp('2018-10-27 00:00:00'), Timestamp('2018-10-28 00:00:00'), Timestamp('2019-07-27 00:00:00'), Timestamp('2019-07-28 00:00:00'), Timestamp('2019-07-29 00:00:00'), Timestamp('2019-07-30 00:00:00'), Timestamp('2019-07-31 00:00:00'), Timestamp('2019-08-01 00:00:00'), Timestamp('2019-08-02 00:00:00'), Timestamp('2019-08-03 00:00:00'), Timestamp('2019-08-04 00:00:00'), Timestamp('2019-08-05 00:00:00'), Timestamp('2019-08-06 00:00:00'), Timestamp('2019-08-07 00:00:00'), Timestamp('2019-08-08 00:00:00'), Timestamp('2019-08-09 00:00:00'), Timestamp('2019-08-10 00:00:00'), Timestamp('2019-08-11 00:00:00'), Timestamp('2019-08-12 00:00:00'), Timestamp('2019-08-13 00:00:00'), Timestamp('2019-08-14 00:00:00'), Timestamp('2019-08-15 00:00:00'), Timestamp('2019-08-16 00:00:00'), Timestamp('2019-08-17 00:00:00'), Timestamp('2019-08-18 00:00:00'), Timestamp('2019-08-19 00:00:00'), Timestamp('2019-08-20 00:00:00'), Timestamp('2019-08-21 00:00:00'), Timestamp('2019-08-22 00:00:00'), Timestamp('2019-08-23 00:00:00'), Timestamp('2019-08-24 00:00:00'), Timestamp('2019-08-25 00:00:00'), Timestamp('2019-08-26 00:00:00'), Timestamp('2019-08-27 00:00:00'), Timestamp('2019-08-28 00:00:00'), Timestamp('2019-08-29 00:00:00'), Timestamp('2019-08-30 00:00:00'), Timestamp('2019-08-31 00:00:00'), Timestamp('2019-09-11 00:00:00'), Timestamp('2019-09-12 00:00:00'), Timestamp('2019-09-13 00:00:00'), Timestamp('2019-09-14 00:00:00'), Timestamp('2019-09-15 00:00:00'), Timestamp('2019-09-16 00:00:00'), Timestamp('2019-09-17 00:00:00'), Timestamp('2019-09-18 00:00:00'), Timestamp('2019-09-19 00:00:00'), Timestamp('2019-09-20 00:00:00'), Timestamp('2019-09-21 00:00:00'), Timestamp('2019-09-22 00:00:00'), Timestamp('2019-09-23 00:00:00'), Timestamp('2019-09-24 00:00:00'), Timestamp('2019-09-25 00:00:00'), Timestamp('2019-09-26 00:00:00'), Timestamp('2019-09-27 00:00:00'), Timestamp('2019-09-28 00:00:00'), Timestamp('2019-09-29 00:00:00'), Timestamp('2019-09-30 00:00:00'), Timestamp('2019-10-01 00:00:00'), Timestamp('2019-10-02 00:00:00'), Timestamp('2019-10-03 00:00:00'), Timestamp('2019-10-04 00:00:00'), Timestamp('2019-10-05 00:00:00'), Timestamp('2019-10-06 00:00:00'), Timestamp('2019-10-07 00:00:00'), Timestamp('2019-10-08 00:00:00'), Timestamp('2019-10-09 00:00:00'), Timestamp('2019-10-10 00:00:00'), Timestamp('2020-12-21 00:00:00'), Timestamp('2020-12-22 00:00:00'), Timestamp('2020-12-23 00:00:00'), Timestamp('2020-12-24 00:00:00'), Timestamp('2020-12-25 00:00:00'), Timestamp('2020-12-26 00:00:00'), Timestamp('2020-12-27 00:00:00'), Timestamp('2020-12-28 00:00:00'), Timestamp('2020-12-29 00:00:00'), Timestamp('2020-12-30 00:00:00'), Timestamp('2020-12-31 00:00:00'), Timestamp('2021-01-01 00:00:00'), Timestamp('2021-01-02 00:00:00'), Timestamp('2021-01-03 00:00:00'), Timestamp('2021-01-04 00:00:00'), Timestamp('2021-01-05 00:00:00'), Timestamp('2021-01-06 00:00:00'), Timestamp('2021-01-07 00:00:00'), Timestamp('2021-01-08 00:00:00'), Timestamp('2021-01-09 00:00:00'), Timestamp('2021-01-10 00:00:00'), Timestamp('2021-01-11 00:00:00'), Timestamp('2021-01-12 00:00:00'), Timestamp('2021-01-13 00:00:00'), Timestamp('2021-01-14 00:00:00'), Timestamp('2021-01-15 00:00:00'), Timestamp('2021-01-16 00:00:00'), Timestamp('2021-01-17 00:00:00'), Timestamp('2021-01-18 00:00:00'), Timestamp('2021-01-19 00:00:00'), Timestamp('2021-11-02 00:00:00'), Timestamp('2021-11-03 00:00:00'), Timestamp('2021-11-04 00:00:00'), Timestamp('2021-11-05 00:00:00'), Timestamp('2021-11-06 00:00:00'), Timestamp('2021-11-07 00:00:00'), Timestamp('2021-11-08 00:00:00'), Timestamp('2021-11-09 00:00:00'), Timestamp('2021-11-10 00:00:00'), Timestamp('2021-11-11 00:00:00'), Timestamp('2021-11-12 00:00:00'), Timestamp('2021-11-13 00:00:00'), Timestamp('2021-11-14 00:00:00'), Timestamp('2021-11-15 00:00:00'), Timestamp('2021-11-16 00:00:00'), Timestamp('2021-11-17 00:00:00'), Timestamp('2021-11-18 00:00:00'), Timestamp('2021-11-19 00:00:00'), Timestamp('2021-11-20 00:00:00'), Timestamp('2021-11-21 00:00:00'), Timestamp('2021-11-22 00:00:00'), Timestamp('2021-11-23 00:00:00'), Timestamp('2021-11-24 00:00:00'), Timestamp('2021-11-25 00:00:00'), Timestamp('2021-11-26 00:00:00'), Timestamp('2021-11-27 00:00:00'), Timestamp('2021-11-28 00:00:00'), Timestamp('2021-11-29 00:00:00'), Timestamp('2021-11-30 00:00:00'), Timestamp('2021-12-01 00:00:00'), Timestamp('2022-12-18 00:00:00'), Timestamp('2022-12-19 00:00:00'), Timestamp('2022-12-20 00:00:00'), Timestamp('2022-12-21 00:00:00'), Timestamp('2022-12-22 00:00:00'), Timestamp('2022-12-23 00:00:00'), Timestamp('2022-12-24 00:00:00'), Timestamp('2022-12-25 00:00:00'), Timestamp('2022-12-26 00:00:00'), Timestamp('2022-12-27 00:00:00'), Timestamp('2022-12-28 00:00:00'), Timestamp('2022-12-29 00:00:00'), Timestamp('2022-12-30 00:00:00'), Timestamp('2022-12-31 00:00:00')]
df.rolling('30D', center=True).T_DAILY_MEAN.mean().plot()
<Axes: xlabel='LST_DATE'>
Filling missing data¶
df
WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | SOLARAD_DAILY | ... | SOIL_MOISTURE_10_DAILY | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | Unnamed: 28 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LST_DATE | |||||||||||||||||||||
2016-01-01 | 64756 | 2.422 | -73.74 | 41.79 | 3.4 | -0.5 | 1.5 | 1.3 | 0.0 | 1.69 | ... | 0.233 | 0.204 | 0.155 | 0.147 | 4.2 | 4.4 | 5.1 | 6.0 | 7.6 | NaN |
2016-01-02 | 64756 | 2.422 | -73.74 | 41.79 | 2.9 | -3.6 | -0.4 | -0.3 | 0.0 | 6.25 | ... | 0.227 | 0.199 | 0.152 | 0.144 | 2.8 | 3.1 | 4.2 | 5.7 | 7.4 | NaN |
2016-01-03 | 64756 | 2.422 | -73.74 | 41.79 | 5.1 | -1.8 | 1.6 | 1.1 | 0.0 | 5.69 | ... | 0.223 | 0.196 | 0.151 | 0.141 | 2.6 | 2.8 | 3.8 | 5.2 | 7.2 | NaN |
2016-01-04 | 64756 | 2.422 | -73.74 | 41.79 | 0.5 | -14.4 | -6.9 | -7.5 | 0.0 | 9.17 | ... | 0.220 | 0.194 | 0.148 | 0.139 | 1.7 | 2.1 | 3.4 | 4.9 | 6.9 | NaN |
2016-01-05 | 64756 | 2.422 | -73.74 | 41.79 | -5.2 | -15.5 | -10.3 | -11.7 | 0.0 | 9.34 | ... | 0.213 | 0.191 | 0.148 | 0.138 | 0.4 | 0.9 | 2.4 | 4.3 | 6.6 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2022-12-27 | 64756 | 2.622 | -73.74 | 41.79 | -0.8 | -8.0 | -4.4 | -3.8 | 0.0 | 4.00 | ... | NaN | NaN | 0.164 | 0.157 | -0.4 | -0.2 | 0.5 | 2.2 | 4.0 | NaN |
2022-12-28 | 64756 | 2.622 | -73.74 | 41.79 | 7.4 | -6.1 | 0.7 | 1.3 | 0.0 | 7.73 | ... | NaN | NaN | 0.162 | 0.156 | -0.4 | -0.3 | 0.4 | 2.1 | 3.8 | NaN |
2022-12-29 | 64756 | 2.622 | -73.74 | 41.79 | 10.7 | -1.8 | 4.4 | 5.0 | 0.0 | 6.66 | ... | NaN | NaN | 0.159 | 0.155 | -0.3 | -0.3 | 0.3 | 1.9 | 3.7 | NaN |
2022-12-30 | 64756 | 2.622 | -73.74 | 41.79 | 16.6 | 4.9 | 10.7 | 10.3 | 0.0 | 5.39 | ... | NaN | NaN | 0.159 | 0.154 | -0.2 | -0.2 | 0.3 | 1.8 | 3.6 | NaN |
2022-12-31 | 64756 | 2.622 | -73.74 | 41.79 | 13.2 | 2.7 | 7.9 | 10.2 | 5.0 | 1.25 | ... | NaN | NaN | 0.160 | 0.153 | -0.1 | -0.2 | 0.3 | 1.8 | 3.4 | NaN |
2557 rows × 27 columns
df_sub = df.loc['2017-01-01':'2017-12-31']
df_sub['SOIL_MOISTURE_10_DAILY'].plot()
<Axes: xlabel='LST_DATE'>
# Fill values forward
df_sub.ffill()['SOIL_MOISTURE_10_DAILY'].plot()
# Fill values backward
df_sub.bfill()['SOIL_MOISTURE_10_DAILY'].plot()
<Axes: xlabel='LST_DATE'>
df_sub.fillna(method = 'ffill', limit = 3)['SOIL_MOISTURE_10_DAILY'].plot()
<Axes: xlabel='LST_DATE'>