Lecture 6: Pandas Basics¶
Let's start by importing pandas library
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
Pandas Data Structures: Series¶
A Series represents a one-dimensional array of data. The main difference between a Series and numpy array is that a Series has an index. The index contains the labels that we use to access the data.
There are many ways to create a Series. We will just show a few.
# Create a series from list
names = ['Xiaomeng', 'Siyi','Siyi', 'Matt']
values = [32, 25,26, 22]
ages = pd.Series(values, index=names)
ages
Xiaomeng 32 Siyi 25 Siyi 26 Matt 22 dtype: int64
ages.plot(kind='bar')
<Axes: >
Arithmetic operations and most numpy function can be applied to Series.
An important point is that the Series keep their index during such operations.
np.sqrt(ages)
Xiaomeng 5.656854 Siyi 5.000000 Siyi 5.099020 Matt 4.690416 dtype: float64
We can access the underlying index object if we need to:
ages.index
Index(['Xiaomeng', 'Siyi', 'Siyi', 'Matt'], dtype='object')
Indexing¶
We can get values back out using the index via the .loc
attribute
ages.loc['Xiaomeng']
32
ages.loc['Siyi']
Siyi 25 Siyi 26 dtype: int64
Or by raw position using .iloc
ages.iloc[2]
26
We can pass a list or array to loc to get multiple rows back:
ages.loc[['Matt', 'Siyi']]
Matt 22 Siyi 25 Siyi 26 dtype: int64
And we can even use slice notation
ages.loc['Xiaomeng':'Matt']
Xiaomeng 32 Siyi 25 Siyi 26 Matt 22 dtype: int64
ages.iloc[:2]
Xiaomeng 32 Siyi 25 dtype: int64
If we need to, we can always get the raw data back out as well
ages.values # a numpy array
array([32, 25, 26, 22])
ages.index # a pandas Index object
Index(['Xiaomeng', 'Siyi', 'Siyi', 'Matt'], dtype='object')
Pandas Data Structures: DataFrame¶
There is a lot more to Series, but they are limit to a single "column". A more useful Pandas data structure is the DataFrame. A DataFrame is basically a bunch of series that share the same index. It's a lot like a table in a spreadsheet.
Below we create a DataFrame.
# first we create a dictionary
data = {'age': [32, 25, 22],
'height': [160, np.NaN, np.NaN],
'is_teacher': [True, False, False]}
df = pd.DataFrame(data, index=['Xiaomeng', 'Siyi', 'Matt'])
df
age | height | is_teacher | |
---|---|---|---|
Xiaomeng | 32 | 160.0 | True |
Siyi | 25 | NaN | False |
Matt | 22 | NaN | False |
# You can set the style of the table
df.style.highlight_max()
age | height | is_teacher | |
---|---|---|---|
Xiaomeng | 32 | 160.000000 | True |
Siyi | 25 | nan | False |
Matt | 22 | nan | False |
Pandas handles missing data very elegantly, keeping track of it through all calculations.
Statistics¶
A wide range of statistical functions are available on both Series and DataFrames.
df.min()
age 22 height 160.0 is_teacher False dtype: object
df.mean()
age 26.333333 height 160.000000 is_teacher 0.333333 dtype: float64
df.count()
age 3 height 1 is_teacher 3 dtype: int64
df.std()
age 5.131601 height NaN is_teacher 0.577350 dtype: float64
df.describe()
age | height | |
---|---|---|
count | 3.000000 | 1.0 |
mean | 26.333333 | 160.0 |
std | 5.131601 | NaN |
min | 22.000000 | 160.0 |
25% | 23.500000 | 160.0 |
50% | 25.000000 | 160.0 |
75% | 28.500000 | 160.0 |
max | 32.000000 | 160.0 |
We can get a single column as a Series using python's getitem syntax on the DataFrame object.
df['height']
Xiaomeng 160.0 Siyi NaN Matt NaN Name: height, dtype: float64
...or using attribute syntax.
df.height
Xiaomeng 160.0 Siyi NaN Matt NaN Name: height, dtype: float64
Indexing works very similar to series
df.loc['Xiaomeng']
age 32 height 160.0 is_teacher True Name: Xiaomeng, dtype: object
df.iloc[2]
age 22 height NaN is_teacher False Name: Matt, dtype: object
But we can also specify the column we want to access
df.loc['Xiaomeng', 'age']
32
df.iloc[:2, 0]
Xiaomeng 32 Siyi 25 Name: age, dtype: int64
If we make a calculation using columns from the DataFrame, it will keep the same index:
Which we can easily add as another column to the DataFrame:
2023 - df['age']
Xiaomeng 1991 Siyi 1998 Matt 2001 Name: age, dtype: int64
df['year'] = 2023 - df['age']
df
age | height | is_teacher | year | |
---|---|---|---|---|
Xiaomeng | 32 | 160.0 | True | 1991 |
Siyi | 25 | NaN | False | 1998 |
Matt | 22 | NaN | False | 2001 |
# Modify values
df.loc['Siyi', 'height'] = 165
# Don't run it many times, it will keep adding values.
df.loc['Xiaomeng', 'age'] += 1
df
age | height | is_teacher | year | |
---|---|---|---|---|
Xiaomeng | 33 | 160.0 | True | 1991 |
Siyi | 25 | 165.0 | False | 1998 |
Matt | 22 | NaN | False | 2001 |
Add new rows¶
df.loc['Kerry'] = [25, np.NaN, False, 1998]
df
age | height | is_teacher | year | |
---|---|---|---|---|
Xiaomeng | 33 | 160.0 | True | 1991 |
Siyi | 25 | 165.0 | False | 1998 |
Matt | 22 | NaN | False | 2001 |
Kerry | 25 | NaN | False | 1998 |
Merge with new series¶
education = pd.Series(['PhD', 'masters', 'bachelor','PhD'],
index=['Xiaomeng', 'Siyi', 'Matt', 'Lisa'],
name='education')
education
Xiaomeng PhD Siyi masters Matt bachelor Lisa PhD Name: education, dtype: object
# returns a new DataFrame
df_join = df.join(education)
df_join
age | height | is_teacher | year | education | |
---|---|---|---|---|---|
Xiaomeng | 33 | 160.0 | True | 1991 | PhD |
Siyi | 25 | 165.0 | False | 1998 | masters |
Matt | 22 | NaN | False | 2001 | bachelor |
Kerry | 25 | NaN | False | 1998 | NaN |
df_join_right = df.join(education, how = 'right')
df_join_right
age | height | is_teacher | year | education | |
---|---|---|---|---|---|
Xiaomeng | 33.0 | 160.0 | True | 1991.0 | PhD |
Siyi | 25.0 | 165.0 | False | 1998.0 | masters |
Matt | 22.0 | NaN | False | 2001.0 | bachelor |
Lisa | NaN | NaN | NaN | NaN | PhD |
Using concat to append new rows¶
df_combined = pd.concat([df_join, df_join_right])
df_combined
age | height | is_teacher | year | education | |
---|---|---|---|---|---|
Xiaomeng | 33.0 | 160.0 | True | 1991.0 | PhD |
Siyi | 25.0 | 165.0 | False | 1998.0 | masters |
Matt | 22.0 | NaN | False | 2001.0 | bachelor |
Kerry | 25.0 | NaN | False | 1998.0 | NaN |
Xiaomeng | 33.0 | 160.0 | True | 1991.0 | PhD |
Siyi | 25.0 | 165.0 | False | 1998.0 | masters |
Matt | 22.0 | NaN | False | 2001.0 | bachelor |
Lisa | NaN | NaN | NaN | NaN | PhD |
df_combined.drop_duplicates()
age | height | is_teacher | year | education | |
---|---|---|---|---|---|
Xiaomeng | 33.0 | 160.0 | True | 1991.0 | PhD |
Siyi | 25.0 | 165.0 | False | 1998.0 | masters |
Matt | 22.0 | NaN | False | 2001.0 | bachelor |
Kerry | 25.0 | NaN | False | 1998.0 | NaN |
Lisa | NaN | NaN | NaN | NaN | PhD |
### This is equivalent to:
df.join(education, how = 'outer')
age | height | is_teacher | year | education | |
---|---|---|---|---|---|
Kerry | 25.0 | NaN | False | 1998.0 | NaN |
Lisa | NaN | NaN | NaN | NaN | PhD |
Matt | 22.0 | NaN | False | 2001.0 | bachelor |
Siyi | 25.0 | 165.0 | False | 1998.0 | masters |
Xiaomeng | 33.0 | 160.0 | True | 1991.0 | PhD |
Working with real data in Pandas¶
Importing: read_csv()¶
As you can see, pd.read_csv() has quite a few parameters. Don't be overwhelmed – most of these are optional arguments that allow you to specify exactly how your data file is structured and which part(s) you want to import. In particular, the sep parameter allows the user to specify the type of delimiter used in the file. The default is a comma, but you can actually pass other common delimiters (such as sep='\t', which is a tab) to import other delimited files. The only required argument is a string specifying the filepath of your file.
pd.read_csv?
df = pd.read_csv('../Assignments/Assignment_3/Millbrook_NY_daily_weather.csv')
Examine the dataframe¶
# Show the first five rows.
df.head()
LST_DATE | WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | ... | SOIL_MOISTURE_10_DAILY | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | Unnamed: 28 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-01-01 | 64756 | 2.422 | -73.74 | 41.79 | 3.4 | -0.5 | 1.5 | 1.3 | 0.0 | ... | 0.233 | 0.204 | 0.155 | 0.147 | 4.2 | 4.4 | 5.1 | 6.0 | 7.6 | NaN |
1 | 2016-01-02 | 64756 | 2.422 | -73.74 | 41.79 | 2.9 | -3.6 | -0.4 | -0.3 | 0.0 | ... | 0.227 | 0.199 | 0.152 | 0.144 | 2.8 | 3.1 | 4.2 | 5.7 | 7.4 | NaN |
2 | 2016-01-03 | 64756 | 2.422 | -73.74 | 41.79 | 5.1 | -1.8 | 1.6 | 1.1 | 0.0 | ... | 0.223 | 0.196 | 0.151 | 0.141 | 2.6 | 2.8 | 3.8 | 5.2 | 7.2 | NaN |
3 | 2016-01-04 | 64756 | 2.422 | -73.74 | 41.79 | 0.5 | -14.4 | -6.9 | -7.5 | 0.0 | ... | 0.220 | 0.194 | 0.148 | 0.139 | 1.7 | 2.1 | 3.4 | 4.9 | 6.9 | NaN |
4 | 2016-01-05 | 64756 | 2.422 | -73.74 | 41.79 | -5.2 | -15.5 | -10.3 | -11.7 | 0.0 | ... | 0.213 | 0.191 | 0.148 | 0.138 | 0.4 | 0.9 | 2.4 | 4.3 | 6.6 | NaN |
5 rows × 29 columns
# Show the first five rows.
df.tail()
LST_DATE | WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | ... | SOIL_MOISTURE_10_DAILY | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | Unnamed: 28 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2552 | 2022-12-27 | 64756 | 2.622 | -73.74 | 41.79 | -0.8 | -8.0 | -4.4 | -3.8 | 0.0 | ... | NaN | NaN | 0.164 | 0.157 | -0.4 | -0.2 | 0.5 | 2.2 | 4.0 | NaN |
2553 | 2022-12-28 | 64756 | 2.622 | -73.74 | 41.79 | 7.4 | -6.1 | 0.7 | 1.3 | 0.0 | ... | NaN | NaN | 0.162 | 0.156 | -0.4 | -0.3 | 0.4 | 2.1 | 3.8 | NaN |
2554 | 2022-12-29 | 64756 | 2.622 | -73.74 | 41.79 | 10.7 | -1.8 | 4.4 | 5.0 | 0.0 | ... | NaN | NaN | 0.159 | 0.155 | -0.3 | -0.3 | 0.3 | 1.9 | 3.7 | NaN |
2555 | 2022-12-30 | 64756 | 2.622 | -73.74 | 41.79 | 16.6 | 4.9 | 10.7 | 10.3 | 0.0 | ... | NaN | NaN | 0.159 | 0.154 | -0.2 | -0.2 | 0.3 | 1.8 | 3.6 | NaN |
2556 | 2022-12-31 | 64756 | 2.622 | -73.74 | 41.79 | 13.2 | 2.7 | 7.9 | 10.2 | 5.0 | ... | NaN | NaN | 0.160 | 0.153 | -0.1 | -0.2 | 0.3 | 1.8 | 3.4 | NaN |
5 rows × 29 columns
# Both df.head() and df.tail() can also accept an integer argument, e.g. df.head(n), where the first n rows will be printed.
df.head(10)
LST_DATE | WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | ... | SOIL_MOISTURE_10_DAILY | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | Unnamed: 28 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-01-01 | 64756 | 2.422 | -73.74 | 41.79 | 3.4 | -0.5 | 1.5 | 1.3 | 0.0 | ... | 0.233 | 0.204 | 0.155 | 0.147 | 4.2 | 4.4 | 5.1 | 6.0 | 7.6 | NaN |
1 | 2016-01-02 | 64756 | 2.422 | -73.74 | 41.79 | 2.9 | -3.6 | -0.4 | -0.3 | 0.0 | ... | 0.227 | 0.199 | 0.152 | 0.144 | 2.8 | 3.1 | 4.2 | 5.7 | 7.4 | NaN |
2 | 2016-01-03 | 64756 | 2.422 | -73.74 | 41.79 | 5.1 | -1.8 | 1.6 | 1.1 | 0.0 | ... | 0.223 | 0.196 | 0.151 | 0.141 | 2.6 | 2.8 | 3.8 | 5.2 | 7.2 | NaN |
3 | 2016-01-04 | 64756 | 2.422 | -73.74 | 41.79 | 0.5 | -14.4 | -6.9 | -7.5 | 0.0 | ... | 0.220 | 0.194 | 0.148 | 0.139 | 1.7 | 2.1 | 3.4 | 4.9 | 6.9 | NaN |
4 | 2016-01-05 | 64756 | 2.422 | -73.74 | 41.79 | -5.2 | -15.5 | -10.3 | -11.7 | 0.0 | ... | 0.213 | 0.191 | 0.148 | 0.138 | 0.4 | 0.9 | 2.4 | 4.3 | 6.6 | NaN |
5 | 2016-01-06 | 64756 | 2.422 | -73.74 | 41.79 | 5.0 | -16.4 | -5.7 | -5.9 | 0.0 | ... | NaN | 0.186 | 0.146 | 0.137 | -0.1 | 0.4 | 1.7 | 3.6 | 6.2 | NaN |
6 | 2016-01-07 | 64756 | 2.422 | -73.74 | 41.79 | 4.6 | -11.7 | -3.6 | -4.8 | 0.0 | ... | NaN | 0.183 | 0.144 | NaN | -0.2 | 0.2 | 1.4 | 3.2 | 5.8 | NaN |
7 | 2016-01-08 | 64756 | 2.422 | -73.74 | 41.79 | 6.9 | -11.8 | -2.5 | -2.0 | 0.0 | ... | NaN | 0.180 | 0.144 | 0.137 | -0.3 | 0.0 | 1.1 | 2.8 | 5.5 | NaN |
8 | 2016-01-09 | 64756 | 2.422 | -73.74 | 41.79 | 6.6 | -0.1 | 3.2 | 4.1 | 0.0 | ... | NaN | 0.179 | 0.141 | 0.136 | -0.1 | 0.1 | 1.0 | 2.6 | 5.2 | NaN |
9 | 2016-01-10 | 64756 | 2.422 | -73.74 | 41.79 | 15.7 | 3.4 | 9.5 | 9.0 | 24.2 | ... | NaN | 0.204 | 0.149 | 0.134 | 0.6 | 0.5 | 1.1 | 2.4 | 4.9 | NaN |
10 rows × 29 columns
df.describe()
WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | SOLARAD_DAILY | ... | SOIL_MOISTURE_10_DAILY | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | Unnamed: 28 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 2557.0 | 2557.000000 | 2557.00 | 2.557000e+03 | 2545.000000 | 2545.000000 | 2545.000000 | 2545.000000 | 2547.000000 | 2545.000000 | ... | 2173.000000 | 2281.000000 | 2545.000000 | 2427.000000 | 2545.000000 | 2545.000000 | 2542.000000 | 2545.000000 | 2544.000000 | 0.0 |
mean | 64756.0 | 2.549059 | -73.74 | 4.179000e+01 | 15.973635 | 4.037642 | 10.003733 | 10.130059 | 3.216726 | 12.908444 | ... | 0.189869 | 0.192877 | 0.150320 | 0.156753 | 12.382318 | 12.353084 | 12.155901 | 12.077367 | 11.994340 | NaN |
std | 0.0 | 0.518602 | 0.00 | 7.106817e-15 | 10.648586 | 9.568906 | 9.865217 | 9.721821 | 8.131592 | 8.015033 | ... | 0.071355 | 0.073708 | 0.029615 | 0.022373 | 9.531614 | 9.468884 | 8.960360 | 8.190971 | 7.334731 | NaN |
min | 64756.0 | -9.000000 | -73.74 | 4.179000e+01 | -12.300000 | -26.000000 | -18.400000 | -19.200000 | 0.000000 | 0.030000 | ... | 0.029000 | 0.030000 | 0.070000 | 0.023000 | -1.800000 | -1.600000 | -0.500000 | 0.500000 | 1.400000 | NaN |
25% | 64756.0 | 2.422000 | -73.74 | 4.179000e+01 | 6.900000 | -3.100000 | 2.000000 | 2.100000 | 0.000000 | 6.070000 | ... | 0.141000 | 0.144000 | 0.133000 | 0.145000 | 2.700000 | 2.800000 | 3.125000 | 3.800000 | 4.700000 | NaN |
50% | 64756.0 | 2.622000 | -73.74 | 4.179000e+01 | 16.900000 | 3.900000 | 10.200000 | 10.500000 | 0.000000 | 11.820000 | ... | 0.216000 | 0.211000 | 0.157000 | 0.160000 | 12.300000 | 12.200000 | 12.000000 | 12.000000 | 11.900000 | NaN |
75% | 64756.0 | 2.622000 | -73.74 | 4.179000e+01 | 25.200000 | 11.900000 | 18.700000 | 18.800000 | 2.150000 | 19.450000 | ... | 0.244000 | 0.246000 | 0.170000 | 0.170000 | 21.800000 | 21.700000 | 21.075000 | 20.100000 | 19.100000 | NaN |
max | 64756.0 | 2.622000 | -73.74 | 4.179000e+01 | 36.500000 | 23.400000 | 28.900000 | 28.400000 | 133.400000 | 31.250000 | ... | 0.359000 | 0.335000 | 0.223000 | 0.218000 | 28.600000 | 28.500000 | 27.100000 | 25.400000 | 23.500000 | NaN |
8 rows × 27 columns
# Basic information about the DataFrame
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2557 entries, 0 to 2556 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 LST_DATE 2557 non-null object 1 WBANNO 2557 non-null int64 2 CRX_VN 2557 non-null float64 3 LONGITUDE 2557 non-null float64 4 LATITUDE 2557 non-null float64 5 T_DAILY_MAX 2545 non-null float64 6 T_DAILY_MIN 2545 non-null float64 7 T_DAILY_MEAN 2545 non-null float64 8 T_DAILY_AVG 2545 non-null float64 9 P_DAILY_CALC 2547 non-null float64 10 SOLARAD_DAILY 2545 non-null float64 11 SUR_TEMP_DAILY_TYPE 2557 non-null object 12 SUR_TEMP_DAILY_MAX 2545 non-null float64 13 SUR_TEMP_DAILY_MIN 2545 non-null float64 14 SUR_TEMP_DAILY_AVG 2545 non-null float64 15 RH_DAILY_MAX 2545 non-null float64 16 RH_DAILY_MIN 2545 non-null float64 17 RH_DAILY_AVG 2545 non-null float64 18 SOIL_MOISTURE_5_DAILY 2164 non-null float64 19 SOIL_MOISTURE_10_DAILY 2173 non-null float64 20 SOIL_MOISTURE_20_DAILY 2281 non-null float64 21 SOIL_MOISTURE_50_DAILY 2545 non-null float64 22 SOIL_MOISTURE_100_DAILY 2427 non-null float64 23 SOIL_TEMP_5_DAILY 2545 non-null float64 24 SOIL_TEMP_10_DAILY 2545 non-null float64 25 SOIL_TEMP_20_DAILY 2542 non-null float64 26 SOIL_TEMP_50_DAILY 2545 non-null float64 27 SOIL_TEMP_100_DAILY 2544 non-null float64 28 Unnamed: 28 0 non-null float64 dtypes: float64(26), int64(1), object(2) memory usage: 579.4+ KB
## Note that the SUR_TEMP_DAILY_TYPE doesn't have numerical values, we need to remove this column
#del df['SUR_TEMP_DAILY_TYPE']
df = df.drop('SUR_TEMP_DAILY_TYPE', axis = 1)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2557 entries, 0 to 2556 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 LST_DATE 2557 non-null object 1 WBANNO 2557 non-null int64 2 CRX_VN 2557 non-null float64 3 LONGITUDE 2557 non-null float64 4 LATITUDE 2557 non-null float64 5 T_DAILY_MAX 2545 non-null float64 6 T_DAILY_MIN 2545 non-null float64 7 T_DAILY_MEAN 2545 non-null float64 8 T_DAILY_AVG 2545 non-null float64 9 P_DAILY_CALC 2547 non-null float64 10 SOLARAD_DAILY 2545 non-null float64 11 SUR_TEMP_DAILY_MAX 2545 non-null float64 12 SUR_TEMP_DAILY_MIN 2545 non-null float64 13 SUR_TEMP_DAILY_AVG 2545 non-null float64 14 RH_DAILY_MAX 2545 non-null float64 15 RH_DAILY_MIN 2545 non-null float64 16 RH_DAILY_AVG 2545 non-null float64 17 SOIL_MOISTURE_5_DAILY 2164 non-null float64 18 SOIL_MOISTURE_10_DAILY 2173 non-null float64 19 SOIL_MOISTURE_20_DAILY 2281 non-null float64 20 SOIL_MOISTURE_50_DAILY 2545 non-null float64 21 SOIL_MOISTURE_100_DAILY 2427 non-null float64 22 SOIL_TEMP_5_DAILY 2545 non-null float64 23 SOIL_TEMP_10_DAILY 2545 non-null float64 24 SOIL_TEMP_20_DAILY 2542 non-null float64 25 SOIL_TEMP_50_DAILY 2545 non-null float64 26 SOIL_TEMP_100_DAILY 2544 non-null float64 27 Unnamed: 28 0 non-null float64 dtypes: float64(26), int64(1), object(1) memory usage: 559.5+ KB
# Return the column names
df.columns
Index(['LST_DATE', 'WBANNO', 'CRX_VN', 'LONGITUDE', 'LATITUDE', 'T_DAILY_MAX', 'T_DAILY_MIN', 'T_DAILY_MEAN', 'T_DAILY_AVG', 'P_DAILY_CALC', 'SOLARAD_DAILY', 'SUR_TEMP_DAILY_MAX', 'SUR_TEMP_DAILY_MIN', 'SUR_TEMP_DAILY_AVG', 'RH_DAILY_MAX', 'RH_DAILY_MIN', 'RH_DAILY_AVG', 'SOIL_MOISTURE_5_DAILY', 'SOIL_MOISTURE_10_DAILY', 'SOIL_MOISTURE_20_DAILY', 'SOIL_MOISTURE_50_DAILY', 'SOIL_MOISTURE_100_DAILY', 'SOIL_TEMP_5_DAILY', 'SOIL_TEMP_10_DAILY', 'SOIL_TEMP_20_DAILY', 'SOIL_TEMP_50_DAILY', 'SOIL_TEMP_100_DAILY', 'Unnamed: 28'], dtype='object')
# Return the index range (number of rows)
df.index
RangeIndex(start=0, stop=2557, step=1)
# Return the DataFrame shape
df.shape
(2557, 28)
# Return the DataFrame values as a Numpy array
df.values
array([['2016-01-01', 64756, 2.422, ..., 6.0, 7.6, nan], ['2016-01-02', 64756, 2.422, ..., 5.7, 7.4, nan], ['2016-01-03', 64756, 2.422, ..., 5.2, 7.2, nan], ..., ['2022-12-29', 64756, 2.622, ..., 1.9, 3.7, nan], ['2022-12-30', 64756, 2.622, ..., 1.8, 3.6, nan], ['2022-12-31', 64756, 2.622, ..., 1.8, 3.4, nan]], dtype=object)
Indexing¶
df.iloc acts just like the index operator works with arrays.
# Using iloc
df.iloc[5,5]
5.0
In addition to indexing a single value, df.iloc can be used to select multiple rows and columns via slicing: df.iloc[row_start:row_end:row_step, col_start:col_end:col_step].
# Select first three rows, and last 12 columns
df.iloc[:3,12:]
SUR_TEMP_DAILY_MIN | SUR_TEMP_DAILY_AVG | RH_DAILY_MAX | RH_DAILY_MIN | RH_DAILY_AVG | SOIL_MOISTURE_5_DAILY | SOIL_MOISTURE_10_DAILY | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | Unnamed: 28 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -4.2 | 0.7 | 90.5 | 51.7 | 69.1 | 0.256 | 0.233 | 0.204 | 0.155 | 0.147 | 4.2 | 4.4 | 5.1 | 6.0 | 7.6 | NaN |
1 | -7.8 | -1.4 | 77.7 | 44.8 | 57.4 | 0.248 | 0.227 | 0.199 | 0.152 | 0.144 | 2.8 | 3.1 | 4.2 | 5.7 | 7.4 | NaN |
2 | -5.7 | -0.4 | 72.7 | 48.7 | 61.0 | 0.243 | 0.223 | 0.196 | 0.151 | 0.141 | 2.6 | 2.8 | 3.8 | 5.2 | 7.2 | NaN |
# First 5 columns, every 40th row
df.iloc[::40,:5]
LST_DATE | WBANNO | CRX_VN | LONGITUDE | LATITUDE | |
---|---|---|---|---|---|
0 | 2016-01-01 | 64756 | 2.422 | -73.74 | 41.79 |
40 | 2016-02-10 | 64756 | 2.422 | -73.74 | 41.79 |
80 | 2016-03-21 | 64756 | 2.422 | -73.74 | 41.79 |
120 | 2016-04-30 | 64756 | 2.422 | -73.74 | 41.79 |
160 | 2016-06-09 | 64756 | 2.422 | -73.74 | 41.79 |
... | ... | ... | ... | ... | ... |
2360 | 2022-06-18 | 64756 | 2.622 | -73.74 | 41.79 |
2400 | 2022-07-28 | 64756 | 2.622 | -73.74 | 41.79 |
2440 | 2022-09-06 | 64756 | 2.622 | -73.74 | 41.79 |
2480 | 2022-10-16 | 64756 | 2.622 | -73.74 | 41.79 |
2520 | 2022-11-25 | 64756 | 2.622 | -73.74 | 41.79 |
64 rows × 5 columns
When indexing a single row, df.loc (like df.iloc) transforms the row into a Series, with the column names as the index:
# Index a row
df.iloc[1]
LST_DATE 2016-01-02 WBANNO 64756 CRX_VN 2.422 LONGITUDE -73.74 LATITUDE 41.79 T_DAILY_MAX 2.9 T_DAILY_MIN -3.6 T_DAILY_MEAN -0.4 T_DAILY_AVG -0.3 P_DAILY_CALC 0.0 SOLARAD_DAILY 6.25 SUR_TEMP_DAILY_MAX 8.7 SUR_TEMP_DAILY_MIN -7.8 SUR_TEMP_DAILY_AVG -1.4 RH_DAILY_MAX 77.7 RH_DAILY_MIN 44.8 RH_DAILY_AVG 57.4 SOIL_MOISTURE_5_DAILY 0.248 SOIL_MOISTURE_10_DAILY 0.227 SOIL_MOISTURE_20_DAILY 0.199 SOIL_MOISTURE_50_DAILY 0.152 SOIL_MOISTURE_100_DAILY 0.144 SOIL_TEMP_5_DAILY 2.8 SOIL_TEMP_10_DAILY 3.1 SOIL_TEMP_20_DAILY 4.2 SOIL_TEMP_50_DAILY 5.7 SOIL_TEMP_100_DAILY 7.4 Unnamed: 28 NaN Name: 1, dtype: object
In addition to df.iloc, rows of a DataFrame can be accessed using df.loc, which "locates" rows based on their labels. Unless you have set a custom index (which we will see later), the row "labels" are the same as the integer index.
df.loc[1]
LST_DATE 2016-01-02 WBANNO 64756 CRX_VN 2.422 LONGITUDE -73.74 LATITUDE 41.79 T_DAILY_MAX 2.9 T_DAILY_MIN -3.6 T_DAILY_MEAN -0.4 T_DAILY_AVG -0.3 P_DAILY_CALC 0.0 SOLARAD_DAILY 6.25 SUR_TEMP_DAILY_MAX 8.7 SUR_TEMP_DAILY_MIN -7.8 SUR_TEMP_DAILY_AVG -1.4 RH_DAILY_MAX 77.7 RH_DAILY_MIN 44.8 RH_DAILY_AVG 57.4 SOIL_MOISTURE_5_DAILY 0.248 SOIL_MOISTURE_10_DAILY 0.227 SOIL_MOISTURE_20_DAILY 0.199 SOIL_MOISTURE_50_DAILY 0.152 SOIL_MOISTURE_100_DAILY 0.144 SOIL_TEMP_5_DAILY 2.8 SOIL_TEMP_10_DAILY 3.1 SOIL_TEMP_20_DAILY 4.2 SOIL_TEMP_50_DAILY 5.7 SOIL_TEMP_100_DAILY 7.4 Unnamed: 28 NaN Name: 1, dtype: object
# Select multiple rows by position
df.iloc[100:200]
LST_DATE | WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | ... | SOIL_MOISTURE_10_DAILY | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | Unnamed: 28 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
100 | 2016-04-10 | 64756 | 2.422 | -73.74 | 41.79 | 9.7 | -3.1 | 3.3 | 3.5 | 0.0 | ... | 0.222 | 0.197 | 0.151 | 0.133 | 6.6 | 6.3 | 6.1 | 6.2 | 6.6 | NaN |
101 | 2016-04-11 | 64756 | 2.422 | -73.74 | 41.79 | 12.7 | 4.3 | 8.5 | 8.1 | 1.4 | ... | 0.218 | 0.194 | 0.150 | 0.134 | 7.3 | 7.1 | 6.8 | 6.5 | 6.6 | NaN |
102 | 2016-04-12 | 64756 | 2.422 | -73.74 | 41.79 | 13.0 | -0.9 | 6.0 | 8.4 | 12.5 | ... | 0.250 | 0.210 | 0.154 | 0.135 | 9.0 | 8.6 | 7.7 | 6.9 | 6.7 | NaN |
103 | 2016-04-13 | 64756 | 2.422 | -73.74 | 41.79 | 12.6 | -3.1 | 4.7 | 5.1 | 0.0 | ... | 0.236 | 0.207 | 0.156 | 0.137 | 8.5 | 8.2 | 7.8 | 7.3 | 6.9 | NaN |
104 | 2016-04-14 | 64756 | 2.422 | -73.74 | 41.79 | 15.1 | -0.2 | 7.5 | 7.3 | 0.0 | ... | 0.224 | 0.199 | 0.153 | 0.142 | 9.3 | 9.0 | 8.3 | 7.5 | 7.1 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
195 | 2016-07-14 | 64756 | 2.422 | -73.74 | 41.79 | 30.4 | 21.6 | 26.0 | 24.5 | 4.2 | ... | 0.070 | 0.053 | 0.099 | NaN | 25.9 | 25.7 | 24.3 | 22.6 | 20.9 | NaN |
196 | 2016-07-15 | 64756 | 2.422 | -73.74 | 41.79 | 31.5 | 18.9 | 25.2 | 25.3 | 0.0 | ... | 0.067 | 0.053 | 0.099 | NaN | 26.7 | 26.3 | 24.5 | 22.7 | 21.0 | NaN |
197 | 2016-07-16 | 64756 | 2.422 | -73.74 | 41.79 | 30.4 | 16.3 | 23.4 | 23.1 | 0.0 | ... | 0.062 | 0.052 | 0.099 | NaN | 26.5 | 26.2 | 24.7 | 23.0 | 21.2 | NaN |
198 | 2016-07-17 | 64756 | 2.422 | -73.74 | 41.79 | 30.7 | 16.7 | 23.7 | 23.5 | 0.0 | ... | 0.057 | 0.050 | 0.097 | NaN | 27.3 | 27.0 | 25.0 | 23.1 | 21.3 | NaN |
199 | 2016-07-18 | 64756 | 2.422 | -73.74 | 41.79 | 33.4 | 15.5 | 24.4 | 22.8 | 4.4 | ... | 0.052 | 0.047 | 0.096 | NaN | 26.2 | 26.1 | 24.9 | 23.3 | 21.5 | NaN |
100 rows × 28 columns
Slicing using df.loc is similar to df.iloc, with the exception that the stop value is inclusive:
df.loc[100:200]
LST_DATE | WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | ... | SOIL_MOISTURE_10_DAILY | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | Unnamed: 28 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
100 | 2016-04-10 | 64756 | 2.422 | -73.74 | 41.79 | 9.7 | -3.1 | 3.3 | 3.5 | 0.0 | ... | 0.222 | 0.197 | 0.151 | 0.133 | 6.6 | 6.3 | 6.1 | 6.2 | 6.6 | NaN |
101 | 2016-04-11 | 64756 | 2.422 | -73.74 | 41.79 | 12.7 | 4.3 | 8.5 | 8.1 | 1.4 | ... | 0.218 | 0.194 | 0.150 | 0.134 | 7.3 | 7.1 | 6.8 | 6.5 | 6.6 | NaN |
102 | 2016-04-12 | 64756 | 2.422 | -73.74 | 41.79 | 13.0 | -0.9 | 6.0 | 8.4 | 12.5 | ... | 0.250 | 0.210 | 0.154 | 0.135 | 9.0 | 8.6 | 7.7 | 6.9 | 6.7 | NaN |
103 | 2016-04-13 | 64756 | 2.422 | -73.74 | 41.79 | 12.6 | -3.1 | 4.7 | 5.1 | 0.0 | ... | 0.236 | 0.207 | 0.156 | 0.137 | 8.5 | 8.2 | 7.8 | 7.3 | 6.9 | NaN |
104 | 2016-04-14 | 64756 | 2.422 | -73.74 | 41.79 | 15.1 | -0.2 | 7.5 | 7.3 | 0.0 | ... | 0.224 | 0.199 | 0.153 | 0.142 | 9.3 | 9.0 | 8.3 | 7.5 | 7.1 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
196 | 2016-07-15 | 64756 | 2.422 | -73.74 | 41.79 | 31.5 | 18.9 | 25.2 | 25.3 | 0.0 | ... | 0.067 | 0.053 | 0.099 | NaN | 26.7 | 26.3 | 24.5 | 22.7 | 21.0 | NaN |
197 | 2016-07-16 | 64756 | 2.422 | -73.74 | 41.79 | 30.4 | 16.3 | 23.4 | 23.1 | 0.0 | ... | 0.062 | 0.052 | 0.099 | NaN | 26.5 | 26.2 | 24.7 | 23.0 | 21.2 | NaN |
198 | 2016-07-17 | 64756 | 2.422 | -73.74 | 41.79 | 30.7 | 16.7 | 23.7 | 23.5 | 0.0 | ... | 0.057 | 0.050 | 0.097 | NaN | 27.3 | 27.0 | 25.0 | 23.1 | 21.3 | NaN |
199 | 2016-07-18 | 64756 | 2.422 | -73.74 | 41.79 | 33.4 | 15.5 | 24.4 | 22.8 | 4.4 | ... | 0.052 | 0.047 | 0.096 | NaN | 26.2 | 26.1 | 24.9 | 23.3 | 21.5 | NaN |
200 | 2016-07-19 | 64756 | 2.422 | -73.74 | 41.79 | 27.1 | 12.0 | 19.6 | 20.3 | 0.0 | ... | 0.051 | 0.046 | 0.095 | NaN | 25.2 | 25.2 | 24.3 | 23.1 | 21.6 | NaN |
101 rows × 28 columns
Indexing columns¶
In addition to integer indexing with df.iloc, columns can be accessed in two ways: dot notation . or square brackets []. The former takes advantage of the fact that the columns are effectively "attributes" of the DataFrame and returns a Series:
df.T_DAILY_MEAN
0 1.5 1 -0.4 2 1.6 3 -6.9 4 -10.3 ... 2552 -4.4 2553 0.7 2554 4.4 2555 10.7 2556 7.9 Name: T_DAILY_MEAN, Length: 2557, dtype: float64
# Indexing a DataFrame column. Returns a Series
df['T_DAILY_MEAN']
0 1.5 1 -0.4 2 1.6 3 -6.9 4 -10.3 ... 2552 -4.4 2553 0.7 2554 4.4 2555 10.7 2556 7.9 Name: T_DAILY_MEAN, Length: 2557, dtype: float64
Using single brackets, the result is a Series. However, using double brackets, it is possible to return the column as a DataFrame:
# Indexing a column as a DataFrame
df[['T_DAILY_MEAN']]
T_DAILY_MEAN | |
---|---|
0 | 1.5 |
1 | -0.4 |
2 | 1.6 |
3 | -6.9 |
4 | -10.3 |
... | ... |
2552 | -4.4 |
2553 | 0.7 |
2554 | 4.4 |
2555 | 10.7 |
2556 | 7.9 |
2557 rows × 1 columns
# Indexing multiple columns
df[['T_DAILY_MAX','T_DAILY_MIN', 'T_DAILY_MEAN', 'T_DAILY_AVG']]
T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | |
---|---|---|---|---|
0 | 3.4 | -0.5 | 1.5 | 1.3 |
1 | 2.9 | -3.6 | -0.4 | -0.3 |
2 | 5.1 | -1.8 | 1.6 | 1.1 |
3 | 0.5 | -14.4 | -6.9 | -7.5 |
4 | -5.2 | -15.5 | -10.3 | -11.7 |
... | ... | ... | ... | ... |
2552 | -0.8 | -8.0 | -4.4 | -3.8 |
2553 | 7.4 | -6.1 | 0.7 | 1.3 |
2554 | 10.7 | -1.8 | 4.4 | 5.0 |
2555 | 16.6 | 4.9 | 10.7 | 10.3 |
2556 | 13.2 | 2.7 | 7.9 | 10.2 |
2557 rows × 4 columns
Working with Datetime objects¶
Like the data we are working with in this exercise, many environmental datasets include timed records. The standard datetime library is the primary way of manipulating dates and times in Python, but there are additional third-party packages that provide additional support.
A few worth exploring are dateutil, an extension of the datetime library useful for parsing timestamps, and pytz, which provides a smooth way of tackling time zones.
Though we will not review datetime objects in depth here, it is useful to understand the basics of how to deal with datetime objects in Python as you will no doubt encounter them in the future.
For now, we will focus on a few pandas functions built on the datetime library to handle datetime objects.
The pd.date_range() function allows you to build a DatetimeIndex with a fixed frequency. This can be done by specifying a start date and an end date as follows:
pd.date_range('4/1/2017','4/30/2017')
DatetimeIndex(['2017-04-01', '2017-04-02', '2017-04-03', '2017-04-04', '2017-04-05', '2017-04-06', '2017-04-07', '2017-04-08', '2017-04-09', '2017-04-10', '2017-04-11', '2017-04-12', '2017-04-13', '2017-04-14', '2017-04-15', '2017-04-16', '2017-04-17', '2017-04-18', '2017-04-19', '2017-04-20', '2017-04-21', '2017-04-22', '2017-04-23', '2017-04-24', '2017-04-25', '2017-04-26', '2017-04-27', '2017-04-28', '2017-04-29', '2017-04-30'], dtype='datetime64[ns]', freq='D')
# Specify start and end, monthly frequency
pd.date_range('4/1/2017','4/30/2018', freq = 'M')
DatetimeIndex(['2017-04-30', '2017-05-31', '2017-06-30', '2017-07-31', '2017-08-31', '2017-09-30', '2017-10-31', '2017-11-30', '2017-12-31', '2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30'], dtype='datetime64[ns]', freq='M')
# Specify start and end, 5min frequency
# datetime64 is 64-bit integer, which represents an offset from 1970-01-01T00:00:00
pd.date_range('4/1/2017','4/30/2018', freq = '5min')
DatetimeIndex(['2017-04-01 00:00:00', '2017-04-01 00:05:00', '2017-04-01 00:10:00', '2017-04-01 00:15:00', '2017-04-01 00:20:00', '2017-04-01 00:25:00', '2017-04-01 00:30:00', '2017-04-01 00:35:00', '2017-04-01 00:40:00', '2017-04-01 00:45:00', ... '2018-04-29 23:15:00', '2018-04-29 23:20:00', '2018-04-29 23:25:00', '2018-04-29 23:30:00', '2018-04-29 23:35:00', '2018-04-29 23:40:00', '2018-04-29 23:45:00', '2018-04-29 23:50:00', '2018-04-29 23:55:00', '2018-04-30 00:00:00'], dtype='datetime64[ns]', length=113473, freq='5T')
Dealing with existing timestamps¶
df.columns
Index(['LST_DATE', 'WBANNO', 'CRX_VN', 'LONGITUDE', 'LATITUDE', 'T_DAILY_MAX', 'T_DAILY_MIN', 'T_DAILY_MEAN', 'T_DAILY_AVG', 'P_DAILY_CALC', 'SOLARAD_DAILY', 'SUR_TEMP_DAILY_MAX', 'SUR_TEMP_DAILY_MIN', 'SUR_TEMP_DAILY_AVG', 'RH_DAILY_MAX', 'RH_DAILY_MIN', 'RH_DAILY_AVG', 'SOIL_MOISTURE_5_DAILY', 'SOIL_MOISTURE_10_DAILY', 'SOIL_MOISTURE_20_DAILY', 'SOIL_MOISTURE_50_DAILY', 'SOIL_MOISTURE_100_DAILY', 'SOIL_TEMP_5_DAILY', 'SOIL_TEMP_10_DAILY', 'SOIL_TEMP_20_DAILY', 'SOIL_TEMP_50_DAILY', 'SOIL_TEMP_100_DAILY', 'Unnamed: 28'], dtype='object')
df['LST_DATE']
0 2016-01-01 1 2016-01-02 2 2016-01-03 3 2016-01-04 4 2016-01-05 ... 2552 2022-12-27 2553 2022-12-28 2554 2022-12-29 2555 2022-12-30 2556 2022-12-31 Name: LST_DATE, Length: 2557, dtype: object
While the values certainly resemble datetime objects, they are stored in pandas as "objects," which basically means that pandas doesn't recognize the data type – it doesn't know how to handle them. Using the pd.to_datetime() function, we can convert this column to datetime objects:
pd.to_datetime(df['LST_DATE'])
0 2016-01-01 1 2016-01-02 2 2016-01-03 3 2016-01-04 4 2016-01-05 ... 2552 2022-12-27 2553 2022-12-28 2554 2022-12-29 2555 2022-12-30 2556 2022-12-31 Name: LST_DATE, Length: 2557, dtype: datetime64[ns]
# Set the LST_DATE as datetime object, you can also do so by setting the parse_dates when read in the csv data.
df['LST_DATE'] = pd.to_datetime(df['LST_DATE'])
# Set the Date column as index:
df = df.set_index('LST_DATE')
# Now it's more intuitive to interpret the data:
df['T_DAILY_MEAN']
LST_DATE 2016-01-01 1.5 2016-01-02 -0.4 2016-01-03 1.6 2016-01-04 -6.9 2016-01-05 -10.3 ... 2022-12-27 -4.4 2022-12-28 0.7 2022-12-29 4.4 2022-12-30 10.7 2022-12-31 7.9 Name: T_DAILY_MEAN, Length: 2557, dtype: float64
df['T_DAILY_MEAN'].plot()
<Axes: xlabel='LST_DATE'>
# Reset the index
df = df.reset_index()
# X axis is not labeled with date
df['T_DAILY_MEAN'].plot()
<Axes: >
df = df.set_index('LST_DATE')
Now that we have a DatetimeIndex, we can access specific attributes of the datetime objects like the year, day, hour, etc. To do this, we add the desired time period using dot notation: df.index.attribute. For a full list of attributes, see the pd.DatetimeIndex documentation. For example:
# Get the hour of each record
df.index.hour
Index([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype='int32', name='LST_DATE', length=2557)
# Get the year of each record, and assign this to a new column
df['year'] = df.index.year
df
WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | SOLARAD_DAILY | ... | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | Unnamed: 28 | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LST_DATE | |||||||||||||||||||||
2016-01-01 | 64756 | 2.422 | -73.74 | 41.79 | 3.4 | -0.5 | 1.5 | 1.3 | 0.0 | 1.69 | ... | 0.204 | 0.155 | 0.147 | 4.2 | 4.4 | 5.1 | 6.0 | 7.6 | NaN | 2016 |
2016-01-02 | 64756 | 2.422 | -73.74 | 41.79 | 2.9 | -3.6 | -0.4 | -0.3 | 0.0 | 6.25 | ... | 0.199 | 0.152 | 0.144 | 2.8 | 3.1 | 4.2 | 5.7 | 7.4 | NaN | 2016 |
2016-01-03 | 64756 | 2.422 | -73.74 | 41.79 | 5.1 | -1.8 | 1.6 | 1.1 | 0.0 | 5.69 | ... | 0.196 | 0.151 | 0.141 | 2.6 | 2.8 | 3.8 | 5.2 | 7.2 | NaN | 2016 |
2016-01-04 | 64756 | 2.422 | -73.74 | 41.79 | 0.5 | -14.4 | -6.9 | -7.5 | 0.0 | 9.17 | ... | 0.194 | 0.148 | 0.139 | 1.7 | 2.1 | 3.4 | 4.9 | 6.9 | NaN | 2016 |
2016-01-05 | 64756 | 2.422 | -73.74 | 41.79 | -5.2 | -15.5 | -10.3 | -11.7 | 0.0 | 9.34 | ... | 0.191 | 0.148 | 0.138 | 0.4 | 0.9 | 2.4 | 4.3 | 6.6 | NaN | 2016 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2022-12-27 | 64756 | 2.622 | -73.74 | 41.79 | -0.8 | -8.0 | -4.4 | -3.8 | 0.0 | 4.00 | ... | NaN | 0.164 | 0.157 | -0.4 | -0.2 | 0.5 | 2.2 | 4.0 | NaN | 2022 |
2022-12-28 | 64756 | 2.622 | -73.74 | 41.79 | 7.4 | -6.1 | 0.7 | 1.3 | 0.0 | 7.73 | ... | NaN | 0.162 | 0.156 | -0.4 | -0.3 | 0.4 | 2.1 | 3.8 | NaN | 2022 |
2022-12-29 | 64756 | 2.622 | -73.74 | 41.79 | 10.7 | -1.8 | 4.4 | 5.0 | 0.0 | 6.66 | ... | NaN | 0.159 | 0.155 | -0.3 | -0.3 | 0.3 | 1.9 | 3.7 | NaN | 2022 |
2022-12-30 | 64756 | 2.622 | -73.74 | 41.79 | 16.6 | 4.9 | 10.7 | 10.3 | 0.0 | 5.39 | ... | NaN | 0.159 | 0.154 | -0.2 | -0.2 | 0.3 | 1.8 | 3.6 | NaN | 2022 |
2022-12-31 | 64756 | 2.622 | -73.74 | 41.79 | 13.2 | 2.7 | 7.9 | 10.2 | 5.0 | 1.25 | ... | NaN | 0.160 | 0.153 | -0.1 | -0.2 | 0.3 | 1.8 | 3.4 | NaN | 2022 |
2557 rows × 28 columns
# Get the unique year values
df.index.year.unique()
Index([2016, 2017, 2018, 2019, 2020, 2021, 2022], dtype='int32', name='LST_DATE')
Plotting Values¶
We can now quickly make plots of the data
df['T_DAILY_MAX'].plot()
<Axes: xlabel='LST_DATE'>
fig, ax = plt.subplots(ncols=2, nrows=2, figsize=(14,14))
df['T_DAILY_MAX'].plot(ax=ax[0,0])
df['P_DAILY_CALC'].plot(ax=ax[0,1])
df[['SOIL_MOISTURE_5_DAILY','SOIL_MOISTURE_10_DAILY','SOIL_MOISTURE_20_DAILY']].boxplot(ax=ax[1,0])
ax[1, 0].set_xticklabels(ax[1, 0].get_xticklabels(), rotation=90);
df[['T_DAILY_MAX','T_DAILY_MIN']].plot(ax=ax[1,1])
<Axes: xlabel='LST_DATE'>
Resampling¶
Since pandas understands time, we can use it to do resampling.
# monthly reampler object
rs_obj = df.resample('M')
rs_obj
<pandas.core.resample.DatetimeIndexResampler object at 0x182eb5ac0>
rs_obj.mean()
WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | SOLARAD_DAILY | ... | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | Unnamed: 28 | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LST_DATE | |||||||||||||||||||||
2016-01-31 | 64756.0 | 2.422 | -73.74 | 41.79 | 3.503226 | -7.193548 | -1.858065 | -1.761290 | 1.106452 | 6.740000 | ... | 0.198903 | 0.146742 | 0.138700 | 0.487097 | 0.674194 | 1.429032 | 2.632258 | 4.593548 | NaN | 2016.0 |
2016-02-29 | 64756.0 | 2.422 | -73.74 | 41.79 | 5.848276 | -6.382759 | -0.262069 | 0.072414 | 4.920690 | 8.379310 | ... | 0.202136 | 0.150586 | 0.139520 | 1.158621 | 1.158621 | 1.344828 | 1.810345 | 2.827586 | NaN | 2016.0 |
2016-03-31 | 64756.0 | 2.422 | -73.74 | 41.79 | 13.122581 | -0.638710 | 6.251613 | 6.393548 | 0.890323 | 14.435484 | ... | 0.186258 | 0.144839 | 0.134581 | 6.529032 | 6.370968 | 5.987097 | 5.593548 | 5.267742 | NaN | 2016.0 |
2016-04-30 | 64756.0 | 2.422 | -73.74 | 41.79 | 14.613333 | 0.840000 | 7.726667 | 8.300000 | 2.033333 | 17.865000 | ... | 0.179200 | 0.142267 | 0.133700 | 10.166667 | 9.926667 | 9.410000 | 8.773333 | 8.076667 | NaN | 2016.0 |
2016-05-31 | 64756.0 | 2.422 | -73.74 | 41.79 | 21.058065 | 8.051613 | 14.577419 | 14.751613 | 2.812903 | 17.959355 | ... | 0.176935 | 0.141516 | 0.135032 | 16.796774 | 16.448387 | 15.416129 | 14.083871 | 12.445161 | NaN | 2016.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2022-08-31 | 64756.0 | 2.622 | -73.74 | 41.79 | 31.280645 | 16.145161 | 23.709677 | 23.235484 | 0.938710 | 19.360323 | ... | 0.072581 | 0.081000 | 0.113581 | 26.316129 | 26.054839 | 25.116129 | 23.645161 | 22.390323 | NaN | 2022.0 |
2022-09-30 | 64756.0 | 2.622 | -73.74 | 41.79 | 23.266667 | 10.700000 | 16.970000 | 16.823333 | 4.263333 | 14.341333 | ... | 0.200300 | 0.121967 | 0.108500 | 20.686667 | 20.650000 | 20.593333 | 20.896667 | 20.746667 | NaN | 2022.0 |
2022-10-31 | 64756.0 | 2.622 | -73.74 | 41.79 | 17.025806 | 3.867742 | 10.441935 | 10.351613 | 2.580645 | 10.186774 | ... | 0.260903 | 0.161194 | 0.133742 | 13.241935 | 13.177419 | 13.483871 | 14.574194 | 15.383871 | NaN | 2022.0 |
2022-11-30 | 64756.0 | 2.622 | -73.74 | 41.79 | 12.380000 | 0.453333 | 6.420000 | 6.910000 | 2.956667 | 7.440000 | ... | 0.268000 | 0.164967 | 0.152633 | 8.196667 | 8.230000 | 8.882759 | 10.240000 | 11.470000 | NaN | 2022.0 |
2022-12-31 | 64756.0 | 2.622 | -73.74 | 41.79 | 5.051613 | -4.961290 | 0.038710 | 0.467742 | 3.751613 | 5.166452 | ... | 0.279500 | 0.172452 | 0.165065 | 2.070968 | 2.183871 | 2.709677 | 4.341935 | 5.867742 | NaN | 2022.0 |
84 rows × 28 columns
We can chain all of that together
df_mm = df.resample('M').mean()
df_mm[['T_DAILY_MIN', 'T_DAILY_MEAN', 'T_DAILY_MAX']].plot()
<Axes: xlabel='LST_DATE'>
Applying Functions¶
In addition to manipulating individual columns, you can apply a function to an entire Series or DataFrame using the pandas function df.apply(). For example, consider our original DataFrame df, which consists of temperature values in °C:
def convert_CtoF(degC):
""" Converts a temperature to from Celsius to Fahrenheit
Parameters
----------
degC : float
Temperature value in °C
Returns
-------
degF : float
Temperature value in °F
"""
degF = (degC *(9/5)) + 32
return degF
df['T_DAILY_MEAN'].apply(convert_CtoF)
LST_DATE 2016-01-01 34.70 2016-01-02 31.28 2016-01-03 34.88 2016-01-04 19.58 2016-01-05 13.46 ... 2022-12-27 24.08 2022-12-28 33.26 2022-12-29 39.92 2022-12-30 51.26 2022-12-31 46.22 Name: T_DAILY_MEAN, Length: 2557, dtype: float64