![]() |
Python Data ScienceWorking with Data: Introducing PandasDr. Jonathan Lamb, Faculty Fellow, CRMDA, <jonathanplamb@ku.edu>Dr. Paul Johnson, Dept. of Political Science, <pauljohn@ku.edu> Keywords: pandas, data science, dataframe * Special thanks to Donne Martin, who provided a Notebook with many suggestions that we have adopted. |
![]() |
Python is a general purpose interactive program intended for flexibility and ease of use (especially congenial to beginning programmers).
Numpy is a scientific calculation library that supplies "more formal structure" (throws away some flexibility, replaces it with rigorous computer-programmer stuff).
Pandas
![]() |
![]() |
Lead developer of Pandas | Very readable, plenty of examples |
Function notation: x = funct()
creates an object x
. For clarity, refer to it as funct()
, not just funct
.
Method notation: x.mthd()
asks x to carry out an instruction .mthd()
Property notation: x.value
retrieves a pre-existing element from x. Note: No Parentheses
Put these 2 lines at the top of any data-oriented Python project.
import numpy as np
import pandas as pd
After that, the functions from
numpy
are available as np.funct()
pandas
are available as pd.funct()
For example, we'll run
pd.DataFrame(...)
to create a data framepd.Series(...)
to create a "series" objectThe short names pd
and np
are conventions, not requirements.
myPrettyPanda
if you want)Data Frame
, terminology adapted from R.It is safe to say that Pandas
and NumPy
are among the most widely used addon libraries for Python. All self-respecting Pythonistas should be aware of them.
The big prize is the DataFrame
, not different from a "spreadsheet".
A Series
is one column from a DataFrame
I keep text files in Zip containers, to save on storage space. This access the data without explicitly extracting the zip contents into a text file.
import os
from zipfile import ZipFile
fn = os.path.join("..", "..", "data", "2017-18_playerBoxScore.csv.zip")
csv_name = "2017-18_playerBoxScore.csv"
plyr = pd.read_csv(ZipFile(fn, "r").open(csv_name), parse_dates=['gmDate'])
plyr.head()
plyr.columns
Question: Do tall players get more rebounds?
%matplotlib inline
plyr['playRPM'] = plyr.playTRB / plyr.playMin
# plyr.playRPM
plyr.plot.scatter(x="playHeight", y="playRPM", alpha = 0.4, figsize=(8,5))
# extract data about the greatest play ever from Davidson
steph = plyr.loc[plyr.playLNm.isin(["Curry"]), ["teamAbbr", "playTRB", "playFGM", "playFG%"]]
steph.plot.scatter(x="playFGM", y="playTRB", c="DarkBlue", figsize = (8,4))
Def: Series A series is a "container" for "one column" of information.
Can be:
* logical: True or False
* integer
* floating point
* characters
* categorical variable like R "factor"
It is a "variable", one value for each observed "case".
Python shows you parts in different ways, but remember it is a "container" with an array in it.
Many Pandas Series use NumPy arrays as the data containers.
The data storage in NumPy (hence Pandas) is "strictly typed". For example, int64
, float64
, etc. Specifies how much menory to use.
We will see dtype
, a NumPy abbrevation for data type
.
In "real life", I usually import a DataFrame
from a file. Take columns out of that.
Here, we have some "toy examples" to show how to use Series.
ser_1 = pd.Series([1, 1, 2, -3, -5, 8, 13, 4, 1, 6, 7])
ser_1
Review the first 3 elements
ser_1.head(3)
Note the output includes dtype: int64. A 64 bit integer is used for storage, the Pandas default.
Review the last 6 elements
ser_1.tail(6)
Run dir()
to see what is in there. Try not to faint
dir(ser_1)
The index is an "item name", similar to a name
in an R vector.
Get the index of the Series:
ser_1.index
Create a Series with a custom index using character strings:
ser_2 = pd.Series([1, 1, 2, -3, -5], index=['a', 'b', 'c', 'd', 'e'])
ser_2
ser_2copy = ser_2.copy()
## Caution: must copy, else is a reference. Discuss amongst yourselves
ser_2copy.index = ['x1', 'x2', 'x3', 'x4', 'y']
ser_2copy
Pull out a number by a numeric index.
Here's a trick question. What is value of item 4 in that variable?
ser_2[4]
Same number by its character index value
ser_2['a']
Check that those are actually equal with ==
ser_2[4] == ser_2['e']
Get a set of values from a Series by passing in a list of names:
ser_2[['c', 'a', 'b']]
Stop and try this Confuse yourself by using numbers-as-characters in the element index
ser_2b = pd.Series([1, 1, 2, -3, -5], index=['5', '4', '3', '2', '1'])
# uncomment, run this:
# ser_2b[4]
# run this
# ser_2b["4"]
Sorry, I was a bit careless here, thinking about the Series as if it were an R vector or a NumPy array.
Although those accesses succeeded, it is best practice in Python to use the accessor methods named .loc[]
and .iloc[]
.
.loc[]
is for accessing by index as-a-character.iloc[]
is for accessing by numeric position of index, starting at 0Odd they use hard brackets, but it is "syntactic sugar" to enhance the user experience.
Why? Many users expect Series to behave exactly like NumPy arrays, but they do not always comply. Using iloc
and loc
methods, then all of the "slice selecting" customs of NumPy will be available.
ser_2
ser_2.iloc[3]
ser_2.loc["d"]
McKinney's Python for Data Analysis, 2ed has an example in which the the bracket method fails, observing "For more precise handling, use loc
(for labels) or iloc
(for integers)" (p. 147).
ser = pd.Series([1, 2, 3, 4, 666])
Without iloc, this effort to choose the last element in the Series fails:
ser[-1]
KeyError Traceback (most recent call last)
<ipython-input-129-44969a759c20> in
~/.local/lib/python3.7/site-packages/pandas/core/series.py in getitem(self, key) 866 key = com.apply_if_callable(key, self) 867 try: --> 868 result = self.index.get_value(self, key) 869 870 if not is_scalar(result):
~/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_value(self, series, key) 4373 try: 4374 return self._engine.get_value(s, k, -> 4375 tz=getattr(series.dtype, 'tz', None)) 4376 except KeyError as e1: 4377 if len(self) > 0 and (self.holds_integer() or self.is_boolean()):
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: -1
ser.iloc[-1]
Select from a Series based on a logical condition, AKA "filter":
ser_2[[True, False, True, False, True]]
Calculate a True/False variable and let it do the filtering
ser_2[ser_2 > 0]
Pandas
inherited this idea from NumPy
. A range within a NumPy
array can be selected by a colon ":"
separated pair of index values. This is referred to as a slice
# reminder
ser_2
Select a slice from a Series (items 1, 2, and 3):
ser_2.iloc[1:4]
Interesting to note that the range includes items 1, 2, and 3, NOT 4
This selects the items 2 through the end
ser_2.iloc[2: ]
Select the items from 2 to the end, leaving off the last element
ser_2.iloc[2: -1]
Select a range from the index, by name
ser_2.loc["b":"d"]
Select a slice from a Series with labels (note the end point is inclusive):
ser_2['a':'b']
Assign to a Series slice (note the end point is inclusive):
ser_2.loc['a':'b'] = 0
ser_2
Series can be added and multiplied in the way you usually expect.
Scalar multiply (2 is a scalar):
ser_2 * 2
NumPy
has functions for logarithms np.log()
, exponentials np.exp()
, trigonometry, etc.
Apply a numpy math function:
ser_2_exp = np.exp(ser_2)
print(ser_2_exp)
# Recall ser_2 is an integer Series
print(ser_2)
ser_2.dtype
Note that dtype int64
changes to float64
automatically when the scalar is a floating point number.
# Now the magic happens
ser_2 * 2.6
Series can be added. An int64
array plus a float64
will generate a float64
ser_2 + ser_2_exp
A Series is like a fixed-length, ordered dictionary. In fact, we can create a series by passing in a dictionary:
dict_1 = {'foo' : 100, 'bar' : 200, 'baz' : 300}
ser_3 = pd.Series(dict_1)
ser_3
Note that Python/Pandas chose the integer
storage, they noticed that we gave them integers.
On the other hand, if we insert just one floating point number, then the variable changes to float64
:
pd.Series({'foo' : 100.1, 'bar' : 200, 'baz' : 300})
Absolute values:
np.abs(ser_2) # absolute value!
Watch what happens if we include a "None" value or a NumPy "NaN" in a Python list that we use to create a Pandas Series:
pd.Series([100, 200, 300, None, 33, np.NaN])
Pandas adopted the NumPy method which allowed "missing values" to be entered and stored as NaN
=not a number.
HOWEVER, NumPy defined this feature only for floating point numbers.
The same thing happens if we have a value of None
in a dict
initializer:
pd.Series({'foo' : 100, 'bar' : 200, 'baz' : 300, 'happy': None})
If we make a mistake and include an extra element in the list we propose as the Series index, watch what happens:
index = ['foo', 'bar', 'baz', 'qux']
ser_4 = pd.Series({'foo' : 100, 'bar' : 200, 'baz' : 300}, index=index)
ser_4
In "qux" is a 4th name, but there are only 3 elements. Rather than rejecting the name, Pandas fills in "NaN":
missing values are variously referred to as None
Null
or NaN
in Python discussion.
Reflecting this terminological uncertainty, there are two equivalent methods to check if elements are missing, .isnull()
and .isna()
# .isna()
ser_4.isna()
# .isnull()
pd.isnull(ser_4)
Interestingly, in a character variable, the value of None
is preserved as a missing indicator:
ser_4b = pd.Series(["fred", "barney", None])
print(ser_4b)
ser_4b.isnull()
However, NumPy allows NaN
that only for floating point numbers.
Integers are promoted to floats if missing values exist.
Until Pandas 0.24! This is still somewhat 'off the beaten path'. One must explicitly ask for a data type Int64
.
Observe:
pd.Series([1, 2, 3, None, 44], dtype="Int64")
For the first time, then Pandas 0.24 allows missing values "NaN" in floats
and in the special Int64
data type.
Some functions (especially stats and plotting functions) require the actual NumPy array, not the pd.Series that contains it.
To extract the "actual data", in Pandas 0.24 we have options (syntax changed, documentation warns many web pages are outdated). I don't honestly know if one way is better than another.
ser_1.array
Note that's an attribute
, but there is also a method
, .to_numpy()
.
ser_1.to_numpy()
You'll also see people retrieving same with a NumPy function asarray
for same purpose.
np.asarray(ser_1)
ser_4.name
But if we assign a name
, it will decorate our output:
ser_4.name = 'foobarbazqux'
ser_4
And we can also name a Series's index, which appears as a label over the row names:
ser_4.index.name = 'kulabel'
ser_4
data_1 = {'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
'year' : [2012, 2013, 2014, 2014, 2015],
'pop' : [5.0, 5.1, 5.2, 4.0, 4.1]}
df_1 = pd.DataFrame(data_1)
df_1
df_2 = pd.DataFrame(data_1, columns=['year', 'state'])
df_2
We can add a new Series (column) by name using input from a Python list:
df_1['unempla'] = [4.1, 3.9, 6.2, 5.0, 6.0]
df_1
Similarly, assign a pre-existing Series to a column (note the index is used to align a partial column):
unemplb = pd.Series([6.0, 6.0, 6.1], index=[2, 3, 4])
print(unemplb)
df_1['unemplb'] = unemplb
df_1
Create a new column by copying an old one, possibly with calculation:
df_1['unemplasquared'] = df_1['unempla']**2 # ** is exponent
df_1
It is difficult to think of a realistic case in which you might need to do this, but you can:
Create a DataFrame from a nested dict of dicts (the keys in
the inner dicts are unioned and sorted to form the index in
the result, unless an explicit index is specified):
pop = {'VA' : {2013 : 5.1, 2014 : 5.2},
'MD' : {2014 : 4.0, 2015 : 4.1}}
df_4 = pd.DataFrame(pop)
df_4
Many will quake in fear, as if meeting the Wizard.
dir(df_4)
Like Series, None
and np.NaN
values in Python input appear as NaN
:
data_2 = {'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
'year' : [2012, 2013, 2014, 2014, 2015],
'pop' : [5.0, 5.1, 5.2, 4.0, None]}
# shuffle columns for fun
df_2 = pd.DataFrame(data_2, columns=['year', 'state', 'pop'])
df_2
df_2['state']
Retrieve a column by attribute, returning a Series:
df_2.state
Choose two columns by name to create a new, smaller data frame
df_2[["state", "year"]]
Use the iloc
(index location) method to can retrieve a row by its numeric position:
df_2.iloc[0, ]
To show selection by name, we need more interesting row index values:
df_2.index = ["a", "e", "i", "o", "u"]
df_2
df_2.loc["o"]
df_2.loc[ ["a", "u"] ]
Choose rows and columns by name at same time:
df_2.loc[ ["a", "u"], ["year", "pop"] ]
Select from a DataFrame based on a logical condition filter:
df_2[df_2['pop'] > 4.5]
It is recommended to select columns by name (avoids mistakes), but it is allowed to select columns by numbers.
If only one element is specified, it seems to assume we want rows
# asks for row 1 only
df_2.iloc[1:2]
To ask for the first column, we specify 2 pieces, rows and columns.
":" means all rows
df_2.iloc[: , 1:2]
We must specify the row selection and column selection in a consistent way.
Observe the error we get from
> df_2.iloc[0:2, 'pop']
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-116-18676538f98d> in <module>
----> 1 df_2.loc[0:2, 'pop']
~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
1492 except (KeyError, IndexError, AttributeError):
1493 pass
-> 1494 return self._getitem_tuple(key)
1495 else:
1496 # we by definition only have the 0th axis
~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
866 def _getitem_tuple(self, tup):
867 try:
--> 868 return self._getitem_lowerdim(tup)
869 except IndexingError:
870 pass
~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_lowerdim(self, tup)
1015 return section
1016 # This is an elided recursive call to iloc/loc/etc'
-> 1017 return getattr(section, self.name)[new_key]
1018
1019 raise IndexingError('not applicable')
~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
1498
1499 maybe_callable = com.apply_if_callable(key, self.obj)
-> 1500 return self._getitem_axis(maybe_callable, axis=axis)
1501
1502 def _is_scalar_access(self, key):
~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
1865 if isinstance(key, slice):
1866 self._validate_key(key, axis)
-> 1867 return self._get_slice_axis(key, axis=axis)
1868 elif com.is_bool_indexer(key):
1869 return self._getbool_axis(key, axis=axis)
~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in _get_slice_axis(self, slice_obj, axis)
1531 labels = obj._get_axis(axis)
1532 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop,
-> 1533 slice_obj.step, kind=self.name)
1534
1535 if isinstance(indexer, slice):
~/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py in slice_indexer(self, start, end, step, kind)
4671 """
4672 start_slice, end_slice = self.slice_locs(start, end, step=step,
-> 4673 kind=kind)
4674
4675 # return a slice
~/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py in slice_locs(self, start, end, step, kind)
4870 start_slice = None
4871 if start is not None:
-> 4872 start_slice = self.get_slice_bound(start, 'left', kind)
4873 if start_slice is None:
4874 start_slice = 0
~/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_slice_bound(self, label, side, kind)
4796 # For datetime indices label may be a string that has to be converted
4797 # to datetime boundary according to its resolution.
-> 4798 label = self._maybe_cast_slice_bound(label, side, kind)
4799
4800 # we need to look up the label
~/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py in _maybe_cast_slice_bound(self, label, side, kind)
4748 # this is rejected (generally .loc gets you here)
4749 elif is_integer(label):
-> 4750 self._invalid_indexer('slice', label)
4751
4752 return label
~/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py in _invalid_indexer(self, form, key)
3065 "indexers [{key}] of {kind}".format(
3066 form=form, klass=type(self), key=key,
-> 3067 kind=type(key)))
3068
3069 # --------------------------------------------------------------------
TypeError: cannot do slice indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [0] of <class 'int'>
Select a slice of rows from a specific column of a DataFrame:
df_2.loc["e":"o", 'pop']
Select from a DataFrame based on a filter:
df_2[df_2['pop'] > 5]
Delete one column:
# Insert a goof, on purpose!
df_2['goof'] = 5
df_2
del df_2['goof']
df_2
Try This: run following, then inspect df. See what you have. Inspect column names, find the index.
ncol = 33
df = pd.DataFrame(np.random.randn(50, 33))
# list comprehension to assign column names
df.columns = ["x" + str(i) for i in range(1,34)]
Until Pandas 0.24, the recommended method was .values()
.
Now, they suggest .to_numpy()
method to take out the NumPy array.
Danger: a NumPy array is a homogenous collection of values (all are floats, integers, or characters)
df_2.to_numpy()
If the columns are different data types, the 2D ndarray's dtype will accomodate all of the values.
The whole thing is promoted to a array of characters.
The old .values()
method still exits, but is deprecated (will vanish)
df_2.values
df_1.shape
Review the dtypes
of all columns
df_1.dtypes
We can transpose a DataFrame, which means that we can make the rows columns and columns rows.
df_2.T
We will assign a name to the row index and column names.
# Check what we have first
df_2.index.name
df_2.columns.name
# See how it prints without the names
df_2
Now we assign names to the index and columns
df_2.index.name = 'My_index_is_named_Jim'
df_2
Set the DataFrame columns name:
df_2.columns.name = 'happy_new_column_name_is_too_long'
df_2
Insert more reasonabe names before proceeding
df_2.columns.name = "df_2"
df_2.index.name = "index"
df_2
Sometimes Pandas procedures will delete rows or add new ones and the rownames or column names become inconsistent/unhelpful.
The DataFrame offers methods line .reset_index()
and .reindex()
as efficient ways to adjust the indices.
The .reindex()
method can be used in 2 ways, either with parameters
labels
axis
(0 for rows, 1 for columns)
or
index
= new row names
columns
= new column names
Question Why bother with this?
Answers
? I really don't know, and
Potential Efficiency. If the data were humongous, it would be slow to rewrite whole data frame. .reindex()
has arguments "copy=False" that will prevent the re-mapping of memory, as long as the index values remain the same.
# Here's the dataframe as it currently stands.
df_2
To erase the existing index, and insert numbers for the rows, use .reset_index()
. Note this keeps the old row names as a variable.
drop=True
parameter to prevent that. .reset_index()
has an "inplace" method.df_2.reset_index()
Reindexing rows returns a new frame with the rows in a particular order:
kk2 = df_2.reindex(["a", "e", "u", "i", "o"], fill_value=0)
If the new index names do not match old names, then rows of missing scores will be "filled in"
df_2.reindex(["w", "a", "e", "y", "u", "i", "o"])
If you don't want NaN
, but would rather have something else, specify fill_value
df_2.reindex(["w", "a", "e", "y" "u", "i", "o"], fill_value=0)
df_2.reindex(columns=['state', 'pop', 'unempl', 'year'])
Here we specify a non-secutive index to demonstrate an interesting featurein .reindex
. It will interpolate data, "filling in" the missing values in the index:
ser_5 = pd.Series(['foo', 'bar', 'baz'], index=[0, 2, 4])
ser_5.reindex(range(5), method='ffill')
ser_5.reindex(range(5), method='bfill')
# Can you tell the difference between 'ffill' and 'bfill'?
Common R user mistake
Pandas will not add columns from 2 different data frames unless the index values match.
Adding series in Pandas behaves more like a horizontal "merge".
First let's create two Series of random numbers, with a different set of indices.
np.random.seed(0)
ser_6 = pd.Series(np.random.randn(5),
index=['a', 'b', 'c', 'd', 'e'])
ser_6
np.random.seed(1)
ser_7 = pd.Series(np.random.randn(5),
index=['a', 'c', 'e', 'f', 'g'])
ser_7
Now, let's add the series together.
ser_6 + ser_7
# So what has the math done here?
We can use the .add()
method and set a fill value instead of NaN for indices that do not overlap:
ser_6.add(ser_7, fill_value=0)
Suppose you wanted to add ser_6 and ser_7, one-for-one, ignoring the indices? Appears some gymnastics are needed.
ser_6.reset_index(drop=True, inplace=True)
ser_6
ser_7.reset_index(drop=True, inplace=True)
ser_7
# ser_6.add(ser_7) or
ser_6 + ser_7
Likewise, adding DataFrame objects results in the union of index pairs for rows and columns if the pairs are not the same, resulting in NaN for indices that do not overlap:
np.random.seed(0)
df_8 = pd.DataFrame(np.random.rand(9).reshape((3, 3)),
columns=['a', 'b', 'c'])
df_8
np.random.seed(1)
df_9 = pd.DataFrame(np.random.rand(9).reshape((3, 3)),
columns=['b', 'c', 'd'])
df_9
df_8 + df_9
As before, we can set a fill value instead of NaN for indices that do not overlap:
df_10 = df_8.add(df_9, fill_value=0)
df_10
Pandas supports arithmetic operations between DataFrames and Series. Match the index of the Series on the DataFrame's columns, broadcasting down the rows:
ser_8 = df_10.iloc[0]
df_11 = df_10 - ser_8
df_11
Match the index of the Series on the DataFrame's columns, broadcasting down the rows and union the indices that do not match:
ser_9 = pd.Series(range(3), index=['a', 'd', 'e'])
ser_9
df_11 - ser_9
Broadcast over the columns and match the rows (axis=0) by using an arithmetic method:
df_10
ser_10 = pd.Series([100, 200, 300])
ser_10
df_10.sub(ser_10, axis=0)
The axis
argument above is pandas' way of saying to index by row (axis 0). As you can probably guess, axis 1 refers to columns.
ser_4
Sort a Series by its index:
ser_4.sort_index()
df_12 = pd.DataFrame(np.arange(12).reshape((3, 4)),
index=['three', 'one', 'two'],
columns=['c', 'a', 'b', 'd'])
df_12
We can sort a DataFrame by its index:
df_12.sort_index()
Sort a DataFrame by columns in descending order:
df_12.sort_index(axis=1, ascending=False)
Sort a DataFrame's values by column:
df_12.sort_values(by=['d', 'c'])
DataFrames can rank over rows or columns. This is easier to illustrate than it is to explain, so see below!
df_13 = pd.DataFrame({'foo' : [7, -5, 7, 4, 2, 0, 4, 7],
'bar' : [-5, 4, 2, 0, 4, 7, 7, 8],
'baz' : [-1, 2, 3, 0, 5, 9, 9, 5]})
df_13
<!Rank a DataFrame over rows:!>
# don't run this
# df_13.rank()
<!Rank a DataFrame over columns:!>
# df_13.rank(axis=1)
Labels do not have to be unique in Pandas:
ser_12 = pd.Series(range(5), index=['foo', 'foo', 'bar', 'bar', 'baz'])
ser_12
ser_12.index.is_unique
Select Series elements:
ser_12['foo']
Select DataFrame elements:
df_14 = pd.DataFrame(np.random.randn(5, 4),
index=['foo', 'foo', 'bar', 'bar', 'baz'])
df_14
df_14.loc['bar']
Unlike NumPy arrays, Pandas descriptive statistics automatically exclude missing data. NaN values are excluded unless the entire row or column is NA.
df_2
## Throws away categorical & character variables!
df_2.describe()
df_2.sum()
Sum over the rows:
This notebook can serve as a resource for you as you work on Pandas.
Pandas elements that we did not mention
19 Essential Snippets in Pandas
A lovely post about Pandas Ufuncs
Next up: something more fun!