left side logo

Python Data Science

Working with Data: Introducing Pandas

Dr. Jonathan Lamb, Faculty Fellow, CRMDA, <jonathanplamb@ku.edu>
Dr. Paul Johnson, Dept. of Political Science, <pauljohn@ku.edu>

Keywords: pandas, data science, dataframe

* Special thanks to Donne Martin, who provided a Notebook with many suggestions that we have adopted.
right side logo

Standing on the Shoulders of Giants

Python -> Numpy -> Pandas

Python is a general purpose interactive program intended for flexibility and ease of use (especially congenial to beginning programmers).

  • Does not deliver "out of the box" high performance numerical computing framework.

Numpy is a scientific calculation library that supplies "more formal structure" (throws away some flexibility, replaces it with rigorous computer-programmer stuff).

  • vectors
  • matrices, arrays
  • data types, etc

Pandas

  1. builds on Numpy. Somewhere "under there" in Pandas we find Numpy
  2. adds terminology, objects, and functions for data analysis
  3. adopts many terms and concepts for a not-entirely-different-from-R user experience

How to learn more

Lead developer of Pandas Very readable, plenty of examples

Functions, Methods, Properties

Function notation: x = funct() creates an object x. For clarity, refer to it as funct(), not just funct.

Method notation: x.mthd() asks x to carry out an instruction .mthd()

Property notation: x.value retrieves a pre-existing element from x. Note: No Parentheses

How to get started

Put these 2 lines at the top of any data-oriented Python project.

In [1]:
import numpy as np
import pandas as pd

After that, the functions from

  • numpy are available as np.funct()
  • pandas are available as pd.funct()

For example, we'll run

  • pd.DataFrame(...) to create a data frame
  • pd.Series(...) to create a "series" object

The short names pd and np are conventions, not requirements.

  • (You could replace with myPrettyPanda if you want)

Pandas Highlights: What Do We Get?

  1. Data Importers! Import tabular data from Excel, CSV, and other formats.
  2. A Data Frame, terminology adapted from R.
  3. A wealth of data selection and transformation tools.
  4. Basic tools for creating tables and graphs.
  5. Categorical variables (version of R factors)
  6. Excellent comprehensive online documentation: https://pandas.pydata.org/pandas-docs/stable

It is safe to say that Pandas and NumPy are among the most widely used addon libraries for Python. All self-respecting Pythonistas should be aware of them.

Pandas Data Structure Overview

The big prize is the DataFrame, not different from a "spreadsheet".

A Series is one column from a DataFrame

Little example to whet the appetite

I keep text files in Zip containers, to save on storage space. This access the data without explicitly extracting the zip contents into a text file.

In [2]:
import os
from zipfile import ZipFile
fn = os.path.join("..", "..", "data", "2017-18_playerBoxScore.csv.zip")
csv_name = "2017-18_playerBoxScore.csv"
plyr = pd.read_csv(ZipFile(fn, "r").open(csv_name), parse_dates=['gmDate'])
plyr.head()
Out[2]:
gmDate gmTime seasTyp playLNm playFNm teamAbbr teamConf teamDiv teamLoc teamRslt ... playFT% playORB playDRB playTRB opptAbbr opptConf opptDiv opptLoc opptRslt opptDayOff
0 2017-10-17 08:00 Regular Brown Jaylen BOS East Atlantic Away Loss ... 0.5000 1 5 6 CLE East Central Home Win 0
1 2017-10-17 08:00 Regular Irving Kyrie BOS East Atlantic Away Loss ... 1.0000 2 2 4 CLE East Central Home Win 0
2 2017-10-17 08:00 Regular Tatum Jayson BOS East Atlantic Away Loss ... 1.0000 4 6 10 CLE East Central Home Win 0
3 2017-10-17 08:00 Regular Horford Al BOS East Atlantic Away Loss ... 0.7143 0 7 7 CLE East Central Home Win 0
4 2017-10-17 08:00 Regular Hayward Gordon BOS East Atlantic Away Loss ... 0.0000 0 1 1 CLE East Central Home Win 0

5 rows × 51 columns

In [3]:
plyr.columns
Out[3]:
Index(['gmDate', 'gmTime', 'seasTyp', 'playLNm', 'playFNm', 'teamAbbr',
       'teamConf', 'teamDiv', 'teamLoc', 'teamRslt', 'teamDayOff', 'offLNm1',
       'offFNm1', 'offLNm2', 'offFNm2', 'offLNm3', 'offFNm3', 'playDispNm',
       'playStat', 'playMin', 'playPos', 'playHeight', 'playWeight',
       'playBDate', 'playPTS', 'playAST', 'playTO', 'playSTL', 'playBLK',
       'playPF', 'playFGA', 'playFGM', 'playFG%', 'play2PA', 'play2PM',
       'play2P%', 'play3PA', 'play3PM', 'play3P%', 'playFTA', 'playFTM',
       'playFT%', 'playORB', 'playDRB', 'playTRB', 'opptAbbr', 'opptConf',
       'opptDiv', 'opptLoc', 'opptRslt', 'opptDayOff'],
      dtype='object')

Question: Do tall players get more rebounds?

In [4]:
%matplotlib inline
plyr['playRPM'] = plyr.playTRB / plyr.playMin
# plyr.playRPM
plyr.plot.scatter(x="playHeight", y="playRPM", alpha = 0.4, figsize=(8,5))
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8e457d2e80>
In [5]:
# extract data about the greatest play ever from Davidson
steph = plyr.loc[plyr.playLNm.isin(["Curry"]), ["teamAbbr", "playTRB", "playFGM", "playFG%"]]
steph.plot.scatter(x="playFGM", y="playTRB", c="DarkBlue", figsize = (8,4))
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8e455144e0>

Series

Def: Series A series is a "container" for "one column" of information.

Can be:

* logical: True or False
* integer
* floating point
* characters
* categorical variable like R "factor"

It is a "variable", one value for each observed "case".

Python shows you parts in different ways, but remember it is a "container" with an array in it.

Many Pandas Series use NumPy arrays as the data containers.

The data storage in NumPy (hence Pandas) is "strictly typed". For example, int64, float64, etc. Specifies how much menory to use.

We will see dtype, a NumPy abbrevation for data type.

1. Create

In "real life", I usually import a DataFrame from a file. Take columns out of that.

Here, we have some "toy examples" to show how to use Series.

Create a Pandas Series from an ordinary Python array:

In [6]:
ser_1 = pd.Series([1, 1, 2, -3, -5, 8, 13, 4, 1, 6, 7])
ser_1
Out[6]:
0      1
1      1
2      2
3     -3
4     -5
5      8
6     13
7      4
8      1
9      6
10     7
dtype: int64

Review the first 3 elements

In [7]:
ser_1.head(3)
Out[7]:
0    1
1    1
2    2
dtype: int64

Note the output includes dtype: int64. A 64 bit integer is used for storage, the Pandas default.

Review the last 6 elements

In [8]:
ser_1.tail(6)
Out[8]:
5      8
6     13
7      4
8      1
9      6
10     7
dtype: int64

2. Inspect the Series Object

Run dir() to see what is in there. Try not to faint

In [9]:
dir(ser_1)
Out[9]:
['T',
 '_AXIS_ALIASES',
 '_AXIS_IALIASES',
 '_AXIS_LEN',
 '_AXIS_NAMES',
 '_AXIS_NUMBERS',
 '_AXIS_ORDERS',
 '_AXIS_REVERSED',
 '_AXIS_SLICEMAP',
 '__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_prepare__',
 '__array_priority__',
 '__array_wrap__',
 '__bool__',
 '__bytes__',
 '__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__div__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__long__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdiv__',
 '__rdivmod__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rmatmul__',
 '__rmod__',
 '__rmul__',
 '__ror__',
 '__round__',
 '__rpow__',
 '__rsub__',
 '__rtruediv__',
 '__rxor__',
 '__setattr__',
 '__setitem__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__unicode__',
 '__weakref__',
 '__xor__',
 '_accessors',
 '_add_numeric_operations',
 '_add_series_only_operations',
 '_add_series_or_dataframe_operations',
 '_agg_by_level',
 '_agg_examples_doc',
 '_agg_see_also_doc',
 '_aggregate',
 '_aggregate_multiple_funcs',
 '_align_frame',
 '_align_series',
 '_binop',
 '_box_item_values',
 '_builtin_table',
 '_can_hold_na',
 '_check_inplace_setting',
 '_check_is_chained_assignment_possible',
 '_check_label_or_level_ambiguity',
 '_check_percentile',
 '_check_setitem_copy',
 '_clear_item_cache',
 '_clip_with_one_bound',
 '_clip_with_scalar',
 '_consolidate',
 '_consolidate_inplace',
 '_construct_axes_dict',
 '_construct_axes_dict_for_slice',
 '_construct_axes_dict_from',
 '_construct_axes_from_arguments',
 '_constructor',
 '_constructor_expanddim',
 '_constructor_sliced',
 '_convert',
 '_create_indexer',
 '_cython_table',
 '_deprecations',
 '_dir_additions',
 '_dir_deletions',
 '_drop_axis',
 '_drop_labels_or_levels',
 '_expand_axes',
 '_find_valid_index',
 '_formatting_values',
 '_from_axes',
 '_get_axis',
 '_get_axis_name',
 '_get_axis_number',
 '_get_axis_resolvers',
 '_get_block_manager_axis',
 '_get_bool_data',
 '_get_cacher',
 '_get_index_resolvers',
 '_get_item_cache',
 '_get_label_or_level_values',
 '_get_numeric_data',
 '_get_value',
 '_get_values',
 '_get_values_tuple',
 '_get_with',
 '_gotitem',
 '_iget_item_cache',
 '_index',
 '_indexed_same',
 '_info_axis',
 '_info_axis_name',
 '_info_axis_number',
 '_init_dict',
 '_init_mgr',
 '_internal_names',
 '_internal_names_set',
 '_is_builtin_func',
 '_is_cached',
 '_is_copy',
 '_is_cython_func',
 '_is_datelike_mixed_type',
 '_is_homogeneous_type',
 '_is_label_or_level_reference',
 '_is_label_reference',
 '_is_level_reference',
 '_is_mixed_type',
 '_is_numeric_mixed_type',
 '_is_view',
 '_ix',
 '_ixs',
 '_map_values',
 '_maybe_cache_changed',
 '_maybe_update_cacher',
 '_metadata',
 '_ndarray_values',
 '_needs_reindex_multi',
 '_obj_with_exclusions',
 '_protect_consolidate',
 '_reduce',
 '_reindex_axes',
 '_reindex_indexer',
 '_reindex_multi',
 '_reindex_with_indexers',
 '_repr_data_resource_',
 '_repr_latex_',
 '_reset_cache',
 '_reset_cacher',
 '_selected_obj',
 '_selection',
 '_selection_list',
 '_selection_name',
 '_set_as_cached',
 '_set_axis',
 '_set_axis_name',
 '_set_is_copy',
 '_set_item',
 '_set_labels',
 '_set_name',
 '_set_subtyp',
 '_set_value',
 '_set_values',
 '_set_with',
 '_set_with_engine',
 '_setup_axes',
 '_shallow_copy',
 '_slice',
 '_stat_axis',
 '_stat_axis_name',
 '_stat_axis_number',
 '_take',
 '_to_dict_of_blocks',
 '_try_aggregate_string_function',
 '_typ',
 '_unpickle_series_compat',
 '_update_inplace',
 '_validate_dtype',
 '_values',
 '_where',
 '_xs',
 'abs',
 'add',
 'add_prefix',
 'add_suffix',
 'agg',
 'aggregate',
 'align',
 'all',
 'any',
 'append',
 'apply',
 'argmax',
 'argmin',
 'argsort',
 'array',
 'as_matrix',
 'asfreq',
 'asof',
 'astype',
 'at',
 'at_time',
 'autocorr',
 'axes',
 'base',
 'between',
 'between_time',
 'bfill',
 'bool',
 'clip',
 'clip_lower',
 'clip_upper',
 'combine',
 'combine_first',
 'compound',
 'compress',
 'copy',
 'corr',
 'count',
 'cov',
 'cummax',
 'cummin',
 'cumprod',
 'cumsum',
 'data',
 'describe',
 'diff',
 'div',
 'divide',
 'divmod',
 'dot',
 'drop',
 'drop_duplicates',
 'droplevel',
 'dropna',
 'dtype',
 'dtypes',
 'duplicated',
 'empty',
 'eq',
 'equals',
 'ewm',
 'expanding',
 'factorize',
 'ffill',
 'fillna',
 'filter',
 'first',
 'first_valid_index',
 'flags',
 'floordiv',
 'from_array',
 'ftype',
 'ftypes',
 'ge',
 'get',
 'get_dtype_counts',
 'get_ftype_counts',
 'get_values',
 'groupby',
 'gt',
 'hasnans',
 'head',
 'hist',
 'iat',
 'idxmax',
 'idxmin',
 'iloc',
 'imag',
 'index',
 'infer_objects',
 'interpolate',
 'is_monotonic',
 'is_monotonic_decreasing',
 'is_monotonic_increasing',
 'is_unique',
 'isin',
 'isna',
 'isnull',
 'item',
 'items',
 'itemsize',
 'iteritems',
 'ix',
 'keys',
 'kurt',
 'kurtosis',
 'last',
 'last_valid_index',
 'le',
 'loc',
 'lt',
 'mad',
 'map',
 'mask',
 'max',
 'mean',
 'median',
 'memory_usage',
 'min',
 'mod',
 'mode',
 'mul',
 'multiply',
 'name',
 'nbytes',
 'ndim',
 'ne',
 'nlargest',
 'nonzero',
 'notna',
 'notnull',
 'nsmallest',
 'nunique',
 'pct_change',
 'pipe',
 'plot',
 'pop',
 'pow',
 'prod',
 'product',
 'ptp',
 'put',
 'quantile',
 'radd',
 'rank',
 'ravel',
 'rdiv',
 'rdivmod',
 'real',
 'reindex',
 'reindex_axis',
 'reindex_like',
 'rename',
 'rename_axis',
 'reorder_levels',
 'repeat',
 'replace',
 'resample',
 'reset_index',
 'rfloordiv',
 'rmod',
 'rmul',
 'rolling',
 'round',
 'rpow',
 'rsub',
 'rtruediv',
 'sample',
 'searchsorted',
 'select',
 'sem',
 'set_axis',
 'shape',
 'shift',
 'size',
 'skew',
 'slice_shift',
 'sort_index',
 'sort_values',
 'squeeze',
 'std',
 'strides',
 'sub',
 'subtract',
 'sum',
 'swapaxes',
 'swaplevel',
 'tail',
 'take',
 'timetuple',
 'to_clipboard',
 'to_csv',
 'to_dense',
 'to_dict',
 'to_excel',
 'to_frame',
 'to_hdf',
 'to_json',
 'to_latex',
 'to_list',
 'to_msgpack',
 'to_numpy',
 'to_period',
 'to_pickle',
 'to_sparse',
 'to_sql',
 'to_string',
 'to_timestamp',
 'to_xarray',
 'transform',
 'transpose',
 'truediv',
 'truncate',
 'tshift',
 'tz_convert',
 'tz_localize',
 'unique',
 'unstack',
 'update',
 'value_counts',
 'values',
 'var',
 'view',
 'where',
 'xs']

The index is an "item name", similar to a name in an R vector.

Get the index of the Series:

In [10]:
ser_1.index
Out[10]:
RangeIndex(start=0, stop=11, step=1)

Create a Series with a custom index using character strings:

In [11]:
ser_2 = pd.Series([1, 1, 2, -3, -5], index=['a', 'b', 'c', 'd', 'e'])
ser_2
Out[11]:
a    1
b    1
c    2
d   -3
e   -5
dtype: int64
We can replace the Series's index in place:
In [12]:
ser_2copy = ser_2.copy()
## Caution: must copy, else is a reference. Discuss amongst yourselves
ser_2copy.index = ['x1', 'x2', 'x3', 'x4', 'y']
ser_2copy
Out[12]:
x1    1
x2    1
x3    2
x4   -3
y    -5
dtype: int64

3. Retrieve values from the series

Pull out a number by a numeric index.

Here's a trick question. What is value of item 4 in that variable?
In [13]:
ser_2[4]
Out[13]:
-5

Same number by its character index value

In [14]:
ser_2['a']
Out[14]:
1

Check that those are actually equal with ==

In [15]:
ser_2[4] == ser_2['e']
Out[15]:
True

Get a set of values from a Series by passing in a list of names:

In [16]:
ser_2[['c', 'a', 'b']]
Out[16]:
c    2
a    1
b    1
dtype: int64

Stop and try this Confuse yourself by using numbers-as-characters in the element index

In [17]:
ser_2b = pd.Series([1, 1, 2, -3, -5], index=['5', '4', '3', '2', '1'])
In [18]:
# uncomment, run this:
# ser_2b[4]
In [19]:
# run this
# ser_2b["4"]

Minor Apology

Sorry, I was a bit careless here, thinking about the Series as if it were an R vector or a NumPy array.

Although those accesses succeeded, it is best practice in Python to use the accessor methods named .loc[] and .iloc[].

  • .loc[] is for accessing by index as-a-character
  • .iloc[] is for accessing by numeric position of index, starting at 0

Odd they use hard brackets, but it is "syntactic sugar" to enhance the user experience.

Why? Many users expect Series to behave exactly like NumPy arrays, but they do not always comply. Using iloc and loc methods, then all of the "slice selecting" customs of NumPy will be available.

In [20]:
ser_2
Out[20]:
a    1
b    1
c    2
d   -3
e   -5
dtype: int64
In [21]:
ser_2.iloc[3]
Out[21]:
-3
In [22]:
ser_2.loc["d"]
Out[22]:
-3

McKinney's Python for Data Analysis, 2ed has an example in which the the bracket method fails, observing "For more precise handling, use loc (for labels) or iloc (for integers)" (p. 147).

In [23]:
ser = pd.Series([1, 2, 3, 4, 666])

Without iloc, this effort to choose the last element in the Series fails:

ser[-1]


KeyError Traceback (most recent call last)

<ipython-input-129-44969a759c20> in ----> 1 ser[-1]

~/.local/lib/python3.7/site-packages/pandas/core/series.py in getitem(self, key) 866 key = com.apply_if_callable(key, self) 867 try: --> 868 result = self.index.get_value(self, key) 869 870 if not is_scalar(result):

~/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_value(self, series, key) 4373 try: 4374 return self._engine.get_value(s, k, -> 4375 tz=getattr(series.dtype, 'tz', None)) 4376 except KeyError as e1: 4377 if len(self) > 0 and (self.holds_integer() or self.is_boolean()):

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: -1

In [24]:
ser.iloc[-1]
Out[24]:
666

Filtering (select by logical vector)

Select from a Series based on a logical condition, AKA "filter":

In [25]:
ser_2[[True, False, True, False, True]]
Out[25]:
a    1
c    2
e   -5
dtype: int64

Calculate a True/False variable and let it do the filtering

In [26]:
ser_2[ser_2 > 0]
Out[26]:
a    1
b    1
c    2
dtype: int64

Slicing with the colon

Pandas inherited this idea from NumPy. A range within a NumPy array can be selected by a colon ":" separated pair of index values. This is referred to as a slice

In [27]:
# reminder
ser_2
Out[27]:
a    1
b    1
c    2
d   -3
e   -5
dtype: int64

Select a slice from a Series (items 1, 2, and 3):

In [28]:
ser_2.iloc[1:4]
Out[28]:
b    1
c    2
d   -3
dtype: int64

Interesting to note that the range includes items 1, 2, and 3, NOT 4

This selects the items 2 through the end

In [29]:
ser_2.iloc[2: ]
Out[29]:
c    2
d   -3
e   -5
dtype: int64

Select the items from 2 to the end, leaving off the last element

In [30]:
ser_2.iloc[2: -1]
Out[30]:
c    2
d   -3
dtype: int64

Select a range from the index, by name

In [31]:
ser_2.loc["b":"d"]
Out[31]:
b    1
c    2
d   -3
dtype: int64

Select a slice from a Series with labels (note the end point is inclusive):

In [32]:
ser_2['a':'b']
Out[32]:
a    1
b    1
dtype: int64

Assign to a Series slice (note the end point is inclusive):

In [33]:
ser_2.loc['a':'b'] = 0
ser_2
Out[33]:
a    0
b    0
c    2
d   -3
e   -5
dtype: int64

4. Math with a Series: easy!

Series can be added and multiplied in the way you usually expect.

Scalar multiply (2 is a scalar):

In [34]:
ser_2 * 2
Out[34]:
a     0
b     0
c     4
d    -6
e   -10
dtype: int64

NumPy has functions for logarithms np.log(), exponentials np.exp(), trigonometry, etc.

Apply a numpy math function:

In [35]:
ser_2_exp = np.exp(ser_2)
print(ser_2_exp)
a    1.000000
b    1.000000
c    7.389056
d    0.049787
e    0.006738
dtype: float64

Some Python "dynamic-typing magic" remains

In [36]:
# Recall ser_2 is an integer Series
print(ser_2)
a    0
b    0
c    2
d   -3
e   -5
dtype: int64
In [37]:
ser_2.dtype
Out[37]:
dtype('int64')

Note that dtype int64 changes to float64 automatically when the scalar is a floating point number.

In [38]:
# Now the magic happens
ser_2 * 2.6
Out[38]:
a     0.0
b     0.0
c     5.2
d    -7.8
e   -13.0
dtype: float64

Series can be added. An int64 array plus a float64 will generate a float64

In [39]:
ser_2 + ser_2_exp
Out[39]:
a    1.000000
b    1.000000
c    9.389056
d   -2.950213
e   -4.993262
dtype: float64

A Series is like a fixed-length, ordered dictionary. In fact, we can create a series by passing in a dictionary:

In [40]:
dict_1 = {'foo' : 100, 'bar' : 200, 'baz' : 300}
ser_3 = pd.Series(dict_1)
ser_3
Out[40]:
foo    100
bar    200
baz    300
dtype: int64

Note that Python/Pandas chose the integer storage, they noticed that we gave them integers.

On the other hand, if we insert just one floating point number, then the variable changes to float64:

In [41]:
pd.Series({'foo' : 100.1, 'bar' : 200, 'baz' : 300})
Out[41]:
foo    100.1
bar    200.0
baz    300.0
dtype: float64

Absolute values:

In [42]:
np.abs(ser_2) # absolute value!
Out[42]:
a    0
b    0
c    2
d    3
e    5
dtype: int64

5. Missing Values

Watch what happens if we include a "None" value or a NumPy "NaN" in a Python list that we use to create a Pandas Series:

In [43]:
pd.Series([100, 200, 300, None, 33, np.NaN])
Out[43]:
0    100.0
1    200.0
2    300.0
3      NaN
4     33.0
5      NaN
dtype: float64

Pandas adopted the NumPy method which allowed "missing values" to be entered and stored as NaN=not a number.

HOWEVER, NumPy defined this feature only for floating point numbers.

The same thing happens if we have a value of None in a dict initializer:

In [44]:
pd.Series({'foo' : 100, 'bar' : 200, 'baz' : 300, 'happy': None})
Out[44]:
foo      100.0
bar      200.0
baz      300.0
happy      NaN
dtype: float64

If we make a mistake and include an extra element in the list we propose as the Series index, watch what happens:

In [45]:
index = ['foo', 'bar', 'baz', 'qux']
ser_4 = pd.Series({'foo' : 100, 'bar' : 200, 'baz' : 300}, index=index)
ser_4
Out[45]:
foo    100.0
bar    200.0
baz    300.0
qux      NaN
dtype: float64

In "qux" is a 4th name, but there are only 3 elements. Rather than rejecting the name, Pandas fills in "NaN":

Terminology:

missing values are variously referred to as None Null or NaN in Python discussion.

Reflecting this terminological uncertainty, there are two equivalent methods to check if elements are missing, .isnull() and .isna()

In [46]:
# .isna()
ser_4.isna()
Out[46]:
foo    False
bar    False
baz    False
qux     True
dtype: bool
In [47]:
# .isnull()
pd.isnull(ser_4)
Out[47]:
foo    False
bar    False
baz    False
qux     True
dtype: bool

Interestingly, in a character variable, the value of None is preserved as a missing indicator:

In [48]:
ser_4b = pd.Series(["fred", "barney", None])
print(ser_4b)
ser_4b.isnull()
0      fred
1    barney
2      None
dtype: object
Out[48]:
0    False
1    False
2     True
dtype: bool

Problem: there was no such thing as a missing integer

However, NumPy allows NaN that only for floating point numbers. Integers are promoted to floats if missing values exist.

Until Pandas 0.24! This is still somewhat 'off the beaten path'. One must explicitly ask for a data type Int64.

Observe:

In [49]:
pd.Series([1, 2, 3, None, 44], dtype="Int64")
Out[49]:
0      1
1      2
2      3
3    NaN
4     44
dtype: Int64

For the first time, then Pandas 0.24 allows missing values "NaN" in floats and in the special Int64 data type.

6. Extract the NumPy array from the Pandas series

Some functions (especially stats and plotting functions) require the actual NumPy array, not the pd.Series that contains it.

To extract the "actual data", in Pandas 0.24 we have options (syntax changed, documentation warns many web pages are outdated). I don't honestly know if one way is better than another.

In [50]:
ser_1.array
Out[50]:
<PandasArray>
[1, 1, 2, -3, -5, 8, 13, 4, 1, 6, 7]
Length: 11, dtype: int64

Note that's an attribute, but there is also a method, .to_numpy().

In [51]:
ser_1.to_numpy()
Out[51]:
array([ 1,  1,  2, -3, -5,  8, 13,  4,  1,  6,  7])

You'll also see people retrieving same with a NumPy function asarray for same purpose.

In [52]:
np.asarray(ser_1)
Out[52]:
array([ 1,  1,  2, -3, -5,  8, 13,  4,  1,  6,  7])

7. Other nutty stuff we found

We can name a Series.

Initially, a series does not have a name

In [53]:
ser_4.name

But if we assign a name, it will decorate our output:

In [54]:
ser_4.name = 'foobarbazqux'
ser_4
Out[54]:
foo    100.0
bar    200.0
baz    300.0
qux      NaN
Name: foobarbazqux, dtype: float64

And we can also name a Series's index, which appears as a label over the row names:

In [55]:
ser_4.index.name = 'kulabel'
ser_4
Out[55]:
kulabel
foo    100.0
bar    200.0
baz    300.0
qux      NaN
Name: foobarbazqux, dtype: float64

DataFrame

A DataFrame is a tabular data structure containing an ordered collection of rows and columns. Think of it like an Excel spreadsheet. Each column can have a different type. DataFrames have both row and column indices.

1. Create

Create a DataFrame from a Python dict:

In [56]:
data_1 = {'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
          'year' : [2012, 2013, 2014, 2014, 2015],
          'pop' : [5.0, 5.1, 5.2, 4.0, 4.1]}
df_1 = pd.DataFrame(data_1)
df_1
Out[56]:
state year pop
0 VA 2012 5.0
1 VA 2013 5.1
2 VA 2014 5.2
3 MD 2014 4.0
4 MD 2015 4.1
Create a DataFrame, but keep only some of the columns:
In [57]:
df_2 = pd.DataFrame(data_1, columns=['year', 'state'])
df_2
Out[57]:
year state
0 2012 VA
1 2013 VA
2 2014 VA
3 2014 MD
4 2015 MD

We can add a new Series (column) by name using input from a Python list:

In [58]:
df_1['unempla'] = [4.1, 3.9, 6.2, 5.0, 6.0]
df_1
Out[58]:
state year pop unempla
0 VA 2012 5.0 4.1
1 VA 2013 5.1 3.9
2 VA 2014 5.2 6.2
3 MD 2014 4.0 5.0
4 MD 2015 4.1 6.0

Similarly, assign a pre-existing Series to a column (note the index is used to align a partial column):

In [59]:
unemplb = pd.Series([6.0, 6.0, 6.1], index=[2, 3, 4])
print(unemplb)
df_1['unemplb'] = unemplb
df_1
2    6.0
3    6.0
4    6.1
dtype: float64
Out[59]:
state year pop unempla unemplb
0 VA 2012 5.0 4.1 NaN
1 VA 2013 5.1 3.9 NaN
2 VA 2014 5.2 6.2 6.0
3 MD 2014 4.0 5.0 6.0
4 MD 2015 4.1 6.0 6.1

Create a new column by copying an old one, possibly with calculation:

In [60]:
df_1['unemplasquared'] = df_1['unempla']**2 # ** is exponent
df_1
Out[60]:
state year pop unempla unemplb unemplasquared
0 VA 2012 5.0 4.1 NaN 16.81
1 VA 2013 5.1 3.9 NaN 15.21
2 VA 2014 5.2 6.2 6.0 38.44
3 MD 2014 4.0 5.0 6.0 25.00
4 MD 2015 4.1 6.0 6.1 36.00

It is difficult to think of a realistic case in which you might need to do this, but you can:

Create a DataFrame from a nested dict of dicts (the keys in 
the inner dicts are unioned and sorted to form the index in 
the result, unless an explicit index is specified):
In [61]:
pop = {'VA' : {2013 : 5.1, 2014 : 5.2},
       'MD' : {2014 : 4.0, 2015 : 4.1}}
df_4 = pd.DataFrame(pop)
df_4
Out[61]:
VA MD
2013 5.1 NaN
2014 5.2 4.0
2015 NaN 4.1

2. Behold the inner structure

Many will quake in fear, as if meeting the Wizard.

In [62]:
dir(df_4)
Out[62]:
['MD',
 'T',
 'VA',
 '_AXIS_ALIASES',
 '_AXIS_IALIASES',
 '_AXIS_LEN',
 '_AXIS_NAMES',
 '_AXIS_NUMBERS',
 '_AXIS_ORDERS',
 '_AXIS_REVERSED',
 '_AXIS_SLICEMAP',
 '__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_priority__',
 '__array_wrap__',
 '__bool__',
 '__bytes__',
 '__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__div__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdiv__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rmatmul__',
 '__rmod__',
 '__rmul__',
 '__ror__',
 '__round__',
 '__rpow__',
 '__rsub__',
 '__rtruediv__',
 '__rxor__',
 '__setattr__',
 '__setitem__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__unicode__',
 '__weakref__',
 '__xor__',
 '_accessors',
 '_add_numeric_operations',
 '_add_series_only_operations',
 '_add_series_or_dataframe_operations',
 '_agg_by_level',
 '_agg_examples_doc',
 '_agg_summary_and_see_also_doc',
 '_aggregate',
 '_aggregate_multiple_funcs',
 '_align_frame',
 '_align_series',
 '_box_col_values',
 '_box_item_values',
 '_builtin_table',
 '_check_inplace_setting',
 '_check_is_chained_assignment_possible',
 '_check_label_or_level_ambiguity',
 '_check_percentile',
 '_check_setitem_copy',
 '_clear_item_cache',
 '_clip_with_one_bound',
 '_clip_with_scalar',
 '_combine_const',
 '_combine_frame',
 '_combine_match_columns',
 '_combine_match_index',
 '_consolidate',
 '_consolidate_inplace',
 '_construct_axes_dict',
 '_construct_axes_dict_for_slice',
 '_construct_axes_dict_from',
 '_construct_axes_from_arguments',
 '_constructor',
 '_constructor_expanddim',
 '_constructor_sliced',
 '_convert',
 '_count_level',
 '_create_indexer',
 '_cython_table',
 '_deprecations',
 '_dir_additions',
 '_dir_deletions',
 '_drop_axis',
 '_drop_labels_or_levels',
 '_ensure_valid_index',
 '_expand_axes',
 '_find_valid_index',
 '_from_arrays',
 '_from_axes',
 '_get_agg_axis',
 '_get_axis',
 '_get_axis_name',
 '_get_axis_number',
 '_get_axis_resolvers',
 '_get_block_manager_axis',
 '_get_bool_data',
 '_get_cacher',
 '_get_index_resolvers',
 '_get_item_cache',
 '_get_label_or_level_values',
 '_get_numeric_data',
 '_get_value',
 '_get_values',
 '_getitem_bool_array',
 '_getitem_frame',
 '_getitem_multilevel',
 '_gotitem',
 '_iget_item_cache',
 '_indexed_same',
 '_info_axis',
 '_info_axis_name',
 '_info_axis_number',
 '_info_repr',
 '_init_mgr',
 '_internal_names',
 '_internal_names_set',
 '_is_builtin_func',
 '_is_cached',
 '_is_copy',
 '_is_cython_func',
 '_is_datelike_mixed_type',
 '_is_homogeneous_type',
 '_is_label_or_level_reference',
 '_is_label_reference',
 '_is_level_reference',
 '_is_mixed_type',
 '_is_numeric_mixed_type',
 '_is_view',
 '_ix',
 '_ixs',
 '_join_compat',
 '_maybe_cache_changed',
 '_maybe_update_cacher',
 '_metadata',
 '_needs_reindex_multi',
 '_obj_with_exclusions',
 '_protect_consolidate',
 '_reduce',
 '_reindex_axes',
 '_reindex_columns',
 '_reindex_index',
 '_reindex_multi',
 '_reindex_with_indexers',
 '_repr_data_resource_',
 '_repr_fits_horizontal_',
 '_repr_fits_vertical_',
 '_repr_html_',
 '_repr_latex_',
 '_reset_cache',
 '_reset_cacher',
 '_sanitize_column',
 '_selected_obj',
 '_selection',
 '_selection_list',
 '_selection_name',
 '_series',
 '_set_as_cached',
 '_set_axis',
 '_set_axis_name',
 '_set_is_copy',
 '_set_item',
 '_set_value',
 '_setitem_array',
 '_setitem_frame',
 '_setitem_slice',
 '_setup_axes',
 '_shallow_copy',
 '_slice',
 '_stat_axis',
 '_stat_axis_name',
 '_stat_axis_number',
 '_take',
 '_to_dict_of_blocks',
 '_try_aggregate_string_function',
 '_typ',
 '_unpickle_frame_compat',
 '_unpickle_matrix_compat',
 '_update_inplace',
 '_validate_dtype',
 '_values',
 '_where',
 '_xs',
 'abs',
 'add',
 'add_prefix',
 'add_suffix',
 'agg',
 'aggregate',
 'align',
 'all',
 'any',
 'append',
 'apply',
 'applymap',
 'as_matrix',
 'asfreq',
 'asof',
 'assign',
 'astype',
 'at',
 'at_time',
 'axes',
 'between_time',
 'bfill',
 'bool',
 'boxplot',
 'clip',
 'clip_lower',
 'clip_upper',
 'columns',
 'combine',
 'combine_first',
 'compound',
 'copy',
 'corr',
 'corrwith',
 'count',
 'cov',
 'cummax',
 'cummin',
 'cumprod',
 'cumsum',
 'describe',
 'diff',
 'div',
 'divide',
 'dot',
 'drop',
 'drop_duplicates',
 'droplevel',
 'dropna',
 'dtypes',
 'duplicated',
 'empty',
 'eq',
 'equals',
 'eval',
 'ewm',
 'expanding',
 'ffill',
 'fillna',
 'filter',
 'first',
 'first_valid_index',
 'floordiv',
 'from_dict',
 'from_records',
 'ftypes',
 'ge',
 'get',
 'get_dtype_counts',
 'get_ftype_counts',
 'get_values',
 'groupby',
 'gt',
 'head',
 'hist',
 'iat',
 'idxmax',
 'idxmin',
 'iloc',
 'index',
 'infer_objects',
 'info',
 'insert',
 'interpolate',
 'isin',
 'isna',
 'isnull',
 'items',
 'iteritems',
 'iterrows',
 'itertuples',
 'ix',
 'join',
 'keys',
 'kurt',
 'kurtosis',
 'last',
 'last_valid_index',
 'le',
 'loc',
 'lookup',
 'lt',
 'mad',
 'mask',
 'max',
 'mean',
 'median',
 'melt',
 'memory_usage',
 'merge',
 'min',
 'mod',
 'mode',
 'mul',
 'multiply',
 'ndim',
 'ne',
 'nlargest',
 'notna',
 'notnull',
 'nsmallest',
 'nunique',
 'pct_change',
 'pipe',
 'pivot',
 'pivot_table',
 'plot',
 'pop',
 'pow',
 'prod',
 'product',
 'quantile',
 'query',
 'radd',
 'rank',
 'rdiv',
 'reindex',
 'reindex_axis',
 'reindex_like',
 'rename',
 'rename_axis',
 'reorder_levels',
 'replace',
 'resample',
 'reset_index',
 'rfloordiv',
 'rmod',
 'rmul',
 'rolling',
 'round',
 'rpow',
 'rsub',
 'rtruediv',
 'sample',
 'select',
 'select_dtypes',
 'sem',
 'set_axis',
 'set_index',
 'shape',
 'shift',
 'size',
 'skew',
 'slice_shift',
 'sort_index',
 'sort_values',
 'squeeze',
 'stack',
 'std',
 'style',
 'sub',
 'subtract',
 'sum',
 'swapaxes',
 'swaplevel',
 'tail',
 'take',
 'timetuple',
 'to_clipboard',
 'to_csv',
 'to_dense',
 'to_dict',
 'to_excel',
 'to_feather',
 'to_gbq',
 'to_hdf',
 'to_html',
 'to_json',
 'to_latex',
 'to_msgpack',
 'to_numpy',
 'to_panel',
 'to_parquet',
 'to_period',
 'to_pickle',
 'to_records',
 'to_sparse',
 'to_sql',
 'to_stata',
 'to_string',
 'to_timestamp',
 'to_xarray',
 'transform',
 'transpose',
 'truediv',
 'truncate',
 'tshift',
 'tz_convert',
 'tz_localize',
 'unstack',
 'update',
 'values',
 'var',
 'where',
 'xs']

3. Missing Values

Like Series, None and np.NaN values in Python input appear as NaN:

In [63]:
data_2 = {'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
          'year' : [2012, 2013, 2014, 2014, 2015],
          'pop' : [5.0, 5.1, 5.2, 4.0, None]}
# shuffle columns for fun
df_2 = pd.DataFrame(data_2, columns=['year', 'state', 'pop'])
df_2
Out[63]:
year state pop
0 2012 VA 5.0
1 2013 VA 5.1
2 2014 VA 5.2
3 2014 MD 4.0
4 2015 MD NaN

4. Extracting Information

Pulling out columns (er, Series) by name in two ways

We can retrieve a column by its name, which is sometimes called a key. This returns a Series object:

In [64]:
df_2['state']
Out[64]:
0    VA
1    VA
2    VA
3    MD
4    MD
Name: state, dtype: object

Retrieve a column by attribute, returning a Series:

In [65]:
df_2.state
Out[65]:
0    VA
1    VA
2    VA
3    MD
4    MD
Name: state, dtype: object

Choose two columns by name to create a new, smaller data frame

In [66]:
df_2[["state", "year"]]
Out[66]:
state year
0 VA 2012
1 VA 2013
2 VA 2014
3 MD 2014
4 MD 2015

Pulling out rows (by name or number)

Use the iloc (index location) method to can retrieve a row by its numeric position:

In [67]:
df_2.iloc[0, ]
Out[67]:
year     2012
state      VA
pop         5
Name: 0, dtype: object

To show selection by name, we need more interesting row index values:

In [68]:
df_2.index = ["a", "e", "i", "o", "u"]
df_2
Out[68]:
year state pop
a 2012 VA 5.0
e 2013 VA 5.1
i 2014 VA 5.2
o 2014 MD 4.0
u 2015 MD NaN
In [69]:
df_2.loc["o"]
Out[69]:
year     2014
state      MD
pop         4
Name: o, dtype: object
In [70]:
df_2.loc[ ["a", "u"] ]
Out[70]:
year state pop
a 2012 VA 5.0
u 2015 MD NaN

Choose rows and columns by name at same time:

In [71]:
df_2.loc[ ["a", "u"], ["year", "pop"] ]
Out[71]:
year pop
a 2012 5.0
u 2015 NaN

Select from a DataFrame based on a logical condition filter:

In [72]:
df_2[df_2['pop'] > 4.5]
Out[72]:
year state pop
a 2012 VA 5.0
e 2013 VA 5.1
i 2014 VA 5.2

It is recommended to select columns by name (avoids mistakes), but it is allowed to select columns by numbers.

How does Pandas know if you want rows or columns?

If only one element is specified, it seems to assume we want rows

In [73]:
# asks for row 1 only
df_2.iloc[1:2]
Out[73]:
year state pop
e 2013 VA 5.1

To ask for the first column, we specify 2 pieces, rows and columns.

":" means all rows

In [74]:
df_2.iloc[: , 1:2]
Out[74]:
state
a VA
e VA
i VA
o MD
u MD

We must specify the row selection and column selection in a consistent way.

Observe the error we get from

> df_2.iloc[0:2, 'pop']
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-116-18676538f98d> in <module>
----> 1 df_2.loc[0:2, 'pop']

~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1492             except (KeyError, IndexError, AttributeError):
   1493                 pass
-> 1494             return self._getitem_tuple(key)
   1495         else:
   1496             # we by definition only have the 0th axis

~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
    866     def _getitem_tuple(self, tup):
    867         try:
--> 868             return self._getitem_lowerdim(tup)
    869         except IndexingError:
    870             pass

~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_lowerdim(self, tup)
   1015                     return section
   1016                 # This is an elided recursive call to iloc/loc/etc'
-> 1017                 return getattr(section, self.name)[new_key]
   1018 
   1019         raise IndexingError('not applicable')

~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1498 
   1499             maybe_callable = com.apply_if_callable(key, self.obj)
-> 1500             return self._getitem_axis(maybe_callable, axis=axis)
   1501 
   1502     def _is_scalar_access(self, key):

~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1865         if isinstance(key, slice):
   1866             self._validate_key(key, axis)
-> 1867             return self._get_slice_axis(key, axis=axis)
   1868         elif com.is_bool_indexer(key):
   1869             return self._getbool_axis(key, axis=axis)

~/.local/lib/python3.7/site-packages/pandas/core/indexing.py in _get_slice_axis(self, slice_obj, axis)
   1531         labels = obj._get_axis(axis)
   1532         indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop,
-> 1533                                        slice_obj.step, kind=self.name)
   1534 
   1535         if isinstance(indexer, slice):

~/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py in slice_indexer(self, start, end, step, kind)
   4671         """
   4672         start_slice, end_slice = self.slice_locs(start, end, step=step,
-> 4673                                                  kind=kind)
   4674 
   4675         # return a slice

~/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py in slice_locs(self, start, end, step, kind)
   4870         start_slice = None
   4871         if start is not None:
-> 4872             start_slice = self.get_slice_bound(start, 'left', kind)
   4873         if start_slice is None:
   4874             start_slice = 0

~/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_slice_bound(self, label, side, kind)
   4796         # For datetime indices label may be a string that has to be converted
   4797         # to datetime boundary according to its resolution.
-> 4798         label = self._maybe_cast_slice_bound(label, side, kind)
   4799 
   4800         # we need to look up the label

~/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py in _maybe_cast_slice_bound(self, label, side, kind)
   4748         # this is rejected (generally .loc gets you here)
   4749         elif is_integer(label):
-> 4750             self._invalid_indexer('slice', label)
   4751 
   4752         return label

~/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py in _invalid_indexer(self, form, key)
   3065                         "indexers [{key}] of {kind}".format(
   3066                             form=form, klass=type(self), key=key,
-> 3067                             kind=type(key)))
   3068 
   3069     # --------------------------------------------------------------------

TypeError: cannot do slice indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [0] of <class 'int'>

Select a slice of rows from a specific column of a DataFrame:

In [75]:
df_2.loc["e":"o", 'pop']
Out[75]:
e    5.1
i    5.2
o    4.0
Name: pop, dtype: float64

Select from a DataFrame based on a filter:

In [76]:
df_2[df_2['pop'] > 5]
Out[76]:
year state pop
e 2013 VA 5.1
i 2014 VA 5.2

5. Delete column(s)

Delete one column:

In [77]:
# Insert a goof, on purpose!
df_2['goof'] = 5
df_2
Out[77]:
year state pop goof
a 2012 VA 5.0 5
e 2013 VA 5.1 5
i 2014 VA 5.2 5
o 2014 MD 4.0 5
u 2015 MD NaN 5
In [78]:
del df_2['goof']
df_2
Out[78]:
year state pop
a 2012 VA 5.0
e 2013 VA 5.1
i 2014 VA 5.2
o 2014 MD 4.0
u 2015 MD NaN

Try This: run following, then inspect df. See what you have. Inspect column names, find the index.

In [79]:
ncol = 33
df = pd.DataFrame(np.random.randn(50, 33))
# list comprehension to assign column names
df.columns = ["x" + str(i) for i in range(1,34)]

6. Extract the NumPy array from the DataFrame

Until Pandas 0.24, the recommended method was .values().

Now, they suggest .to_numpy() method to take out the NumPy array.

Danger: a NumPy array is a homogenous collection of values (all are floats, integers, or characters)

In [80]:
df_2.to_numpy()
Out[80]:
array([[2012, 'VA', 5.0],
       [2013, 'VA', 5.1],
       [2014, 'VA', 5.2],
       [2014, 'MD', 4.0],
       [2015, 'MD', nan]], dtype=object)

If the columns are different data types, the 2D ndarray's dtype will accomodate all of the values.

The whole thing is promoted to a array of characters.

The old .values() method still exits, but is deprecated (will vanish)

In [81]:
df_2.values
Out[81]:
array([[2012, 'VA', 5.0],
       [2013, 'VA', 5.1],
       [2014, 'VA', 5.2],
       [2014, 'MD', 4.0],
       [2015, 'MD', nan]], dtype=object)

7. Operations on entire DataFrames

Ask the data frame for its numbers of rows and columns (not a method, just a property)

In [82]:
df_1.shape
Out[82]:
(5, 6)

Review the dtypes of all columns

In [83]:
df_1.dtypes
Out[83]:
state              object
year                int64
pop               float64
unempla           float64
unemplb           float64
unemplasquared    float64
dtype: object

Transpose

We can transpose a DataFrame, which means that we can make the rows columns and columns rows.

In [84]:
df_2.T
Out[84]:
a e i o u
year 2012 2013 2014 2014 2015
state VA VA VA MD MD
pop 5 5.1 5.2 4 NaN

8. Nutty Stuff Again: names for house keeping

Can name the inner properties of the data.frame itself.

We will assign a name to the row index and column names.

In [85]:
# Check what we have first
df_2.index.name
In [86]:
df_2.columns.name
In [87]:
# See how it prints without the names
df_2
Out[87]:
year state pop
a 2012 VA 5.0
e 2013 VA 5.1
i 2014 VA 5.2
o 2014 MD 4.0
u 2015 MD NaN

Now we assign names to the index and columns

In [88]:
df_2.index.name = 'My_index_is_named_Jim'
df_2
Out[88]:
year state pop
My_index_is_named_Jim
a 2012 VA 5.0
e 2013 VA 5.1
i 2014 VA 5.2
o 2014 MD 4.0
u 2015 MD NaN

Set the DataFrame columns name:

In [89]:
df_2.columns.name = 'happy_new_column_name_is_too_long'
df_2
Out[89]:
happy_new_column_name_is_too_long year state pop
My_index_is_named_Jim
a 2012 VA 5.0
e 2013 VA 5.1
i 2014 VA 5.2
o 2014 MD 4.0
u 2015 MD NaN

Insert more reasonabe names before proceeding

In [90]:
df_2.columns.name = "df_2"
df_2.index.name = "index"
df_2
Out[90]:
df_2 year state pop
index
a 2012 VA 5.0
e 2013 VA 5.1
i 2014 VA 5.2
o 2014 MD 4.0
u 2015 MD NaN

Reindexing

Why Re-index?

Sometimes Pandas procedures will delete rows or add new ones and the rownames or column names become inconsistent/unhelpful.

The DataFrame offers methods line .reset_index() and .reindex() as efficient ways to adjust the indices.

The .reindex() method can be used in 2 ways, either with parameters

labels

axis (0 for rows, 1 for columns)

or

index = new row names

columns = new column names

Question Why bother with this?

Answers

  1. ? I really don't know, and

  2. Potential Efficiency. If the data were humongous, it would be slow to rewrite whole data frame. .reindex() has arguments "copy=False" that will prevent the re-mapping of memory, as long as the index values remain the same.

In [91]:
# Here's the dataframe as it currently stands.
df_2
Out[91]:
df_2 year state pop
index
a 2012 VA 5.0
e 2013 VA 5.1
i 2014 VA 5.2
o 2014 MD 4.0
u 2015 MD NaN

To erase the existing index, and insert numbers for the rows, use .reset_index(). Note this keeps the old row names as a variable.

  • Use the drop=True parameter to prevent that.
  • .reset_index() has an "inplace" method.
In [92]:
df_2.reset_index()
Out[92]:
df_2 index year state pop
0 a 2012 VA 5.0
1 e 2013 VA 5.1
2 i 2014 VA 5.2
3 o 2014 MD 4.0
4 u 2015 MD NaN

Reindexing rows returns a new frame with the rows in a particular order:

In [93]:
kk2 = df_2.reindex(["a", "e", "u", "i", "o"], fill_value=0)

If the new index names do not match old names, then rows of missing scores will be "filled in"

In [94]:
df_2.reindex(["w", "a", "e", "y", "u", "i", "o"])
Out[94]:
df_2 year state pop
index
w NaN NaN NaN
a 2012.0 VA 5.0
e 2013.0 VA 5.1
y NaN NaN NaN
u 2015.0 MD NaN
i 2014.0 VA 5.2
o 2014.0 MD 4.0

If you don't want NaN, but would rather have something else, specify fill_value

In [95]:
df_2.reindex(["w", "a", "e", "y" "u", "i", "o"], fill_value=0)
Out[95]:
df_2 year state pop
index
w 0 0 0.0
a 2012 VA 5.0
e 2013 VA 5.1
yu 0 0 0.0
i 2014 VA 5.2
o 2014 MD 4.0
In [96]:
df_2.reindex(columns=['state', 'pop', 'unempl', 'year'])
Out[96]:
df_2 state pop unempl year
index
a VA 5.0 NaN 2012
e VA 5.1 NaN 2013
i VA 5.2 NaN 2014
o MD 4.0 NaN 2014
u MD NaN NaN 2015

Letting the index fill in data

Here we specify a non-secutive index to demonstrate an interesting featurein .reindex. It will interpolate data, "filling in" the missing values in the index:

In [97]:
ser_5 = pd.Series(['foo', 'bar', 'baz'], index=[0, 2, 4])
In [98]:
ser_5.reindex(range(5), method='ffill')
Out[98]:
0    foo
1    foo
2    bar
3    bar
4    baz
dtype: object
In [99]:
ser_5.reindex(range(5), method='bfill')

# Can you tell the difference between 'ffill' and 'bfill'?
Out[99]:
0    foo
1    bar
2    bar
3    baz
4    baz
dtype: object

Arithmetic and Data Alignment

Common R user mistake

Pandas will not add columns from 2 different data frames unless the index values match.

Adding series in Pandas behaves more like a horizontal "merge".

First let's create two Series of random numbers, with a different set of indices.

In [100]:
np.random.seed(0)
ser_6 = pd.Series(np.random.randn(5),
               index=['a', 'b', 'c', 'd', 'e'])
ser_6
Out[100]:
a    1.764052
b    0.400157
c    0.978738
d    2.240893
e    1.867558
dtype: float64
In [101]:
np.random.seed(1)
ser_7 = pd.Series(np.random.randn(5),
               index=['a', 'c', 'e', 'f', 'g'])
ser_7
Out[101]:
a    1.624345
c   -0.611756
e   -0.528172
f   -1.072969
g    0.865408
dtype: float64

Now, let's add the series together.

In [102]:
ser_6 + ser_7

# So what has the math done here?
Out[102]:
a    3.388398
b         NaN
c    0.366982
d         NaN
e    1.339386
f         NaN
g         NaN
dtype: float64

We can use the .add() method and set a fill value instead of NaN for indices that do not overlap:

In [103]:
ser_6.add(ser_7, fill_value=0)
Out[103]:
a    3.388398
b    0.400157
c    0.366982
d    2.240893
e    1.339386
f   -1.072969
g    0.865408
dtype: float64

Suppose you wanted to add ser_6 and ser_7, one-for-one, ignoring the indices? Appears some gymnastics are needed.

In [104]:
ser_6.reset_index(drop=True, inplace=True)
ser_6
Out[104]:
0    1.764052
1    0.400157
2    0.978738
3    2.240893
4    1.867558
dtype: float64
In [105]:
ser_7.reset_index(drop=True, inplace=True)
ser_7
Out[105]:
0    1.624345
1   -0.611756
2   -0.528172
3   -1.072969
4    0.865408
dtype: float64
In [106]:
# ser_6.add(ser_7) or
ser_6 + ser_7
Out[106]:
0    3.388398
1   -0.211599
2    0.450566
3    1.167925
4    2.732966
dtype: float64

Likewise, adding DataFrame objects results in the union of index pairs for rows and columns if the pairs are not the same, resulting in NaN for indices that do not overlap:

In [107]:
np.random.seed(0)
df_8 = pd.DataFrame(np.random.rand(9).reshape((3, 3)),
                 columns=['a', 'b', 'c'])
df_8
Out[107]:
a b c
0 0.548814 0.715189 0.602763
1 0.544883 0.423655 0.645894
2 0.437587 0.891773 0.963663
In [108]:
np.random.seed(1)
df_9 = pd.DataFrame(np.random.rand(9).reshape((3, 3)),
                 columns=['b', 'c', 'd'])
df_9
Out[108]:
b c d
0 0.417022 0.720324 0.000114
1 0.302333 0.146756 0.092339
2 0.186260 0.345561 0.396767
In [109]:
df_8 + df_9
Out[109]:
a b c d
0 NaN 1.132211 1.323088 NaN
1 NaN 0.725987 0.792650 NaN
2 NaN 1.078033 1.309223 NaN

As before, we can set a fill value instead of NaN for indices that do not overlap:

In [110]:
df_10 = df_8.add(df_9, fill_value=0)
df_10
Out[110]:
a b c d
0 0.548814 1.132211 1.323088 0.000114
1 0.544883 0.725987 0.792650 0.092339
2 0.437587 1.078033 1.309223 0.396767

Pandas supports arithmetic operations between DataFrames and Series. Match the index of the Series on the DataFrame's columns, broadcasting down the rows:

In [111]:
ser_8 = df_10.iloc[0]
df_11 = df_10 - ser_8
df_11
Out[111]:
a b c d
0 0.000000 0.000000 0.000000 0.000000
1 -0.003930 -0.406224 -0.530438 0.092224
2 -0.111226 -0.054178 -0.013864 0.396653

Match the index of the Series on the DataFrame's columns, broadcasting down the rows and union the indices that do not match:

In [112]:
ser_9 = pd.Series(range(3), index=['a', 'd', 'e'])
ser_9
Out[112]:
a    0
d    1
e    2
dtype: int64
In [113]:
df_11 - ser_9
Out[113]:
a b c d e
0 0.000000 NaN NaN -1.000000 NaN
1 -0.003930 NaN NaN -0.907776 NaN
2 -0.111226 NaN NaN -0.603347 NaN

Broadcast over the columns and match the rows (axis=0) by using an arithmetic method:

In [114]:
df_10
Out[114]:
a b c d
0 0.548814 1.132211 1.323088 0.000114
1 0.544883 0.725987 0.792650 0.092339
2 0.437587 1.078033 1.309223 0.396767
In [115]:
ser_10 = pd.Series([100, 200, 300])
ser_10
Out[115]:
0    100
1    200
2    300
dtype: int64
In [116]:
df_10.sub(ser_10, axis=0)
Out[116]:
a b c d
0 -99.451186 -98.867789 -98.676912 -99.999886
1 -199.455117 -199.274013 -199.207350 -199.907661
2 -299.562413 -298.921967 -298.690777 -299.603233

The axis argument above is pandas' way of saying to index by row (axis 0). As you can probably guess, axis 1 refers to columns.

Sorting and Ranking

In [117]:
ser_4
Out[117]:
kulabel
foo    100.0
bar    200.0
baz    300.0
qux      NaN
Name: foobarbazqux, dtype: float64

Sort a Series by its index:

In [118]:
ser_4.sort_index()
Out[118]:
kulabel
bar    200.0
baz    300.0
foo    100.0
qux      NaN
Name: foobarbazqux, dtype: float64
In [119]:
df_12 = pd.DataFrame(np.arange(12).reshape((3, 4)),
                  index=['three', 'one', 'two'],
                  columns=['c', 'a', 'b', 'd'])
df_12
Out[119]:
c a b d
three 0 1 2 3
one 4 5 6 7
two 8 9 10 11

We can sort a DataFrame by its index:

In [120]:
df_12.sort_index()
Out[120]:
c a b d
one 4 5 6 7
three 0 1 2 3
two 8 9 10 11

Sort a DataFrame by columns in descending order:

In [121]:
df_12.sort_index(axis=1, ascending=False)
Out[121]:
d c b a
three 3 0 2 1
one 7 4 6 5
two 11 8 10 9

Sort a DataFrame's values by column:

In [122]:
df_12.sort_values(by=['d', 'c'])
Out[122]:
c a b d
three 0 1 2 3
one 4 5 6 7
two 8 9 10 11

DataFrames can rank over rows or columns. This is easier to illustrate than it is to explain, so see below!

In [123]:
df_13 = pd.DataFrame({'foo' : [7, -5, 7, 4, 2, 0, 4, 7],
                   'bar' : [-5, 4, 2, 0, 4, 7, 7, 8],
                   'baz' : [-1, 2, 3, 0, 5, 9, 9, 5]})
df_13
Out[123]:
foo bar baz
0 7 -5 -1
1 -5 4 2
2 7 2 3
3 4 0 0
4 2 4 5
5 0 7 9
6 4 7 9
7 7 8 5

<!Rank a DataFrame over rows:!>

In [124]:
# don't run this
# df_13.rank()

<!Rank a DataFrame over columns:!>

In [125]:
# df_13.rank(axis=1)

Axis Indexes with Duplicate Values

Labels do not have to be unique in Pandas:

In [126]:
ser_12 = pd.Series(range(5), index=['foo', 'foo', 'bar', 'bar', 'baz'])
ser_12
Out[126]:
foo    0
foo    1
bar    2
bar    3
baz    4
dtype: int64
In [127]:
ser_12.index.is_unique
Out[127]:
False

Select Series elements:

In [128]:
ser_12['foo']
Out[128]:
foo    0
foo    1
dtype: int64

Select DataFrame elements:

In [129]:
df_14 = pd.DataFrame(np.random.randn(5, 4),
                  index=['foo', 'foo', 'bar', 'bar', 'baz'])
df_14
Out[129]:
0 1 2 3
foo -2.363469 1.135345 -1.017014 0.637362
foo -0.859907 1.772608 -1.110363 0.181214
bar 0.564345 -0.566510 0.729976 0.372994
bar 0.533811 -0.091973 1.913820 0.330797
baz 1.141943 -1.129595 -0.850052 0.960820
In [130]:
df_14.loc['bar']
Out[130]:
0 1 2 3
bar 0.564345 -0.566510 0.729976 0.372994
bar 0.533811 -0.091973 1.913820 0.330797

Summarizing and Computing Descriptive Statistics

Unlike NumPy arrays, Pandas descriptive statistics automatically exclude missing data. NaN values are excluded unless the entire row or column is NA.

In [131]:
df_2
Out[131]:
df_2 year state pop
index
a 2012 VA 5.0
e 2013 VA 5.1
i 2014 VA 5.2
o 2014 MD 4.0
u 2015 MD NaN
In [132]:
## Throws away categorical & character variables!
df_2.describe()
Out[132]:
df_2 year pop
count 5.000000 4.000000
mean 2013.600000 4.825000
std 1.140175 0.556028
min 2012.000000 4.000000
25% 2013.000000 4.750000
50% 2014.000000 5.050000
75% 2014.000000 5.125000
max 2015.000000 5.200000
In [133]:
df_2.sum()
Out[133]:
df_2
year          10068
state    VAVAVAMDMD
pop            19.3
dtype: object

Sum over the rows:

Whew! That was a lot.

This notebook can serve as a resource for you as you work on Pandas.

And yet we barely scratched the surface

Pandas elements that we did not mention

  1. combining data frames (merge, concatenate)
  2. special features for categorical variables
  3. countless others for data exploration

A few other helpful links:

19 Essential Snippets in Pandas

Useful Pandas Snippets

A lovely post about Pandas Ufuncs

Pandas Data Types

Next up: something more fun!