pandas
course
http://oit2.scps.nyu.edu/~meretzkm/pandas/
mark.meretzky@gmail.com
-
Outline
for the pandas course.
-
The pandas course is intended for students who have already
taken this
Python course
or the equivalent.
Bibliography
Nowadays all the information you need is online and free,
so there is no textbook for this course.
The following documentation includes tutorials and references.
- The Python language:
https://docs.python.org/
- The NumPy library:
https://numpy.org/devdocs/
- The pandas library:
https://pandas.pydata.org/pandas-docs/stable/
-
The matplotlib library:
https://matplotlib.org/contents.html
-
Many of our online datasets will come from NYC OpenData:
https://opendata.cityofnewyork.us/
Start with the Department of Health
restaurant
inspection results.
View
data.
The course
-
Teaser.
Download a dataset into a pandas
DataFrame
.
Don’t worry about cleaning the data.
-
Baby names:
a dataset of the 32,033 names shared by the
3,487,353 American babies of 2018,
male and female.
-
Restaurant:
a dataset of 391,000 restaurant inspections by the NYC board of health.
-
Install
Python
and pandas.
-
Array.
Heterogeneous vs. homogeneous containers:
Python
list
vs. Python
array.array
.
Compare space with
sys.getsizeof
;
compare speed with
timeit.repeat
or with IPython
%time
.
-
NumPy
ndarray
.
-
ndarray
.
The
dtype
s:
signed vs. unsigned
int
,
different sizes of
int
and
float
.
Single vs. multi-dimensional.
Vectorized operations instead of
Python
for
loops.
-
arange
:
Python
range
function
vs. NumPy
np.arange
function.
-
Series
.
A pandas
Series
is a one-dimensional
np.ndarray
with an index.
-
Documentation
about class
Series
.
-
Install
pandas.
-
Create
a
Series
object with the default index.
-
Iterate
through a
Series
using a
for
loop.
-
Default.
Examine the default index of a
Series
.
-
Explicit.
Give an explicit index to a
Series
.
-
Replace
the index of a
Series
.
This will not change the order of the rows.
-
Index
and slice a
Series
.
-
Reindex
a
Series
.
This could change the order of the rows.
-
Satisfy.
Select the rows of a
Series
that satisfy a condition.
-
Data.
Create a
Series
from data.
-
Reductions
and aggregations.
-
Vectorized
operations on a
Series
.
-
Align
the indices of two
Series
es.
-
Tall.
A tall
Series
holding a column of strings from a real dataset.
-
Baby names.
A tall
Series
holding a column of numbers from a real dataset.
-
Plot
a
Series
using
matplotlib.pyplot
.
-
DataFrame
-
Create
a
DataFrame
object with the default index.
-
Iterate through the rows of a DataFrame with
itertuples
rather than with
iterrows
.
Iterate through the columns with and without
items
.
-
Read a numeric column
using
read_csv
.
-
Year of birth:
download and proofread a simple
DataFrame
.
-
Add a column
to a
DataFrame
.
-
reindex vs. reset_index vs. assign to index.
-
Hierarchical index.
-
The
groupby
method in pandas.
-
Pivot table
is a special case of
groupby
.
See also
pivot
.
-
Cross
tabulation
is a special case of
pivot_table
.
-
Time series
To do
-
Indexing and slicing by the job you want to get done.
Select a row by label or position.
Select several rows by label or position.
Drop rows by label or position.
-
Vectorized operation on Strings:
reduce each name to initial, or capitalize.
-
Stand alone example of
unique
, maybe
Index.is_unique
.
Example of
index values not unique.
-
GroupBy.count and GroupBy.size:
-
Stand alone groupby.