The pandas and NumPy libraries for the Python language provide data structures and data analysis tools for financial and scientific computing. With these libraries, Python code uses far less time and memory and becomes simpler to write and maintain. This course—for students who have successfully completed Introduction to Python Programming INFO1-CE9990 or the equivalent—is designed to prepare you for a position as a pandas developer and data scientist.
Learn to create and manipulate the three major pandas data structures,
the
ndarray
,
Series
,
and
DataFrame
.
Read, write, and compress these structures
using file formats including
plain text, CSV, JSON, XML, HTML tables, and binary.
Perform arithmetic and other vectorized operations
along the rows and columns of data without having to code a Python
for
loop.
Sort, search, merge, and summarize your tables of data.
Practice reshaping and transforming the pandas data structures,
create pivot tables and cross tabulations,
and reformat and redistribute the information in flexible ways.
Index your data by a fixed or variable time series.
Add additional dimensions to your data
via the hierarchical indexing of rows and columns.
Use the
“split-apply-combine”
paradign
to divide the data into groups,
process each group separately,
and recombine the results.
Learn to format string and numeric data;
display the results as text and/or graphics.
Copious in-class examples are drawn from online datasets such as
NYC OpenData.
Further information:
http://oit2.scps.nyu.edu/~meretzkm/pandas/
This Python course is ideal for
those who have already completed
Introduction
to Python Programming/INFO1-CE9990
or the equivalent.
Participants should know how to iterate
through the native Python data structures
list
,
tuple
,
set
,
and
dictionary
,
and how to create new ones using Python
comprehensions.
[The following note should not be included in the catalog description:
Participants would not need to know how to create
Python classes, objects, iterators, and generators.]
list
and
tuple
may contain values of
different
datatypes,
offering convenient flexibility.
Homogeneous containers such as
the Python Standard Library
array.array
and the
NumPy
ndarray
must contain values of the
same
datatype,
offering savings in space and time.
Install
NumPy
and
pandas on your Mac, PC, or Linux,
and measure the memory and speed tradeoffs
between these two kinds of containers.
ndarray
in
NumPy.
An
ndarray
will be important because the data in each pandas
Series
is held in an
ndarray
.
Create and initialize an
ndarray
with one or more axes
(dimensions).
Survey the available
dtype
s
(types of data)
that can be stored in the
ndarray
,
and choose the correct size of
int
and
float
data for your application.
Use the placeholder value
np.nan
(“not a number”)
for missing information.
Perform vectorized arithmetic on an
ndarray
without writing Python
for
loops,
and apply ufuncs (universal functions) to the data in the rows and columns
by specifying the appropriate axis.
Join, concatenate, and merge two
ndarray
s;
form unions and intersections of the data.
Sort and reorder the rows and columns;
reshape
and
t
ranspose
the
ndarray
for insertion into a pandas
Series
or
DataFrame
.
Select and access the individual items, rows, and columns
of a
ndarray
by indexing and slicing;
learn the difference between a
copy
and a
view.
Format an
ndarray
as a Python string,
display it as graphics,
or save and compress it in a file.
Series
in
pandas.
A
Series
will be important because each column of a
DataFrame
is a
Series
.
A
Series
is like an ordered dictionary.
The keys of the
Series
(not necessarily unique)
are called the
index;
the values of the
Series
are stored in a one-dimensional NumPy
ndarray
.
Create and initialize a
Series
from a Python data structure or from a
NumPy
ndarray
.
Give the default index to the
Series
or create your own index.
Perform vectorized arithmetic and other operations on one or two
Series
objects.
Join two
Series
objects that have the same index or different indexes.
Select one or more rows from a
Series
using indexing and slicing,
sort the rows by index or by value,
and write the contents of the
Series
to a string or to a file.
DataFrame
in
pandas.
A
DataFrame
is important because it is the principal pandas data structure.
It’s like a spreadsheet consisting of rows and columns.
Each column is a named
Series
,
and all the columns share the same index.
Create a
DataFrame
and initialize it from a NumPy
ndarray
,
a pandas
Series
,
or a Python data structure,
rearranging the latter if necessary to get it into
DataFrame
format.
Index a
DataFrame
with
loc
vs.
iloc
and
take
.
Perform elementary operations on a
DataFrame
such as slicing and selecting rows and columns.
Combine two
DataFrame
s
with a database-style “join” operation.
Copy information between a
DataFrame
and a Python data structure such as a
list
of
list
s.
DataFrame
from a file
on your local machine,
or from a website or server.
Investigate the most common file formats including plain text,
CSV (comma-separated values),
JSON (JavaScipt Object Notation),
XML (Extensible Markup Language),
HTML tables,
and various “serialized” or “marshalled”
binary formats.
Designate a column of the incoming data to act as the index.
Perform type inference and data conversion;
eliminate or fill in missing or redundant data.
Read and process a large file in separate pieces.
When you’re done with the
DataFrame
,
write it back to a file.
Interact with web APIs and databases.
Series
.
Separate continuous numeric data into bins,
either of equal length or containing equal numbers of items.
Select random samples or permutations of the data.
Series
or
DataFrame
.
A hierarchical index or
MultiIndex
is like a tree with many branches,
organized into two or more levels.
Create a hierarchical index
with the methods
from_arrays
,
from_product
,
or
from_tuples
.
Shift information back and forth between a hierarchical index and the
conventional rows and columns of a data structure with the methods
stack
and
unstack
of a
DataFrame
or
Series
.
Hierarchical and partial indexing along different axes
(i.e., horizontal vs. vertical hierarchical indexing).
Swap the levels of a hierarchical index.
Select, sort, and summarize data based on a given level.
Series
or
DataFrame
into groups.
Horizontal vs. vertical grouping with the
groupby
method of a
Series
or
DataFrame
.
Specify which row (or column) belongs to which group by using the items
in the index or values of the
Series
or
DataFrame
,
or by using values taken from a completely different data structure.
Iterate with a
for
loop through the resulting groups.
Or apply a function to each group separately,
and recombine the results using the
“split-apply-combine” paradigm.
If a group is a
DataFrame
,
the function may be applied to the entire group,
or separately to each row or column of the group.
For each group,
the applied function may return a single number,
a
Series
,
a
DataFrame
,
or some other data structure.
The recombined results
may therefore have a hierarchical index for its rows and/or columns,
which can then be
unstack
ed
for clarity.
Explore some of the ways of dividing the data into groups or bins,
including
by quantile,
or by a
Categorical
object returned by the
cut
method,
or by a sequence of values
that would be called “enumerations” in other languages.
datatime.datetime
offers microsecond precision;
the pandas datatypes
pd.Timestamp
and
pd.Period
offer nanosecond precision plus awareness of timezones and daylight savings
time.
The intervals in a time series may be fixed (daily, weekly, quarterly)
or variable (specified as part of the dataset,
or by a rule such as “the last business day of each month”).
Create a fixed-interval time series
by specifying a starting point, frequency, and ending point,
or by specifying a starting point, frequency, and number of intervals.
Store the resulting series
as a column or row in a
Series
or
DataFrame
,
or as a pandas
DatetimeIndex
.
Shift a time series forward or backward
(lead or lag).
Resample
a time series to convert it from one frequency to another
(e.g.,
downsample and aggregate monthly to quarterly,
or upsample and interpolate quarterly to monthly).
Convert between
Timestamp
s
and
Period
s
(closed on the left, closed on the right).
Apply a moving or rolling window function
to your data;
use an exponentially weighted function to give more weight to
the most recent observations.
Emphasize the similarity between
groupby
,
resample
,
and
rolling
.
Read and write dates and times in customizable formats.
Categorical
objects.
Create pivot tables as a special case of grouping;
create cross tabulations as a special case of pivot tables.
Create indicator matrices with the
get_dummies
method
(they look like diagrams of guitar chord tabulatures).
Pivot between
“long”
and
“wide”
formats for data.
“Broadcast”
values to perform arithmetic
with two data structures of different sizes and shapes.
Mark Meretzky
mark.meretzky@gmail.com
http://oit2.scps.nyu.edu/~meretzkm/
The homework will consist of one or two Python programs per week,
for ten weeks.
Each program will download information from a website or server
and store it into a pandas
DataFrame
.
The programs can be run on Mac, PC, or Linux,
using either native Python or Jupyter notebooks.
Grades will be based entirely on the homework:
there will be no tests.
Nowadays all the information you need is online and free, so there is no textbook for this course. The following documentation includes tutorials and references.
https://docs.python.org/
https://numpy.org/devdocs/
https://pandas.pydata.org/pandas-docs/stable/
https://opendata.cityofnewyork.us/
If a student really wanted to buy a textbook, I recommend O’Reilly’s Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, 2nd ed. (2017) by Wes McKinney, the creator of pandas. See the errata.