Course proposal: Python and pandas

Catalog description

The pandas and NumPy libraries for the Python language provide data structures and data analysis tools for financial and scientific computing. With these libraries, Python code uses far less time and memory and becomes simpler to write and maintain. This course—for students who have successfully completed Introduction to Python Programming INFO1-CE9990 or the equivalent—is designed to prepare you for a position as a pandas developer and data scientist.

Learn to create and manipulate the three major pandas data structures, the ndarray, Series, and DataFrame. Read, write, and compress these structures using file formats including plain text, CSV, JSON, XML, HTML tables, and binary. Perform arithmetic and other vectorized operations along the rows and columns of data without having to code a Python for loop. Sort, search, merge, and summarize your tables of data. Practice reshaping and transforming the pandas data structures, create pivot tables and cross tabulations, and reformat and redistribute the information in flexible ways. Index your data by a fixed or variable time series. Add additional dimensions to your data via the hierarchical indexing of rows and columns. Use the “split-apply-combine” paradign to divide the data into groups, process each group separately, and recombine the results. Learn to format string and numeric data; display the results as text and/or graphics. Copious in-class examples are drawn from online datasets such as NYC OpenData.

Further information:
http://oit2.scps.nyu.edu/~meretzkm/pandas/

You’ll walk away with …

Ideal for …

Prerequisites

This Python course is ideal for those who have already completed Introduction to Python Programming/INFO1-CE9990 or the equivalent. Participants should know how to iterate through the native Python data structures list, tuple, set, and dictionary, and how to create new ones using Python comprehensions. [The following note should not be included in the catalog description: Participants would not need to know how to create Python classes, objects, iterators, and generators.]

Course outline

  1. From Python to pandas: heterogeneous vs. homogeneous containers. Heterogeneous containers such as the Python list and tuple may contain values of different datatypes, offering convenient flexibility. Homogeneous containers such as the Python Standard Library array.array and the NumPy ndarray must contain values of the same datatype, offering savings in space and time. Install NumPy and pandas on your Mac, PC, or Linux, and measure the memory and speed tradeoffs between these two kinds of containers.
  2. The n-dimensional array ndarray in NumPy. An ndarray will be important because the data in each pandas Series is held in an ndarray. Create and initialize an ndarray with one or more axes (dimensions). Survey the available dtypes (types of data) that can be stored in the ndarray, and choose the correct size of int and float data for your application. Use the placeholder value np.nan (“not a number”) for missing information. Perform vectorized arithmetic on an ndarray without writing Python for loops, and apply ufuncs (universal functions) to the data in the rows and columns by specifying the appropriate axis. Join, concatenate, and merge two ndarrays; form unions and intersections of the data. Sort and reorder the rows and columns; reshape and transpose the ndarray for insertion into a pandas Series or DataFrame. Select and access the individual items, rows, and columns of a ndarray by indexing and slicing; learn the difference between a copy and a view. Format an ndarray as a Python string, display it as graphics, or save and compress it in a file.
  3. The one-dimensional Series in pandas. A Series will be important because each column of a DataFrame is a Series. A Series is like an ordered dictionary. The keys of the Series (not necessarily unique) are called the index; the values of the Series are stored in a one-dimensional NumPy ndarray. Create and initialize a Series from a Python data structure or from a NumPy ndarray. Give the default index to the Series or create your own index. Perform vectorized arithmetic and other operations on one or two Series objects. Join two Series objects that have the same index or different indexes. Select one or more rows from a Series using indexing and slicing, sort the rows by index or by value, and write the contents of the Series to a string or to a file.
  4. The two-dimensional DataFrame in pandas. A DataFrame is important because it is the principal pandas data structure. It’s like a spreadsheet consisting of rows and columns. Each column is a named Series, and all the columns share the same index. Create a DataFrame and initialize it from a NumPy ndarray, a pandas Series, or a Python data structure, rearranging the latter if necessary to get it into DataFrame format. Index a DataFrame with loc vs. iloc and take. Perform elementary operations on a DataFrame such as slicing and selecting rows and columns. Combine two DataFrames with a database-style “join” operation. Copy information between a DataFrame and a Python data structure such as a list of lists.
  5. Read a DataFrame from a file on your local machine, or from a website or server. Investigate the most common file formats including plain text, CSV (comma-separated values), JSON (JavaScipt Object Notation), XML (Extensible Markup Language), HTML tables, and various “serialized” or “marshalled” binary formats. Designate a column of the incoming data to act as the index. Perform type inference and data conversion; eliminate or fill in missing or redundant data. Read and process a large file in separate pieces. When you’re done with the DataFrame, write it back to a file. Interact with web APIs and databases.
  6. Clean and prepare the data. When reading a large dataset from a file or server, some of the material can arrive damaged or missing. Filter out the damaged items or records, or replace them with default values such as the average of the data that is intact. Weed out duplicates, detect outliers and clamp them to a sane range of values, and rename the rows and columns. Transform the data with a function, a Python dictionary, or a pandas Series. Separate continuous numeric data into bins, either of equal length or containing equal numbers of items. Select random samples or permutations of the data.
  7. Hierarchical indexing adds extra dimensions to a Series or DataFrame. A hierarchical index or MultiIndex is like a tree with many branches, organized into two or more levels. Create a hierarchical index with the methods from_arrays, from_product, or from_tuples. Shift information back and forth between a hierarchical index and the conventional rows and columns of a data structure with the methods stack and unstack of a DataFrame or Series. Hierarchical and partial indexing along different axes (i.e., horizontal vs. vertical hierarchical indexing). Swap the levels of a hierarchical index. Select, sort, and summarize data based on a given level.
  8. Divide the rows or columns of a Series or DataFrame into groups. Horizontal vs. vertical grouping with the groupby method of a Series or DataFrame. Specify which row (or column) belongs to which group by using the items in the index or values of the Series or DataFrame, or by using values taken from a completely different data structure. Iterate with a for loop through the resulting groups. Or apply a function to each group separately, and recombine the results using the “split-apply-combine” paradigm. If a group is a DataFrame, the function may be applied to the entire group, or separately to each row or column of the group. For each group, the applied function may return a single number, a Series, a DataFrame, or some other data structure. The recombined results may therefore have a hierarchical index for its rows and/or columns, which can then be unstacked for clarity. Explore some of the ways of dividing the data into groups or bins, including by quantile, or by a Categorical object returned by the cut method, or by a sequence of values that would be called “enumerations” in other languages.
  9. Index the rows or columns by a series of times or periods. The Python datatype datatime.datetime offers microsecond precision; the pandas datatypes pd.Timestamp and pd.Period offer nanosecond precision plus awareness of timezones and daylight savings time. The intervals in a time series may be fixed (daily, weekly, quarterly) or variable (specified as part of the dataset, or by a rule such as “the last business day of each month”). Create a fixed-interval time series by specifying a starting point, frequency, and ending point, or by specifying a starting point, frequency, and number of intervals. Store the resulting series as a column or row in a Series or DataFrame, or as a pandas DatetimeIndex. Shift a time series forward or backward (lead or lag). Resample a time series to convert it from one frequency to another (e.g., downsample and aggregate monthly to quarterly, or upsample and interpolate quarterly to monthly). Convert between Timestamps and Periods (closed on the left, closed on the right). Apply a moving or rolling window function to your data; use an exponentially weighted function to give more weight to the most recent observations. Emphasize the similarity between groupby, resample, and rolling. Read and write dates and times in customizable formats.
  10. Advanced data wrangling. Pandas offers savings in execution space and speed, but also allows you to restructure and reshape your data with very little code. Save space with dimension tables, save time with Categorical objects. Create pivot tables as a special case of grouping; create cross tabulations as a special case of pivot tables. Create indicator matrices with the get_dummies method (they look like diagrams of guitar chord tabulatures). Pivot between “long” and “wide” formats for data. “Broadcast” values to perform arithmetic with two data structures of different sizes and shapes.

Teacher

Mark Meretzky
mark.meretzky@gmail.com
http://oit2.scps.nyu.edu/~meretzkm/

Homework

The homework will consist of one or two Python programs per week, for ten weeks. Each program will download information from a website or server and store it into a pandas DataFrame. The programs can be run on Mac, PC, or Linux, using either native Python or Jupyter notebooks. Grades will be based entirely on the homework: there will be no tests.

Bibliography

Nowadays all the information you need is online and free, so there is no textbook for this course. The following documentation includes tutorials and references.

  1. The Python language:
    https://docs.python.org/
  2. The NumPy library:
    https://numpy.org/devdocs/
  3. The pandas library:
    https://pandas.pydata.org/pandas-docs/stable/
  4. Many of our online datasets will come from NYC OpenData:
    https://opendata.cityofnewyork.us/

If a student really wanted to buy a textbook, I recommend O’Reilly’s Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, 2nd ed. (2017) by Wes McKinney, the creator of pandas. See the errata.