pandas course
http://oit2.scps.nyu.edu/~meretzkm/pandas/
mark.meretzky@gmail.com

  1. Outline for the pandas course.
  2. The pandas course is intended for students who have already taken this Python course or the equivalent.

Bibliography

Nowadays all the information you need is online and free, so there is no textbook for this course. The following documentation includes tutorials and references.

  1. The Python language:
    https://docs.python.org/
  2. The NumPy library:
    https://numpy.org/devdocs/
  3. The pandas library:
    https://pandas.pydata.org/pandas-docs/stable/
  4. The matplotlib library:
    https://matplotlib.org/contents.html
  5. Many of our online datasets will come from NYC OpenData:
    https://opendata.cityofnewyork.us/
    Start with the Department of Health restaurant inspection results. View data.

The course

  1. Teaser. Download a dataset into a pandas DataFrame. Don’t worry about cleaning the data.
    1. Baby names: a dataset of the 32,033 names shared by the 3,487,353 American babies of 2018, male and female.
    2. Restaurant: a dataset of 391,000 restaurant inspections by the NYC board of health.
  2. Install Python and pandas.
  3. Array. Heterogeneous vs. homogeneous containers: Python list vs. Python array.array. Compare space with sys.getsizeof; compare speed with timeit.repeat or with IPython %time.
  4. NumPy ndarray.
    1. ndarray. The dtypes: signed vs. unsigned int, different sizes of int and float. Single vs. multi-dimensional. Vectorized operations instead of Python for loops.
    2. arange: Python range function vs. NumPy np.arange function.
  5. Series. A pandas Series is a one-dimensional np.ndarray with an index.
    1. Documentation about class Series.
    2. Install pandas.
    3. Create a Series object with the default index.
    4. Iterate through a Series using a for loop.
    5. Default. Examine the default index of a Series.
    6. Explicit. Give an explicit index to a Series.
    7. Replace the index of a Series. This will not change the order of the rows.
    8. Index and slice a Series.
    9. Reindex a Series. This could change the order of the rows.
    10. Satisfy. Select the rows of a Series that satisfy a condition.
    11. Data. Create a Series from data.
    12. Reductions and aggregations.
    13. Vectorized operations on a Series.
    14. Align the indices of two Serieses.
    15. Tall. A tall Series holding a column of strings from a real dataset.
    16. Baby names. A tall Series holding a column of numbers from a real dataset.
    17. Plot a Series using matplotlib.pyplot.
  6. DataFrame
    1. Create a DataFrame object with the default index.
    2. Iterate through the rows of a DataFrame with itertuples rather than with iterrows. Iterate through the columns with and without items.
    3. Read a numeric column using read_csv.
    4. Year of birth: download and proofread a simple DataFrame.
    5. Add a column to a DataFrame.
  7. reindex vs. reset_index vs. assign to index.
  8. Hierarchical index.
  9. The groupby method in pandas.
  10. Pivot table is a special case of groupby. See also pivot.
  11. Cross tabulation is a special case of pivot_table.
  12. Time series

To do

  1. Indexing and slicing by the job you want to get done.
    Select a row by label or position.
    Select several rows by label or position.
    Drop rows by label or position.
  2. Vectorized operation on Strings: reduce each name to initial, or capitalize.
  3. Stand alone example of unique, maybe Index.is_unique. Example of index values not unique.
  4. GroupBy.count and GroupBy.size:
  5. Stand alone groupby.