The following program creates a
Series
of the baby boy names of 2018,
in order of decreasing popularity.
Names of equal popularity are in alphabetical order.
Names with less than five babies are not listed.
""" Baby names example from Wes McKinney, "Python for Data Analysis", 2nd. ed., pp. 424-440. Social Security Administration, Office of the Chief Actuary http://www.ssa.gov/oact/babynames/limits.html Download and unzip the .zip file for National Data: https://www.ssa.gov/oact/babynames/names.zip I placed the resulting folder in my /Users/myname/Downloads folder. """ import sys import os import pandas as pd year = 2018 #most recent available filename = os.path.expanduser(f"~/Downloads/names/yob{year}.txt") #year of birth names = ["name", "sex", "births"] df = pd.read_csv(filename, names = names) df = df[df.sex == "M"] #Keep the male rows. df.sort_values(ascending = [False, True], by = ["births", "name"], inplace = True) index = pd.Index(data = df["name"], name = "name") series = pd.Series(data = df.births.array, index = index, name = "Number of Births") with pd.option_context("display.min_rows", 40): #context manager print(series) sys.exit(0)
name Liam 19837 Noah 18267 William 14516 James 13525 Oliver 13389 Benjamin 13381 Elijah 12886 Lucas 12585 Mason 12435 Logan 12352 Alexander 11989 Ethan 11854 Jacob 11770 Michael 11620 Daniel 11173 Henry 10649 Jackson 10323 Sebastian 10054 Aiden 9979 Matthew 9924 ... Zien 5 Zier 5 Zierre 5 Zihir 5 Zim 5 Zin 5 Zishe 5 Zmari 5 Zoel 5 Zola 5 Zuber 5 Zubeyr 5 Zyell 5 Zyheem 5 Zykeem 5 Zylas 5 Zyran 5 Zyrie 5 Zyron 5 Zzyzx 5 Name: Number of Births, Length: 14004, dtype: int64
Do something interesting with this
Series
,
starting by verifying that the names in the index are unique.
(There should be only one row with each name.)
What is the total number of names?
What is the total number of baby boys?
How many names had 10,000 or more babies?
How many babies had names that were shared by 10,000 or more other babies?
What percent of the babies had names in the top 10?
What initial had the most names?
(It was A).
What initial had the most babies?
How many names started with an A?
How many babies had names that started with an A?
What were the ten most popular names that started with A?
import sys import os import pandas as pd year = 2018 filename = os.path.expanduser(f"~/Downloads/names/yob{year}.txt") #year of birth names = ["name", "sex", "births"] df = pd.read_csv(filename, names = names) df = df[df.sex == "M"] df.sort_values(ascending = [False, True], by = ["births", "name"], inplace = True) series = df["births"] series.index = df["name"] series.name = "Number of births for each name" series.index.name = "name" groups = series.groupby(series.index.str[0]) #Initials in alphabetical order numberOfNames = groups.count() numberOfNames.name = "Number of names for each initial" with pd.option_context("display.max_rows", 6): print(numberOfNames) print() #The same Series, in descending numerical order. #print(numberOfNames.sort_values(ascending = False)) #print() numberOfBabies = groups.sum() numberOfBabies.name = "Number of babies for each initial" with pd.option_context("display.max_rows", 6): print(numberOfBabies) print() #The same Series, in descending numerical order. #print(numberOfBabies.sort_values(ascending = False)) #print() numberOfBabies = groups.apply(lambda group: group.sort_values(ascending = False)[:3]) numberOfBabies.index.names = ["initial", "name"] numberOfBabies.name = "Three most popular names for each initial" with pd.option_context("display.max_rows", 3 * 26): print(numberOfBabies) sys.exit(0)
name A 1569 B 668 C 712 ... X 60 Y 259 Z 402 Name: Number of names for each initial, Length: 26, dtype: int64 name A 180564 B 91509 C 135023 ... X 7864 Y 8163 Z 26063 Name: Number of babies for each initial, Length: 26, dtype: int64 initial name A Alexander 11989 Aiden 9979 Anthony 8003 B Benjamin 13381 Brayden 4383 Bryson 4194 C Carter 9312 Christopher 7261 Caleb 6929 D Daniel 11173 David 9697 Dylan 8549 E Elijah 12886 Ethan 11854 Eli 6027 F Finn 2316 Felix 1638 Finley 1280 G Grayson 8538 Gabriel 8335 Greyson 4728 H Henry 10649 Hudson 6540 Hunter 6066 I Isaac 8417 Isaiah 6614 Ian 4675 J James 13525 Jacob 11770 Jackson 10323 K Kayden 3972 Kai 3421 Kingston 3330 L Liam 19837 Lucas 12585 Logan 12352 M Mason 12435 Michael 11620 Matthew 9924 N Noah 18267 Nathan 6790 Nolan 5607 O Oliver 13389 Owen 9288 Oscar 1945 P Parker 3978 Patrick 2111 Preston 1954 Q Quinn 828 Quentin 511 Quinton 456 R Ryan 6905 Robert 5140 Roman 4364 S Sebastian 10054 Samuel 9734 Santiago 4647 T Theodore 7020 Thomas 6779 Tyler 3298 U Uriel 580 Uriah 461 Ulises 236 V Vincent 3552 Victor 2213 Valentino 396 W William 14516 Wyatt 9127 Weston 3760 X Xavier 4298 Xander 2257 Xzavier 253 Y Yusuf 485 Yosef 328 Yousef 285 Z Zachary 3528 Zion 2153 Zayden 2126 Name: Three most popular names for each initial, dtype: int64
"Divide the frequencies into bins." import sys import os import numpy as np import pandas as pd year = 2018 #most recent available filename = os.path.expanduser(f"~/python/names/yob{year}.txt") #year of birth names = ["name", "sex", "births"] df = pd.read_csv(filename, names = names) df = df[df.sex == "M"] #Keep the male rows. df.sort_values(ascending = [False, True], by = ["births", "name"], inplace = True) index = pd.Index(data = df["name"], name = "name") series = pd.Series(data = df.births.array, index = index, name = "Number of Births") with pd.option_context("display.min_rows", 10): #context manager print(series) print() bins = np.arange(0, 21_000, 1_000) #smallest int in each of the 20 categories seriesOfBins = pd.cut(series, bins = bins, right = False) #first bin includes 0 but not 1_000 seriesOfBins.name = "Which bin does each name belong to?" #Examine the dtype of the seriesOfBins. dtype = seriesOfBins.dtype print(f"{dtype.name = }") print(f"{dtype.ordered = }") print(f"{type(dtype.categories) = }") print(f"{dtype.categories.closed = }") print(f"{len(dtype.categories) = }") print() for category in dtype.categories: print(category) print() print(seriesOfBins) print() seriesOfCounts = seriesOfBins.value_counts(sort = False) seriesOfCounts.name = "Number of names in each bin" print(seriesOfCounts) sys.exit(0)
name Liam 19837 Noah 18267 William 14516 James 13525 Oliver 13389 ... Zylas 5 Zyran 5 Zyrie 5 Zyron 5 Zzyzx 5 Name: Number of Births, Length: 14004, dtype: int64 dtype.name = 'category' dtype.ordered = True type(dtype.categories) = <class 'pandas.core.indexes.interval.IntervalIndex'> dtype.categories.closed = 'left' len(dtype.categories) = 20 [0, 1000) [1000, 2000) [2000, 3000) [3000, 4000) [4000, 5000) [5000, 6000) [6000, 7000) [7000, 8000) [8000, 9000) [9000, 10000) [10000, 11000) [11000, 12000) [12000, 13000) [13000, 14000) [14000, 15000) [15000, 16000) [16000, 17000) [17000, 18000) [18000, 19000) [19000, 20000) name Liam [19000, 20000) Noah [18000, 19000) William [14000, 15000) James [13000, 14000) Oliver [13000, 14000) ... Zylas [0, 1000) Zyran [0, 1000) Zyrie [0, 1000) Zyron [0, 1000) Zzyzx [0, 1000) Name: Which bin does each name belong to?, Length: 14004, dtype: category Categories (20, interval[int64]): [[0, 1000) < [1000, 2000) < [2000, 3000) < [3000, 4000) < ... < [16000, 17000) < [17000, 18000) < [18000, 19000) < [19000, 20000)] [0, 1000) 13672 [1000, 2000) 131 [2000, 3000) 72 [3000, 4000) 34 [4000, 5000) 22 [5000, 6000) 14 [6000, 7000) 15 [7000, 8000) 6 [8000, 9000) 11 [9000, 10000) 9 [10000, 11000) 3 [11000, 12000) 5 [12000, 13000) 4 [13000, 14000) 3 [14000, 15000) 1 [15000, 16000) 0 [16000, 17000) 0 [17000, 18000) 0 [18000, 19000) 1 [19000, 20000) 1 Name: Number of names in each bin, dtype: int64