read_csv
usually reads from a file or URL.
But the
read_csv
in this program reads from a
str
to make it easy to experiment with different inputs.
The
str
is more than one line, so we write it as a
triple-quoted
string literal.
Since the values in the second column are integers
(1
,
11
,
21
,
31
,
41
),
read_csv
deduces that the
dtype
of this column is
np.int64
.
"Create a pd.DataFrame from a string that looks like a two-column CSV file." import sys import io import pandas as pd #a six-line string s = """\ temp,humidity 0,1 10,11 20,21 30,31 40,41""" infile = io.StringIO(s) df = pd.read_csv(infile) #The argument of read_csv is usually a filename or URL. df.index.name = "day" print(df) print() print(f"The dtypes of the {len(df.columns)} columns are:") print(df.dtypes) sys.exit(0)
temp humidity
day
0 0 1
1 10 11
2 20 21
3 30 31
4 40 41
The dtypes of the 2 columns are:
temp int64
humidity int64
dtype: object
If one or more values are missing,
NA
(“not available”),
or
NULL
,
s = """\ temp,humidity 0,1 10,11 20, 30,31 40,41""" |
s = """\ temp,humidity 0,1 10,11 20 30,31 40,41""" |
s = """\
temp,humidity
0,1
10,11
20,NA
30,31
40,41"""
|
s = """\
temp,humidity
0,1
10,11
20,NULL
30,31
40,41"""
|
DataFrame
is
np.nan
(“not a number”),
which is of type
float
and prints as
NaN
.
The
dtype
of the column therefore becomes
np.float64
.
temp humidity day 0 0 1.0 1 10 11.0 2 20 NaN 3 30 31.0 4 40 41.0 The dtypes of the 2 columns are: temp int64 humidity float64 dtype: object
Accidentally typed
2l
with lowercase letter L.
s = """\
temp,humidity
0,1
10,11
20,2l
30,31
40,41"""
temp humidity day 0 0 1 1 10 11 2 20 2l 3 30 31 4 40 41 The dtypes of the 2 columns are: temp int64 humidity object dtype: object
"Create a pd.DataFrame from a string that looks like a two-column CSV file."
import sys
import io
import numpy as np
import pandas as pd
#a six-line string
s = """\
temp,humidity
0,1
10,11
20,2l
30,31
40,41"""
def converter(humidity): #Convert a string into a float.
try:
f = np.float64(humidity)
except:
return np.nan #humidity was not a valid number
else:
return f #humidity was a valid number
converters = {
"humidity": converter
}
infile = io.StringIO(s)
df = pd.read_csv(infile, converters = converters)
df.index.name = "day"
print(df)
print()
print(f"The dtypes of the {len(df.columns)} columns are:")
print(df.dtypes)
sys.exit(0)
temp humidity day 0 0 1.0 1 10 11.0 2 20 NaN 3 30 31.0 4 40 41.0 The dtypes of the 2 columns are: temp int64 humidity float64 dtype: object
df
now has
np.nan
in every place where there was no valid number.
Let’s delete any row that has a
np.nan
.
seriesOfBools = df["humidity"].isnull() printf"The following {seriesOfBools.sum()} row(s) have no valid humidity:") print(df[seriesOfBools]) print() df.dropna(how = "any", inplace = True) #or how = "all" print(df)
The following 1 row(s) have no valid humidity: temp humidity day 2 20 NaN temp humidity day 0 0 1.0 1 10 11.0 3 30 31.0 4 40 41.0
We could replace every absent humidity with the
mean
of the present humidities.
value = { "temp": 32, "humidity": 0 #or "humidity": df["humidity"].mean(skipna = True) } df.fillna(value = value, inplace = True) print(df)
temp humidity
day
0 0 1.0
1 10 11.0
2 20 0.0
3 30 31.0
4 40 41.0
"Create a pd.DataFrame from a string that looks like a two-column CSV file."
import sys
import io
import numpy as np
import pandas as pd
#a six-line string
s = """\
temp,humidity
0,1
10,11
20,2l
30,31
40,41"""
def converter(humidity): #Convert a string into a float.
try:
f = np.float64(humidity)
except:
return np.nan #humidity was not a valid number
else:
return f #humidity was a valid number
converters = {
"humidity": converter
}
infile = io.StringIO(s)
df = pd.read_csv(infile, converters = converters)
df.index.name = "day"
value = {
"temp": 32,
"humidity": df["humidity"].mean(skipna = True)
}
df.fillna(value = value, inplace = True)
print(f"{sys.getsizeof(df) = :,}")
df["humidity"] = df["humidity"].astype(np.int16)
print(f"{sys.getsizeof(df) = :,}")
print()
print(df)
print()
print(f"The dtypes of the {len(df.columns)} columns are:")
print(df.dtypes)
sys.exit(0)
sys.getsizeof(df) = 224 sys.getsizeof(df) = 194 temp humidity day 0 0 1 1 10 11 2 20 21 3 30 31 4 40 41 The dtypes of the 2 columns are: temp int64 humidity int16 dtype: object