Read a numeric column

read_csv deduces the dtype of the column.

read_csv usually reads from a file or URL. But the read_csv in this program reads from a str to make it easy to experiment with different inputs. The str is more than one line, so we write it as a triple-quoted string literal.

Since the values in the second column are integers (1, 11, 21, 31, 41), read_csv deduces that the dtype of this column is np.int64.

"Create a pd.DataFrame from a string that looks like a two-column CSV file."

import sys
import io
import pandas as pd

#a six-line string

s = """\
temp,humidity
0,1
10,11
20,21
30,31
40,41"""

infile = io.StringIO(s)
df = pd.read_csv(infile)   #The argument of read_csv is usually a filename or URL.
df.index.name = "day"

print(df)
print()

print(f"The dtypes of the {len(df.columns)} columns are:")
print(df.dtypes)
sys.exit(0)

     temp  humidity
day                
0       0         1
1      10        11
2      20        21
3      30        31
4      40        41

The dtypes of the 2 columns are:
temp        int64
humidity    int64
dtype: object

A missing or sentinal value becomes np.nan.

If one or more values are missing, NA (“not available”), or NULL,

s = """\
temp,humidity
0,1
10,11
20,
30,31
40,41"""

s = """\
temp,humidity
0,1
10,11
20
30,31
40,41"""

s = """\
temp,humidity
0,1
10,11
20,NA
30,31
40,41"""

s = """\
temp,humidity
0,1
10,11
20,NULL
30,31
40,41"""

the value stored into the DataFrame is np.nan (“not a number”), which is of type float and prints as NaN. The dtype of the column therefore becomes np.float64.

     temp  humidity
day                
0       0       1.0
1      10      11.0
2      20       NaN
3      30      31.0
4      40      41.0

The dtypes of the 2 columns are:
temp          int64
humidity    float64
dtype: object

An invalid value changes the column to object.

Accidentally typed 2l with lowercase letter L.

s = """\
temp,humidity
0,1
10,11
20,2l
30,31
40,41"""

     temp humidity
day               
0       0        1
1      10       11
2      20       2l
3      30       31
4      40       41

The dtypes of the 2 columns are:
temp         int64
humidity    object
dtype: object

Change an invalid value to np.nan.

"Create a pd.DataFrame from a string that looks like a two-column CSV file."

import sys
import io
import numpy as np
import pandas as pd

#a six-line string

s = """\
temp,humidity
0,1
10,11
20,2l
30,31
40,41"""

def converter(humidity): #Convert a string into a float.
    try:
        f = np.float64(humidity)
    except:
        return np.nan    #humidity was not a valid number
    else:
        return f         #humidity was a valid number

converters = {
    "humidity": converter
}

infile = io.StringIO(s)
df = pd.read_csv(infile, converters = converters)
df.index.name = "day"

print(df)
print()

print(f"The dtypes of the {len(df.columns)} columns are:")
print(df.dtypes)
sys.exit(0)

     temp  humidity
day                
0       0       1.0
1      10      11.0
2      20       NaN
3      30      31.0
4      40      41.0

The dtypes of the 2 columns are:
temp          int64
humidity    float64
dtype: object

Delete every row containing np.nan.

df now has np.nan in every place where there was no valid number. Let’s delete any row that has a np.nan.

seriesOfBools = df["humidity"].isnull()
printf"The following {seriesOfBools.sum()} row(s) have no valid humidity:")
print(df[seriesOfBools])
print()

df.dropna(how = "any", inplace = True)   #or how = "all"
print(df)

The following 1 row(s) have no valid humidity:
     temp  humidity
day                
2      20       NaN

     temp  humidity
day                
0       0       1.0
1      10      11.0
3      30      31.0
4      40      41.0

Change every np.nan to a different value.

We could replace every absent humidity with the mean of the present humidities.

value = {
    "temp":     32,
    "humidity":  0    #or "humidity": df["humidity"].mean(skipna = True)
}

df.fillna(value = value, inplace = True)
print(df)

     temp  humidity
day                
0       0       1.0
1      10      11.0
2      20       0.0
3      30      31.0
4      40      41.0

Change the column of np.float64s to a column of np.int16s.

"Create a pd.DataFrame from a string that looks like a two-column CSV file."

import sys
import io
import numpy as np
import pandas as pd

#a six-line string

s = """\
temp,humidity
0,1
10,11
20,2l
30,31
40,41"""

def converter(humidity): #Convert a string into a float.
    try:
        f = np.float64(humidity)
    except:
        return np.nan    #humidity was not a valid number
    else:
        return f         #humidity was a valid number

converters = {
    "humidity": converter
}

infile = io.StringIO(s)
df = pd.read_csv(infile, converters = converters)
df.index.name = "day"

value = {
    "temp":     32,
    "humidity": df["humidity"].mean(skipna = True)
}

df.fillna(value = value, inplace = True)

print(f"{sys.getsizeof(df) = :,}")
df["humidity"] = df["humidity"].astype(np.int16)
print(f"{sys.getsizeof(df) = :,}")
print()

print(df)
print()

print(f"The dtypes of the {len(df.columns)} columns are:")
print(df.dtypes)
sys.exit(0)

sys.getsizeof(df) = 224
sys.getsizeof(df) = 194

     temp  humidity
day                
0       0         1
1      10        11
2      20        21
3      30        31
4      40        41

The dtypes of the 2 columns are:
temp        int64
humidity    int16
dtype: object