Loading Data¶

Read then Launch

This content is best viewed in html because jupyter notebook cannot display some content (e.g. figures, equations) properly. You should finish reading this page first and then launch it as an interactive notebook in Google Colab (faster, Google account needed) or Binder by clicking the rocket symbol () at the top.

Data frame and basic operations¶

In Python, Pandas is a commonly used library to read data from files into data frames. Use the Auto.csv file (click to open) as an example. First, take a look at the csv file. There are headers, missing values are marked by ‘?’. The data is separated by comma. We can use the read_csv function to read the csv file into a data frame. The read_csv function has many parameters, we can use ? to get the documentation of the function.

The following code loads libraries needed for this section and shows how to read the csv file Auto.csv in the textbook into a data frame auto_df.

import pandas as pd
import urllib
from matplotlib import pyplot as plt

%matplotlib inline

data_url = "https://github.com/pykale/transparentML/raw/main/data/Auto.csv"
auto_df = pd.read_csv(data_url, header=0, na_values="?")

The .head() method can be used to get the first 5 (by default) rows of the data frame.

auto_df.head()

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name
0	18.0	8	307.0	130.0	3504	12.0	70	1	chevrolet chevelle malibu
1	15.0	8	350.0	165.0	3693	11.5	70	1	buick skylark 320
2	18.0	8	318.0	150.0	3436	11.0	70	1	plymouth satellite
3	16.0	8	304.0	150.0	3433	12.0	70	1	amc rebel sst
4	17.0	8	302.0	140.0	3449	10.5	70	1	ford torino

The .describe() method can get the summary statistics of the data frame. Specify the argument include to get the summary statistics of certain variables, e.g. include = "all" for mixed types, include = [np.number] for numerical columns, and include = ["O"] for objects.

auto_df.describe()

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin
count	397.000000	397.000000	397.000000	392.000000	397.000000	397.000000	397.000000	397.000000
mean	23.515869	5.458438	193.532746	104.469388	2970.261965	15.555668	75.994962	1.574307
std	7.825804	1.701577	104.379583	38.491160	847.904119	2.749995	3.690005	0.802549
min	9.000000	3.000000	68.000000	46.000000	1613.000000	8.000000	70.000000	1.000000
25%	17.500000	4.000000	104.000000	75.000000	2223.000000	13.800000	73.000000	1.000000
50%	23.000000	4.000000	146.000000	93.500000	2800.000000	15.500000	76.000000	1.000000
75%	29.000000	8.000000	262.000000	126.000000	3609.000000	17.100000	79.000000	2.000000
max	46.600000	8.000000	455.000000	230.000000	5140.000000	24.800000	82.000000	3.000000

auto_df.describe(include="all")

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name
count	397.000000	397.000000	397.000000	392.000000	397.000000	397.000000	397.000000	397.000000	397
unique	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	304
top	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	ford pinto
freq	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	6
mean	23.515869	5.458438	193.532746	104.469388	2970.261965	15.555668	75.994962	1.574307	NaN
std	7.825804	1.701577	104.379583	38.491160	847.904119	2.749995	3.690005	0.802549	NaN
min	9.000000	3.000000	68.000000	46.000000	1613.000000	8.000000	70.000000	1.000000	NaN
25%	17.500000	4.000000	104.000000	75.000000	2223.000000	13.800000	73.000000	1.000000	NaN
50%	23.000000	4.000000	146.000000	93.500000	2800.000000	15.500000	76.000000	1.000000	NaN
75%	29.000000	8.000000	262.000000	126.000000	3609.000000	17.100000	79.000000	2.000000	NaN
max	46.600000	8.000000	455.000000	230.000000	5140.000000	24.800000	82.000000	3.000000	NaN

The dimension of a data frame can be found out by the same .shape() method as in numpy arrays.

auto_df.shape

(397, 9)

Indexing in Pandas data frame is similar to indexing in numpy arrays. A row, a column, or a submatrix can be accessed by the .iloc[] or .loc[] method. iloc is used to index by position, and loc is used to index by labels (row and column names).

auto_df.iloc[:4, :2]

	mpg	cylinders
0	18.0	8
1	15.0	8
2	18.0	8
3	16.0	8

auto_df.loc[[0, 1, 2, 3], ["mpg", "cylinders"]]

	mpg	cylinders
0	18.0	8
1	15.0	8
2	18.0	8
3	16.0	8

There is an alternative way to select the first 4 rows.

auto_df[:4]

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name
0	18.0	8	307.0	130.0	3504	12.0	70	1	chevrolet chevelle malibu
1	15.0	8	350.0	165.0	3693	11.5	70	1	buick skylark 320
2	18.0	8	318.0	150.0	3436	11.0	70	1	plymouth satellite
3	16.0	8	304.0	150.0	3433	12.0	70	1	amc rebel sst

The column names can be found out by the list function or the .columns attribute.

print(list(auto_df))
print(auto_df.columns)

['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin', 'name']
Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'year', 'origin', 'name'],
      dtype='object')

.isnull() and .sum() methods can be used to find out how many NaNs in each variables.

auto_df.isnull().sum()

mpg             0
cylinders       0
displacement    0
horsepower      5
weight          0
acceleration    0
year            0
origin          0
name            0
dtype: int64

# after the previous steps, there are 397 obs in the data and only 5 with missing values. We can just drop the ones with missing values
print(auto_df.shape)
auto_df = auto_df.dropna()
print(auto_df.shape)

(397, 9)
(392, 9)

The type of variable(s) can be changed. The following example will change the cylinders into categorical variable

auto_df["cylinders"] = auto_df["cylinders"].astype("category")

Visualising data¶

Refer to a column of data frame by name using .column_name. See the options in plt.plot for more.

plt.plot(auto_df.cylinders, auto_df.mpg, "ro")
plt.show()

The .hist() method can get the histogram of certain variables. Specify the argument column to get the histogram of a certain variable.

auto_df.hist(column=["cylinders", "mpg"])
plt.show()

Exercises¶

1. This exercise is related to the College dataset. It contains a number of features for \(777\) different universities and colleges in the US.

a. Use the read_csv() function to read the data and print the first \(20\) rows of the loaded data. Make sure that you have the directory set to the correct location for the data.

# Write your code below to answer the question