1.6. Organisation

1.6.1. The rest of the course

The rest of this course is organised into nine chapters, with appendices at the end. Each chapter will focus on either the system or the process.

We have seven chapters on ML systems, which are: Chapter 2 on linear regression, Chapter 3 on logistic regression and and linear discriminant analysis, Chapter 6 on feature selection and regularisation, Chapter 7 on trees and ensembles, Chapter 8 on generalised linear models and support vector machines, Chapter 9 on principal component analysis and \(K\)-means/hierarchical clustering, and Chapter 10 on neural networks and deep learning, including convolutional and recurrent neural networks (CNNs and RNNs). These chapters will cover the system side of ML. Real-world applications will be used to illustrate the concepts and techniques.

We have two chapters on ML processes, which are: Chapter 4 on hypothesis testing and software development, and Chapter 5 on cross validation and bootstrap. These chapters will cover the process side of ML, including leave-one-out/k-fold cross validation,bootstrap, types of errors, significance of results, and the software development life cycle on GitHub.

1.6.2. Real-world datasets used

In this course, we will use real-world datasets to introduce machine learning from the perspective of AI transparency. We will use the following datasets from the textbook. You can click on the name of the dataset to see the actual data.

Table 1.2 Datasets used in this course, from the textbook (to refine)

Name

Data provided

Machine learning problem

Advertising

Sales, TV, radio, newspaper

Predict sales based on TV, radio, and newspaper advertising

Auto

Gas mileage, horsepower, and other information for cars.

Predict gas mileage for a car.

Bikeshare

Hourly usage of a bike sharing program in Washington, DC.

Predict the number of bikes rented per hour.

Boston

Housing values and other information about Boston census tracts.

Predict the median value of a house.

BrainCancer

Survival times for patients diagnosed with brain cancer.

Predict the survival time for a patient.

Caravan

Information about individuals offered caravan insurance.

Predict whether an individual will buy caravan insurance.

Carseats

Information about car seat sales in 400 stores.

Predict the sales of a car seat.

College

Demographic characteristics, tuition, and more for USA colleges.

Predict the number of applications received by a college.

Credit

Information about credit card debt for 10,000 customers.

Predict the amount of credit card debt for a customer.

Default

Customer default records for a credit card company.

Predict whether a customer will default on a credit card payment.

Fund

Returns of 2,000 hedge fund managers over 50 months.

Predict the returns of a hedge fund manager.

Heart

Information about heart disease for 303 patients.

Predict whether a patient has heart disease.

Hitters

Records and salaries for baseball players.

Predict the salary of a baseball player.

Iris

Measurements of 150 iris flowers.

Predict the species of an iris flower.

Khan

Gene expression measurements for four cancer types.

Predict the cancer type for a patient.

NCI60

Gene expression measurements for 64 cancer cell lines.

Find clusters or groups among the cell lines for personalised treatment.

OJ

Sales information for Citrus Hill and Minute Maid orange juice.

Predict the sales of orange juice.

Portfolio

Past values of financial assets, for use in portfolio allocation.

Predict the value of a financial asset.

Publication

Time to publication for 244 clinical trials.

Predict the time to publication for a clinical trial.

Smarket

Daily percentage returns for S&P 500 over a 5-year period.

Predict whether the stock index with increase or decrease.

USArrests

Crime statistics per 100,000 residents in 50 states of USA.

Predict the crime rate in a state.

Wage

Income survey data for men in central Atlantic region of USA.

Predict the income of men

Weekly

1,089 weekly stock market returns for 21 years.

Predict the stock market return in a week

The above datasets show the diverse range of problems that machine learning can solve, which shows only the tip of the iceberg actually. Applications of machine learning are everywhere, from healthcare to finance, from manufacturing to agriculture, from transportation to education, and so on. The datasets used in this course are from the textbook, which is a good starting point for learning about machine learning. However, you can also find many other datasets online, such as Kaggle, UCI Machine Learning Repository, OpenML, Google Dataset Search, and so on.

1.6.3. Machine learning models

This course focuses on machine learning models (or methods) that are most widely used in practice, while NOT aiming to be exhaustive in covering all the models. The following table shows the machine learning models that we will cover in this course.

Table 1.3 Machine learning models/methods

Method

Description

Example

Linear regression

A linear model for regression.

Predicting the price of a house.

Logistic regression

A linear model for classification.

Predicting whether a customer will default on a credit card payment.

Support vector machine

A kernel-based model for classification.

Predicting whether a customer will default on a credit card payment.

Decision tree

A nonlinear model for classification and regression.

Predicting whether a customer will default on a credit card payment.

Random forest

An ensemble of decision trees for classification and regression.

Predicting whether a customer will default on a credit card payment.

Neural network

A nonlinear model for classification and regression.

Predicting whether a customer will default on a credit card payment.

\(K\)-means

A clustering model.

Finding groups of similar customers.

Principal component analysis

A dimensionality reduction model.

Finding the most important features of a dataset.

No single model will perform well in all possible scenarios. Therefore, it is important to understand the assumptions and trade-offs of each model so that you can choose the right model for a given problem.

1.6.4. Exercises

1. Choose three or more datasets of your interest from Table 1.2. Click on the name of each chosen dataset to explore and get a sense of the data. You may not be able to get a beautiful view or a view at all for those larger ones. Write down the possible machine learning problems using terminology in Table 1.1 that can be solved using each of your chosen dataset.

Compare your answer with the solution below

Dataset

Machine learning problems

Advertising

Regression

Auto

Regression

Bikeshare

Regression

Boston

Regression

BrainCancer

Regression

Caravan

Classification

Carseats

Regression

College

Regression

Credit

Regression

Default

Classification

Fund

Regression

Hitters

Regression

Iris

Classfication

Khan

Classification

NCI60

Clustering

OJ

Regression

Portfolio

Regression

Publication

Regression

Smarket

Classification

USArrests

Clustering

Wage

Regression

Weekly

Classification