bugl
bugl
HomeLearnPatternsPathsSearch
HomeLearnPatternsPathsSearch

Loading lesson path

Learn/Python/Data Science and Scientific Python
Python•Data Science and Scientific Python

Machine Learning - Scale

Flash cards

Review the key moves

1/4
Core idea

What is the main idea behind Machine Learning - Scale?

Lesson checks

Practice each idea before moving on

Short Mimo-style checks built from this lesson's code, terms, and sequence.

1Quick choice

Which statement best captures the main point of this lesson?

2Fill blank

Complete the missing token from the example code.

___ pandas
3Order

Put the learning moves in the order that makes the concept easiest to apply.

The answer to this problem is scaling.
When your data has different values, and even different measurement units, it can be difficult to compare them.
Predict CO2 Values

Scale Features

When your data has different values, and even different measurement units, it can be difficult to compare them. What is kilograms compared to meters? Or altitude compared to time?

The answer to this problem is scaling. We can scale data into new values that are easier to compare.

Take a look at the table below, it is the same data set that we used in the multiple regression chapter , but this time the volume column contains values in liters instead of cm 3 (1.0 instead of 1000).

CarModelVolumeWeightCO2
ToyotaAygo1.079099
MitsubishiSpace Star1.2116095
SkodaCitigo1.092995
Fiat5000.986590
MiniCooper1.51140105
VWUp!1.0929105
SkodaFabia1.4110990
MercedesA-Class1.5136592
FordFiesta1.5111298
AudiA11.6115099
HyundaiI201.198099
SuzukiSwift1.3990101
FordFiesta1.0111299
HondaCivic1.6125294
HundaiI301.6132697
OpelAstra1.6133097
BMW11.6136599
Mazda32.21280104
SkodaRapid1.61119104
FordFocus2.01328105
FordMondeo1.6158494
OpelInsignia2.0142899
MercedesC-Class2.1136599
SkodaOctavia1.6141599
VolvoS602.0141599
MercedesCLA1.51465102
AudiA42.01490104
AudiA62.01725114
VolvoV701.61523109
BMW52.01705114
MercedesE-Class2.11605115
VolvoXC702.01746117
FordB-Max1.61235104
BMW21.61390108
OpelZafira1.61405109
MercedesSLK2.51395120

It can be difficult to compare the volume 1.0 with the weight 790, but if we scale them both into comparable values, we can easily see how much one value is compared to the other.

There are different methods for scaling data, in this tutorial we will use a method called standardization.

z = (x - u) / s

Where z is the new value, x is the original value, u is the mean and s is the standard deviation.

If you take the weight column from the data set above, the first value is 790, and the scaled value will be:

If you take the volume column from the data set above, the first value is 1.0, and the scaled value will be:

(1.0 - 1.61 ) / 0.38 = -1.59

Now you can compare -2.1 with -1.59 instead of comparing 790 with 1.0.

You do not have to do this manually, the Python sklearn module has a method called StandardScaler() which returns a Scaler object with methods for transforming data sets.

Example

import pandas
from sklearn import linear_model
from
sklearn.preprocessing import StandardScaler
scale = StandardScaler()

df = pandas.read_csv("data.csv")
X = df[['Weight', 'Volume']]

scaledX = scale.fit_transform(X)

print(scaledX)

Predict CO2 Values

The task in the Multiple Regression chapter was to predict the CO2 emission from a car when you only knew its weight and volume.

When the data set is scaled, you will have to use the scale when you predict values:

Example

import pandas
from sklearn import linear_model
from
sklearn.preprocessing import StandardScaler
scale = StandardScaler()

df = pandas.read_csv("data.csv")
X = df[['Weight', 'Volume']]

y = df['CO2']

scaledX = scale.fit_transform(X)

regr = linear_model.LinearRegression()
regr.fit(scaledX, y)
scaled =
scale.transform([[2300, 1.3]])
predictedCO2 = regr.predict([scaled[0]])

print(predictedCO2)

Previous

Matplotlib Bars

Next

Matplotlib Histograms