Data Prog with Python
  • Introduction
  • Week2
  • Week3
  • Week4
  • Week5
  • Week6
  • Week7
  • Week8
  • Week9
  • Week10
  • project1
  • project2
  • project3
  • Useful codes for exam
Powered by GitBook
On this page
  • Standardization of Data
  • Normalization of Data
  • Binarization of Data
  • Missing Data Imputation
  • PCA
  • Exercise
  • Load the ‘diabetes’ dataset from sklearn dataset library, and do the followings :

Was this helpful?

Week5

Standardization of Data

standardized_x=(x-average)/std

import sklearn.preprocessing as skp
scaler=skp.StandardScaler().fit(Dataset)
standardized_Dataset=scaler.transform(Dataset)

standardized_Dataset=skp.scale(Dataset,axis=0) 
#0 column

Normalization of Data

Normalized_x=(x-min)/(max-min)

Normalizer=skp.Normalizer().fit(Dataset)
normalized_Dataset =Normalizer.transform(Dataset)
normalized_Dataset=skp.normalize(Dataset, norm="l2") 
#in which norm you wan to normalize the data l1 or l2

Binarization of Data

binarizer=skp.Binarizer(threshold=0.1).fit(Dataset)
binarized_Dataset=binarizer.transform(Dataset)

binarized_Dataset=skp.binarize(Dataset,threshold=0.1)

Missing Data Imputation

mean/median/most_frequent

imp=skp.Imputer(missing_values=0, strategy="mean",axix=0)

imp=skp.Imputer(missing_values="NaN", strategy="mean",axix=0)

imp.fit_transform(Dataset)

PCA

pca=skd.PCA(n_components=n, whiten=False) 
# True: uncorrelated output
pca.fit(Dataset)
Dataset_reduced_dim=pca.transform(Dataset)

Exercise

Load the ‘diabetes’ dataset from sklearn dataset library, and do the followings :

• Standardize the data • Normalize the data • Reduce the dimension of the data to 4 columns with PCA • Cluster the input features with k-mean clustering library of scipy package, to 4 clusters

import sklearn.preprocessing as skp
import sklearn.decomposition as skd
import scipy.cluster.vq as spcv
import scipy.stats as sps
from sklearn.datasets import load_diabetes
diabetes = load_diabetes(return_X_y=False)
Dataset = diabetes.data
Target = diabetes.target
print(Dataset.shape)
standardized_Dataset = skp.scale(Dataset, axis=0)
Normalized_Dataset = skp.normalize(standardized_Dataset, norm='l2')
pca = skd.PCA(n_components=4, whiten=False)
pca.fit(Normalized_Dataset)
Dataset_Reduced_Dim = pca.transform(Normalized_Dataset)
print(Dataset_Reduced_Dim.shape)
centroids,var = spcv.kmeans(Dataset_Reduced_Dim,4)
id,dist = spcv.vq(Dataset_Reduced_Dim,centroids)
print(centroids)
print('---------------------------------------------------')
print(id)
PreviousWeek4NextWeek6

Last updated 5 years ago

Was this helpful?