7 Data Preprocessing
After visualization, comes data preprocessing. You may have read many quotes making the point that the majority of time in data science is spent cleaning / preprocessing data. Depending on the data, this is very often very true.
Let’s start with the Sonar dataset and add some missing values for this example.
7.1 Check data
To check your data, simply enough use the
Dataset: Sonar [[ Summary ]] 208 cases with 61 features: * 60 continuous features * 0 integer features * 1 categorical feature, which is not ordered * 0 constant features * 0 duplicated cases * 2 features include 'NA' values; 10 'NA' values total ** Max percent missing in a feature is 2.40% (V1) ** Max percent missing in a case is 1.64% (case #10) [[ Recommendations ]] * Consider imputing missing values or use complete cases only
The output produces a list of useful information about your dataset, followed by recommendations.
To clean / preprocess the data, use the
preprocess command. In this case we want to impute missing data. By default,
preprocess uses the missForest package to predict missing values from the available data using random forest in an iterative procedure.
[2020-06-23 08:18:30 preprocess] Imputing missing values using missRanger... Missing value imputation by random forests Variables to impute: V1, V2 Variables used to impute: V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13, V14, V15, V16, V17, V18, V19, V20, V21, V22, V23, V24, V25, V26, V27, V28, V29, V30, V31, V32, V33, V34, V35, V36, V37, V38, V39, V40, V41, V42, V43, V44, V45, V46, V47, V48, V49, V50, V51, V52, V53, V54, V55, V56, V57, V58, V59, V60, Class iter 1: .. iter 2: .. iter 3: .. iter 4: .. iter 5: .. [2020-06-23 08:18:33 preprocess] Done
Let’s now check our preprocessed data:
Dataset: Sonar.pre [[ Summary ]] 208 cases with 61 features: * 60 continuous features * 0 integer features * 1 categorical feature, which is not ordered * 0 constant features * 0 duplicated cases * 0 features include 'NA' values [[ Recommendations ]] * Everything looks good