Skip to main content

3 Stages of model building

3 Stages of model building in machine learning

1. Data collection, preprocessing, data cleaning

    In this step we collect the data like features and labels. 

    Data Preprocessing :  

    1. Data cleaning includes simple techniques such as outlier and missing value removal and replace new values with either mean, median in regression and mode in classification

  • Mean is a average of sequence ;  the sum of the values divided by the number of values. 
  • Median is centered value or mean of two middle values in sequence; used when there are outliers in the sequence that might skew the average of the values.
  • Mode is value that appears most often in sequence

    2. Handle or encoding categorical variables :

    Transform categorical variables into numbers ; tool : labelencoder()
    Differentiate those categories , tool : one hot encoder(), get_dummies()   

    3. Feature scaling : 

    Standardize or equalize the numeric features ; Scaling is to bring all values to same magnitude
  • Normalization ranging between 0 and 1. It is also known as Min-Max scaling.
  • Standardization :  values are centered around the mean with a unit standard deviation.

    4. Splitting the data set into training and testing set


2. Algorithm selection 

After seeing the data set we have to decide which algorithm should be chose

If we have regression problem like salary prediction, price evaluation
we use, Linear regression, Polynomial regression,  Support vector regression

If we have classification problem like salary prediction, price evaluation
we use, Logistic regression, K neighbor classifier, Scale vector classifier, Decision tree, random forest

If we have cluster problem, we use k- means algorithm

3. Evaluate model, error analysis

  • Finding accuracy of model ; Determining or evaluating model performance for regression classification problems
  • The most commonly used metric is the mean square error (MSE), root mean square error (RMSE) 
  • In classfication we use confusion matrix for model analysis 






Comments

Popular posts from this blog

Multiple classification from many of directories

  # %%  Import nessacary libraries import  numpy  as  np import  pandas  as  pd import  cv2 import  matplotlib.pyplot  as  plt import  os import  glob # %%   Keras Tensorflow libraries from  keras  import  layers from  keras.models  import  Model from  keras.optimizers  import  RMSprop , Adam , Nadam from  keras.preprocessing.image  import  ImageDataGenerator from  keras.layers  import  Input, BatchNormalization, Dense, Dropout, Conv2D, Flatten, GlobalAveragePooling2D, LeakyReLU from  keras.preprocessing.image  import  ImageDataGenerator, img_to_array, load_img # %%  Path path  =   r 'G:/Machine Learning/Project/Lego Mnifigures Classification/dataset' open_dir  =  os....

Classification & Confusion Matrix & Accuracy Paradox

Classification  work on voting the object belongs from which classes has more probability  There are two types of classification : Binary classification : There are two classes we have ex: male-female , cat-dog , yes-not  Multiple classification :   There are classes more than two we have ex: traffic signs , face recognition , flower race  , Digit Recognition Confusion matrix :  Confusion matrix is one type of technique to evaluate the model accuracy for classification problem. In this technique we consider how many of positive and negative data points we predict correctly. The main consideration terms are accuracy, precision and recall The accuracy was an appealing matric, because it was a single number. Here precision and recall(sensitivity) are two numbers. So to get the final score (accuracy) of our model we use F1 score, so that we have a single number. Here is the F1 score's mathematical formula: F1 = 2x precision x recall / (precision ...

Digit Recognition

Here you can import digit dataset from scikit learn library which is in-built, So you don't need to download from other else Note: If you use visual code, I recommend you to turn your color theme to Monokai because it has a few extra and important keyword and attractive colors than other theme.   # %%  Import libraries import  numpy  as  np import  pandas  as  pd import  matplotlib.pyplot  as  plt import  random  # %%   Load dataset from  sklearn.datasets  import  load_digits dataset  =  load_digits() dataset.keys() output: d ict_keys(['data', 'target', 'target_names', 'images', 'DESCR']) You have to check all to direct print them Here DESCR is a description of dataset # %%   divide the dataset into input and target inputs  =  dataset.data target  =  dataset.target # %% ...