As a Data Scientist or Machine Learning Engineer working in a real world Machine Learning Project, you should go through this checklist to make sure that everything goes well throughout the whole pipeline.
There are eight main steps:
There are eight main steps:
- Frame the problem
- Get the data
- Explore the data
- Prepare the data
- Model the data
- Fine-tune the models
- Present the solution
- Launch the ML system
1. Frame the problem
The very first step is where the (business) objective is defined and assumptions about the solution are made. The following questions need to be answered:- What are the current solutions/workarounds (if any)?
- How the problem should be framed (supervised/unsupervised, online/offline, etc.)?
- How should performance be measured (metrics) ?
- What would be the minimum performance needed to reach the business objective?
- How the problem can be solved manually?
2. Get the data
In this step, the required data is gathered and splitted into training set and test set. The following steps must be considered:- List the type of data required and quantity
- Find appropriate source to get data
- Check how much space it will take
- Check legal obligations, and get authorization if necessary
- Get the data
- Convert the data into required format
- Ensure sensitive information is deleted or protected
- Split data into training set and test set
3. Explore the data
The gathered data is explored to gain insights: study and understand more deeply about data. The following sub-steps should be considered in this step:- Make a copy of data to explore
- Visualize the data
- Study about attributes (name, data type, missing values)
- Study the correlations between attributes
- Document everything learned about data
4. Prepare the data
After knowing about the data, data need to be cleaned and pre-processed before training the models.
- Data Cleaning:
- Fix or remove outliers (binning & clipping)
- Fill in missing values or drop rows/columns (scrubbing)
- Feature engineering:
- Select appropriate features
- Discretize continuous features
- Feature Crossing: Create new feature by combining two or more features
- Feature scaling: standardize or normalize features
5. Model the data
Now, its time to train various models (algorithms) and list down the most promising models. The following sub-steps need to be considered:- Train many quick and dirty models from different categories (e.g., linear, naive Bayes, SVM, Random Forests, neural net, etc.) using standard parameters
- Measure and compare their performance (use N-fold cross-validation)
- Analyze the most significant variables for each algorithm
- Analyze the types of errors the models make
- Short-list the top three to five most promising models
6. Fine-tune the models
From the short-listed few models, we need to find the best performing model. The major tasks to be considered are:
- Fine tune the hyper-parameters using Cross Validation
- Grid Search / Random Search
- Bayesian Optimization
- Try Ensemble Methods (combining best models)
- Keep the best model after measuring performance on test set.
7. Present the solution
This is much less technical step. The proper documentation until now should be prepared and presented in the best way possible highlighting the big picture. Remember the following steps:- Document everything what is done until now
- Create a nice presentation
- Explain why this solution achieves the business objective
- Don’t forget to present interesting points noticed along the way
- Describe what worked and what did not (list assumptions & system’s limitations)
- Ensure that key findings are communicated through beautiful visualizations or easy-to-remember statements (e.g., “the median income is the number-one predictor of housing prices”).
8. Launch the ML system
Finally, its time to launch and deploy the ML System in the desired platform. As it a software solution, the testing and maintenance also come under this step.The following things need to be done:- Get your solution ready for production (plug in production data, write unit tests, etc.)
- Monitor the performance at regular intervals
- Retrain the models on a regular basis on fresh data
- Always work on copies of the data
- Never look on test set (use test set only for final model testing)
- Try to automate steps wherever possible on all of above steps
- Write functions for all data transformations (reuse)
- Feel free to adopt the checklist as per your needs
Comments
Post a Comment