Challenges of Machine Learning

Machine Learning enables software systems to learn things by themselves. We don't need to explicitly program computer systems instead they can learn from examples. Beside the advantages of machine learning, there are various challenges to implement machine learning in computer systems:
  • Requires Huge Amount of Data
  • Sampling Bias
  • Feature Engineering
  • Hyperparameter Tunning
  • Overfitting and Underfitting

Requires Huge Amount of Data

Data is very important in machine learning. The power of machine learning lies in data and data is the only thing to learn from.
We human can learn new things even by experiencing anything for the first time. For example, a toddler can easily learn to recognize dog by seeing, listening and playing around for few times. But it is very very difficult to teach machine to recognize dog.
Even for very simple problems (e.g. recognize hand written digits), it requires thousands of data to train machines. Complex problems such as: face recognition and speech recognition requires millions of data to learn from. More the data, more the machine can learn. But, there is not enough data available in every field to train machines.

Sampling Bias

The data collected by us can be biased. The ML model trained with the biased data would also make biased predictions.
For example, you want to train machine to classify dog vs cat. You need to collect lots of images of dog and cat to feed into machine. Suppose, you like cat more than dog and you collected 70% images of cat and 30% images of dog. If you feed this data, the trained machine would be biased and mostly try to predict cat. That's why training sample shouldn't be unbiased and balanced.

Poor Quality Data

Most of the time, the real world data is full of errors and very noisy. Some training instances may have missing values or some fields can have integer value instead of string value. There can be many other forms of noise in data. All of these poor quality data must be cleaned up before feeding into training algorithms.
Feature engineering is also difficult and important step in machine learning. Every features (attributes) in the data may not be important to train machine for specific problem. For example: to build a stock price prediction system, you don't need details like: stock buyer or seller instead you will opening price and closing price.  Similarly you can create new feature by combining two or more features. Like you can create new feature as stock price change by subtracting opening price from closing price.
Data cleaning and pre-processing is considered as the most time consuming task in machine learning pipeline.

Hyperparameter Tunning

Even if you have cleaned data with you and trained a machine learning model with default parameter values, the ML model doesn't work properly. You need to tune various hyper-parameters of machine learning model such as: learning rate, steps, batch size, cost function. etc. during training. There are unlimited possibilities of parameter values for every algorithm to test manually and choose the best value for specific parameter.
You can use techniques such as: Grid Search, Randomized search or Bayesian optimization to fine tune the hyperparameters based on your training size and problem.

Overfitting and Underfitting

Overfitting occurs when a model tries to fit the training set so closely such that it can't generalize well on test set. There can be various reasons of model overfitting the training set. The dataset can be sparse, different feature values might be scaled differently, unbalanced dataset or any kind of noise present in the data. To prevent overfitting, you can regularize by constraining the model. There is hyperparameter (alpha) to control the amount of regularization.

Conversely, underfitting occurs when the model can't learn the underlying structure of the data. The model is so simple that it can't predict well on neither training set nor test set. The machine learning algorithms might be very simple or the regularization value may be large. The features might be irrelevant to target or dataset may be very noisy. You can try fixing any of these issues to prevent underfitting.

Comments