Primary Questions to Ask in Solving Data Mining Problems

Rosmina Joy Cabauatan
Mar 31, 2021
3 min read

Selecting the right technique to use for a data mining problem leads to the correct solution. There are five primary questions to ask in solving data mining problems that serve as a basis in the selection of appropriate data mining techniques.

Have you tried crafting a data mining problem statement before? What were the data mining problem statements that you have formulated? Do they fall under the following question categories? If not, then you should refine it to align it to any of them. It is always good to review a previously acquired knowledge before you proceed. Read the article on crafting a data mining problem statement.

Question 1. How much or How Many

More companies are trying to get into the mask-making business as everybody scrounge for protection confronting the COVID 19 pandemic. Suppose that you are working in one of these companies, and you are thinking of striking the need for protective gear while the demand is very high. Will, you suggest that your company,

Should venture into the production of other types of personal protective equipment or should your company not venture into the production of personal protective equipment and venture into AI-driven machines such as the Danish Sanitizing Robot instead?

Isn't it that the expected sales or budget is a numeric value? What data are needed to answer the questions? To answer these questions, you need to formulate a "How much" question which can result in numeric prediction. For example,

How much budget will be allotted for COVID 19 vaccine in the Philippines?

To solve a numeric prediction problem, use a regression data mining technique. The technique finds the relationship between variables.

Question 2. Is it a Class A or Class B?

Suppose you want to answer a question, Will the type of personal protective equipment pass the standard or not? There are two possible outcomes to this question.

Standard
Not Standard

This is a Class A or Class B question. Class A is the Standard outcome, and Class B is the Not Standard outcome. This problem is called prediction of classes. Other examples of this type of question are:

Is the email spam or legit?
Is the sentiment positive or negative?

One good question that is related to Covid19 vaccine is,

Which vaccine will the Philippine government prefer, Sputnik V of Russia or Ad5-nCoV of China?

In predicting classes of data, classification is the appropriate data mining technique to use. Classification assigns items in a dataset to target categories or classes. The goal is to accurately predict the target class for each case n the data.

Question 3. How is this organized?

Suppose you want to understand how the data of a car company is organized in terms of purchases. By knowing the structure, you will know a specific customer group to target for a specific promotional campaign.

This type of problem can be answered using the clustering data mining technique. Clustering is about finding groups of data points that are close together but are far from other groups of points. The concept of clustering depends on the distance of data points. Some common methods to use are Euclidean distance and Cosine distance.

Again, with Covid19, a good question can be like,

Which age group is susceptible to COVID 19?

Question 4. Is it weird behavior?

In this type of problem, you are looking for irregularities in data through the behavior of the data points. In data mining, these irregularities are called anomalies or sometimes referred to as outliers. An anomaly is an observation that greatly deviates from most of the other observations. Anomaly detection flags unexpected or unusual behaviors that can serve as the basis for the detection of problems. Intrusion detection is commonly applied in computer networks, social media, but can also be applied in other fields such as healthcare, usually in monitoring conditions of patients.

For example,

Is this unusually funded project of the Philippine Health Insurance Corporation within its priorities?

Question 5. What should be done next?

This is commonly referred to as reinforcement or adaptive learning. Reinforcement learning is learning how to best react to situations, through trial and error. In machine learning, reinforcement learning is researched for artificial decision-makers, referred to as agents. Reinforcement learning enables an agent to learn from experience and adapt to new situations, without human intervention. A reinforcement learning agent can learn to make decisions on the sampling of the environment which provides the data.

For example,

Should your company not venture into the production of personal protective equipment and venture into AI-driven machines such as the Danish Sanitizing Robot instead?