Crafting a Data Mining Problem Statement

Rosmina Joy Cabauatan
Mar 31, 2021
3 min read

Updated: Mar 31, 2021

Crafting a problem statement is the initial step towards identifying the right data mining technique to use for model generation. It begins with the question, How can data solve a problem?. A problem can be solved using data by going through the following steps:

Know the practical motivation of solving the problem using data
Identify the data to solve the problem
Formulate the data mining problem
Prepare the data
Explore the data
Create visualizations for pattern recognition
Apply inferential analysis
Create a statistical inference
Know the ethical considerations

Step 1. Know the practical motivation of solving the problem using data

A good problem statement can be solved using data if its answers the following questions:

Is the problem related to data?
Can you solve the problem using the data?

Stage 2. Identify the data to solve the problem

In collecting data, you must answer the following questions.

Does the data you are trying to collect match the problem?
Does the data represent reality?
How to collect the relevant data?

For example,

if a poll is used to survey people's opinion about COVID 19 massive swab testing, how should this be conducted such that it can reflect reality?

It's always good to review a lesson in statistics about getting a sample of a population. The figure below shows the commonly used sampling methods.

In simple random sampling, every member of the population has an equal chance of being selected. Stratified sampling is appropriate when the population has mixed characteristics and every characteristic is proportionally represented in the sample. In systematic sampling, every member of the population is listed with a number, but instead of randomly generating numbers, individuals are chosen at regular intervals. Cluster sampling also involves dividing the population into subgroups, but each subgroup should have similar characteristics to the whole sample. Instead of sampling individuals from each subgroup, the entire subgroup is randomly selected.

Stage 3. Formulate the data mining problem

A data mining problem statement must reflect the data that will be used to solve the problem. For example,

Can age, blood type, location, and travel history predict if someone would most likely be infected with Covid19?

What about,

How do I know whether someone has infected with Covid19?

Is the statement correctly crafted? How will you restructure to make it correct?

Stage 4. Prepare the data

Preparing the data for analysis is an essential step to knowledge discovery. Data has quality if it satisfies the requirements of the intended use. The following are the factors comprising data quality.

accuracy
completeness
consistency
timeliness
believability
interpretability

Read the article Data Preprocessing: An Essential Step to Knowledge Discovery to know the challenges and major tasks in preparing data for analysis.

Stage 5. Explore the data

In this stage, some statistical properties of the data such as mean, median, variance, and distribution are studied. You can also create a graphical presentation that can help you gain insight from the data. An example of a question to ask can be like,

What initial insights can you extract from the distribution of Covid19 cases in the Philippines by age group?

Stage 6. Visualization and Pattern Recognition

Exploratory analysis is commonly referred to as descriptive analytics. In this type of analytics, statistical characteristics of data are presented in the form of visualization. This is the presentation of information using graphical representations to facilitate comparison of data and recognize patterns for initial decision making.

Visualization can be classified as explore or calculate and communicate. Explore and calculate requires a further analysis which requires a reason about a conveyed information. Communicate is a type that explains the information and suggests a hint for decision making. Visualization can also be classified according to purposes, such as comparison, composition, distribution, and relationship. The following questions can serve as a guide in determining which among the types is best suited for specific data.

How many variables are needed in the chart?
How many data points will you display for each variable?

Stage 7. Inferential Analysis

Inferential statistical analysis infers the properties of a population by testing hypotheses and deriving estimates. The findings can be taken from the sample group and be generalized to a larger population.

Stage 8. Statistical Inference

Statistical inference can become certain by generalizing the learning procedure or model and estimate the confidence of prediction by the general model.

For example, a claim that COVID 19 cases will go down by exactly 10% in the coming month is certainly not dependable. However, if the claim that COVID 19 cases will decrease in a range of 3%-5% in the following month with the confidence of 95%. In statistics, this is referred to as confidence interval.

Stage 9. Ethical Considerations

It is also essential to keep in mind to safeguard data that needs utmost confidentiality. Hence, in analyzing data, it is necessary to consider the privacy of individuals or organizations.