What is Data Mining?
Data mining is when we extract information from existing data sets and make predictions based on that information. It’s used a lot in finance because much of financial data is unstructured (i.e., not put into any special format), which make it hard to analyze with traditional techniques like linear regression or time-series analysis because there are just too many variables involved to analyze every aspect of the system at once.
Fintalent’s data mining consultants note that data mining is often confused with data analysis, which is more of the initial act of cleaning and manipulating the data itself. Data mining is just one step beyond it, making predictions based on the cleaned-up data and trying different combinations of variables to see what gives the best result.
Data mining can be used to predict stock prices, individual or collective behavior (e.g., how much a customer will spend in a given month), credit risk, fraud detection or even election outcomes (e.g., who will win a presidential election). Each of these may seem very different, but they actually share similar problems and are often solved in the same way.
Financial data is often not structured in the way we expect it to be, and is extremely difficult to analyze using traditional methods. I will start by giving some examples of data sets that have been analyzed using data mining techniques, and then explain what data mining is and how it could be used in finance (this time without the surprise factor).
Data mining is often used to predict future investment outcomes, like which mutual fund will perform the best during a given period of time or how a specific stock will behave over the next few years. It can also be used to predict human behavior, such as how much each customer will spend in a given month or what their preferred stock quote will be at their next trade. The potential applications are almost limitless and it is only by sifting through the clean financial data that it becomes possible to find interesting ways to make predictions with it.
Data Mining Approaches
Data mining through statistics – Statistics involves the scientific study of all aspects of data, and it can be used in many different ways to analyze data. This can be done in two ways: through descriptive statistics or through inferential statistics . The former is used to summarize data with graphs, tables and other useful forms to see what the data looks like and how it can be interpreted (e.g., mean, median, standard deviation). It is very useful in exploratory research where you’re trying to find out if there are any trends or correlations within the data set. The latter is used for forecasting future events (e.g., predicting next year’s quarterly profits based on this year’s revenue). It is also used in financial data analysis to draw inferences to come up with a model that can predict how certain events will affect the company’s stock.
Data mining through correlation – Correlation can be thought of as a measure of relationship between two things in a data set or multiple things within an entire group of things (e.g., if you have several correlated variables, like price and quantity, then you may think that a higher price will lead to more sales). It is used to generate predictions by considering past values and looking for similarities between them that can be applied in the future (e.g. looking at all past values of a stock and trying to see if they followed a pattern or not). For example, a stock that has gone down in value over the last twenty years is likely to continue doing so when its price increases.
Data mining through regression – Regression can be thought of as the opposite of correlation, where you take certain values, like price and quantity together, and try to see if any relationship exists between them. It is used for prediction by looking at past values and trying to predict future events based on them (e.g., predicting the quarterly profit from past quarterly revenue: profit = 10000/revenue). The regression model is typically called the equation (e.g. profit = 10000/revenue) to make it more understandable.
Data mining through clustering – Clustering can be thought of as the grouping of similar data points together. It is used for the purpose of categorizing data (e.g., grouping books by genre) or discovering connections and dependencies among different variables that cannot be explained by statistical analysis alone (e.g., finding out which types of users tend to buy two different types of books). This can also help in coming up with recommendations to improve the way a company manages their inventory, how they communicate with their customers or their marketing strategies.
Directed data mining – This is where you assign specific attributes to different customers, who are then grouped into the same data set. It can be used for following the development of a relationship (e.g., if my customer is an early adopter, I will offer them more attractive promotions). Actual customer behavior can also be predicted based on certain attributes (e.g., if this person has an interest in technology, we can make a prediction about what types of programs they will like).
Multi-dimensional analytics – This is where you take multiple variables and look at how they are related in different combinations to one another. It is used to compare data sets and come up with conclusions that cannot be drawn from a single variable alone. It can also be used to give insights into various aspects of the subject matter (e.g., putting all the different data sets together will let you see how all of them are related).
Data mining through association rules – This type of data mining is used for discovering patterns (i.e., discovering correlations) in large data sets and grouping different variables together so that a pattern emerges from them. It can be thought of as looking at a jigsaw puzzle where you have each piece but not the order in which they go into place relative to one another. The pieces are different attributes of one entity, i.e. the jigsaw puzzle is data, and the pieces associated with different attributes are the entities (e.g., when you put a few different types of books into a category, there is an association between them and it makes sense to think of them as belonging together).
Predictive modeling – Predictive modeling is a form of mining data where one variable can be used to predict another variable based on how it changes over time (e.g., using the quarterly sales in an electronics store to predict next year’s sales). There are also more advanced versions of this where one variable can be used to predict many different other variables at the same time. It is used for a variety of purposes (e.g. to create sales forecasts, predicting certain aspects of financial performance, projecting returns, etc.).
Data mining through Causal Inference – This is using statistics to determine cause and effect relationships between different variables based on the behavior that occurs in the data set (e.g., creating a prediction model for company profitability). It can be thought of as what if? scenarios that show how one variable affects another based on their likelihood of occurring over time.
Data mining through a process – This type of data mining is similar to causal inference, but it involves a specific action that has been taken (or not) based on the outcome of that action. It can be thought of as an association between two variables that occurs due to a process (e.g., associating a customer’s purchase activity with their behavior on social networking sites).