Big data thinking

in #big4 years ago

Sample to full

In the past, when we conducted data analysis, we usually used sample data for research. Samples refer to part of the data collected from the overall data according to the principle of random sampling, so sample thinking is still very common. In its essence, we can easily find that sample analysis is because traditional methods are difficult to perform large-scale and full-scale analysis, because the cost will be high and the efficiency will be low. For example, when we were young, we often saw a large-scale census over the past few years, which required a large number of grassroots personnel to register door-to-door, which resulted in a long work cycle and low efficiency. After the registration is completed, the analysts in a stage are doing analysis and speculation based on sample thinking. In the era of big data, a lot of information has been digitized and networked in real time, and new big data technologies can process massive amounts of data quickly and efficiently. We can easily do full analysis at lower cost and lower price. Sample analysis is based on point-by-point, partial and general thinking, and total analysis truly reflects the objective facts of all data.

Accurate to blur

In traditional data analysis, due to the small amount of data, analysts can accurately analyze the data, even down to a single record. And when an exception occurs, you can also do in-depth research on the cause of the exception for a single piece of data. But in the era of big data, with the explosion of data, it has been difficult for analysts to pay attention to more details. Precise thinking is based on a small amount of data. Based on precise rules, mutations and even sudden changes will occur in the face of massive amounts of data. Therefore, in the era of big data, our analysis puts more emphasis on high-probability events, which is the so-called ambiguity. This is not to say that we should abandon rigorous and precise thinking, but that we should increase the fuzzy thinking under big data. The most typical case is that Google predicts flu. Google uses your search records to predict the possibility of flu in a certain area. It is a kind of vague thinking. It cannot be absolutely accurate, but the probability will be very high.

Causality to association

When each of us starts school, there is a typical cause and effect relationship among the sentence patterns learned in Chinese class. I learned a lot of formulas from a small math class, and through the reasoning and proof of formulas, I have been emphasizing causality. So far, when each of us sees problems and phenomena, we always ask ourselves why. So it can be seen that causal thinking has formed a deep imprint in each of us. But anyone who studies data mining knows a "beer and diaper" story. The content of the story is like this. Wal-Mart staff found a strange phenomenon when counting product sales information on a periodic basis:

Every weekend, a certain supermarket chain sells a lot of beer and diapers. In order to clarify this reason, they sent staff to investigate. Through observations and visits, I learned that in families with children in the United States, the wife often asks her husband to buy diapers for the children after get off work, and the husbands take the diapers back to the holiday and they love to drink them when watching football games. Of beer, so the sales of beer and diapers have grown together. After understanding the reasons, Wal-Mart staff broke the routine and tried to put beer and diapers together.

As a result, the sales of beer and diapers increased sharply, bringing a lot of profits to the merchants. Through this story, we can see that in the original product, diapers and beer, two things that are not related to each other, have increased sales. An algorithm in data mining called association rule analysis is to mine the characteristics of data associations. Through data mining, we can see the association phenomenon of the data, but we don't necessarily know its causality. Because the association relationship reflects the phenomenon from the perspective of data thinking, and the causality reflects the phenomenon from the business perspective.

image.png

Coin Marketplace

STEEM 0.18
TRX 0.16
JST 0.030
BTC 68109.99
ETH 2636.23
USDT 1.00
SBD 2.69