Seven Weeks as Data Analyst: How to Build a Thinking Framework for Data Analysis

Author: Qin public road No.: Qin Road (tracykanc)
Supporting seven weeks as a data analyst video tutorial: Sword work, seven weeks into a “division”! Qin Lu lectures, becoming a data analyst for seven weeks

This article is How to become a data analyst for seven weeks The eighth tutorial, if you want to understand the original intention of writing, you can first read the seven-week guide. Tips: If you are already familiar with data analysis thinking, you don’t have to look at this article again, or just pick some parts.

Someone once asked me, what is data analysis thinking? If analytic thinking is a structured manifestation, then data analysis thinking adds a criterion to it:

Not I think, but proof of data

This is a watershed. “I think” is an intuitionistic and empirical thinking. It is impossible for work to depend on its own intuition and it is even more impossible for the company to rely on it. Proof of data is the most direct expression of data analysis. It is based on data-oriented thinking rather than skill. The former is guidance and the latter is only application.

As an individual, how should we establish data analysis thinking?

Build your index system

Before we talk about indicators, we push back the time by several decades. Peter Drucker, the father of modern management, said a classic sentence:

If you can’t measure it, then you can’t grow it effectively.

The so-called measurement is the need to unify standards to define and evaluate business. This standard is an indicator. Assuming that Pharaoh opened a fruit shop next door, you asked him how the business is going every day. He could answer that it was a good sale. It was very good and it was down recently. These are all very imaginary words, because he thinks that selling well may be selling 50, and you think that selling well is selling 100.

This is the cognitive trap caused by “I think”. When you put a case into the company, you will encounter more problems: If you have an operation and you say that the product is performing well, because there are a lot of people evaluating and praising each day, you are given a few screenshots. The other operator said that there are some problems with the product and the sales of the promoted goods are not good. Who do you believe?

In fact, it is hard to believe that all of these public opinion judgments are caused by lack of data analysis thinking.

Pharaoh wants to describe the business. He should use sales. This is his indicator. The Internet wants to describe the product. It should also use indicators such as activity rate, usage rate and conversion rate.

If you can’t describe business with indicators, then you can’t grow it effectively.

Understanding and using indicators is the first step in data analysis thinking. Next you need to establish a system of indicators. Isolated indicators do not show the value of data. As with analytical thinking, indicators can also be structured and structured.

Let’s take a look at the Internet’s products. A user will go through these steps from the beginning to the end. E-commerce APP or content platform are all similar. Think about which indicators will you need to use?

The following chart explains what indexing is. This is the difference between data analysis thinking and data management. It is also a typical data management operation.

The index system does not have a universal template, and different business forms have different index systems. Unlike mobile apps and websites, SaaS and e-commerce are different. Low-frequency consumption and high-frequency consumption are not the same. Such as a wedding-related APP, do not need to consider the re-buy rate indicators; Internet finance, must be risk control indicators; e-commerce, the seller and the buyer’s indicators are different.

These need different industry experience and business knowledge to learn and master. Is there any common skills and precautions?

Good and bad indicators

Not all indicators are good. This is a common mistake made by the fledglings. We continue to return to Pharaoh’s fruit shop to think about it. Is this indicator of sales good?

The recent increase in prices, Pharaoh’s compliance increased the price of fruit, and did not dare to rise. Although the fruit sales did not change significantly, but Lao Wang did not make a month down how much money, private money is not enough.

Pharaoh’s sales of various types of fruit this month were 2000, but in the end it was a loss. After careful study, it was found that although the sales volume was high, the fruit inventory was also high, and each month there were several hundred units of fruit sold in the long-lost loss.

Both of these examples illustrate that looking at sales is more than a matter of refusal. Sales volume is a measure, but it is not a good indicator. Pharaoh’s self-employed households should use the profits of fruit shops as the core element.

Good indicators should be the core driving indicators. Although indicators are important, some indicators need to be more important. Just like sales and profits, users and active users, the latter are all more important than the former.

The core indicators are not only written in the weekly report, but are the goal of the entire operation team, product team, and even R&D team.

The relationship between the core driving indicators and the company’s development is the key direction of the company in one phase. Remember that this is a stage where the core drivers are different at different times. The core driving metrics for different businesses are also different.

The common core indicators of Internet companies are the number of users and the active rate. The number of users represents the market’s mass and possession, and the active rate represents the health of the product. However, this is the core indicator of the development stage. During the product 1.0 period, we should focus on polishing products and improve product quality before the big promotion. At this time, the retention rate is a core indicator. In the later stages of a product with a certain user base, commercialization is more important than being active. We will pay attention to the indicators related to money, such as click-through rates and profit margins.

The core driving indicators are generally the company’s overall goals. If you look at individual job responsibilities, you can also find your own core indicators. For example, content operations can focus on readings and reading time.

The core driving indicators must bring maximum advantages and benefits to companies and individuals. Remember the rule of 28? 20% of the indicators will definitely bring 80% of the results, and this 20% of the indicators is the core.

On the other hand, good indicators have one more characteristic. It should be a ratio or ratio

Take the number of active users and understand. We have 100,000 active users. What does this mean? This does not explain anything. If the product itself has tens of millions of registered users, then 100,000 users are very unhealthy and the product is in decline. If the product only has four or five hundred thousand users, then the viscosity of the product is very high.

Because the number of simple active users does not make much sense, operations and products are more concerned about the activity rate. This indicator is a ratio that divides the number of active users by the total number of users. So when setting up the indicators, we all tried to think whether it could be a ratio.

What are the bad indicators?

One is the vanity indicator. It has no practical significance.

Do the products have hundreds of thousands of exposures in the app store? No, what I need is the actual download. Has the meaning of the download been great? Not too big, I hope the user registration is successful. Exposures and downloads are all vanity indicators, but the degree of vanity is not the same.

New media are pursuing WeChat public number reading. If you rely on reading to advertise, then the number of readings is meaningful. If you rely on graphic sales to sell goods, then you should pay more attention to the conversion rate and sales of goods. After all, an exaggerated title can bring about High readings, readings at this time are vanity indicators. Unfortunately, many employers are still tireless pursuit of 10W +, even if the amount of brush.

Vanity metrics are meaningless metrics. They tend to look good and can gloss over the performance of operations and products, but we must avoid them.

The second bad indicator is the posteriori indicator, which often reflects only what has already happened.

For example, I have a definition of lost users: If you do not open APP for three months, you will lose. The number of users who have lost their daily statistics has not been opened for a long time. In terms of timeliness, it has been a long time and it has been difficult to recover through measures. I know that once a bad operation was hurting the user, is it still useful?

The ROI (Return on Investment) of an activity operation is also an a posteriori indicator, and one activity can only know its profit after paying the cost. However, the costs have already been spent, and the activities are good and bad are doomed. The cycle of activities is long and there is room for adjustment. If the activity is short-term, this indicator can only be used for resumption but it cannot drive business.

The third bad indicator is the complexity indicator, which traps data analysis in traps caused by a bunch of indicators.

Indicators can be subdivided and dismantled. For example, the activity rate can be subdivided into daily activity rate, weekly activity rate, monthly activity rate, and old user activity rate. Data analysis should select indicators according to specific conditions. If it is a weather tool, you can select a daily activity rate. If it is a social APP, you can choose a weekly activity rate, and a product with a lower frequency is a monthly activity rate.

Each product has several indicators that are suitable for it. Don’t put a bunch of indicators on top of it. When you prepare twenty or thirty indicators for analysis, you will find that there is no way to start.

Structure of indicators

Since the indicators are too complex or too complicated, how can we correctly choose indicators?

Like the pyramid structure of analytical thinking, indicators also have an intrinsic structure that presents a tree. The construction of indicator structure is based on business process and structure-oriented.

Assuming that you are a content operator, you need to do an analysis of the existing business and improve the content-related data. What would you do?

We have converted the pyramid thinking to become a data analysis method.

Starting with the process of content operations, it is: Content Collection – Content Editing – User Browsing – User Click – User Reading – User Comments or Forwarding – Continue to the next article.

This is a standard process, and each process has indicators that can be established. Content collection can establish a hot spot index to see which one is more fire. The user browsing the user clicks on the standard PV and UV statistics, and the user reads the reading time.

From the point of view of the process to build an indicator framework, users can comprehensively include relevant data, with no omissions.

The indicators listed in this framework still follow the principle of indicators: There is a need to have core drivers. Remove the vanity indicator, make appropriate deletions, and do not add indicators to add indicators.

Dimensional analysis

When you have indicators, you can proceed with the analysis. The data analysis can be roughly divided into three categories. The first is the use of dimensional analysis data, the second is the use of statistical knowledge such as data distribution hypothesis testing, and the last is the use of machine learning. Let’s start with the dimensional analysis.

Dimension is the parameter describing the object. In the concrete analysis, we can think of it as the angle of analyzing things. Sales is an angle, activity is an angle, and time is an angle, so they can all be considered as dimensions.

When we have dimensions, we can form data models by combining different dimensions. The data model is not a profound concept. It is a data cube.

The above figure is a data model/data cube composed of three dimensions. They are the product type, time, and area. We can obtain sales of electronic products in the second quarter of 2010 in the Shanghai area, as well as sales of books in the first quarter of 2010 in the Jiangsu region.

The data model organizes complex data in a structured and organized manner. The indicators we talked about earlier can all be used as dimensions. Here is an example:

Combining the three dimensions of user type, activity, and time, and observing the use of different user groups on products, whether the duration of use of group A is more pronounced?

Combine the three types of product types, order amounts, and regions. Observe whether there are differences in sales of different products in different regions.

The data model can observe data from different perspectives and levels, which improves the flexibility of analysis and meets different analysis requirements. This process is called OLAP (online analytical processing). Of course it involves more complex data modeling and data warehousing, and we don’t need to know in detail.

The data model also has several common techniques called drill, roll, and slice.

The choice is to continue to subdivide the dimension. For example, Zhejiang Province was subdivided into Hangzhou, Wenzhou, and Ningbo, and the first quarter of 2010 became January, February, and March. The upper volume is the opposite concept of drilling, the dimension aggregation, such as Zhejiang, Shanghai, Jiangsu polymerization into Zhejiang and Shanghai dimensions. Slices are selected for specific dimensions, such as only the Shanghai dimension, or only the first quarter of 2010. Because data cubes are multidimensional, we observe and compare data only in two dimensions, ie in tables.

The above tree structure represents the drill (subdivision of source and time), and then obtains specific data through the Air slice of the Route.

Smart you may have thought that our commonly used PivotTable is a kind of dimensional analysis, and we need to put the dimensions that we need to analyze into the combination of ranks to calculate sums, counts, and averages. Put a picture of the case that was used: Calculate the average salary using the dimensions of the city and the length of the job.

In addition to Excel, BI, R, Python can use dimensional analysis. BI is relatively simplest.

When it comes to dimension law, one of the core thinking of the analysis is to emphasize: comparison, comparison of different dimensions, which is probably one of the best shortcuts for newcomers to quickly improve. For example, the comparison of past and present time trends, such as the comparison of dimensions in different regions, such as the comparison of product types, such as the comparison of groups of different users. A single data has no analytical significance, and only a combination of multiple data can play the greatest value of the data.

I want to analyze the company’s profit, profit = sales – cost. Then find out the indicators/dimensions involved in sales, such as product types, regions, user groups, etc., and find out the reasons for problems or good performance through continuous combination and dismantling. The same is true for cost.

This is the correct data analysis thinking. Summarize it: We establish and filter out indicators through the business, use the indicators as dimensions, and use dimensions to analyze.

Many people will ask, what is the difference between indicators and dimensions?

Dimensions are the angles to explain and observe things, and indicators are the criteria for measuring data. Dimensions are a larger area, not just data, such as time dimensions and city dimensions. We cannot use indicators to indicate that indicators (retention, bounce rate, browsing time, etc.) can become dimensions. Popular understanding: Dimensions > Indicators.

Here, we already have a thinking framework for data analysis. The reason is the framework, because there is still lack of specific skills, such as how to verify that a certain dimension is the key to the impact of data, such as how to use machine learning to improve the business, which involves the knowledge of data and statistics, will be explained later.

Here I would like to emphasize that data analysis is not a result but a process. Remember the phrase “If you can’t measure it, then you can’t effectively grow it?” The ultimate goal of data analysis is to grow your business. If data analysis requires performance indicators, it must not be the right or wrong of the analysis, but the result of the final data upgrade.

Data analysis needs feedback. When I analyze the results of a certain business, verify it. Tell the operations and product personnel what the improved data is and everything is subject to the results. If the results do not improve, then you should reflect on the analysis process.

This is also the element of data analysis and the results are guided. If the analysis is only when a report is presented and there are no follow-up or improvement measures in the follow-up, then the data analysis is equal to zero.

Business guidance data, data-driven business. This is the only way.


Answering the questions in the previous article may cause everyone to wait.

You are a data analyst for Taobao. Now you need to estimate the sales volume of Double 11. You cannot get all the data for the double 11 and the previous day. You can only get data starting from November 12th. What should you estimate?

Because it is an open question, there is no fixed answer.

Everyone’s answers are divided into two categories:

One is through the follow-up sales of the double eleven, judge 16 years, the disadvantage is that it takes a year, the advantage is simple to non-verbal.

The second category is based on the sales data after November 12th, which is estimated in advance. Some weights will be considered during the period. The disadvantage is that the double eleven is a crest, and the prediction is big. The advantage is that it has good maneuverability.

Because the topic is mainly analytical thinking, the purpose is to find possible ideas, so there is no other way?

We try to open our minds because sales can respond to goods. Are there other dimensions? We may think of: return rate, and product evaluation rate. Since the double eleven goods can only be evaluated after the exchange and receipt of goods on the 12th, we can estimate the total sales rate based on the average daily ratio of the two indicators, and the total number of replacements and evaluations of the double 11 products. the amount. Return rate will certainly be a little higher (after all, double 11 return a lot), then the product evaluation rate is more accurate.

Is there any other way? Of course there are, for example, a lot of people will pay double XI with ants, so can the ratio of follow-up repayments be estimated?

If you let it go? Although I do not know the date of Taobao, but can seek external data, such as Jingdong, Jingdong’s double 11 sales is how many times is usual, then use this multiple to estimate Taobao.

The overall analysis structure is divided into:

External data:

  • Jingdong and other platforms double 11 sales

Internal data:

  • Product data: product evaluation rate, return rate, product sales
  • Payment data: ant spent rate, etc.