Author: Liu Shunxiang public number:

Data Analysis 1480(微信ID：lsxxx2011)

Supporting Tutorials: Teach you how to do text mining https:// edu.hellobi.com/course/ 181

If you have a batch of data, you may use statistics, mining algorithms, visualization methods and other technologies to play around with your data, but how do I play when you have no data? Next, we will take you to play with data analysis without data.

This article explains the process of data analysis in detail from the following directories:

**1, the data source acquisition; **

**2, data exploration and cleaning; **

**3, model construction (clustering algorithms and linear regression); **

**4, model prediction; **

**5, model evaluation; **

## **First, the acquisition of data sources **

As the subject of this article, I want to analyze the data of second-hand housing in Shanghai, I want to see what factors will affect the price? Which listings can be grouped into one? How can I predict the price of second-hand housing? But I do not have such a sample of data. How do I answer the above question?

In the Internet age, network information is so developed and the amount of information is so huge that it is enough to just grab some data to drink a pot. We have already talked about how to extract information from the Internet in previous issues. We use Python as a flexible and convenient tool to complete crawlers. For example:

Crawl the lynx comment data through Python

Using Python to Acquire Douban Reading Book Information

Crawl a web page image using Python

Of course, the data on second-hand housing in Shanghai is still obtained through crawlers. The crawling platform comes from the chain home. The page is like this:

The data that I need to retrieve is the content in the red box, that is, the second-hand housing in each area in Shanghai. **Community name, type, area, area, floor, orientation, price and unit price **. First cut a few Python crawl code, source code and data analysis code written in the text of the link, if you need to download can go to the specified Baidu cloud disk link to download.

The code in the above figure is to construct all the links that need crawler.

The code in the above figure is the contents of the crawling of the specified field.

The data for climbing down is long (a total of more than 28,000 second-hand houses):

## **Second, data exploration and cleaning (all in R language) **

When the data is captured, it is customary to perform an exploratory analysis of the data, that is, to understand what my data looks like.

## **1. House distribution **

`# House Type Distributionlibrary(ggplot2)type_freq <- data.frame(table(house$户型))# Drawingtype_p <- ggplot(data = type_freq, mapping = aes(x = reorder(Var1, -Freq),y = Freq)) + geom_bar(stat = 'identity', fill = 'steelblue') + theme(axis.text.x = element_text(angle = 30, vjust = 0.5)) + xlab('户型') + ylab('套数')type_p`

We found that there are only a few households with a large number of units and the others are very few. **Long-tailed distribution type (seriously skewed) **, Therefore, consider the 1000 sets of all the units into one category.

`# sets less than a thousand sets of room types to othersType <- c('2 Room 2 Hall', '2 Room 1 Hall', '3 Room 2 Hall', '1 Room 1 Hall', '3 Room 1 Hall', '4 Room 2 Hall', '1 Room 0', '2 Room 0')House$type.new <- ifelse(house$ house %in% type, house$ house, 'other')type_freq <- data.frame(table(house$type.new))# Drawingtype_p <- ggplot(data = type_freq, mapping = aes(x = reorder(Var1, -Freq),y = Freq)) + geom_bar(stat = 'identity', fill = 'steelblue') + theme(axis.text.x = element_text(angle = 30, vjust = 0.5)) + xlab('户型') + ylab('套数')type_p`

## **2, the distribution of second-hand housing area and house prices **

`# Area Normality TestNorm.test (house$ area)`

`# House Price Normality TestNorm.test(house$price.W.)`

**The above norm.test function is my custom function **, The function code is also in the link below, you can download it yourself. From the above figure, we can see that **If the area and price of second-hand housing do not meet the normal distribution, then analysis of variance of such data cannot be performed directly or a linear regression model can be constructed. **Because both of these statistical methods require assumptions about the distribution of normality, we will explain how to deal with such problems later.

## **3, floor distribution of second-hand housing **

There are a total of 151 levels in the floor of the raw data, such as 5 above ground, low/6 floors, middle/11 floors, and high/40 floors. We feel it necessary to set these 151 levels as The three levels, low, middle and high, help to model later needs.

`# Divide floors into low, middle, and high areas.House$floow <- ifelse(substring(house$floor,1,2) %in%c('lowzone','centerzone','highzone'), substring(house$floor,1,2), 'Low area')# Percentage distribution of floor typespercent <- paste(round(prop.table(table(house$floow))*100,2),'%',sep = '')df <- data.frame(table(house$floow))df <- cbind(df, percent)df`

It can be seen that the distribution of the three floors is roughly the same, with the largest being the high zone, accounting for 36.1%.

## **4, the average price of second-hand housing in various regions of Shanghai **

`# Average House Price in ShanghaiAvg_price <- aggregate (house$ unit price. square meter., by = list(house$ area), mean)#Drawingp <- ggplot(data = avg_price, mapping = aes(x = reorder(Group.1, -x), y = x, group = 1)) + geom_area(fill = 'lightgreen') + geom_line(colour = 'steelblue', size = 2) + geom_point() + xlab('') + ylab('均价')p`

Obviously, the three regions with the highest prices for second-hand housing in Shanghai are: Jing’an, Huangpu and Xuhui. The average price is above 7.5W. The three regions with the lowest prices are Chongming, Jinshan and Fengxian.

## **5, the lack of housing construction time is serious **

Construction time this variable has 6216 **Missing, accounting for 22% of the total sample **. Although the lack of serious, but I can not simply throw out the variable, **So, considering the grouping by each region, we need to implement the multiplier replacement method. **. Here are two custom functions built:

`library(Hmisc)# Custom multiplier functionstat.mode <- function(x, rm.na = TRUE){ if (rm.na == TRUE){ y = x[!is.na(x)] } res = names(table(y))[which.max(table(y))] return(res)}# Custom functions for grouping alternatesmy.impute <- function(data, category.col = NULL, miss.col = NULL, method = stat.mode){ impute.data = NULL for(i in as.character(unique(data[,category.col]))){ sub.data = subset(data, data[,category.col] == i) sub.data[,miss.col] = impute(sub.data[,miss.col], method) impute.data = c(impute.data, sub.data[,miss.col]) } data[,miss.col] = impute.data return(data)}# Convert blank strings in construction time to missing valuesHouse$ building time [house$building time == ''] <- NA# Substitute missing values and perform variable selection on data setsFinal_house <- subset (my.impute(house, 'region', 'architecture time'), select = c(type.new, float, area, price, W., unit price, square meter, construction time))#Construct a new field, the construction time and the current time of 2016final_house <- transform(final_house, builtdate2now = 2016-as.integer(substring(as.character(建筑时间),1,4)))# Delete the original construction time fieldFinal_house <- subset (final_house, select = - construction time)`

The final clean data set is as follows:

Next, we can further analyze such clean data sets, such as clustering, linear regression and so on.

## **Third, model construction **

With so many houses, how can I classify them? Which properties should be classified as one? This will use a clustering algorithm. **We use a simple and fast k-means algorithm to achieve clustering work **. But before clustering, do I need to weigh the few types that I should gather? according to **Clustering principle: the gap within the group should be small, and the gap between groups should be large **. We plot the squared deviation sum within the group under different clusters. In the clustering process, we select three numerical variables: area, price, and unit price:

`tot.wssplot <- function(data, nc, seed=1234){#Assuming the total sum of squared deviations in a group tot.wss <- (nrow(data)-1)*sum(apply(data,2,var)) for (i in 2:nc){# must specify the number of random seeds set.seed(seed) tot.wss[i] <- kmeans(data, centers=i, iter.max = 100)$tot.withinss } plot(1:nc, tot.wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares",col = 'blue', lwd = 2, main = 'Choose best Clusters')}# Plot the sum of squared deviations within groups within different cluster numbersStandard <- data.frame(scale(final house[,c('area','price.W.','unit price.square.')]))myplot <- tot.wssplot(standrad, nc = 15)`

When all the samples are considered as a class, the sum of the squared deviations reaches a maximum. As the number of clusters increases, the sum of the squared deviations within the group gradually decreases. Until the extreme situation, each sample is considered as a class, and this time within the group The sum of the squared deviation is 0. From the above figure, the number of clusters is more than 5 times, and the squared deviation within the group is very slow. You can treat the inflection point as 5, that is, clustering into 5 categories.

`# Consolidate sample data into 5 categoriesset.seed(1234)clust <- kmeans(x = standrad, centers = 5, iter.max = 100)table(clust$cluster)`

`# According to the results of the clustering, see the distribution of regions in each categorytable(final_house$区域,clust$cluster)`

`# Average area of each apartmentaggregate(final_house$面积, list(final_house$type.new), mean)`

`# According to the clustering results, comparing the average area, average price, and average unit price of houses in various typesaggregate(final_house[,3:5], list(clust$cluster), mean)`

From an average point of view, I can roughly synthesize over 28,000 suites of sources into the following categories:

**a. Large-sized apartment (3 rooms, 2 halls, 4 rooms and 2 halls), belonging to the second category. **The average area is more than 130 square meters. This large-sized apartment is mainly distributed in Qingpu, Huangpu, Songjiang and other places (it can be seen from the regional distribution charts in various types).

**b. Lot type (high price) belongs to the first category. **Typical areas are Huangpu, Xuhui, Changning, Pudong and other places (it can be seen from the distribution maps of various types).

**c. Volkswagen dwelling type (small size, affordable price, multiple listings) belongs to category 4 and 5. **Typical areas include Baoshan, Hongkou, Minhang, Pudong, Putuo, Yangpu, etc.

**d, 徘徊 type (large house and lot type of listing), belonging to the third category. **Typical areas include Fengxian, Jiading, Qingpu, Songjiang and other places. These areas are also areas that will rise rapidly in the future.

`# Scatter plot of area and unit price, divided by clusteringp <- ggplot(data = final_house[,3:5], mapping = aes(x = area, y = unit price. square meter., color = factor(clust$cluster)))p <- p + geom_point(pch = 20, size = 3)p + scale_colour_manual(values = c("red","blue", "green", "black", "orange"))`

Next I want to **Build a linear regression equation with the help of available data (price, area, unit price, floor, floor plan, building duration, clustering level) **For the judgment and forecast of the housing price factor. Since there are discrete variables in the data, such as apartment types, floors, etc., these variables need to be **Perform dummy variable processing **。

`# Constructing Dummy Variables for Floors and Clustering Results# Convert several discrete variables to factors to facilitate one-time processing of dummy variablesfinal_house$cluster <- factor(clust$cluster)final_house$floow <- factor(final_house$floow)final_house$type.new <- factor(final_house$type.new)# Filter out all factor variablesfactors <- names(final_house)[sapply(final_house, class) == 'factor']# Convert a factor variable to the right half of the formula formulaformula <- f <- as.formula(paste('~', paste(factors, collapse = '+')))dummy <- dummyVars(formula = formula, data = final_house)pred <- predict(dummy, newdata = final_house)head(pred)`

`# Normalizing dummy variables to final_house datasetsfinal_house2 <- cbind(final_house,pred)# Filter data to be modeledmodel.data <- subset(final_house2,select = -c(1,2,3,8,17,18,24))# Linear regression modeling of dataFit1 <- lm(price.W. ~ .,data = model.data)summary(fit1)`

From the body it seems to be OK, only the building duration and the room type parameters of Room 2 and Hall 0 are not significant, others are all significant at the level of confidence of 0.01. **Don’t praise yourself **We say that the use of linear regression is based on the assumption that the dependent variable satisfies the normal or approximate normal distribution. As mentioned before, house price is obviously skewed in the sample and does not obey the normal distribution, so here **Use COX-BOX transform processing **. Based on the lambda results of the COX-BOX transformation, we convert the y variables, namely:

`# Cox-Box转换library(car)powerTransform(fit1)`

According to the results, **0.23 is very close to the value of 0 in the above table, so consider the logarithmic transformation of the price of second-hand housing. **

`fit2 <- lm(log(价格.W.) ~ .,data = model.data)summary(fit2)`

**This result is obviously much better than fit1 **Only the central area of the floor was significant at the level of confidence of 0.1, and the remaining variables were all significant at the confidence level of 0.01. **And the adjusted R-square value has also been increased to 94.3% **That is, these independent variables account for 94.3% of the price of housing.

Finally, let’s take another look at the diagnostic results of the final model:

`# Using the plot method to complete a qualitative diagnosis of the modelopar <- par(no.readonly = TRUE)par(mfrow = c(2,2))plot(fit2)par(opar)`

From the above figure, we basically satisfy several assumptions of the linear regression model, that is, the residual item obeys a normal distribution with the mean 0 (upper left) and a standard deviation (constant lower left) (upper left). Based on this model, we can have a targeted forecast of housing prices.

Today’s learning process is here. If you have any questions, please leave a message or add Wechat (lsx19890717) to chat. The crawler code, R language script, and data in this article are available at the following link:

**link: http:// pan.baidu.com/s/1c1BFhX e Password: 36dm **