1. Is there sufficient price transparency in real estate transactions?
1) REINS: Knowledge Enclosed Within the Real Estate Industry
There is a website called REINS
REINS is an information source accessible only to real estate agents, serving as a platform for sharing real estate transaction information to facilitate business operations among agents. You may have seen many flyers posted in front of real estate offices, and you could say that the database is the source of this information. If the public could access this information, it would significantly reduce the need to go through real estate brokers to find properties. In fact, many people already use real estate search websites.
However, what if you wanted to gather this data, process it, and determine whether a particular property is reasonably priced? Currently, under Japanese law, you need a real estate agent's license to do this. In an era where massive amounts of data, such as currency exchange or stock market trading data, can be easily obtained (sometimes even for free), allowing for statistical analysis to determine fair prices, the real estate industry has been desperately avoiding the broad sharing of such fundamental information.
The excuse from the real estate industry might be something along these lines:
"Real estate is highly individual and only meaningful when explained in detail. Therefore, if users were to analyze this data on their own, it could lead to misleading predictions or simply be a waste of time. For this reason, access to such information should be restricted to professionals."
But if that’s the case, how do they explain that stock market analysis, which is inherently highly individualized, is widely accessible, allowing investors to trade with confidence, thereby invigorating and expanding the market? Even if real estate professionals hoard knowledge, transactions won’t occur unless those who want to trade have easy access to necessary information. It should be considered that such industry practices are, in fact, hindering the revitalization of real estate transactions in Japan.
2) What information do users want from their perspective?
Naturally, as a buyer, one would want to know the fair purchase price of a property based on market conditions, and as a seller, one would want to know the fair selling price of their property, also based on market conditions.
However, what the real estate agent is likely to show you are a few flyers of nearby properties from the past (REINS information) and say something like, "For properties similar to this one, the recent price range is around here." You'll probably receive a suggestion along the lines of, "Considering this information, this seems to be a fair price."
If a friendly real estate agent says this, many people might go along with the suggested price simply because they feel they can trust what the agent is saying. However, it's important to remember that real estate agents are also working hard to meet their sales targets. If they wanted to, they could easily cherry-pick the flyer information pulled from REINS, deciding to "overlook" certain data that might not be convenient. Moreover, just looking at a few nearby transaction records doesn’t guarantee that you’ll arrive at a fair price.
If end users could take a broader view and understand what a statistically fair price might be, they could make decisions more efficiently and with greater confidence.
3) Real Estate Information Library
There is a government-operated website called the Real Estate Information Library.
"Summary of the Committee for Reviewing Improvements to the Real Estate Transaction Information Provision System".The opening section of the document titled " from the Ministry of Land, Infrastructure, Transport and Tourism aptly states the following:
The opening section of the document titled "Summary of the Committee for Reviewing Improvements to the Real Estate Transaction Information Provision System" from the Ministry of Land, Infrastructure, Transport and Tourism states the following:
"Real estate is an important asset for the public, and the development of the real estate market is a critical policy issue for revitalizing local economies and improving the quality of life. However, there is a significant information gap between consumers and real estate professionals in the real estate market. In particular, obtaining price information for existing homes is difficult, leading many consumers to feel uncertain about the fairness of transaction prices.
In light of the realities of the real estate market, since April 2007, a consumer information service called RMI (REINS Market Information) has been provided using real estate transaction price (transaction price) information for condominiums and single-family homes held by the Designated Real Estate Distribution Organization (REINS), which is a transaction information exchange system among professionals."
The Ministry of Land, Infrastructure, Transport and Tourism has invested considerable time and cost in the development and operation of the Land Comprehensive Information System (Real Estate Information Library). If this system were to become ineffective, it would reflect poorly on the Ministry. However, there are no substantial evaluation documents assessing its usability. Therefore, this article aims to briefly review it.
Based on the document, it seems that the system may have been diluted in a way that reflects a desire to avoid providing detailed transaction information to users. For example, surveys might have been conducted with previous transaction participants who naturally prefer not to have their transaction details shared publicly. However, prospective buyers and sellers often have a strong interest in knowing past transaction data to make informed decisions. Given the current ease of accessing historical transaction data online, it raises the question of whether a land information system that significantly obscures specific property details is still useful.
Given that the data is available, it would be necessary to assess how effectively the system serves its intended purpose and whether it meets the needs of its users.
2. Can reasonable pricing be achieved with the data from the Real Estate Information Library?
1) Reviewing the contents of the data
Let's take a look at the actual data. For example, reviewing past data from Tokyo might show something like this:
At first glance, it's clear that there are many gaps in the data. Additionally, transaction prices are significantly rounded off. The precision of the area data, given that it is reported in 5-square-meter increments, is quite poor for smaller properties in central Tokyo. If the address details only go up to a general area like Iidabashi in Chiyoda Ward, it's regrettable that the system seems to have been overly diluted under the guise of personal information protection.
Moreover, as will be shown later, the data is quite dirty, and it appears that even basic data cleaning was not performed before uploading it to the database. This means users are forced to clean the data themselves each time, resulting in a very low level of usability, which is also disappointing.
2) Considerations for Defining the Target When Creating a Model
The focus of this analysis was narrowed to second-hand properties due to the high number of transactions. Additionally, within Tokyo, the analysis specifically targeted the central wards as shown in the table below. The reason for this focus is that there are relatively more transactions in these areas. However, even with this specialization, the number of data points over the four-year period from Q1 2018 to Q4 2022 appears quite small, and there is a strong impression that a significant amount of information remains undisclosed.
Area | Number of Transactions |
---|---|
Minato-ku | 4561 |
Shibuya-ku | 3364 |
Shinjuku-ku | 5651 |
Chiyoda-ku | 1592 |
Shinagawa-ku | 5645 |
Bunkyo-ku | 3748 |
Meguro-ku | 3061 |
Total | 27622 |
3) Source Code
The process from data loading to cleaning is as follows:
Particularly, the config.xlsx
file contains one worksheet per categorical variable, each with a table for converting categorical values into numerical values. This allows for qualitative information to be replaced with quantitative values. For example, the following table replaces land type categories with numerical IDs.
ID | Type |
---|---|
1 | Residential Land (Land) |
2 | Residential Land (Land and Building) |
3 | Used Condominium, etc. |
4 | Agricultural Land |
5 | Forest Land |
Details of other individual modifications are documented in the source code below, so please refer to it as needed.
Additionally, since the original data is in Japanese, variable names and other code content will not be translated.
import pandas as pd
FILE = "./data/All_20053_20224/13_Tokyo_20053_202242.csv" # 土地総合情報システムのデータ
df_data = pd.read_csv(FILE, encoding= "cp932")
import re
col = df_data.columns.values
for c in range(len(col)):
col[c] = re.sub(":","", col[c])
# 対象を中古マンションに制限
target = "中古マンション等"
df_data = df_data[df_data["種類"]== target].reset_index(drop=True)
# カテゴリ変数
Category = ["種類","地区名", "地域","間取り", "土地の形状","建物の構造", "今後の利用目的", \
"前面道路方位", "前面道路種類","都市計画", "改装", "取引の事情等"] # "用途"は不使用
COL2 = ["最寄駅:距離(分)", "取引価格(総額)", "面積(㎡)", "間口", \
"建築年", "前面道路幅員(m)", "建ぺい率(%)", "容積率(%)",\
"取引時点"]
import glob
ITEM = "./data/config/config.xlsx" # カテゴリ変数を整数値に変換するためのテーブルを定義しているファイルを読み込む
file = glob.glob(ITEM)
item = []
for c in Category:
df_item = pd.read_excel(file[0], sheet_name=None)
df_data2 = df_data.copy()
def get_change(df_data2):
L = len(df_data2)
for c in Category:
for i in range(L):
M = len(df_item[c])
for m in range(M):
hit = 0
if df_data2.at[i, c] == df_item[c].at[m, c]:
df_data2.at[i, c] = m
hit =1
break
if hit == 0:
df_data2.at[i, c] = M
#print("!")
return df_data2
# ここからはデータのクリーニングを行う
# 取引時点の修正 大小関係を指定して抽出出来るようにdatetimeに変換する
import unicodedata
Q = ["1", "2", "3", "4"]
for i in range(L):
c = df_data2.at[i, "取引時点"]
for q in Q:
if q in c:
nq = int(unicodedata.normalize('NFKC', q) ) * 3
if nq < 10:
nqc = "/0"+ str(nq) + "/01"
else :
nqc = "/" + str(nq) + "/01"
string = "年第" + q + "四半期"
df_data2.at[i, "取引時点"] = c.replace(string, nqc)
# 建築年の修正 西暦に変換することで大小関係を評価出来るようにする
def get_chikunen(df_data2):
wareki = {"昭和":1925, "平成":1988 , "令和":2018}
L = len(df_data2)
today = 2023
for i in range(L):
s = df_data2.at[i, "建築年"]
y = 0
if type(s)==float:
y = 1950
else:
for w in wareki.keys():
dum = re.findall(w, s)
if 0<len(dum) :
y =int( re.search(r'\d+', s).group() )
y = y+ wareki[w]
break
if y == 0:
y = 1950
df_data2.at[i, "築年数"] = today - y
return df_data2
# 面積の修正 記法が統一されていないものを修正
def get_m2(df_data2):
s = "㎡以上"
L = len(df_data2)
for i in range(L):
c = str(df_data2.at[i, "面積(㎡)"])
if s in c:
df_data2.at[i, "面積(㎡)"] = c.replace(s,"")
elif "m" in c:
d = re.findall('([1-9]\d{0,2}(,\d{3})*)m', c) ## カンマ付きの数字で、数字の後にmがあるものを抽出
df_data2.at[i, "面積(㎡)"] = int(d[0][0].replace(",", ""))
return df_data2
def ifin(s, c):
for ss in s:
# print(ss, str(c)
if ss in str(c):
return True
return False
# 最寄駅距離(分)
s = ["分", "時", "H", "M", "?"] # この文字列が含まれていたら60分に統一してしまう
max_minutes = 30
for i in range(L):
c = df_data2.at[i,"最寄駅距離(分)"]
if type(c) == float:
df_data2.at[i,"最寄駅距離(分)"] = max_minutes
elif ifin(s, c):
df_data2.at[i,"最寄駅距離(分)"] = max_minutes
else:
pass
# 間口の修正 余計な文字列が含まれているものを除去
s = "m以上"
for i in range(L):
c = df_data2.at[i, "間口"]
if type(c) == float: ## nan と 15.5みたいな正しい間口はスルー
pass
elif s in c:
df_data2.at[i, "間口"] = c.replace(s,"")
# 機械学習用のデータセットの保存と読み込み
df_data2.to_csv("./data/temp/df_data2.csv", encoding="cp932",index=False)
Up to this point, we have covered the data preprocessing for analysis.
From here on, we will proceed with the analysis tasks.
df = pd.read_csv("./data/temp/df_data2.csv", encoding="cp932")
# 取引時点の絞り込み
L = len(df)
from datetime import datetime as dt
for i in range(L):
date = df.at[i, "取引時点"]
df.at[i, "取引時点"] = dt.strptime(date, '%Y/%m/%d')
th = "取引時点"
th_date = "2019/1/1"
date = dt.strptime(th_date, '%Y/%m/%d')
df = df[date < df[th]].reset_index(drop=True)
# 分析に利用する変数の選択
col_apart = ["市区町村コード","地区名", "間取り", "建物の構造", "今後の利用目的", \
"改装","最寄駅距離(分)", "面積(㎡)", "築年数"] # "建ぺい率(%)", "容積率(%)"都市計画", "取引の事情等",
col_X = ["種類","地域", "間取り", "土地の形状", "建物の構造", "今後の利用目的", \
"前面道路方位", "前面道路種類", "改装",\
"最寄駅距離(分)", "面積(㎡)", "間口", "前面道路幅員(m)", "築年数"] # , "取引時点" #,"用途" "取引の事情等",,"都市計画""建ぺい率(%)", "容積率(%)"
col_y = ["取引価格(総額)"]
# 説明変数の作成
X = df.loc[: , col_apart]
X["label"] = df.loc[:, col_y]
# 地域絞り
th = "市区町村コード"
X = X[13101 <= X[th]]
X = X[X[th] <= 13113]
# 面積絞り
th = "面積(㎡)"
X = X[20<X[th]]
X = X[X[th] < 40]
dfx = X.reset_index(drop=True) #.dropna(axis=0).reset_index(drop=True)
L = len(dfx)
M = int(0.95*L)
X_train, X_test = dfx.loc[0: M, col_apart], dfx.loc[M: , col_apart]
y_train, y_test = dfx.loc[0: M, "label"], dfx.loc[M: , "label"]
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
import lightgbm as lgb
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import ticker
def str2time(s):
if type(s) == str:
return datetime.datetime.strptime(s, "%Y-%m-%d HH:%M:%S")
def make_fig(df, title):
df['Time'] = df['放送日'].apply(str2time)
fig = plt.figure(figsize =(8,4), facecolor = 'gray' )
ax = fig.add_subplot(111, xlabel = 'Time', ylabel = title, fc = 'gray' )
ax.xaxis.set_major_locator(ticker.MultipleLocator(100)) #n ko tobashi
ax.xaxis.label.set_color('w')
ax.yaxis.label.set_color('w')
ax.spines['top'].set_color('w')
ax.spines['left'].set_color('w')
ax.spines['right'].set_color('w')
ax.spines['bottom'].set_color('w')
ax.tick_params(axis ='x', colors='w')
ax.tick_params(axis ='y', colors='w')
ax.plot(df['Time'], df[title], color = 'orange')
ax.set_title(title)
fig.autofmt_xdate(rotation=45)
plt.grid(linestyle='--', color = 'white')
plt.show()
file = './'+title+'.png'
fig.savefig(file)
def get_pic(x, y, label):
fig = plt.figure(figsize=(7, 5))
x_max = max(x)
x_min = min(x)
print('min',x_min, 'max',x_max, int((x_max-x_min)/10) )
pitch = int((x_max-x_min)/10)
if pitch ==0:
pitch = 1
x1 = np.arange(int(x_min), int(x_max), pitch)
#y1 = a[0]*x1 + b
y2 = x1
# Figure内にAxesを追加()
ax = fig.add_subplot(111)
ax.plot(x1, y2, color = "brown", label='perfect')
ax.scatter(x, y, label=label, color = "red", s=10 ) #...3
plt.xlabel("prediction")
plt.ylabel("observation")
plt.legend()
plt.show()
from sklearn import linear_model
CLF = linear_model.LinearRegression()
def tankaiki(X, Y):
CLF.fit(X, Y)
return CLF.coef_, CLF.intercept_ , CLF.score(X, Y)
#XX_test, y_pred = predictionGBM(X_test, num_iteration=gbm.best_iteration)
def important():
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
lgb_score = mean_squared_error(y_test, y_pred)
a, b, r2 = tankaiki(np.array(y_test).reshape(-1, 1), y_pred)
print('model: Y = {}X + {}: R2:{}'.format(a[0], b, r2))
lgb.plot_importance(gbm, height=0.5, figsize=(8,16), importance_type='gain')
#df_performance = r
importance = pd.DataFrame(gbm.feature_importance(importance_type='gain'), index=X_train.columns, columns=['importance'])
importance = importance.sort_values('importance', ascending=False)
display(importance)
X_eval = X_train
y_eval = y_train
# 学習用
lgb_train = lgb.Dataset(X_train, y_train,
#categorical_feature=categorical_features,
free_raw_data=False)
# 検証用
lgb_eval = lgb.Dataset(X_eval, y_eval, reference=lgb_train,
#categorical_feature=categorical_features,
free_raw_data=False)
# パラメータを設定
params = {
'objective': 'mean_squared_error',
'metric': 'l1',
'verbosity': -1,
'boosting_type': 'gbdt',
'learning_rate': 0.05,
'num_leaves': 21,
#'max_depth' : 10,
#'min_data_in_leaf': 15,
'seed': 42,
#'num_iteration': 150
}
# 学習
evaluation_results = {} # 学習の経過を保存する箱
model = lgb.train(params, # 上記で設定したパラメータ
lgb_train, # 使用するデータセット
num_boost_round=1000, # 学習の回数
valid_names=['train', 'valid'], # 学習経過で表示する名称
valid_sets=[lgb_train, lgb_eval], # モデル検証のデータセット
evals_result=evaluation_results, # 学習の経過を保存
#categorical_feature=categorical_features, # カテゴリー変数を設定
early_stopping_rounds=20, # アーリーストッピング# 学習
verbose_eval=-1) # 学習の経過の非表示
# テストデータで予測する
y_pred = model.predict(X_test, num_iteration=model.best_iteration)
lgb_score = mean_squared_error(y_test, y_pred)
importance = pd.DataFrame(model.feature_importance(importance_type='gain'), index=X_train.columns, columns=['importance'])
importance = importance.sort_values('importance', ascending=False)
display(importance)
get_pic(y_pred, y_test, 'prediction')
4) Analysis Results
First, when ranking the importance of variables for the predictive model, the results are as follows:
The top three variables—“year built「築年数」,” “area「面積」,” and “district name「地区名」”—align well with intuition. Following these, “municipal code「市区町村コード」,” “distance to nearest station (minutes)「最寄駅距離(分)」,” “layout「間取り」,” “renovation「改装」,” and “building structure「建物の構造」” also seem to be reasonably expected. This suggests that the model has produced results that are qualitatively consistent with expectations.
Next, the scatter plot comparing the model's predicted results with the actual values is as follows. The units on the axes are in ten million yen, which indicates a significant amount of variability. For properties with an estimated value of around 30 million yen, there is likely to be an error of approximately plus or minus 10 million yen.
However, in real estate transactions, the individuality of properties is very high, so it's not possible to conclusively say that the model's accuracy is poor based solely on the above results. Of course, there is undeniable room for significant improvement in terms of whether the rounded price information, area information, and other potentially influential factors are sufficiently included, as well as whether there is enough data for learning.
At the very least, it can be said that even with such a small and rough dataset, it is possible to obtain reasonable results. If the information were more comprehensive, there is a good chance that significant performance improvements could be achieved.
3. Summary
- The Land Comprehensive Information System, when combined with statistical processing such as machine learning, has the potential to provide reasonably accurate price guidelines.
- However, the data is not sufficiently comprehensive for predicting individual property prices reliably.
- The lack of data is a significant hindrance.
- Can the Land Comprehensive Information System be upgraded to include more comprehensive information? As it stands, its usefulness is extremely limited, and a vast amount of taxpayers' money and daily input from stakeholders are wasted, making it a system that does not serve national interests effectively.
- Is it possible for general users to access REINS, even for a fee?
For work inquiries, please contact here:(info@garnetstar.jp)
Comments