“Running into debt isn’t so bad. It’s running into creditors that hurts.” –Unknown
Lending Club (“LC”) is the world’s largest peer-to-peer online lending platform. It reduces the cost of lending and borrowing for individuals with advanced data analytics. The function of peer-to-peer companies is to match people who have money with people who want to borrow money. As a leader in this industry, LC completed its initial public offering in December 2014, which significantly improved public trust in this fast-growing company.
In this project, we will explore data published by Lending Club and try to dig out some inspiring and valuable insights.
In this analysis, we used 11 packages for data manipulation, data visualization, missingness and correlation analysis, and feature engineering. The names are shown below:
The Lending Club Issued Loans dataset contains information about 1,021,286 loans and 111 variables collected from 2007, to the first quarter of 2016(2016Q1). The size of this dataset is about 550M. It is available on the Lending Club corporate website.
# Preparation of the first part of data: 12-13/14/16Q1.
loandf_org1 <- rbind(loandf_12_13, loandf_14, loandf_16Q1)
loandf_org1$issue_d <- as.Date(gsub("^", "01-", loandf_org1$issue_d), format = "%d-%b-%y")
# Preparation of the second part of data: 07-11/15, which has a different
# date format with the above part
loandf_org2 <- rbind(loandf_07_11, loandf_15)
loandf_org2$issue_d <- as.Date(gsub("^", "01-", loandf_org2$issue_d), format = "%d-%y-%b") # fomat the date
loandf_org <- rbind(loandf_org1, loandf_org2)
Forty variables were selected from both technical and statistical perspectives in this analysis. A detailed explanation of all variables is availiable in LC Data Dictionary.
# 41 variables, selection criteria: limited missingness, understandable,
# useful
var_name = c("id", "member_id", "loan_amnt", "term", "int_rate", "installment",
"grade", "sub_grade", "emp_title", "emp_length", "home_ownership", "annual_inc",
"verification_status", "issue_d", "loan_status", "purpose", "title", "addr_state",
"dti", "earliest_cr_line", "open_acc", "total_acc", "total_pymnt", "total_rec_prncp",
"total_rec_int", "open_il_6m", "open_il_12m", "open_il_24m", "mths_since_rcnt_il",
"total_bal_il", "il_util", "all_util", "total_rev_hi_lim", "mort_acc", "mths_since_recent_bc",
"mths_since_recent_bc_dlq", "num_actv_bc_tl", "num_op_rev_tl", "tot_hi_cred_lim",
"total_bal_ex_mort", "total_bc_limit")
loandf <- loandf_org %>% select(one_of(var_name))
# Save the pre-selected dataset
save(loandf, file = "./Lending_Club/loandf.RData")
Six trends are investigated:
* The growth of issued loans, both in terms of dollars and volume
* The geographical distribution of loans
* The purposes for loans
* The interest rate changes over time
* The interest rate distribution over different grades
* The grades distribution over different home ownership
* The relation between annual income and its verification status
The first step to understanding the growth of the company is to examine the volume and the amount of issued loans.
# Loan Amount and volume changs from 2007 to 2016
amt_df <- loandf %>%
select(issue_d, loan_amnt) %>%
na.omit() %>% # remove incomplete cases (rows with 'NA')
group_by(issue_d) %>%
dplyr::summarise(amount = sum(loan_amnt, na.rm = T), volume = n(), avgAmt = amount/volume)
# Changes of amount
g_amt <- ggplot(amt_df, aes(x = issue_d))
g_amt + geom_line(aes(y = amount), color = 'red') +
labs(title = 'Loan Amount by Month', x = 'Date Issued', y = 'Amount($)') +
theme_linedraw()
# Changes of volume
g_vol <- ggplot(amt_df, aes(x = issue_d))
g_vol + geom_line(aes(y = volume), color = 'red') +
labs(title = 'Loan Volume by Month', x = 'Date Issued', y = 'Volume') +
theme_linedraw()
Lending Club was launched in 2007, and its business has grown significantly since the start of 2012. From 2014 to 2016Q1, monthly loan amount and volume have been increasing drastically. However, as seen from the above two line graphs, the growth during the spikes is very unstable. Additionally, the increasing paths of the loan amounts and volume are similar in the two graphs. In the long term, we can still expect a high growth rate based on the company’s performance in the date range specified in the graphs.
# Grade, term, volume, year Extract 'year' from 'date'
loandf$issue_yr <- format(loandf$issue_d, "20%y")
# Compare loans with different terms
gtvy_df <- loandf %>% select(issue_yr, grade, term)
gtvy_df <- gtvy_df[complete.cases(gtvy_df), ]
gtvy_df <- gtvy_df[!gtvy_df$issue_yr == 2016, ]
# Two bar charts
g_gtvy <- ggplot(gtvy_df, aes(x = issue_yr))
g_gtvy + geom_bar(aes(fill = grade)) + facet_grid(~term) + labs(title = "Loan Volume by Year",
x = "Issued Year", y = "Volume") + theme_bw()
As seen from these two bar charts, the number of 36-month loans is greater than 60-month loans. In addition, the proportion of loans with different grades varies with terms. As for 30-month loans, the majority are in grade A, B, and C. As for 60-month loans, only a small percentage of loans are in grade A, and most of loans are in grade B, C, D, and E.
## Ghanges of average amount per loan
g_avgAmt <- ggplot(amt_df, aes(x = issue_d, y = avgAmt))
g_avgAmt + geom_point(color = "cadetblue4", size = 0.5) + geom_smooth(color = "red",
linetype = "dashed", size = 0.7, se = FALSE) + labs(title = "Average Loan Amount by Month",
x = "Date Issued", y = "avgAmount") + theme_bw()
In order to learn more about the growth, we fit a basic trend line to figure out the changes in single loan amount. The above line chart shows the average amount of a single loan increased at a constant rate from 2007 to 2012, but it grew with a descreasing rate from 2013 to 2014. From 2015 and 2016Q1, the average loan amount remained roughly unchanged.
Understanding the geographical distribution help us know the markets of LC’s current business, and it also provides information about where potential customers are.
# Loan issued locations by volume or amount
locVol_df <- select(loandf, addr_state)
locVol_df <- locVol_df %>% na.omit() %>% group_by(addr_state) %>% dplyr::summarise(value = n())
locAmt_df <- select(loandf, addr_state, loan_amnt)
locAmt_df$loan_amnt <- as.numeric(locAmt_df$loan_amnt) # Integer overflow: +/-2*10^9
locAmt_df <- locAmt_df %>% na.omit() %>% group_by(addr_state) %>% dplyr::summarise(value = sum(loan_amnt,
na.rm = TRUE))
# Detailed parameter preparation of functions are hidden due to limited
# space Draw two maps
state_choropleth(locVol_df, title = "Loan Volume by State", num_colors = 9)
state_choropleth(locAmt_df, title = "Loan Amount by State", num_colors = 9)
From a geographical perspective, California, Texas, New York, Florida, and Illinois have the largest dollar amounts and volumes of loans. California is the location of Lending Club’s headquarters, so it is reasonable this state has more business. As for Texas, New York, Florida, and Illinois, their high volume and amount of loans may be related to Lending Club’s promotion activities in these places.
Peer-to-peer lending focuses on both individual borrowing and business borrowing, so it’s important to understand why consumers decide to borrow money.
# Purposes of loans
prp_df <- loandf %>% select(purpose, loan_amnt) %>% na.omit() %>% group_by(purpose) %>%
dplyr::summarise(volume = n(), average_amnt = sum(as.numeric(loan_amnt),
rm.na = TRUE)/n())
prp_df <- prp_df[!prp_df$purpose == "", ]
treemap(prp_df, index = "purpose", vSize = "volume", vColor = "average_amnt",
range = c(6000, 16000), type = "manual", palette = c("yellow", "green",
"orange", "orange2", "firebrick"), algorithm = "pivotSize", sortID = "-size",
title = "Purposes of Loans", title.legend = "Avg_Amnt", fontfamily.labels = "serif",
fontsize.labels = 16, fontsize.legend = 10, fontface.labels = 1, position.legend = "bottom",
force.print.labels = T, border.col = "white")
Unsurprisingly, debt consolidation is the most common reason for borrowing. The greatest advantage of peer-to-peer lending is the low cost. Loans issued by LC usually charge lower interest rates compared with money provided by traditional banks. Most consumers choose to consolidate debt to enjoy lower borrowing costs.
Notice that there are three variables in the above tree map: purposes, average amount of a loan, and the total volume of loans. It helps give an overall view of the relationship between purposes of loans and the volume and amount of loans.
The various sizes in the above tree map are directly proportional to the volume of loans with different purposes. We can know that debt consolidation and credit card are the most popular reasons for borrowing.
The different color is related to the average amount of a loan. Loans for debt consolidation, credit card, house, and small business usually have higher average amount than other purposes.
We explored two terms of loans: 36 months and 60 months, and we analyzed the different distribution of interest rates for loans with various grades in this two terms. It helps us examine LC’s core advantage over traditional banks, that is, the algorithm LC applies to evaluating the credit risk of different borrowers.
# Term, interest rate and grade: dataser preprocessing
loandf$int_rate = as.numeric(sub("%", "", loandf$int_rate))
tig_df <- select(loandf, int_rate, sub_grade, grade, term, issue_yr)
tig_df <- tig_df[tig_df$issue_yr %in% c(2010, 2011, 2012, 2013, 2014, 2015),
]
# Stacked boxplots
g_tig <- ggplot(tig_df, aes(grade, int_rate))
g_tig + geom_boxplot(outlier.size = 0.5, color = "red") + facet_grid(term ~
issue_yr) + labs(title = "Interest Rate Distribution by Grade", x = "Grade",
y = "Interest Rate(%)") + theme_bw()
First, as to the horizontal analysis of the above graph, it’s absorbing that interest rates of loans with different grades behave in a significant trend through years. The interest rates have been increasing from 2010 to 2015, especially for low-grade loans such as grades D, E, F, G. Also, we see that disparities of interest rates become larger and larger, and finally interest rate intervals among different grades are almost equal. That is possibly due in part to LC’s credit policy updates. This change is significant proof that LC has become more and more proficient in the evaluation of loans’ risk and debt management.
However, as to vertical analysis of this graph, results are not so good. According to the risk theory, with a longer duration comes a higher risk that the loan will not be repaid, so in long-term rates are generally higher than short-term ones. From the above graph, we couldn’t find clear difference between long-term loans and short-term loans.
In order to better explore the relationship among terms, grades, and interest rate, we separately analyze the distribution of interest rate for 35 subgrades.
# Interest rate , term and sub_grade:
tisub_df <- mutate(tig_df, term = ifelse(term == " 36 months", "36", "60")) # for more clear display in the following graph
tisub_df1 <- filter(tisub_df, grade %in% c("A", "B", "C", "D"))
tisub_df2 <- filter(tisub_df, grade %in% c("E", "F", "G"))
# Grade A, B, C
g_tisub1 <- ggplot(tisub_df1, aes(term, int_rate))
g_tisub1 + geom_boxplot(outlier.size = 0.5, aes(color = term)) + guides(color = F) +
facet_wrap(~sub_grade, nrow = 1) + labs(title = "Interest Rate Distribution by Term of Grades ABCD",
x = "Term", y = "Interest Rate(%)") + theme_bw()
# Grade D, E
g_tisub2 <- ggplot(tisub_df2, aes(term, int_rate))
g_tisub2 + geom_boxplot(outlier.size = 0.5, aes(color = term)) + guides(color = F) +
facet_wrap(~sub_grade, nrow = 1) + labs(title = "Interest Rate Distribution by Term of Grades EFG",
x = "Term", y = "Interest Rate(%)") + theme_bw()
As seen from the above graphs, we can see that as the subgrade decreases, the interest rate increases in general, but for grade G1-G5, the interest rate on 60-month loans seems unchanged, and the interest rate on 36-month loans with grade G5 seems like an outlier. These points are worth deeper analysis with inside information of LC’s business.
Last but not least, we find that the interest rate of different terms are not that different as the risk theory implies. One reason may be the inefficiency of loan pricing in LC although LC claims that they set ‘Risk Modifiers’ for loan terms.
After exploring some data about evaluating grades, we continue to examine the effect of grades.
# Relation between return rate and default rate and grade and term Get
# numeric value in term columns
deft_df <- loandf %>% select(grade, loan_status, term)
deft_df <- deft_df[!deft_df$loan_status %in% c("", "Does not meet the credit policy. Status:Fully Paid",
"Does not meet the credit policy. Status:Charged Off"), ]
## The distribution of grades for loans
g_deft = ggplot(data = deft_df, aes(x = grade))
g_deft + geom_bar(fill = "dodgerblue") + labs(title = "Number of Loans by Grade",
x = "Grade", y = "Count") + theme_bw()
# Subgroup 1: Current & Fully paid
deft_df1 <- filter(deft_df, loan_status %in% c("Current", "Fully Paid"))
g_deftSb1 = ggplot(data = deft_df1, aes(x = grade))
g_deftSb1 + geom_bar(fill = "chartreuse3") + facet_wrap(~loan_status) + labs(title = "Volume Distribution in Current & Fully paid",
x = "Grade", y = "Count") + theme_bw()
# Subgroup 2: In Grace Period & Late(16-30 days) & Late(31-120 days)
deft_df2 <- filter(deft_df, loan_status %in% c("In Grace Period", "Late (16-30 days)",
"Late (31-120 days)"))
g_deftSb2 = ggplot(data = deft_df2, aes(x = grade))
g_deftSb2 + geom_bar(fill = "darkgoldenrod1") + facet_wrap(~loan_status) + labs(title = "Volume Distribution in Grace Period & Late",
x = "Grade", y = "Count") + theme_bw()
# Subgroup 3: Default & Charged off
deft_df3 <- filter(deft_df, loan_status %in% c("Charged Off", "Default"))
g_deftSb3 = ggplot(data = deft_df3, aes(x = grade))
g_deftSb3 + geom_bar(fill = "firebrick2") + facet_wrap(~loan_status) + labs(title = "Volume Distribution in Charged off & Default",
x = "Grade", y = "Count") + theme_bw()
Based on above four graphs, the proportion of low-grade loans such as grades D, E, F, and G become larger and larger when the loan status moves from the best status, ‘current & Fully paid’, to the worst status, ‘Charged off & Default’. In other words, the long tail in right side of the distribution is shorter and shorter, and the top two grades change from B, C to C, D. This is not an exact examination of the effeciency of grades, but it still provides information about risk of loans in different grades.
Next, we want to analyze loans’ grades of different home ownership.
# Home_ownership <-> grade (bar - facet) keep only 'MORTGAGE', 'OWN', 'RENT'
mort_df <- loandf %>% select(home_ownership, grade, sub_grade)
table(mort_df$home_ownership)
##
## ANY MORTGAGE NONE OTHER OWN RENT
## 2 3 510387 50 182 103665 406982
mort_df <- mort_df[mort_df$home_ownership %in% c("MORTGAGE", "OWN", "RENT"),
] # Other catergories have only a few data
g_mort <- ggplot(mort_df, aes(grade))
g_mort + geom_bar(aes(fill = grade)) + facet_wrap(~home_ownership) + labs(x = "Grade",
y = "Number of Loans", title = "Issued Loans of Different Home Ownership") +
theme_bw()
Obviously, people in ‘MORTGAGE’ and ‘RENT’ have much more demands of borrowing money than people in ‘OWN’ based on the bar chart. That’s because people who own a house usually have better financial situation than others.
This is a simple analysis of the fraud detection.
# Verified or not and annual_inc
vrf_raw_df <- loandf %>% select(verification_status, annual_inc)
vrf_raw_df <- vrf_raw_df[complete.cases(vrf_raw_df), ]
vrf_df <- vrf_raw_df %>% group_by(verification_status) %>% dplyr::summarise(mean = mean(annual_inc),
std = sd(annual_inc))
data.frame(vrf_df)
## verification_status mean std
## 1 Not Verified 68695.11 45698.13
## 2 Source Verified 79705.72 84944.55
## 3 Verified 77891.39 58326.88
The average annual income of loans ‘Not Verified’ is lower than that of other statuses.
The third part of our analysis is from the technical perspective. We usually spend most time in data cleaning and missingness checking before applying any statistical model to datasets.
# Missingess
colnames(loandf)
## [1] "id" "member_id"
## [3] "loan_amnt" "term"
## [5] "int_rate" "installment"
## [7] "grade" "sub_grade"
## [9] "emp_title" "emp_length"
## [11] "home_ownership" "annual_inc"
## [13] "verification_status" "issue_d"
## [15] "loan_status" "purpose"
## [17] "title" "addr_state"
## [19] "dti" "earliest_cr_line"
## [21] "open_acc" "total_acc"
## [23] "total_pymnt" "total_rec_prncp"
## [25] "total_rec_int" "open_il_6m"
## [27] "open_il_12m" "open_il_24m"
## [29] "mths_since_rcnt_il" "total_bal_il"
## [31] "il_util" "all_util"
## [33] "total_rev_hi_lim" "mort_acc"
## [35] "mths_since_recent_bc" "mths_since_recent_bc_dlq"
## [37] "num_actv_bc_tl" "num_op_rev_tl"
## [39] "tot_hi_cred_lim" "total_bal_ex_mort"
## [41] "total_bc_limit" "issue_yr"
aggr(loandf, prop = T, number = F, label = T, gap = T, only.miss = T)
9 variables have about 80% missing values, and 6 variables have about 5% missing values. We often deal with variables with different method based on the percentage of missingness. Here we will remove the 9 variables because it loses too much information. As for other 6 variables, we can either apply imputation methods or just keep the complete observations.
## Earliest_crm_line: have to format the date data(The format is disordered)
## This function is used for formating disordered date data such as 'Dec-12'
## and '12-Dec'
tdate2 <- function(x) {
ifelse((is.na(as.numeric(substring(x, 1, 1)))), as.Date(gsub("^", "01-",
x), format = "%d-%b-%y"), as.Date(gsub("^", "01-", x), format = "%d-%y-%b"))
}
loandf$earliest_cr_line <- as.Date(tdate2(loandf$earliest_cr_line))
loandf$earliest_cr_line <- as.Date(ifelse(loandf$earliest_cr_line > "2016-03-01",
format(loandf$earliest_cr_line, "19%y-%m-%d"), format(loandf$earliest_cr_line)))
## Only the part without too much missingness is applied to this correlation
## map
cor_var_name = c("loan_amnt", "int_rate", "installment", "sub_grade", "annual_inc",
"issue_d", "dti", "earliest_cr_line", "open_acc", "total_acc", "total_pymnt",
"total_rec_prncp", "total_rec_int", "total_rev_hi_lim", "mort_acc", "mths_since_recent_bc",
"num_actv_bc_tl", "num_op_rev_tl", "tot_hi_cred_lim", "total_bal_ex_mort",
"total_bc_limit")
cor_var <- select(loandf, one_of(cor_var_name))
cor_var <- cor_var[complete.cases(cor_var), ] # reomve incomplete cases
# Feature engineering1: calculate the difference between earlist_cr_line and
# issue_d
cor_var$credit_tm <- as.numeric(cor_var$issue_d - cor_var$earliest_cr_line)
# Feature engineering2: transfer sub_grade to numeric value
cor_var$num_subgrade <- as.numeric(as.factor(cor_var$sub_grade))
cor_var <- select(cor_var, -sub_grade, -issue_d, -earliest_cr_line) # remove old variables
Feature engineering help us extract real information from raw data. Also, it can provide useful features for future models.
Before we apply advanced models, it’s necessary to check every assumption of the models. Correlation map is one of the most important ways to figure out the multicollinearity among dependent variables.
# correlation map between subgrades and other factors
summary(cor_var) # check basic information of all variables
## loan_amnt int_rate installment annual_inc
## Min. : 1000 Min. : 5.32 Min. : 23.26 Min. : 0
## 1st Qu.: 8475 1st Qu.: 9.75 1st Qu.: 267.06 1st Qu.: 46200
## Median :13875 Median :12.99 Median : 391.28 Median : 65000
## Mean :15109 Mean :13.17 Mean : 446.67 Mean : 76320
## 3rd Qu.:20000 3rd Qu.:16.20 3rd Qu.: 586.32 3rd Qu.: 90000
## Max. :40000 Max. :28.99 Max. :1536.95 Max. :9550000
## dti open_acc total_acc total_pymnt
## Min. : -1.00 Min. : 1.0 Min. : 2.00 Min. : 0
## 1st Qu.: 12.26 1st Qu.: 8.0 1st Qu.: 17.00 1st Qu.: 2687
## Median : 18.07 Median :11.0 Median : 24.00 Median : 5732
## Mean : 18.83 Mean :11.8 Mean : 25.51 Mean : 8382
## 3rd Qu.: 24.56 3rd Qu.:15.0 3rd Qu.: 32.00 3rd Qu.:11392
## Max. :9999.00 Max. :90.0 Max. :176.00 Max. :57232
## total_rec_prncp total_rec_int total_rev_hi_lim mort_acc
## Min. : 0 Min. : 0.0 Min. : 0 Min. : 0.000
## 1st Qu.: 1718 1st Qu.: 589.7 1st Qu.: 14300 1st Qu.: 0.000
## Median : 3847 Median : 1251.7 Median : 24200 Median : 1.000
## Mean : 6386 Mean : 1936.7 Mean : 32841 Mean : 1.752
## 3rd Qu.: 8586 3rd Qu.: 2456.4 3rd Qu.: 40603 3rd Qu.: 3.000
## Max. :40000 Max. :23559.0 Max. :9999999 Max. :52.000
## mths_since_recent_bc num_actv_bc_tl num_op_rev_tl tot_hi_cred_lim
## Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0
## 1st Qu.: 6.00 1st Qu.: 2.000 1st Qu.: 5.000 1st Qu.: 49602
## Median : 14.00 Median : 3.000 Median : 8.000 Median : 112526
## Mean : 24.79 Mean : 3.772 Mean : 8.395 Mean : 172921
## 3rd Qu.: 30.00 3rd Qu.: 5.000 3rd Qu.:11.000 3rd Qu.: 250175
## Max. :616.00 Max. :36.000 Max. :83.000 Max. :9999999
## total_bal_ex_mort total_bc_limit credit_tm num_subgrade
## Min. : 0 Min. : 0 Min. : 1096 Min. : 1.0
## 1st Qu.: 21622 1st Qu.: 7900 1st Qu.: 4199 1st Qu.: 7.0
## Median : 37968 Median : 15100 Median : 5510 Median :11.0
## Mean : 50126 Mean : 21642 Mean : 6071 Mean :11.9
## 3rd Qu.: 62969 3rd Qu.: 28100 3rd Qu.: 7458 3rd Qu.:16.0
## Max. :2921551 Max. :1105500 Max. :25933 Max. :35.0
M <- cor(cor_var) # transfer to matrix
corrplot(M, method = "number", title = "Correlation Map of Subgrade & Factors",
type = "lower", order = "FPC", number.cex = 0.5, tl.cex = 0.8)
The correlation map shows the relation between every two variables in the dataset.
In this project, we have a brief and clear introduction to the growth of LC’s main business. It’s obvious that LC is currently a fast-growing but the increasing of volume and amount of loans is erratic. It is surprising given that we expect public companies to be more stable, which means some customers still cannot completely trust LC and its products. The drastic fluctuation of LC’s stock price also proves the conclusion.
However, LC’s business model still brings a great advantage over traditional banks. LC is improving its risk and credit evaluation technology and trying to extend its market from individuals to businesses. From the latest growth data, we still believe LC issued loans will continue to grow rapidly.
Geographical distribution of issued loans may be influenced by several factors such as state financial standing, state culture, LC’s advertising strategies, etc. We need more data for the deeper exploration, so combining current dataset with other datasets is a good choice.
FICO is widely accepted by traditional banks as a credit index, but LC claims that its algorithm considers more than 2000 variables in credit evolution and risk management. We need to have more data and apply advanced models to compare these two risk grading system.