“Running into debt isn’t so bad. It’s running into creditors that hurts.” –Unknown

1 Introduction

Lending Club (“LC”) is the world’s largest peer-to-peer online lending platform. It reduces the cost of lending and borrowing for individuals with advanced data analytics. The function of peer-to-peer companies is to match people who have money with people who want to borrow money. As a leader in this industry, LC completed its initial public offering in December 2014, which significantly improved public trust in this fast-growing company.
In this project, we will explore data published by Lending Club and try to dig out some inspiring and valuable insights.

1.1 R Packages:

In this analysis, we used 11 packages for data manipulation, data visualization, missingness and correlation analysis, and feature engineering. The names are shown below:

Data manipulation: library(dplyr|tidyr)
Data visualization: library(ggplot2|choroplethr|choroplethr|treemap|ggthemes|RColorBrewer)
Missingness & correlation analysis: library(VIM|corrplot)
Feature engineering: library(zoo)

1.2 Data:

The Lending Club Issued Loans dataset contains information about 1,021,286 loans and 111 variables collected from 2007, to the first quarter of 2016(2016Q1). The size of this dataset is about 550M. It is available on the Lending Club corporate website.

# Preparation of the first part of data: 12-13/14/16Q1.
loandf_org1 <- rbind(loandf_12_13, loandf_14, loandf_16Q1)
loandf_org1$issue_d <- as.Date(gsub("^", "01-", loandf_org1$issue_d), format = "%d-%b-%y")

# Preparation of the second part of data: 07-11/15, which has a different
# date format with the above part
loandf_org2 <- rbind(loandf_07_11, loandf_15)
loandf_org2$issue_d <- as.Date(gsub("^", "01-", loandf_org2$issue_d), format = "%d-%y-%b")  # fomat the date

loandf_org <- rbind(loandf_org1, loandf_org2)

Forty variables were selected from both technical and statistical perspectives in this analysis. A detailed explanation of all variables is availiable in LC Data Dictionary.

# 41 variables, selection criteria: limited missingness, understandable,
# useful
var_name = c("id", "member_id", "loan_amnt", "term", "int_rate", "installment", 
    "grade", "sub_grade", "emp_title", "emp_length", "home_ownership", "annual_inc", 
    "verification_status", "issue_d", "loan_status", "purpose", "title", "addr_state", 
    "dti", "earliest_cr_line", "open_acc", "total_acc", "total_pymnt", "total_rec_prncp", 
    "total_rec_int", "open_il_6m", "open_il_12m", "open_il_24m", "mths_since_rcnt_il", 
    "total_bal_il", "il_util", "all_util", "total_rev_hi_lim", "mort_acc", "mths_since_recent_bc", 
    "mths_since_recent_bc_dlq", "num_actv_bc_tl", "num_op_rev_tl", "tot_hi_cred_lim", 
    "total_bal_ex_mort", "total_bc_limit")
loandf <- loandf_org %>% select(one_of(var_name))

# Save the pre-selected dataset
save(loandf, file = "./Lending_Club/loandf.RData")

2 Exploratory Data Analysis

Six trends are investigated:
* The growth of issued loans, both in terms of dollars and volume
* The geographical distribution of loans
* The purposes for loans
* The interest rate changes over time
* The interest rate distribution over different grades
* The grades distribution over different home ownership
* The relation between annual income and its verification status

2.1 Growth of Issued Loans

The first step to understanding the growth of the company is to examine the volume and the amount of issued loans.

# Loan Amount and volume changs from 2007 to 2016
amt_df <- loandf %>%
  select(issue_d, loan_amnt) %>%
  na.omit() %>% # remove incomplete cases (rows with 'NA')
  group_by(issue_d) %>%
  dplyr::summarise(amount = sum(loan_amnt, na.rm = T), volume = n(), avgAmt = amount/volume) 
  
# Changes of amount
g_amt <- ggplot(amt_df, aes(x = issue_d))
g_amt + geom_line(aes(y = amount), color = 'red') + 
        labs(title = 'Loan Amount by Month', x = 'Date Issued', y = 'Amount($)') + 
        theme_linedraw()

# Changes of volume
g_vol <- ggplot(amt_df, aes(x = issue_d))
g_vol + geom_line(aes(y = volume), color = 'red') + 
        labs(title = 'Loan Volume by Month', x = 'Date Issued', y = 'Volume') + 
        theme_linedraw()

Lending Club was launched in 2007, and its business has grown significantly since the start of 2012. From 2014 to 2016Q1, monthly loan amount and volume have been increasing drastically. However, as seen from the above two line graphs, the growth during the spikes is very unstable. Additionally, the increasing paths of the loan amounts and volume are similar in the two graphs. In the long term, we can still expect a high growth rate based on the company’s performance in the date range specified in the graphs.

# Grade, term, volume, year Extract 'year' from 'date'
loandf$issue_yr <- format(loandf$issue_d, "20%y")

# Compare loans with different terms
gtvy_df <- loandf %>% select(issue_yr, grade, term)
gtvy_df <- gtvy_df[complete.cases(gtvy_df), ]
gtvy_df <- gtvy_df[!gtvy_df$issue_yr == 2016, ]

# Two bar charts
g_gtvy <- ggplot(gtvy_df, aes(x = issue_yr))
g_gtvy + geom_bar(aes(fill = grade)) + facet_grid(~term) + labs(title = "Loan Volume by Year", 
    x = "Issued Year", y = "Volume") + theme_bw()

As seen from these two bar charts, the number of 36-month loans is greater than 60-month loans. In addition, the proportion of loans with different grades varies with terms. As for 30-month loans, the majority are in grade A, B, and C. As for 60-month loans, only a small percentage of loans are in grade A, and most of loans are in grade B, C, D, and E.

## Ghanges of average amount per loan
g_avgAmt <- ggplot(amt_df, aes(x = issue_d, y = avgAmt))
g_avgAmt + geom_point(color = "cadetblue4", size = 0.5) + geom_smooth(color = "red", 
    linetype = "dashed", size = 0.7, se = FALSE) + labs(title = "Average Loan Amount by Month", 
    x = "Date Issued", y = "avgAmount") + theme_bw()

In order to learn more about the growth, we fit a basic trend line to figure out the changes in single loan amount. The above line chart shows the average amount of a single loan increased at a constant rate from 2007 to 2012, but it grew with a descreasing rate from 2013 to 2014. From 2015 and 2016Q1, the average loan amount remained roughly unchanged.

2.2 Geographical Distribution of Issued Loans

Understanding the geographical distribution help us know the markets of LC’s current business, and it also provides information about where potential customers are.

# Loan issued locations by volume or amount
locVol_df <- select(loandf, addr_state)
locVol_df <- locVol_df %>% na.omit() %>% group_by(addr_state) %>% dplyr::summarise(value = n())

locAmt_df <- select(loandf, addr_state, loan_amnt)
locAmt_df$loan_amnt <- as.numeric(locAmt_df$loan_amnt)  # Integer overflow: +/-2*10^9
locAmt_df <- locAmt_df %>% na.omit() %>% group_by(addr_state) %>% dplyr::summarise(value = sum(loan_amnt, 
    na.rm = TRUE))

# Detailed parameter preparation of functions are hidden due to limited
# space Draw two maps
state_choropleth(locVol_df, title = "Loan Volume by State", num_colors = 9)

state_choropleth(locAmt_df, title = "Loan Amount by State", num_colors = 9)

From a geographical perspective, California, Texas, New York, Florida, and Illinois have the largest dollar amounts and volumes of loans. California is the location of Lending Club’s headquarters, so it is reasonable this state has more business. As for Texas, New York, Florida, and Illinois, their high volume and amount of loans may be related to Lending Club’s promotion activities in these places.

2.3 Why and How Much do People Borrow?

Peer-to-peer lending focuses on both individual borrowing and business borrowing, so it’s important to understand why consumers decide to borrow money.

# Purposes of loans
prp_df <- loandf %>% select(purpose, loan_amnt) %>% na.omit() %>% group_by(purpose) %>% 
    dplyr::summarise(volume = n(), average_amnt = sum(as.numeric(loan_amnt), 
        rm.na = TRUE)/n())

prp_df <- prp_df[!prp_df$purpose == "", ]

treemap(prp_df, index = "purpose", vSize = "volume", vColor = "average_amnt", 
    range = c(6000, 16000), type = "manual", palette = c("yellow", "green", 
        "orange", "orange2", "firebrick"), algorithm = "pivotSize", sortID = "-size", 
    title = "Purposes of Loans", title.legend = "Avg_Amnt", fontfamily.labels = "serif", 
    fontsize.labels = 16, fontsize.legend = 10, fontface.labels = 1, position.legend = "bottom", 
    force.print.labels = T, border.col = "white")

Unsurprisingly, debt consolidation is the most common reason for borrowing. The greatest advantage of peer-to-peer lending is the low cost. Loans issued by LC usually charge lower interest rates compared with money provided by traditional banks. Most consumers choose to consolidate debt to enjoy lower borrowing costs.
Notice that there are three variables in the above tree map: purposes, average amount of a loan, and the total volume of loans. It helps give an overall view of the relationship between purposes of loans and the volume and amount of loans.
The various sizes in the above tree map are directly proportional to the volume of loans with different purposes. We can know that debt consolidation and credit card are the most popular reasons for borrowing.
The different color is related to the average amount of a loan. Loans for debt consolidation, credit card, house, and small business usually have higher average amount than other purposes.

2.3.1 2.4 Interest Rate Change, Time and Term

We explored two terms of loans: 36 months and 60 months, and we analyzed the different distribution of interest rates for loans with various grades in this two terms. It helps us examine LC’s core advantage over traditional banks, that is, the algorithm LC applies to evaluating the credit risk of different borrowers.

# Term, interest rate and grade: dataser preprocessing
loandf$int_rate = as.numeric(sub("%", "", loandf$int_rate))
tig_df <- select(loandf, int_rate, sub_grade, grade, term, issue_yr)
tig_df <- tig_df[tig_df$issue_yr %in% c(2010, 2011, 2012, 2013, 2014, 2015), 
    ]

# Stacked boxplots
g_tig <- ggplot(tig_df, aes(grade, int_rate))
g_tig + geom_boxplot(outlier.size = 0.5, color = "red") + facet_grid(term ~ 
    issue_yr) + labs(title = "Interest Rate Distribution by Grade", x = "Grade", 
    y = "Interest Rate(%)") + theme_bw()

First, as to the horizontal analysis of the above graph, it’s absorbing that interest rates of loans with different grades behave in a significant trend through years. The interest rates have been increasing from 2010 to 2015, especially for low-grade loans such as grades D, E, F, G. Also, we see that disparities of interest rates become larger and larger, and finally interest rate intervals among different grades are almost equal. That is possibly due in part to LC’s credit policy updates. This change is significant proof that LC has become more and more proficient in the evaluation of loans’ risk and debt management.

However, as to vertical analysis of this graph, results are not so good. According to the risk theory, with a longer duration comes a higher risk that the loan will not be repaid, so in long-term rates are generally higher than short-term ones. From the above graph, we couldn’t find clear difference between long-term loans and short-term loans.

In order to better explore the relationship among terms, grades, and interest rate, we separately analyze the distribution of interest rate for 35 subgrades.

# Interest rate , term and sub_grade:
tisub_df <- mutate(tig_df, term = ifelse(term == " 36 months", "36", "60"))  # for more clear display in the following graph
tisub_df1 <- filter(tisub_df, grade %in% c("A", "B", "C", "D"))
tisub_df2 <- filter(tisub_df, grade %in% c("E", "F", "G"))
# Grade A, B, C
g_tisub1 <- ggplot(tisub_df1, aes(term, int_rate))
g_tisub1 + geom_boxplot(outlier.size = 0.5, aes(color = term)) + guides(color = F) + 
    facet_wrap(~sub_grade, nrow = 1) + labs(title = "Interest Rate Distribution by Term of Grades ABCD", 
    x = "Term", y = "Interest Rate(%)") + theme_bw()

# Grade D, E
g_tisub2 <- ggplot(tisub_df2, aes(term, int_rate))
g_tisub2 + geom_boxplot(outlier.size = 0.5, aes(color = term)) + guides(color = F) + 
    facet_wrap(~sub_grade, nrow = 1) + labs(title = "Interest Rate Distribution by Term of Grades EFG", 
    x = "Term", y = "Interest Rate(%)") + theme_bw()

As seen from the above graphs, we can see that as the subgrade decreases, the interest rate increases in general, but for grade G1-G5, the interest rate on 60-month loans seems unchanged, and the interest rate on 36-month loans with grade G5 seems like an outlier. These points are worth deeper analysis with inside information of LC’s business.
Last but not least, we find that the interest rate of different terms are not that different as the risk theory implies. One reason may be the inefficiency of loan pricing in LC although LC claims that they set ‘Risk Modifiers’ for loan terms.

2.4 Grade and Default

After exploring some data about evaluating grades, we continue to examine the effect of grades.

# Relation between return rate and default rate and grade and term Get
# numeric value in term columns
deft_df <- loandf %>% select(grade, loan_status, term)
deft_df <- deft_df[!deft_df$loan_status %in% c("", "Does not meet the credit policy. Status:Fully Paid", 
    "Does not meet the credit policy. Status:Charged Off"), ]

## The distribution of grades for loans
g_deft = ggplot(data = deft_df, aes(x = grade))
g_deft + geom_bar(fill = "dodgerblue") + labs(title = "Number of Loans by Grade", 
    x = "Grade", y = "Count") + theme_bw()

# Subgroup 1: Current & Fully paid
deft_df1 <- filter(deft_df, loan_status %in% c("Current", "Fully Paid"))

g_deftSb1 = ggplot(data = deft_df1, aes(x = grade))
g_deftSb1 + geom_bar(fill = "chartreuse3") + facet_wrap(~loan_status) + labs(title = "Volume Distribution in Current & Fully paid", 
    x = "Grade", y = "Count") + theme_bw()

# Subgroup 2: In Grace Period & Late(16-30 days) & Late(31-120 days)
deft_df2 <- filter(deft_df, loan_status %in% c("In Grace Period", "Late (16-30 days)", 
    "Late (31-120 days)"))

g_deftSb2 = ggplot(data = deft_df2, aes(x = grade))
g_deftSb2 + geom_bar(fill = "darkgoldenrod1") + facet_wrap(~loan_status) + labs(title = "Volume Distribution in Grace Period & Late", 
    x = "Grade", y = "Count") + theme_bw()

# Subgroup 3: Default & Charged off
deft_df3 <- filter(deft_df, loan_status %in% c("Charged Off", "Default"))
g_deftSb3 = ggplot(data = deft_df3, aes(x = grade))
g_deftSb3 + geom_bar(fill = "firebrick2") + facet_wrap(~loan_status) + labs(title = "Volume Distribution in Charged off & Default", 
    x = "Grade", y = "Count") + theme_bw()

Based on above four graphs, the proportion of low-grade loans such as grades D, E, F, and G become larger and larger when the loan status moves from the best status, ‘current & Fully paid’, to the worst status, ‘Charged off & Default’. In other words, the long tail in right side of the distribution is shorter and shorter, and the top two grades change from B, C to C, D. This is not an exact examination of the effeciency of grades, but it still provides information about risk of loans in different grades.

2.5 Grade and Home Ownership

Next, we want to analyze loans’ grades of different home ownership.

# Home_ownership <-> grade (bar - facet) keep only 'MORTGAGE', 'OWN', 'RENT'
mort_df <- loandf %>% select(home_ownership, grade, sub_grade)
table(mort_df$home_ownership)

## 
##               ANY MORTGAGE     NONE    OTHER      OWN     RENT 
##        2        3   510387       50      182   103665   406982

mort_df <- mort_df[mort_df$home_ownership %in% c("MORTGAGE", "OWN", "RENT"), 
    ]  # Other catergories have only a few data

g_mort <- ggplot(mort_df, aes(grade))
g_mort + geom_bar(aes(fill = grade)) + facet_wrap(~home_ownership) + labs(x = "Grade", 
    y = "Number of Loans", title = "Issued Loans of Different Home Ownership") + 
    theme_bw()

Obviously, people in ‘MORTGAGE’ and ‘RENT’ have much more demands of borrowing money than people in ‘OWN’ based on the bar chart. That’s because people who own a house usually have better financial situation than others.

2.6 Income and Verification Status

This is a simple analysis of the fraud detection.

# Verified or not and annual_inc
vrf_raw_df <- loandf %>% select(verification_status, annual_inc)
vrf_raw_df <- vrf_raw_df[complete.cases(vrf_raw_df), ]

vrf_df <- vrf_raw_df %>% group_by(verification_status) %>% dplyr::summarise(mean = mean(annual_inc), 
    std = sd(annual_inc))
data.frame(vrf_df)

##   verification_status     mean      std
## 1        Not Verified 68695.11 45698.13
## 2     Source Verified 79705.72 84944.55
## 3            Verified 77891.39 58326.88

The average annual income of loans ‘Not Verified’ is lower than that of other statuses.

3 Missingness & Feature Engineering

The third part of our analysis is from the technical perspective. We usually spend most time in data cleaning and missingness checking before applying any statistical model to datasets.

3.1 Check Missingness

# Missingess
colnames(loandf)

##  [1] "id"                       "member_id"               
##  [3] "loan_amnt"                "term"                    
##  [5] "int_rate"                 "installment"             
##  [7] "grade"                    "sub_grade"               
##  [9] "emp_title"                "emp_length"              
## [11] "home_ownership"           "annual_inc"              
## [13] "verification_status"      "issue_d"                 
## [15] "loan_status"              "purpose"                 
## [17] "title"                    "addr_state"              
## [19] "dti"                      "earliest_cr_line"        
## [21] "open_acc"                 "total_acc"               
## [23] "total_pymnt"              "total_rec_prncp"         
## [25] "total_rec_int"            "open_il_6m"              
## [27] "open_il_12m"              "open_il_24m"             
## [29] "mths_since_rcnt_il"       "total_bal_il"            
## [31] "il_util"                  "all_util"                
## [33] "total_rev_hi_lim"         "mort_acc"                
## [35] "mths_since_recent_bc"     "mths_since_recent_bc_dlq"
## [37] "num_actv_bc_tl"           "num_op_rev_tl"           
## [39] "tot_hi_cred_lim"          "total_bal_ex_mort"       
## [41] "total_bc_limit"           "issue_yr"

aggr(loandf, prop = T, number = F, label = T, gap = T, only.miss = T)

9 variables have about 80% missing values, and 6 variables have about 5% missing values. We often deal with variables with different method based on the percentage of missingness. Here we will remove the 9 variables because it loses too much information. As for other 6 variables, we can either apply imputation methods or just keep the complete observations.

3.2 Feature Engineering

## Earliest_crm_line: have to format the date data(The format is disordered)
## This function is used for formating disordered date data such as 'Dec-12'
## and '12-Dec'
tdate2 <- function(x) {
    ifelse((is.na(as.numeric(substring(x, 1, 1)))), as.Date(gsub("^", "01-", 
        x), format = "%d-%b-%y"), as.Date(gsub("^", "01-", x), format = "%d-%y-%b"))
}

loandf$earliest_cr_line <- as.Date(tdate2(loandf$earliest_cr_line))
loandf$earliest_cr_line <- as.Date(ifelse(loandf$earliest_cr_line > "2016-03-01", 
    format(loandf$earliest_cr_line, "19%y-%m-%d"), format(loandf$earliest_cr_line)))

## Only the part without too much missingness is applied to this correlation
## map
cor_var_name = c("loan_amnt", "int_rate", "installment", "sub_grade", "annual_inc", 
    "issue_d", "dti", "earliest_cr_line", "open_acc", "total_acc", "total_pymnt", 
    "total_rec_prncp", "total_rec_int", "total_rev_hi_lim", "mort_acc", "mths_since_recent_bc", 
    "num_actv_bc_tl", "num_op_rev_tl", "tot_hi_cred_lim", "total_bal_ex_mort", 
    "total_bc_limit")

cor_var <- select(loandf, one_of(cor_var_name))
cor_var <- cor_var[complete.cases(cor_var), ]  # reomve incomplete cases

# Feature engineering1: calculate the difference between earlist_cr_line and
# issue_d
cor_var$credit_tm <- as.numeric(cor_var$issue_d - cor_var$earliest_cr_line)
# Feature engineering2: transfer sub_grade to numeric value
cor_var$num_subgrade <- as.numeric(as.factor(cor_var$sub_grade))

cor_var <- select(cor_var, -sub_grade, -issue_d, -earliest_cr_line)  # remove old variables

Feature engineering help us extract real information from raw data. Also, it can provide useful features for future models.

3.3 Correlation Analysis

Before we apply advanced models, it’s necessary to check every assumption of the models. Correlation map is one of the most important ways to figure out the multicollinearity among dependent variables.

# correlation map between subgrades and other factors
summary(cor_var)  # check basic information of all variables

##    loan_amnt        int_rate      installment        annual_inc     
##  Min.   : 1000   Min.   : 5.32   Min.   :  23.26   Min.   :      0  
##  1st Qu.: 8475   1st Qu.: 9.75   1st Qu.: 267.06   1st Qu.:  46200  
##  Median :13875   Median :12.99   Median : 391.28   Median :  65000  
##  Mean   :15109   Mean   :13.17   Mean   : 446.67   Mean   :  76320  
##  3rd Qu.:20000   3rd Qu.:16.20   3rd Qu.: 586.32   3rd Qu.:  90000  
##  Max.   :40000   Max.   :28.99   Max.   :1536.95   Max.   :9550000  
##       dti             open_acc      total_acc       total_pymnt   
##  Min.   :  -1.00   Min.   : 1.0   Min.   :  2.00   Min.   :    0  
##  1st Qu.:  12.26   1st Qu.: 8.0   1st Qu.: 17.00   1st Qu.: 2687  
##  Median :  18.07   Median :11.0   Median : 24.00   Median : 5732  
##  Mean   :  18.83   Mean   :11.8   Mean   : 25.51   Mean   : 8382  
##  3rd Qu.:  24.56   3rd Qu.:15.0   3rd Qu.: 32.00   3rd Qu.:11392  
##  Max.   :9999.00   Max.   :90.0   Max.   :176.00   Max.   :57232  
##  total_rec_prncp total_rec_int     total_rev_hi_lim     mort_acc     
##  Min.   :    0   Min.   :    0.0   Min.   :      0   Min.   : 0.000  
##  1st Qu.: 1718   1st Qu.:  589.7   1st Qu.:  14300   1st Qu.: 0.000  
##  Median : 3847   Median : 1251.7   Median :  24200   Median : 1.000  
##  Mean   : 6386   Mean   : 1936.7   Mean   :  32841   Mean   : 1.752  
##  3rd Qu.: 8586   3rd Qu.: 2456.4   3rd Qu.:  40603   3rd Qu.: 3.000  
##  Max.   :40000   Max.   :23559.0   Max.   :9999999   Max.   :52.000  
##  mths_since_recent_bc num_actv_bc_tl   num_op_rev_tl    tot_hi_cred_lim  
##  Min.   :  0.00       Min.   : 0.000   Min.   : 0.000   Min.   :      0  
##  1st Qu.:  6.00       1st Qu.: 2.000   1st Qu.: 5.000   1st Qu.:  49602  
##  Median : 14.00       Median : 3.000   Median : 8.000   Median : 112526  
##  Mean   : 24.79       Mean   : 3.772   Mean   : 8.395   Mean   : 172921  
##  3rd Qu.: 30.00       3rd Qu.: 5.000   3rd Qu.:11.000   3rd Qu.: 250175  
##  Max.   :616.00       Max.   :36.000   Max.   :83.000   Max.   :9999999  
##  total_bal_ex_mort total_bc_limit      credit_tm      num_subgrade 
##  Min.   :      0   Min.   :      0   Min.   : 1096   Min.   : 1.0  
##  1st Qu.:  21622   1st Qu.:   7900   1st Qu.: 4199   1st Qu.: 7.0  
##  Median :  37968   Median :  15100   Median : 5510   Median :11.0  
##  Mean   :  50126   Mean   :  21642   Mean   : 6071   Mean   :11.9  
##  3rd Qu.:  62969   3rd Qu.:  28100   3rd Qu.: 7458   3rd Qu.:16.0  
##  Max.   :2921551   Max.   :1105500   Max.   :25933   Max.   :35.0

M <- cor(cor_var)  # transfer to matrix 
corrplot(M, method = "number", title = "Correlation Map of Subgrade & Factors", 
    type = "lower", order = "FPC", number.cex = 0.5, tl.cex = 0.8)

The correlation map shows the relation between every two variables in the dataset.

4 Summary & Insights

In this project, we have a brief and clear introduction to the growth of LC’s main business. It’s obvious that LC is currently a fast-growing but the increasing of volume and amount of loans is erratic. It is surprising given that we expect public companies to be more stable, which means some customers still cannot completely trust LC and its products. The drastic fluctuation of LC’s stock price also proves the conclusion.

However, LC’s business model still brings a great advantage over traditional banks. LC is improving its risk and credit evaluation technology and trying to extend its market from individuals to businesses. From the latest growth data, we still believe LC issued loans will continue to grow rapidly.

4.1 Insights:

LC’s business grows significantly since the start of 2012, but the growth is unsteady in the short term.
California, Texas, New York, Florida, and Illinois are currently largest markets of LC.
Debt consolidation, and credit card are most popular reasons for borrowing from LC.
LC has become more and more proficient in the evaluation of loans’ risk and debt management during past five years.
LC may be inefficient in pricing loans with different terms.
LC Customers in ‘MORTGAGE’ and ‘RENT’ have much more demands of borrowing money than customers in ‘OWN’.

4.2 What’s more:

Geographical distribution of issued loans may be influenced by several factors such as state financial standing, state culture, LC’s advertising strategies, etc. We need more data for the deeper exploration, so combining current dataset with other datasets is a good choice.
FICO is widely accepted by traditional banks as a credit index, but LC claims that its algorithm considers more than 2000 variables in credit evolution and risk management. We need to have more data and apply advanced models to compare these two risk grading system.

Exploratory Data Analysis of Lending Club Issued Loans

Shu Liu (Shutel at hotmail dot com)

07/12/2016