HMDA Loan Application - An SQL exercise

Today we are going to use HMDA loan application data to look at the relationship between different demographic features of borrowers when applying for a loan, and how that impacts eventual loan acceptance or denial in 2008. We are also going to look at how these demographic factors interact with different reasons given for loan denial. Before we begin, this is just a quick example of using SQL databases in your analysis pipelines, so while we are looking at real data about real people, other researchers and scholars have done a lot more complex and interesting work on this topic and I encourage you to check them out (click here for reading list).

We are going ask two questions of this dataset today, and use SQL to query the dataset.

Question 1

In Durham County, NC, does an applicant’s race affect loan denial rates, when no reason for denial is given?

Step 1

First we need to connect our R notebook to the local SQL database. Thankfully, with the RPostgres and DBI packages, it is exceedingly simple to connect it to R in notebooks.

#I always try to load all the packages I expect to use first and try to properly set up my R environment
library(DBI) #load the database interface package
library(dplyr) #load dplyr for later
library(ggplot2) # we will plot later

#all the below settings can be changed to work with cloud databases as well
con <- dbConnect(RPostgres::Postgres(), #you'll need RPostgres for R to talk to PostgreSQL
                 dbname = 'mortgages', #this is database I created
                 host = 'localhost', #currently I have this saved locally
                 port = 5432, #localport
                 user = 'postgres')

Step 2

Now that we are connected to the database, we can send SQL queries in R either through SQL defined code chunks or sending commands in R code chunks using tbl(). I’m going to opt for the former because I like to keep my languages seperate in my notebooks as much as possible.

In this first query, we are going pull data for the primary applicant’s race, what action was taken on the loan application, and then compute the percent of applicantion outcomes by race, specifically only when there was no reason given for denial (if it was denied). I’m going to also involve some computations in my SQL query, so that it will calculate the average outcome (i.e. ACTIONTYPE) for an application by race (see the second line of code).

As a quick technical note for the code, make sure the code chunk header looks like this: {sql, connection=con, output.var="df.q1"}. Here I have connected the code chunk to the database variable, and then I save the output of the SQL query to a variable I can plot later in ggplot2.

SELECT RACE1, ACTIONTYPE, 
      COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(PARTITION BY RACE1) AS Percentage
FROM hmda
WHERE STATE='37' AND COUNTY='063' AND DENIAL1 = ''
GROUP BY ACTIONTYPE, RACE1;

Step 3

Now instead of just printing out a table, let’s plot the results for an easier interpretation.

ggplot(df.q1 %>% filter(race1 != 6, race1 != 7), #get rid of race when (1) not applicable and (2) when not provided by applicant
       aes(x = race1, y = percentage, 
           group = interaction(race1, factor(actiontype)), 
           fill = factor(actiontype))) +
  geom_bar(stat = 'identity', color = "black") + 
  ylab("Percent of applications (%)") + xlab("Race") + #set axis labels
  scale_fill_brewer(palette="Dark2", #use a default color scheme and set labels
                    name="Action to Loan Application", 
                    label = c("Loan Originated",
                              "Application approaced but not accepted",
                              "Application denied by financial institution",
                              "Application withdrawn by applicant",
                              "File closed for incompleteness",
                              "Loan purchased by institution")) +
  scale_x_discrete(labels = c("American Indian \nAlaska Native",
                              "Asian",
                              "Black \nAfrican American",
                              "Native Hawaiian \n Pacific Islander",
                              "White")) +
  theme_bw() + #set of design elements that look nice
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) #angle the x-axis labels to fit

Each bar for race adds up to 100% because we want to know what happens to applications as a funciton of race, but we need to account for different base rates of applications. Therefore, looking at the outcome for applications as a percent value for each race lets us do between-race comparisons.

What does this mean?

  • Most people (I hope) won’t be surprised (but dissapointed) to see that the percent of White applicants with approved loan applications is higher than most other racial demographics. This is especially true for Black applicants and American Indian/Alaska Native applicants, but also for Native Hawaiian/Pacific Islander applicants.

  • Here is where interpretting data is especially important. You should also notice that in purple we see that the percent of applicants who are denied, for no given reason, is much larger for Black and American Indian/Alaska Native appicants.

  • This leads us to further questions we might want to start asking (and that some researchers have written on see); why are Black and American Indian/Alaska Native applicants being denied at higher rates than White applicants for no stated reasons?

Question 2

In Durham County, NC, is the income of accepted loan applications different based on race?

Step 1

Here, I will change my output variable to reflect I am asking a new question: {sql, connection=con, output.var="df.q2"}. Here we can ask SQL to turn the income variable to a numeric value we can average over using CAST. This will throw an error if we don’t remove the NA values from INCOME in our join.

SELECT RACE1, AVG(CAST(INCOME AS NUMERIC))
FROM hmda
WHERE STATE='37' AND COUNTY='063' AND ACTIONTYPE = '1' AND INCOME <> 'NA'
GROUP BY RACE1;

Step 2

Now let’s plot the results of the SQL query. The code for the plot here is pretty similar to above, so I’ve hidden it this time.

What does this mean?

  • You’ll notice that Black applicants had the lowest average income in Durham County, NC, among accepted loan applications. This is a stark difference in the graph, and there is a lot more at play here that contributes to this disparity.

  • However, though White applicants had the highest acceptance rate for loan applications, you’ll notice that they do not have the highest average income. In fact, the average income for the remaining racial groups is within 10k of each other.

Why am I doing this?

I’ve been properly learning SQL through Coursera course (here) this summer. A few years ago, I quickly loaded up some databases for a perception and decision making project I was doing simulations for, but in the hurry of trying to get out a finished product I didn’t have the time to properly learn the fundementals of the language. It turns out, it is quite useful (surprising no data scientist ever), especially when you start working with very large amounts of data. Obviously SQL databases are indispensible for many businesses in record keeping and other functions, but here I’m going to use them in a more exploratory research example when downloading data from Data.gov. There are a lot of rich, interesting datasets there, however, loading up a ~2.5Gb file in to R (or Python) seems excessive when I only need part of that dataset to answer the questions I’m interested in today.

As a first pass – a little exercise outside of class – I logged on to Data.gov and just downloaded the first dataset I saw that seemed interesting; the 2008 Home Mortgage Disclosure Act (HMDA) Loan Application Register (LAR) Data (see). For this example, I just quickly converted this massive csv into a database using PostgreSQL (see). Though today I only asked simple questions using this data, if I want, I can keep building this database as I find and get more data from different sources. This allows me to store my gathered data in a much more dynamic framework, allowing for the addition of more datasets, without having a bunch of messy and weirdly named .csv files sitting in a folder. Open and well-kept data is good data.

Peter Whitehead
Peter Whitehead
Cognitive Psychologist and Neuroscientist

I am Peter Whitehead, MA, a cognitive psychologist and neuroscientist who specializes in using an interdisciplinary approach to deliver actionable insights into human behavior and decision making.