# Large Datasets: Claude Code for Economists with Paul Goldsmith-Pinkham | Markus Academy | Ep. 162-4

https://www.youtube.com/watch?v=4uwI1-9DafU

[00:08] Welcome back everybody.
[00:10] For another video of the mini video series with Paul Goldsmith Pinkham on Cloud Code for applied economists.
[00:19] Hi Paul. Good to be back.
[00:20] Hi and Marcus.
[00:22] I'm looking forward to this. This is fun.
[00:24] So we're kind of taking off the training wheels a little bit for today was in the sense that we got started last last three videos we got started.
[00:33] We sort of learned how how Cloud works.
[00:35] We've done some basic things making figures.
[00:37] Last time we did a little bit of work thinking about a database and downloading Edgar files.
[00:42] Today we're going to do a lot more with kind of big data in a structured fashion.
[00:47] We're going to do mainly stuff from Cloud Code in the command line.
[00:54] And my hope is that what this is really useful for is that we many applied economists work with really big data sets.
[01:03] You know, it's we have now the access to these big administrative data sets either through
[01:09] the web or somewhere else where you have many many observations, right?
[01:13] Like the one that we're going to do today is going to be what's called the HMDA data set.
[01:16] So this is the Home Mortgage Disclosure Act data set.
[01:18] It's a public data set.
[01:19] Everyone has it has access to it.
[01:22] And it has you know, the whole thing together when you get the CSVs is something like 70 gigabytes.
[01:27] Really large files.
[01:29] You can get much more, but that's quite big.
[01:34] And often when we work with these files, you know, none of us are trained as data engineers.
[01:39] We end up working with a lot of flat CSVs.
[01:40] We maybe make a bunch of different copies.
[01:43] We end up just constructing a lot of stuff.
[01:44] And I want to kind of show that one thing that is really nice about working with these agents in Cloud Code is that you can kind of more easily integrate into better practices for storing data and working with things in ways that actually like open up a lot of new interesting work and kind of improve replication and other things.
[02:03] So that's what we're going to do today.
[02:05] What I'm going to do is I'm just going to go straight to the command line.
[02:07] So
[02:11] we're going to go to the command line and I will show you here.
[02:17] Are you able to see this?
[02:18] So this is let me quickly.
[02:20] So I'm in a command line here.
[02:23] Where this is inside a folder that I have for you and I just want to kind of show you like that I have I've kind of established.
[02:32] I made some prompts beforehand and I want to kind of describe what the project is that we're going to do in detail.
[02:38] Are you able to read this okay?
[02:40] So we're not doing anything with Cloud Code yet.
[02:41] I just want to read it together so I can tell you kind of the goal.
[02:44] And I tell you there's a little bit of preprocessing here just because I knew the task and I I got this set up.
[02:51] I set up a prompt.
[02:52] I could have very easily done this inside of Cloud Code, but just for the purposes of getting started I wanted to do this.
[02:59] So first what we're going to be doing today is we're going to be trying to download and kind of harmonize a large administrative data set that's referred to as HMDA.
[03:10] It stands for the Home Mortgage Disclosure Act, which is a data
[03:12] set that's now produced every year that basically has near universal coverage of all mortgages in the United States when they were originated.
[03:18] Both originated and actually denied.
[03:22] So this is this data was originally constructed in order to deal with potential racial discrimination in mortgage origination.
[03:30] So the data for this we're going to build a panel that uses original data.
[03:33] So what I've done in this prompt, I set up a prompt to say I want to build a panel data set of mortgage lending in the US from 2007 to 2024 using HMDA data.
[03:45] The goal is to study the geographic expansion of fintech mortgage lenders over time.
[03:49] Right?
[03:51] So a pretty reasonable pretty reasonable problem.
[03:53] And I'm going to this is something I knew a little bit more of, but of course you might iterate on is that there's two sources for this data.
[04:01] I didn't want to have a ton of time for Cloud to have to worry about this.
[04:08] So the first one is 2007 to 2017.
[04:08] The CFPB has these zip files that are online.
[04:14] Mhm.
[04:14] And from 2018 to 2024 there's additional data that's stored here.
[04:21] Um these you're able to get access to both of them are publicly available.
[04:26] I'm kind of describing them and they change over time.
[04:28] So what I'm going to tell it is that these are really big data sets.
[04:32] These have like each the whole thing is going to have 250 million rows.
[04:34] So it's a big data set.
[04:36] This is not going to fit inside of memory, right?
[04:39] So if you you Stata or something like this, you could it's quite costly load this all into into Stata or to Python or any of these ones.
[04:49] So you wouldn't want to directly load this into memory.
[04:51] And so what I'm going to ask it to do is I want it to write a script that is going to pull this data down.
[04:57] I want to make it so that you know, it's so what I'm telling it here was include resume capability.
[05:01] That way in case something gets broken.
[05:04] So that could happen to you, right?
[05:06] You're downloading something and your computer turns off or reboots.
[05:09] You want to have it so that it downloads correctly.
[05:13] Mhm.
[05:14] This is a a user agent header.
[05:16] So this is just the same thing that we had last time.
[05:18] In that when you when you talk to things on the internet, it wants to know who you are.
[05:21] If you do it without this, it it it will sometimes block you.
[05:26] Then I want to take this data and convert into just Parquets to make it more efficient.
[05:31] And I'm going to set up something called a DuckDB database.
[05:33] So we briefly talked about this last time, but Marcus, I didn't kind of give you all the details on this.
[05:39] What this is is really if if you've heard of a relational database or something with SQL, so SQL.
[05:43] These are ways of storing data that keep track of relationships between data.
[05:50] So you might imagine, right, that I have a table where I have different I have one table that stores who are all the lenders.
[05:58] And then another table that has the loans.
[06:00] And of course in a database that has stuff about loans is going to be a lender identifier there.
[06:07] And you might say, 'Oh, I want to take all the lenders in here who have more than a hundred million dollars in assets and I want to link'
[06:14] these over and only pull their loans.
[06:17] And so these types of relational databases are very good at this kind of information.
[06:21] And what's really good is it has metadata in it.
[06:23] So metadata means like additional information about the data sets that is about all the data that's in there.
[06:31] So it's things like what the name is, a description, the type of data it is, kind of what's valid about it.
[06:37] And what's really nice is that the metadata is going to give information to Cloud or any other LLM that you use with it in the future so that it doesn't have to kind of figure out what's going on, right?
[06:51] So you imagine if I give you a data set, Marcus, I the benefit of it is like there's a difference between a data set where somebody gives you names that have really good labels, right?
[07:00] Versus one that is totally indecipherable.
[07:03] Right?
[07:05] Does that make sense?
[07:07] So it's like but metadata is labels are metadata as well.
[07:09] Exactly. A label So if you've ever used in Stata, sometimes people talk about like value labels and
[07:15] other things.
[07:17] There's there's or variable labels.
[07:19] Those are often kind of additional meta information that's there that makes it kind of more useful to understand what's going on.
[07:25] Um an important thing here is that there was a big change in the data from in 2018.
[07:31] There were different column names and encoding schemes.
[07:34] And so I'm going to ask it to build a crosswalk and harmonize across this.
[07:35] All right.
[07:38] And so I'm telling it kind of the things I care about.
[07:39] And so I'm going to ask it to always use DuckDB in here.
[07:43] It can't kind of load everything and do anything.
[07:45] So it really needs to do everything kind of using SQL.
[07:48] So that's what we're going to do.
[07:51] We're going to start with that.
[07:52] DuckDB is the only provider?
[07:54] It's a particular provider like
[07:55] No, no, no.
[07:57] So there could be any So you can use lots of things.
[07:58] So DuckDB is what's very nice about DuckDB is it will be a way of it's a program that sits on top of um what are called Parquet files.
[08:07] So I'll briefly just explain if you don't A Parquet file is just another way to store data just like a CSV.
[08:11] CSV, right, stands for comma separated values.
[08:13] Mhm.
[08:16] Parquets are ones that are actually column.
[08:19] They basically focus on things in.
[08:21] So you think about a CSV, a CSV what it does is every line is kind of a piece of data, right?
[08:27] So you have a line, each each piece of information is separated by commas and then you have a new line and then you you keep you kind of cycle through it.
[08:35] Parquets are stored that what it does is that a whole column is stored and you kind of look through a column and there's information about the column stored at the beginning.
[08:43] And so it can be much more efficient and kind of compressed a lot more compactly.
[08:46] They tend to be more efficient ways of storing data um directly.
[08:54] And DuckDB sits on top of it and is very very fast.
[08:57] It's kind of one of the fastest ways of working with data.
[09:00] In Pandas there's something called Polars, which is similar.
[09:04] But DuckDB is a really nice one.
[09:06] I kind of encourage researchers to use it.
[09:08] There's also you could use something called SQLite.
[09:09] The main reason I'm encouraging this is that if often relational databases you have to do all this work to set up a server.
[09:15] Mhm.
[09:18] SQLite and DuckDB are both ways of getting the benefits of relational databases without having to do anything complicated like set up a server.
[09:26] So that's why we're going to do this.
[09:28] It's everything on No, you don't have to do anything.
[09:29] Yeah, yeah, it's just a file that sits there and you'll work with it and Python or R can both directly query it without anything special.
[09:38] So it's just a really nice way of being able to work with big data sets very efficiently.
[09:40] Mhm.
[09:43] So what we're going to do now is we are going to split this screen here.
[09:51] If you remember we talked about one of the benefits of um having these this type go this is called ghosty.
[10:01] So this is a terminal that we're using.
[10:03] If you remember Marcus, we talked about this and one of the things I told you was that it can be very nice to have these things so you can have things side by side for example.
[10:11] And you'll see that if you can see that now I have two terminals side by side and so one thing I can do is I can look at the prompt here that I was going to do and on the
[10:20] other one what I'll do is I'll make a a folder.
[10:25] and we'll go into it.
[10:25] Yes.
[10:27] And we'll say okay now we're going to load Claude.
[10:31] So here we So here we're going to load Claude.
[10:35] and now we have it.
[10:35] So here we're inside it says welcome back and I'm going to take this prompt over here and I'm going to copy it.
[10:44] The only complication that we're going to do run into is so okay so it doesn't want me to do that.
[10:52] This is the problem with the slightly big size.
[10:54] So I'm just going to make it smaller so I can copy it all at once.
[10:58] And going to copy this and we'll make it bigger again.
[11:03] And I'll go over here and I'm going to paste it in.
[11:04] And the one thing I'm going to say is focus only on this subdirectory.
[11:09] Don't scout in the folder above.
[11:11] So one thing that it will do is it will start looking at other folders in this directory and so of course I've
[11:20] built these videos to sort of do here but of course if it starts looking up above it'll say oh this is part of a broader educational series and so I'm going to say let's focus just here.
[11:31] Don't go scouting everywhere else.
[11:35] Um the you could do more things to kind of prevent it from knowing what's going on.
[11:41] That's called sandboxing but I don't want to kind of deal with all those issues.
[11:44] So I'm going to do this so I've pasted this prompt over and now I am going to um uh run it here.
[11:52] So it's going to do this.
[11:55] So it I I want to build a country level data set that looks like this blah blah blah.
[11:58] So now we can read what it's saying and I'll even make it a little bit bigger so we can focus on this part.
[12:05] So it says let me start by exploring the existing things.
[12:08] So it didn't listen to me.
[12:10] So it's very annoying.
[12:10] I'm going to say no you can't read the other folders.
[12:14] So it's going to get very annoying.
[12:16] So I'm going to say no don't read these.
[12:19] Yes you can do web search commands and no
[12:21] you can't read in there and no you can't read it.
[12:23] So now it's useful to look at um uh what's going on.
[12:26] So it's doing two things.
[12:31] So it's trying to understand what's going on in here.
[12:33] It's it's Let me just pause before I let it run more which is that it says let me explore begin by exploring the project structure and any related files and then research the data formats to build a plan.
[12:35] So what's interesting about this is that it's launched two agents.
[12:45] So if you remember we talked about this last time that agents are kind of sub agents where it sends out to do things.
[12:47] And so what it's done here is it's launched two things.
[12:51] One a bunch of scouts to figure out what's going on in here and then the other is to research the data formats.
[12:55] And it's doing these in parallel.
[12:58] So these are two different kind of agents that have been sent out.
[13:00] And so I'm going to say yes you
[13:02] So this one is going to be allowed to search and I am not going to allow it to look here and no it's going to not be allowed to find anything in the previous one and then I'll allow it to look more.
[13:21] I'm allowing it to kind of see here it it's really really wants to.
[13:26] So what I'm going to do I think if it keeps asking me so okay so now what's useful is let me describe you.
[13:32] So if you look here this is what it does.
[13:34] Remember we talked about how agents work.
[13:35] They have their own prompt.
[13:38] Part of the reason they use agents is so that it doesn't use up your own context window.
[13:42] So this one it's saying explore project structure says explore this and parent directory to understand what files are there.
[13:48] What related files are there.
[13:49] Any existing things.
[13:52] Be thorough.
[13:54] Report the full content of key files.
[13:55] So it's looking and I'm not letting it do a lot of these things.
[13:58] And makes a big difference?
[14:01] I'm sorry.
[14:02] Say again.
[14:04] If you say be thorough.
[14:05] Yes so I didn't tell it to do this.
[14:08] Claude wrote this prompt.
[14:10] Oh.
[14:11] And it's probably it's probably Anthropic has sort of learned that this is a really useful thing to do to say like be thorough.
[14:17] So yes telling it to be thorough is good so that it double checked itself.
[14:21] And then this one to research the format it says research the
[14:22] Home Mortgage Disclosure Act data formats for both pre-2018 and post-2018 eras.
[14:27] I need to understand the format of the data structure how to create a crosswalk.
[14:31] What are the key differences?
[14:34] Focus on these variables.
[14:35] Search the web for data dictionary.
[14:37] So it's searching and it's downloading information and it's looking through that.
[14:44] And um so I'm going to continue to not allow it to do this and I'm going to allow it to fetch this content.
[14:48] It's searching the Fed and other places to do this.
[14:54] Um and I'm going to say yes you can look at the the get repository.
[14:59] Um it's really aggressively trying to do this.
[15:01] So what I'm going to do is I'm going to say Oh interesting.
[15:07] So this one is actually um I need to find I need So I'm going to say um you should not look in the parent directory.
[15:19] So this is useful.
[15:21] So here it gave me three options but I want to tell
[15:23] it more.
[15:27] So it I can press tab and then it will let me add additional information.
[15:30] I'm saying no don't look in the parent directory.
[15:34] And so now it will continue to search the web and it will continue to search the web.
[15:38] So it's doing kind of what you and I would ever do if we were running a project.
[15:41] And so now you notice so here it says it's done exploring the project structure.
[15:45] I said no you don't need to look at the parent directory.
[15:48] Um.
[15:50] And by the way this is a good example of how persistent these agents will be unless you kind of directly tell them to stop.
[15:57] I kept saying no but then it just kept trying new ways to look in the parent directory.
[16:01] So I had to say no you like you know don't do that thing anymore.
[16:06] But why is it so persistent?
[16:08] It doesn't make sense.
[16:10] Um because when I said no I didn't tell it why it shouldn't be doing that.
[16:13] I see.
[16:16] And the problem is that that explore agent got written its own prompt.
[16:19] So it's really interesting.
[16:21] I mean it's a very interesting problem.
[16:23] So that remember this one here it doesn't
[16:25] see anything that I told it.
[16:26] So when I wrote that prompt for the what would you call the the head agent Yes.
[16:31] uh the director I told it hey don't look only focus on this subdirectory.
[16:35] But then when it went to this one it said hey you should look in this directory and also look in the parent directory.
[16:39] It sort of screwed up this prompt.
[16:41] And so the sub agent says hey I got to I got to find here.
[16:46] So then it says so when it tried to look in the bigger the parent directory I said permission for this tool I denied it remember and it says the tool use was rejected.
[16:54] Try a different approach.
[16:57] And so it's like oh well how am I going to look in there?
[16:59] So now it's trying all sorts of ways to do it because it's smart.
[17:03] And so it keeps saying hey it didn't work try something else.
[17:08] And so it's saying like all right well there's many ways to skin a cat and so it keeps trying new things until I said eventually I said um no uh you should not the user said you should not look in the parent directory.
[17:26] I see their access restrictions.
[17:28] The settings indicate this is a constrained environment.
[17:31] Based on what I can access and restriction in place here is my summary.
[17:33] So there's really nothing in here.
[17:35] The directory is currently empty.
[17:37] There's these restrictions.
[17:39] You're allowed to do these things.
[17:40] So it's a little confused because it doesn't understand why I'm making these restrictions.
[17:46] It's saying that I want to keep this isolated and to only focus on external data sources.
[17:51] So it's going to report this back to the head agent.
[17:55] And And now we're we have this other one that's doing this other information.
[18:00] So remember the way that I'm toggling back and forth.
[18:01] You can't see me doing this Marcus but I'm pressing control O to to do things back and forth.
[18:06] And so now I'm going to continue to let it see this information.
[18:12] So it's still searching the web.
[18:13] You can see that it's done 33 different tool calls.
[18:16] So it's searching the web in lots of different ways and it's reading lots of information there.
[18:22] And that's going to What it's do trying to do is it's trying to build information
[18:27] to then construct a crosswalk in a database.
[18:31] So now it said excellent I have a thorough understanding of both data formats.
[18:35] Let me verify the actual column names by sampling the real data files and then design the plan.
[18:40] So it's going to What it's going to do is it's going to kind of take a little bit from the data just to understand this.
[18:45] So I'm going to say yes you can look at this information.
[18:48] Now it's Now it's doing it again and I'm going to say now we've learned how to fix this and so I'm going to say no do not look in the parent directory.
[18:57] Mhm. Stay So why does it So first it makes a plan and then it experiments with the plan and then only it implements step by step.
[19:04] Exactly. So it's kind of like what you would do if you were going to work uh on on something right?
[19:11] You might say uh I'm an One of the key things that plan mode does that's a kind of an important thing to keep in mind is that plan mode will not edit things.
[19:20] So one thing remember we're talking about how these agents will kind of run wild sometimes if you give it the wrong ideas is that
[19:29] it will um it will in plan mode it will specifically not edit anything which it turns the Claude code actually turns that ability off.
[19:38] And so that can be very nice.
[19:40] If you're saying look I know I just want to build a plan with you what to do.
[19:41] Let's go into plan mode.
[19:45] And you can actually turn that on directly by pressing shift and then tab.
[19:50] And that's very important but it will automatically go into plan mode as well if it understands that it's a complicated project.
[19:55] So here this is a relatively complicated project and so it knows that it needs to look up a bunch of information about what this looks like.
[20:03] So it's looking up resources of what the columns look like.
[20:07] And it's going to go in there.
[20:09] So it's building this up and it's continuing to look.
[20:11] You can see that this one is in white and that means that it's stopped doing this particular sub-agent.
[20:17] This plan agent is design designing an implementation plan and so it's doing a lot of work here by looking what the data is.
[20:23] And we can actually read what the prompt is that it wrote for it.
[20:25] It says, "Designing a system to build a county-level panel data set of US
[20:30] mortgage lending from HMDA data. Here's
[20:32] the context." And tells a bunch of
[20:33] information. Downloads Here's what the
[20:36] requirements are. Here's the things that
[20:38] you need to do and you should use
[20:39] DuckDB.
[20:41] Here design the architecture for this
[20:43] system. File organization, download
[20:45] approach, CSV to Parquet conversion
[20:47] strategy using DuckDB.
[20:50] Harmonization approach. Should we create
[20:53] harmonized Parquet files or use DuckDB
[20:54] views?
[20:56] So what this means is that should I make
[20:57] all the files the same or should I do
[20:59] that afterwards within DuckDB to make
[21:01] them similar?
[21:03] Add metadata and then how should I
[21:05] handle the lender crosswalk and then
[21:07] finally kind of a crosswalk mapping for
[21:10] state [snorts] FIPS codes versus state
[21:11] >> So can you say again the harmonization
[21:13] approach is What what's the
[21:15] harmonization approach?
[21:16] >> Uh so I think what it's asking so I
[21:19] asked it to harmonize the variables if
[21:21] you remember that. So I wanted these
[21:22] things so that I could have a kind of
[21:23] consistent variable across years. They
[21:25] change in the data sets themselves. I
[21:28] just happen to know that from my own
[21:30] work with it. Um
[21:32] So what it's asking is, "Well, should I
[21:34] change the files themselves or should
[21:37] what I do is just store the raw data in
[21:39] there and then make a view on top of it
[21:41] which is like kind of a renaming
[21:43] restructuring." So those are slightly
[21:45] different approaches on how you might
[21:46] want to harmonize something.
[21:48] So it's still thinking very hard right
[21:50] now which is kind of interesting. So
[21:51] it's it's doing some work here. You can
[21:53] see that it's it's searching for
[21:55] information about how to do stuff.
[21:57] Um
[21:59] and
[22:01] I have a solid understanding of both. So
[22:04] let's read through what it's going to
[22:04] say just briefly.
[22:07] Um here's the goal. This is gives it the
[22:09] context. Goes, "This is the thing it's
[22:11] going to write to a file.
[22:13] Um build a harmonized data set of US
[22:15] mortgage lending. Data spans two
[22:17] different formats, pre-2018, post-2018
[22:20] with different column names, coding
[22:21] schemes, and lender identification
[22:22] systems. 250 million rows across 18
[22:25] years. All processing uses DuckDB and
[22:28] and Parquet."
[22:29] So it's going to write a bunch of code
[22:30] here. It's going to have six different
[22:32] Python files. Um one is for configuring
[22:36] things. This is how to download. This is
[22:37] the conversion. This is the raw Parquet
[22:40] to harmonize Parquet then assemble a
[22:42] database. And then this is the thing
[22:44] that's going to run all the different
[22:45] pieces of the pipeline.
[22:47] So this Parquet again is is like a
[22:49] column summarized.
[22:50] >> Yeah, Parquet think of it as just a
[22:51] different version. It's like CSVs, but
[22:53] it's column-based. It's just a more
[22:55] efficient way of storing the data. And
[22:57] it's what DuckDB uses underneath it. So
[23:00] now it's going to go through what are
[23:01] all the different files, the
[23:02] configuration. So here is the template
[23:05] um for where the data is. Here is the
[23:08] data dictionary of how you map between
[23:09] FIPS codes and states, column name
[23:12] mappings. So this is how you're going to
[23:14] map between columns,
[23:16] the schema, and then variation of the
[23:18] columns.
[23:21] Um
[23:22] So then there's this download um this is
[23:25] the download procedure. This is how it's
[23:27] going to do the downloads. Um
[23:31] and then here's the conversion. So I
[23:32] think the kind of most of this is just,
[23:34] you know, programming to download data
[23:36] correctly. Here is how it's going to
[23:38] harmonize the schema. So what it's going
[23:40] to do is it's going to have years. It's
[23:42] going to have a lender identifier. It's
[23:44] going to have a lender name. It's going
[23:46] to have the county and state code state
[23:48] code. It's going to have um
[23:51] the loan type, the loan purpose, the
[23:53] action taken, and the loan amount, and
[23:55] then the the race and income.
[23:57] So this is really going to kind of um
[24:00] convert this down. Um we could kind of
[24:04] We're going to have the raw data. So
[24:06] this is going to be a view that it's
[24:07] going to create. There's many other
[24:08] things we could do as well.
[24:09] So one of the things it's going to have
[24:11] to do, it's going to have to construct
[24:13] some different variable names depending
[24:14] on what things look like.
[24:17] It's also going to need to do some
[24:18] conversions to make things harmonize. Um
[24:21] it's going to drop rows where it have
[24:23] doesn't information about the county or
[24:24] state code. And then it's going to build
[24:26] the database based off of this. And so
[24:28] this is going to create a kind of a view
[24:30] with all the data and then it's going to
[24:32] create this metadata and then we're
[24:35] going to create this kind of harmonized
[24:37] version and the run pipeline is going to
[24:39] do all these pieces.
[24:41] So after we've done the coding, we're
[24:43] going to just run this to download the
[24:44] years. This is the different steps it's
[24:46] going to try and do. And then you can
[24:48] query the data to take a look at it. And
[24:50] so this is going to be the things that
[24:51] it's going to do. And so now if we're
[24:53] comfortable with this, we can say yes
[24:55] and auto accept the edits. So now I'm
[24:57] going to say
[24:59] So one thing that is useful is that this
[25:00] will save it into a folder
[25:03] um on your computer elsewhere. So in dot
[25:06] Claude's if you go to tilde dot Claude
[25:09] plans, it will save the plan here if you
[25:11] want to look at it later.
[25:14] Um
[25:16] I'm going to make two That's all. I'm
[25:19] going to do this.
[25:21] Um
[25:21] I'm going to say yes. And so now what it
[25:24] will do is it's cleared my whole
[25:26] context. So what I've done is I've said,
[25:29] "Hey, um
[25:32] just get rid of all the work that you
[25:33] did before. Just take the plan." So
[25:34] remember we talked about context window
[25:36] management in the first video? It's
[25:38] exactly that. It's saying, "Hey, take
[25:40] the plan, read it, and let's start from
[25:41] the beginning." And so now it's starting
[25:44] from a clean directory. It's going to
[25:45] write the seven files.
[25:48] Um
[25:49] and it's going to create these seven
[25:51] ones. Here's a summary of what was
[25:53] built. All these different files. So
[25:55] config, download, convert, harmonize,
[25:58] build, run. A git ignore. So this is if
[26:02] you use git. Um this is a way of
[26:04] ignoring things that are kind of um not
[26:07] important. So now it's telling you you
[26:09] should run this. Now
[26:11] I could tell it so I could run this
[26:13] separately on the command line over
[26:14] here.
[26:16] Honestly, once you're already doing it
[26:17] in here, you can just tell it. You say,
[26:19] "Hey, just run it yourself." Um now it's
[26:21] going to look at the data. So it's
[26:22] saying, "Great, we have all of this
[26:23] data. Before we run the full pipeline,
[26:24] let's verify the column names in one
[26:27] pre-2018 and one post-2018 to make sure
[26:29] it's correct." We're going to say um
[26:31] yes.
[26:33] One thing I want to kind of just say
[26:35] quickly here uh Marcus is you know,
[26:38] often if I kind of know what's going on
[26:40] in these things, there's this meme about
[26:43] you know these uh these drinking birds
[26:45] that kind of they tip back and forth.
[26:46] There's this joke about people who use
[26:48] this. You know, you're just hitting yes
[26:50] every time it kind of does this. Um so
[26:54] there's a there's a joke there too of
[26:57] like
[26:58] it's good to be reading carefully, but
[27:00] if you kind of trust that most of the
[27:02] time it's working reasonably well, you
[27:03] can kind of you could either kind of
[27:05] move very quickly if this can be very
[27:07] quick.
[27:11] In future videos
[27:13] um
[27:14] in future videos we'll talk about ways
[27:16] in which you can actually do this in
[27:17] which you don't have to give any input
[27:19] and you can kind of let it make
[27:20] decisions on its own, but then of course
[27:21] you have to check the work. And that's
[27:23] called that's called like um
[27:26] What's called YOLO mode. Um so now it's
[27:28] going to try and convert everything.
[27:30] YOLO mode. Yeah, you only live once. So
[27:32] now I'm going to let it convert one of
[27:34] the files
[27:35] into a Parquet.
[27:37] So it's going to try and do this.
[27:39] >> [snorts]
[27:53] >> So now it says, "2017 converted
[27:55] successfully. 14 million rows, 114
[27:58] megabyte Parquet." So just as a a sense,
[28:01] right? So remember I was talking about
[28:02] how 2017 is 1.7 gigs as a CSV. When it's
[28:06] a Parquet file,
[28:08] um it turns it's ends up being 100 megs.
[28:12] I see.
[28:12] >> So this is a much more efficient ways of
[28:13] storing data.
[28:16] So it somehow compresses it.
[28:18] Yeah, because it ends up not needing as
[28:21] much um
[28:23] space [snorts]
[28:24] for a lot of the things.
[28:25] >> same information content.
[28:27] Exactly. It's just compressed much more
[28:30] um efficiently.
[28:31] So now it's going to try and harmonize
[28:33] for 2017. It's going to run the
[28:34] harmonization.
[28:38] And so it says, "Looks like there's an
[28:39] issue. Works. The respondent name
[28:41] warning is expected because it's not in
[28:43] these files. It'll be null. Now we're
[28:45] going to test a post-2018 year. Let me
[28:47] use 2024 because it's smaller."
[28:51] So now it's going to run this again.
[28:52] This is the one that converts it to a
[28:54] Parquet.
[29:07] So this one
[29:09] 2014 is a 4.3 gigabyte file and now it's
[29:12] 450 megabytes.
[29:14] And I noticed the column I noticed the
[29:16] column names for post-2018. It converted
[29:18] hyphens to nothing, not underscores.
[29:20] This is different than I expected. The
[29:23] actual thing is this. I need to update
[29:24] this to make it work. So this is what's
[29:26] called thinking. This is it's basically
[29:28] it's figuring stuff out in the
[29:30] background.
[29:31] Uh it's now going to kind of try and
[29:34] make these consistent. So it now is
[29:38] verifying these. It's created kind of
[29:40] two tables. It has two years. It's
[29:42] building a database and the data looks
[29:43] correct.
[29:45] Okay. Los Angeles, Maricopa, Cook
[29:48] County, and Harris County are the top
[29:50] counties as expected. So, Cook County is
[29:52] in Chicago. I don't know where Harris
[29:54] County is. I think Maricopa is in
[29:55] California. I don't know where Harris
[29:57] County is. Um
[30:00] Let me remain that
[30:01] Let me run the remaining 16 years
[30:05] through convert and harmonize. I'll kick
[30:06] these off as a background task
[30:08] and I'll check these periodically. So,
[30:10] what it's saying is that it's going to
[30:12] um run these.
[30:14] But, um it can just run right this is
[30:17] the same thing as if you ran it in
[30:18] Python. So, it doesn't need to stop
[30:20] Claude per se. So, that what happens
[30:22] here is what's funny is it's
[30:24] to make it so that it will just pause
[30:27] and sit here. What it's done is it's
[30:29] it's said, "I'm going to wait 60 seconds
[30:31] and then look at the output." Okay. So,
[30:35] this is fine. You will just leave this
[30:37] so that way it could keep working.
[30:39] Um
[30:41] What I want to kind of show you while
[30:42] that's running is you can see
[30:46] what the DuckDB database will look like.
[30:48] So, in the database
[30:50] here
[30:52] on the left, this home panel DuckDB
[30:55] is sort of stored here.
[30:58] This is this file that it's going to try
[30:59] and make. It's and it's going through
[31:01] now and it is
[31:05] making parquet files.
[31:11] It's going to slowly doing these all in
[31:13] parallel.
[31:15] They're almost done.
[31:17] But, you can see that this has gone from
[31:18] a 70 GB
[31:20] process to
[31:22] um
[31:24] just a basically this is going to end up
[31:26] being a GB Mhm. just because it's much
[31:28] more efficiently stored. All right. So,
[31:30] now it's all done. So, we have all 18
[31:33] years converted. We're going to run
[31:35] harmonize on the remaining 16 years cuz
[31:38] we already did two. So, it's going to
[31:39] run this. It's doing this task.
[31:42] It's [snorts] just going to um basically
[31:44] convert these in the background
[31:46] and then it's going to make a basically
[31:48] make a a database out of it. So, it's
[31:49] going to harmonize all of these.
[31:52] It has the same
[31:55] It's basically moving these ones. All it
[31:57] all 18 years harmonize successfully.
[31:59] Now, let's build the final database.
[32:02] So, it has 291 million rows across 18
[32:05] years. Let me find run the full
[32:07] verification query. So, a nice thing
[32:08] here you can see if you
[32:10] you know, we all did a bunch on
[32:11] mortgages is the peak of mortgage
[32:13] origination 2007 and then of course
[32:16] things fell off tremendously. Yes. Here
[32:18] during COVID, this is when interest
[32:20] rates fell. So, that's why we had this
[32:21] massive increase because there was a ton
[32:23] of refinancing.
[32:25] And then we've had um
[32:27] a bunch of a fall off because interest
[32:29] rates have gone up and there's been a
[32:30] very kind of little little origination.
[32:32] So, now it's all done and we have this
[32:34] final database here. So, we can read
[32:37] through what it says. Here's the
[32:38] pipeline.
[32:39] Um one single database. One thing I
[32:41] noticed is that these purpose codes
[32:44] appear in the 2018 database. These are
[32:47] pre-2018 encodings for refinancing and
[32:49] home improvement respectively. The
[32:50] harmonization maps the raw values
[32:52] directly. If you want to recode these
[32:54] for this year, I can add that
[32:55] transformation. Otherwise, it is um
[33:00] it is clean and ready to query. If I
[33:02] were doing this myself, I would it but I
[33:04] would do it. But, now we're all set. So,
[33:06] we can the last thing that we can do
[33:08] kind of now now we have the stuff. So,
[33:10] let's look at what we have. We built
[33:12] this whole pipeline to download the data
[33:14] and harmonize thing.
[33:16] And now I want to just say, "Okay, well,
[33:17] what's something we could do with this?"
[33:19] So,
[33:20] you know, a nice part of this is we
[33:22] could say, "I want to build some
[33:24] aggregated data where what we're going
[33:26] to do is
[33:28] I want to let's just for the purposes of
[33:31] this
[33:32] I want to build some views. So, this is
[33:34] what it's what it means is I want to
[33:35] take the data
[33:36] and I want to construct the amount of
[33:39] origination by county year. Mhm.
[33:42] And I can I want to basically build two
[33:44] aggregated databases. So, the challenge
[33:46] I wanted to do this by lender by year,
[33:49] but we'd have to do kind of more work on
[33:50] this um cuz I want to classify things by
[33:54] um fintech and non-fintech.
[33:56] Um
[33:58] I think the challenge given what was set
[34:00] up is I would need to do more work on
[34:02] here. So, what I want to do instead is I
[34:05] want to just build out an aggregate
[34:07] database where I'm going to say, "What
[34:09] are Here's what I'm going to do. I want
[34:10] to build out aggregate database where
[34:12] what I'm going to be getting is county
[34:14] by year, total originations, dollar
[34:16] value, number of active lenders, the HHI
[34:19] which is like the market concentration,
[34:21] the denial rate, and the median loan
[34:22] amount."
[34:23] So, I'm going to
[34:25] um I'm going to ask it to do that.
[34:29] And so, this should be very quick. So,
[34:31] it will just
[34:33] um do this in a few and kind of one very
[34:36] natural thing that we can do and I'll
[34:38] post about this online just to kind of
[34:40] keep this from getting longer is
[34:42] um now we can also do things for example
[34:44] where we could crosswalk on um a
[34:47] question of
[34:49] um
[34:50] shadow banks versus non-shadow banks.
[34:52] This is a big question in the finance
[34:54] literature about how much lending is
[34:55] happening in mortgages among shadow
[34:57] banks. And so, we could very easily try
[34:59] to have this classify things
[35:02] following others um in this approach.
[35:04] So, after we do this one and I show this
[35:07] in the director's cut, we can show the
[35:09] longer version of this. But, I want to
[35:10] kind of show you an example of how this
[35:12] can be done. It's going to say, "Let me
[35:13] create a a function for the build DB
[35:16] Python code which is going to create HHI
[35:20] and then collapse it down."
[35:24] So, it's going to do this. It's going to
[35:25] create a new little function here. So,
[35:27] that's what Claude proposed that, no? Or
[35:30] Yes, it I told it I just want to make
[35:32] this table and it's saying, "Here's how
[35:34] I would do it." And so, I'm letting it
[35:36] go go to town. It's going to run that.
[35:38] It's going to rebuild the database with
[35:40] the new table.
[35:43] So, now it's you'll notice it's found an
[35:44] error here. It says, "2018 LA County
[35:46] must have many fragmented rows instead
[35:48] of one.
[35:49] There must be format variation. Let me
[35:51] investigate." So, it looks like it's run
[35:53] into a problem here in which
[35:55] there are some issues.
[35:58] So, it looks like it found a FIPS code
[36:00] which is inconsistent in some of the
[36:01] post-2018.
[36:04] So, it's going to try and
[36:06] Sorry for No, no, go ahead. Yeah, yeah.
[36:09] >> What are FIPS codes?
[36:10] Mhm.
[36:11] FIPS, what does FIPS stand for?
[36:14] Oh, FIPS, good question. It stands for
[36:16] Federal
[36:19] Good question. I don't actually know.
[36:21] It's a Federal Information Processing
[36:23] something. It's a So, FIPS is a way of
[36:26] you would get these long codes. Um so,
[36:29] state it's going to be a two-digit code.
[36:31] County is it's the state plus a
[36:33] three-digit county code. And then you
[36:35] can actually aggregate it to the
[36:38] This is the same code that gets used for
[36:39] example for census blocks and census
[36:41] tracts. It's a long um multi-numeric
[36:45] thing. And so, what it's done here is
[36:46] it's figured out what the problem was.
[36:49] The group by here
[36:51] um
[36:52] we need to derive the state FIPS code
[36:55] correctly. It looks like there were some
[36:56] errors there. So, now it is done this
[36:58] correctly.
[37:00] And it has one row per county year. LA
[37:03] County shows a clean time series across
[37:05] all 18 years. Tells a nice story. So, it
[37:08] looks like it's constructed this. And
[37:09] so, now what I'm going to ask it to do
[37:11] is to make a histogram of HHI for across
[37:13] counties for 2007, 2015, 2021,
[37:18] and 2024 and uh make a four-panel graph.
[37:22] Okay. Um
[37:28] Um The reason I'm going to do this uh
[37:30] make this H uh histogram
[37:34] uh dollar weighted.
[37:36] Mhm.
[37:38] So,
[37:40] the reason I might be interested in this
[37:41] is I want to understand kind of how
[37:42] changing
[37:44] um lender concentration has varied over
[37:46] time. So, um HHIs that are that are much
[37:49] larger are going to reflect counties
[37:51] where there's much more concentration
[37:53] for a single lender. What is HHI?
[37:55] Herfindahl Index? Herfindahl Index,
[37:57] exactly. It stands for that. I forget
[37:59] the Herfindahl-Hirschman Index.
[38:02] Um that's right. Hirschman always gets
[38:04] the short shrift when people describe
[38:06] it. Mhm. Um
[38:08] Which one is
[38:10] >> and it does that very, very quickly. So,
[38:11] you'll see it's already constructed
[38:13] this.
[38:14] And
[38:15] um
[38:17] Once it's done here, it's going to have
[38:19] its own opinion. Um
[38:23] It's going to say, "It's remarkably low
[38:26] across all four years.
[38:28] Um the weighted mean ranges from 308 to
[38:31] 356.
[38:33] Um
[38:34] The big dollar counties have hundreds of
[38:36] competing ones. It's very right skewed.
[38:38] So, that you know, the rural counties
[38:39] tend to be um
[38:41] very concentrated. This was the most
[38:43] competitive, whereas these two are tied.
[38:47] So, I'm going to What I'm going to do
[38:48] So, the last thing that I want to do um
[38:52] All right. So, the last thing So, this
[38:53] is kind of uh
[38:54] I I'm going to
[38:56] Let me actually
[38:58] show you here. Marcus, if it's okay, I
[39:00] want to
[39:00] >> up the the PNG file here?
[39:02] >> Yes. So, I want to open it so you can
[39:04] see it. So, I'm going to share this now.
[39:05] I'm going to switch my screen for a
[39:06] second and share it. So, I want to be
[39:08] able to show you what that looks like.
[39:10] Um
[39:12] So, now
[39:13] um
[39:15] Where did it save it? Saved it under
[39:16] data. So, now we can look at this
[39:18] figure.
[39:21] Now, it's very funny to me that it used
[39:23] that um
[39:28] So, here is
[39:30] what it looks like. So, it made this
[39:32] one.
[39:33] If you remember
[39:35] I didn't tell it anything of what the
[39:36] figure to look like. This is all
[39:37] defaults.
[39:39] I frankly think it would have been very
[39:40] ugly. Um
[39:43] I think the one thing that would be
[39:44] interesting here if you could do for
[39:45] example is you could also ask now you
[39:47] could say like let's compare across
[39:49] counties. So you could do a scatter
[39:50] plot. You could say how has this kind of
[39:52] changed over time? Kind of the real
[39:54] thing that's interesting though is is
[39:56] that the mean concentration has gone up
[39:59] but I would say that it hasn't
[40:00] dramatically shifted across these
[40:02] hugely. Um it's a relatively competitive
[40:05] industry although who are the major
[40:07] players here maybe have changed a little
[40:09] bit.
[40:10] Um
[40:12] As a kind of the key benefit here that I
[40:14] just want to describe to you is
[40:16] um
[40:17] what I'm going to do now is I'll just go
[40:18] back to the the the um
[40:21] the command line
[40:23] is to tell you
[40:28] I just want to kind of show you very
[40:30] quickly
[40:31] um
[40:33] how you could use this DuckDB. So like
[40:34] just to give you a sense of what this
[40:36] looks like. So in the data set um
[40:39] we are
[40:41] um here
[40:43] and we take the the data
[40:47] Mhm. and we look at the DuckDB file we
[40:49] can run DuckDB in and
[40:51] um
[40:53] we can take a look at what's inside of
[40:55] it. So I mentioned that this is a great
[40:57] program. This is independent of looking
[40:58] at this is so on the left here I'll make
[41:00] it bigger so you can see it. This is I'm
[41:02] now inside I've run DuckDB to look at
[41:05] this thing and I can say show me the
[41:06] tables um in here. And so now what I
[41:08] have is a bunch of different tables. So
[41:10] I have the main data here. I have the
[41:12] metadata. This is what my um
[41:15] my LLM would use and then I've made this
[41:17] new thing called county year. So if you
[41:20] describe county year
[41:22] this is a a
[41:23] uh this is this thing that I just made.
[41:26] Um so I have all this information. I can
[41:28] also describe the underlying raw data.
[41:32] And you can see that this is kind of
[41:34] what useful about it is it it gets
[41:35] stored very efficiently. It has the the
[41:38] strings. It has the loan type and it has
[41:40] the loan purpose, the amount, the
[41:42] origination, different pieces. I've
[41:44] selected to a subset of things so I
[41:46] could incorporate and do more if I
[41:48] wanted um in the raw data but this is
[41:51] just a very useful way of structuring
[41:53] data. Now I have one file that I just
[41:55] point the LLM to if I'm going to do
[41:57] something.
[41:58] So what I'm going to do now as a last um
[42:01] exercise is
[42:03] um what we're going to do is we are
[42:06] going to ask um
[42:09] we're going to ask the LLM to
[42:16] um try and replicate some some um
[42:19] labeling that comes from the finance
[42:22] literature. So there's
[42:24] been work of people trying to label what
[42:25] are fintechs or shadow banks in the
[42:27] mortgage literature. So what I'm going
[42:29] to do is I'm going to say um I want to
[42:31] classify lenders as fintech or shadow
[42:34] banks versus traditional banks. Let's
[42:36] start with the lending classification
[42:38] from Fooster, Placer, Schnabl and
[42:40] Victory which is a paper uh
[42:42] in the Review of Financial Studies. They
[42:44] have a list in that paper. I'm going to
[42:46] use that as a base and then I want to
[42:47] extend the classification to cover these
[42:49] other names. So store it in a lender
[42:51] classification and document what's going
[42:53] on here. So
[42:54] I'm what I'm going to do is I'm going to
[42:55] say this and then I'm going to ask
[42:58] tell me what it would take to do this
[43:03] task before you um try anything
[43:08] complicated.
[43:10] Is this feasible?
[43:12] What might be missing?
[43:15] So we're going to we're going to ask it
[43:16] to do this now and we'll see kind of
[43:18] what it's able to do.
[43:20] Good question to ask before diving in.
[43:22] So this is very sycophantic. Let me
[43:24] check what we actually have to work
[43:26] with.
[43:29] So it's reading the DuckDB database. So
[43:33] underneath here it's looking it's
[43:34] looking at the names and so it sees that
[43:36] there's no names. So that's a big
[43:37] problem. Yes. And that was why I was
[43:41] telling you earlier that it will see
[43:42] kind of how doable it is um and it might
[43:45] need to kind of construct
[43:48] I see. So it's going to go back to the
[43:50] original data. What?
[43:51] Yeah, exactly. So it's going to look to
[43:53] see what's in there.
[43:55] Um
[43:56] and it's going to figure it out for me.
[43:57] This is a little bit like you might ask
[43:59] an RA, right? You might say hey
[44:01] can you I want to do this. What is that
[44:03] going to entail?
[44:07] So it's looking at different uh
[44:09] different files. My recollection here is
[44:11] that what it needs to do is it needs to
[44:15] um
[44:17] figure out what the names are in here.
[44:20] And so there's a crosswalk in this data.
[44:23] Mhm. We would need to download this
[44:25] data. This is a panel data set where
[44:27] we'd need to do the here. And so it's
[44:29] say it's going to say well, here's
[44:31] what's missing. We have all the lender
[44:33] identifiers but we have no names and no
[44:35] institution type.
[44:37] So what do we need? We need these HMDA
[44:39] panel or trans middle sheet transmission
[44:41] trans middle sheet files. Mhm. They are
[44:44] going to have a bunch of information
[44:45] that links across the two of these.
[44:48] So
[44:50] how can we do that?
[44:52] We need a name we need the names to do
[44:54] this. And so what's feasible? So we
[44:56] could do an additional download to get
[44:58] this. So should I download the panel
[45:00] files? And so do like see if it is
[45:05] feasible to download panel files.
[45:09] And so we can ask.
[45:13] And if it's easy these are tiny files it
[45:15] looks like.
[45:16] And it's going to try and figure out
[45:18] what it's doing.
[45:19] So it's looking it's going to it's
[45:21] thinking about it and so now it's doing
[45:23] it's how to download these files.
[45:25] And so it's it's remember this is a an
[45:28] agent that it's spun out now to look for
[45:30] this and it's researched how to do this.
[45:33] Look for these sources.
[45:35] And now try and figure out if it's
[45:37] doable.
[45:39] So it's looking.
[45:41] And it looks like it's done and now or
[45:42] it's it's researching this right now.
[45:44] It's not done.
[45:45] And it's downloaded a bunch. It sees
[45:47] that it looks like it has found one of
[45:49] the data sources. So CFPB has it there
[45:52] for a bunch of these files and I'm going
[45:53] to say yes. So what it's trying to do
[45:55] now is trying to download it. So curl is
[45:56] a command to download data and it's
[45:58] going to download and look at the
[46:00] header.
[46:01] And we're going to say yeah, go ahead.
[46:04] All right. So that took a little while
[46:06] to do this but um just to give you a
[46:08] sense this it did a ton of work
[46:11] basically a huge number uh it it ended
[46:14] up using something like 75,000 tokens to
[46:16] really research on this. And now it has
[46:18] a new plan of what it wants to do. It
[46:20] says all right, here's what we're going
[46:21] to do. We're going to reconstruct this
[46:23] plan. We're going to redo this plan and
[46:25] we're going to deal
[46:27] um
[46:28] Sorry, we're going to basically
[46:30] here's what I found. So it did 75,000
[46:33] tokens
[46:34] to figure out what was going on. It says
[46:36] yes, it's available for all of these
[46:37] years.
[46:39] 2024 has not been published yet. Um
[46:43] each file is about one it's not there
[46:45] for 2024. That's fine. It's going to use
[46:48] something else but it won't have an
[46:49] other lender code. So let's do it. So it
[46:52] took a while took 7 minutes and 41
[46:54] seconds to figure it out but now it
[46:56] knows how to do it. So we'll say yes, go
[46:58] ahead
[46:59] and implement.
[47:01] So it's going to try and implement this.
[47:02] It's going to download 18 small files.
[47:04] It's going to merge these on to lender
[47:06] ID and then it will have names and then
[47:08] it will subclassify based off of this
[47:10] from the paper. And so it's going to
[47:13] it's going to try and do this.
[47:15] Um
[47:17] and what you'll see so just to give you
[47:19] a sense of how this data set works is
[47:20] that there's a file that's just about
[47:22] the institutions that are there and it
[47:23] has a name and lender codes. And so what
[47:27] you know, these one of the useful ways
[47:28] is that you can kind of classify banks
[47:30] based off of what they look like um in
[47:32] this information.
[47:34] So it's going to try and implement the
[47:35] code. So
[47:37] we'll let this we'll let this run for a
[47:38] second.
[47:40] And so it's made some decisions
[47:42] on what are the core list. So this it
[47:44] knows from Fooster et al. Here's all the
[47:47] Mhm. Names. The names and then it's made
[47:49] it's made some decisions on what are
[47:51] additional fintech loans. So we may or
[47:53] may not agree with these but these are
[47:55] certainly reasonable place to start.
[47:57] Better Mortgage and better.com and SoFi
[47:59] are good place to start.
[48:01] Um [snorts] and now it's going to it
[48:03] basically uh downloaded this data.
[48:08] Um it actually didn't So this is
[48:10] something I didn't wouldn't really like
[48:12] and we didn't specify this so this is
[48:14] why it's doing it. These are small files
[48:16] and so it's just actually just running
[48:18] it from the command it's basically
[48:19] writing it out itself. This is not
[48:21] replicable though. You would you know,
[48:23] this would not be good if you were
[48:24] trying to have a pipeline where you
[48:26] could run it from scratch.
[48:28] So I wouldn't traditionally like that. I
[48:29] would go back and I'd change that. I
[48:31] would say actually no, write a script to
[48:32] download it. Now it's going to load it
[48:34] in.
[48:35] It's got all of these it's understanding
[48:37] the data. Data looks great. We've got
[48:39] this. Got a nonbank.
[48:42] Now it's going to rebuild the database.
[48:46] And now it has its first error.
[48:49] Mhm.
[48:51] Exit code one.
[48:54] So we could see what it says. It's it's
[48:56] kind of
[48:58] How do you look at this? Or adjust?
[49:01] Uh it's the same way as it's I would say
[49:03] press control O and so there was some
[49:05] error that happened there and so we'll
[49:06] say
[49:07] what happened? Oh, they have Windows
[49:09] line ending. So this is the sort of
[49:10] thing I would never know. It says
[49:12] Windows line endings and fixed width
[49:13] paddings. Let me fix the reading
[49:15] options.
[49:16] So I need to add this for these messy
[49:18] pre-2018. This is the kind of thing
[49:19] that's awful when you work with data. Um
[49:24] and so it's updating the code to address
[49:26] this. It's going to rerun it now.
[49:32] Looks like it's excellent. 26 fintech
[49:34] lenders identified all loans all years.
[49:37] Let me verify that these look
[49:38] reasonable.
[49:43] And now it has market share over time.
[49:46] And so what you can see is here's an LA
[49:48] County
[49:49] kind of grown substantially and of
[49:51] course as of 2020 went up to 70%
[49:54] and now it's around 60. And so this is
[49:56] kind of very here's Rocket Mortgage
[49:58] which is a huge lender. It's kind of the
[49:59] largest lender.
[50:00] Um
[50:02] and it looks like it's great. So it's
[50:04] basically created everything exactly
[50:05] from this this classification.
[50:08] Um and now what we could do is I'm going
[50:10] to say, "Please make a figure of the
[50:13] trend in um
[50:15] raw in fintech origination share over
[50:20] time
[50:21] in the data set
[50:24] uh
[50:25] using Kieran Healy styling
[50:28] for um
[50:30] plotting."
[50:33] So now it's going to make a figure.
[50:35] And then we'll look at it and we will
[50:39] summarize.
[50:40] So while this is running
[50:44] So let me kind of while this is running,
[50:46] let me start by summarizing kind of what
[50:48] we did at a high level.
[50:50] I think there's a lot of stuff we did.
[50:51] This was a ended up being really long.
[50:53] I'm sorry for everybody who stuck with
[50:54] us. Appreciate it.
[50:56] The thing that this did was we took big
[50:59] data sets. We kind of worked with Claude
[51:02] code to
[51:03] down write a script to download them, to
[51:05] kind of put them all in the pipeline,
[51:07] and then clean them up and work in a
[51:08] framework that is
[51:11] um
[51:12] kind of
[51:14] lets us work with it very efficiently.
[51:16] So DuckDB is the thing that I was
[51:17] arguing you should use. There are other
[51:19] things like SQLite that you can do, but
[51:21] they don't require a server or anything
[51:23] else and they have a lot of metadata. So
[51:24] now you could take that data set and you
[51:26] could always heart go back to it and
[51:28] then the LLM would kind of understand
[51:29] what's going on.
[51:31] We then did a crosswalk where we unders-
[51:33] we basically mapped information about
[51:35] the lenders directly from the literature
[51:37] and we were able to kind of pull that
[51:39] data in and harmonize a lot of the
[51:41] information there. And that lets us do a
[51:44] lot. This is this data set is actually
[51:46] for a lot of researchers in finance it's
[51:48] kind of one of the main things that
[51:49] you'll start with um when you're working
[51:50] with it. So let me end by kind of
[51:52] showing you what you can see in this
[51:54] data final data set. So I'm going to
[51:55] share it with you. And now it'll look a
[51:57] little nicer than our other figure.
[52:01] So
[52:02] here.
[52:03] So here
[52:05] is you know, it decided to do like this.
[52:10] So this is the
[52:13] um
[52:14] that.
[52:18] It's interesting in that
[52:20] this is basically non-bank lenders,
[52:22] right? So the the light blue here and
[52:26] actually I'll read you what it says
[52:28] here. It's so there's a
[52:30] light blue band is
[52:32] non-mor- non-bank lenders. So there are
[52:34] a lot of non-bank lenders in the United
[52:36] States that make loans and they've
[52:39] always done that. They've always kind of
[52:40] been on the outside. Um this was kind of
[52:43] a bit they were a lot of the ones who
[52:44] were doing securitization and other
[52:45] things. This blue band is non-bank
[52:47] lenders who Fooster and co-authors
[52:50] classify as fintech. And what you can
[52:52] see is that they've just grown as a
[52:54] proportion of overall lending by quite a
[52:56] bit during this period. Now, whether or
[52:58] not these other places are non-fintech
[53:00] is a is a big question, but this is
[53:02] right around when Quicken became big and
[53:03] you can see that they've just grown as
[53:04] an overall share. And more generally,
[53:07] this is the collapse in traditional bank
[53:09] lending overall. They're the residual
[53:12] and they've gone down. They now make up
[53:14] only 40% of all lending relative to
[53:17] um during the crisis and what we
[53:19] >> of the dark blue?
[53:21] Yes. Okay.
[53:23] Um Rocket Mortgage is that's sort of in
[53:27] here and they they kind of consist in
[53:28] here.
[53:29] Um
[53:31] so it's kind of very cool. There's a lot
[53:33] of other things you could do. You could
[53:34] work with Claude to kind of clean this
[53:35] stuff up. All the slow stuff we kind of
[53:37] already done now. And so, you know, I I
[53:40] think I'd really encourage you. There's
[53:42] a lot of really cool big data sets um
[53:45] out there. I've kind of um I've posted
[53:48] online about this, but I just want to
[53:49] tell you like
[53:50] I've done stuff with uh we're not going
[53:52] to make anyone have to watch us download
[53:55] a bunch of big data sets again, but kind
[53:57] of the there's stuff
[53:59] from what's called IPEDS which is
[54:02] information about post-secondary
[54:03] institutions. So if you're interested in
[54:05] colleges or universities the federal
[54:07] government posts a huge amount of
[54:08] information about that that is all
[54:10] structured and I've posted code on how
[54:12] can you harmonize that and put this all
[54:13] into a one big DuckDB database. If
[54:16] you're interested in studying um
[54:17] universities
[54:18] same thing is true for a lot of other
[54:19] databases. And so I think the kind of
[54:22] the world's your oyster if you're
[54:23] interested in working with data, if
[54:24] you're a graduate student, if you're a
[54:25] junior researcher and you want to kind
[54:27] of learn about new data, I think it's
[54:28] become a lot easier to work with these
[54:30] big data sets and kind of cleanly clean
[54:32] and work with them. So hopefully this
[54:34] was useful. Thank you for sticking with
[54:35] us for this long video and um
[54:38] next time I think Marcus we're going to
[54:40] talk about writing and kind of the
[54:42] general ways in which you can use LLMs
[54:43] for those sorts of tasks, writing
[54:45] reports, kind of iterating on proofs and
[54:48] other things, yeah?
[54:49] Very good. Thanks a lot, Paul. Uh so
[54:52] next time we're looking forward how to
[54:53] write your referee report for this
[54:55] project we started today.
[54:57] Exactly. Exactly. All right. Thanks so
[54:59] much, Marcus. Thanks a lot. Thanks to
[55:01] everybody and uh see you soon again
[55:04] for the next mini video.
[55:08] Bye-bye.
[55:09] Bye.
[55:11] All right.