# Web scraping: Claude Code for Economists with Paul Goldsmith-Pinkham | Markus Academy | Ep. 162-3

https://www.youtube.com/watch?v=wqLZrKdevHs

[00:08] Welcome back everybody.
[00:11] to the mini video series on how to use Cloud Code for applied economists with Paul Smith Bingham.
[00:16] Hi Paul.
[00:18] Hi Marcus, how you doing?
[00:21] Now we have the third video.
[00:24] First we did installation and terminal setup for Cloud Code and Co-work as well.
[00:29] And then we talked about data analysis with a simple example and today Paul will show show us how to scrap data from the web.
[00:38] And we're looking forward to Paul's presentation.
[00:40] [snorts]
[00:42] Great.
[00:44] So today we're really going to be kind of deep in the the woods of doing stuff like coding and kind of right at the command line relative to this.
[00:53] So previously we had done some work kind of really using uh you know, Co-work and different things we were talking about the the different different things.
[01:03] Um today we're going to really be focused on um
[01:10] kind of doing a real data collection exercise kind of right at the command line just so you can kind of see how powerful using Cloud Code is.
[01:16] These are things I think that would be doable in Cloud Co-work but for today we're going to focus really on Cloud um Co-work.
[01:21] So um before we were kind of downloading data directly from a government API.
[01:29] So this is kind of one of the benefits with this, right?
[01:30] Is that it's the data is kind of very well structured.
[01:33] We were kind of downloading from an API or from tables that were already made from something that it's meant to be used.
[01:39] But a lot of stuff that we're going to do, the data isn't really that structured at all.
[01:41] So it's you know, we're scraping something from the web or we're given data that's kind of very messy and we're going to do a lot of work to turn it into something that is structured.
[01:49] So what we're going to do today is we're going to use something from the SEC the the Securities Exchange Commission's uh website which is called Edgar.
[01:57] Now, this is still a very structured data set, but it's really used for regulatory um filings as different companies file things.
[02:03] And we're going to take what are called 10-Ks which are annual filings made by companies.
[02:10] Where we're going to we're going to pull stuff out.
[02:12] We're going to build a database and then we're going to basically do some basic analysis here.
[02:16] And we're going to do it all with cloud code.
[02:18] So, the example that I want to I want to do is is thinking about um tariffs.
[02:23] So, this is just you know, is I something I picked.
[02:25] We could pick lots of stories, but I think this would be a fun one.
[02:28] So, there in 10-Ks, there is something called item 1A which are these risk factors which is basically you describe material risks that face your company.
[02:39] So, when you have new risks, you have to talk about it.
[02:41] So, there's no quantity per se described in it.
[02:43] But, one thing that's interesting to measure is sort of how much these tariff risks change and came up and how often is this being described as a material risk over time for these folks.
[02:53] Um you know, this is a lot of related to a lot of work about uh this is the work by Baker, Bloom, and Davis thinking about policy uncertainty and macro.
[03:01] And it's just risk factors generally is is really interesting kind of if you're interested in this type of problem, there's no reason to focus per se on tariffs, but pulling this data can be kind of a pain.
[03:08] I'm going to show you
[03:11] it's really kind of a straightforward problem.
[03:15] So, we're going to do a number of things, but it's all going to kind of be we're not going to have to do too much.
[03:20] You mainly Marcus and I will be sitting here watching what's going on and talking about it.
[03:23] We're going to scrape 10-K filings from Edgar.
[03:26] We're going to extract the information from item 1A.
[03:28] We're going to put this into a database and then we're going to query that database to talk about various trends that are used in that.
[03:35] And maybe we'll make a picture that kind of looks nice um to describe this.
[03:39] So, I'm going to just show you what this looks like and then I'm about to move over to the command line in a second, but I'm going to kind of really give a structured prompt um for this just to make life easy.
[03:51] Um I'm going to ask I'm going to say I want to build a database a data set of 10K risk factors um from SEC SEC Edgar to study how tariff related risk disclosure change from this time period.
[04:03] We could change that if you wanted.
[04:04] Here's what I need to know or here's what I want to know from Edgar.
[04:09] Every company here or here's what I know about it, excuse me.
[04:10] Every company has a CIK, which is just means
[04:13] that there's a If you have a company, you have a number in Edgar that's associated with you.
[04:18] Here is where the API is.
[04:21] So for this one, I'm going to give Claude a little more context up front than usual.
[04:26] So, you know, I'm going to actually tell it a little more explicitly the things that I know.
[04:30] I'm not going to be so quite as vague as I was in the last prompt, mainly to make life easier for it.
[04:35] If you remember from the first video, I talked about is kind of the the how well it does as a function of how clear and precise you are.
[04:40] So, the more precise I am, the less it can have to search randomly for things.
[04:44] I'm going to say, "Look, I want to build a data set of 10K risk factors to study how tariff related risk disclosure change.
[04:51] Here's what I know.
[04:54] So, I know how it's structured.
[04:56] You have to have a CIK.
[04:59] You need Here's the um Edgar filing API and here's how you should rate limit it.
[05:01] And you need to use something called the user agent header, which is a way, if you remember from the last video, that this is to keep it from basically so it can clear these websites.
[05:10] And I want 10Ks What's the rate limit?
[05:14] Sorry.
[05:15] Oh, the rate limit is so that it doesn't just try and get too many files from this.
[05:18] So, I'm saying I see.
[05:20] You can only download, you know, 10 every second here.
[05:22] So, it's it's to keep you from spamming the server and getting blocked.
[05:27] And most of the time uh Claude would know beforehand, but this is this is roughly the rate that you can
[05:33] And the user agent, that's yes, that's just a way to um to tell this is a way to tell the Python script that it writes to pretend they basically have to have a user agent associated with it, which is going to be my name and my email address.
[05:44] Okay.
[05:47] Um you're going to say, I want and I'm going to do it for 30 firms in different things.
[05:49] I'm going to just give it some examples.
[05:52] Obviously, if we're doing this for a research project, we'd say do it for every, you know, publicly listed stock, you know, all 500 and maybe I give it a list of names.
[05:58] But for here, I'm just going to let Claude make some decisions.
[06:00] Okay.
[06:02] So, I'm going to do this now.
[06:03] I'm going to take this over.
[06:04] So, I'm going to pause for 1 second as I move over and do this on the command line.
[06:08] Great.
[06:08] So, here we are at the command line, Marcus.
[06:10] So, what we're going to do is I'm going to make a a folder that this is inside of the one of the things
[06:15] that I've been doing.
[06:17] I'm just going to make a folder that we'll go in.
[06:20] We're going to start a new fresh Claude um session.
[06:24] And here, I'm going to tell it to basically create um I'm going to give it the prompt that we just talked about.
[06:31] So, we're going to start the project.
[06:32] So, I'm going to put it in.
[06:34] It's going to have this pasted text and I'm going to say I want to build a data set, blah blah blah.
[06:38] It's everything we just talked about.
[06:39] Mhm.
[06:39] So, it's going to start thinking and it's going to say, "Okay, I'm going to enter plan mode."
[06:43] So, now it's saying, "Okay, I need to make a plan on how to do this."
[06:46] So, let me think about how to do this.
[06:49] "Okay, it's going to look here."
[06:52] Um it's thinking.
[06:54] And so, one thing you can do is you can look and it's saying it it ran a command.
[06:59] It said this is called LS, which means it looks inside.
[07:01] It says, "Hey, this is a brand new folder.
[07:04] There's nothing in here."
[07:04] Yeah.
[07:06] So, it's thinking.
[07:06] So, you remember how we talked about how there's a context window where it does lots of stuff.
[07:08] So, now it says, "Wants to build a thing that does this.
[07:11] This is a project a substantial project that involves all these things.
[07:14] It's a non-trivial implementation.
[07:14] Let me enter
[07:16] plan mode to design the approach before doing this.
[07:20] Now, it's saying, "Okay, let me look at this and research the Edgar API before designing it."
[07:25] I'm going to say, "Yes, you can fetch this content from here."
[07:26] I'm going to say, "Yes, and don't ask again."
[07:31] I'm giving it lots of permissions to go search things on the web.
[07:34] And here's how you can search.
[07:35] It's looking up for documentation.
[07:37] So, right now it's running all of these commands, which is looking for it's asking, you know, what are things here?
[07:42] What goes What's going on here?
[07:44] So, it's doing a lot of different tool commands all at the same time.
[07:49] Um and it's trying to figure out how to query Edgar um to do different things.
[07:54] So, I'm giving it a lot of permissions just so it can go and search and understand how to do this.
[08:00] This is a bit like what you and I would do if we were going and looking this up.
[08:02] It just reads faster than I do.
[08:06] So, it's working.
[08:10] Once it's kind of figured out how these tools work, it's going to then try to make a plan.
[08:14] So, it's figuring out, okay, well, here's the the different filing things.
[08:17] Here's what it's doing.
[08:20] And so, it's it's searching.
[08:22] Still searching.
[08:23] You can see it's kind of refining what it's going on.
[08:24] And so, what the way to think about these is that these are each tools that it goes and does something and then it sends something back to the to the agent.
[08:33] And we actually don't see it in this window, but that's what it's doing.
[08:36] Now, it's it's trying to pull something from this person's GitHub page.
[08:39] I'm going to say yes.
[08:41] You can go look at that.
[08:44] But it's still making a plan, no?
[08:45] Okay, great.
[08:47] So, we had to stop really quick there, Marcus.
[08:48] Sorry about that.
[08:50] But one thing that's great about this is I interrupted while I was in the middle of this and that can happen for a number of reasons.
[08:53] Maybe you you kind of hit the button by accident cuz you have fat fingers or or you maybe your computer turns off or something like this.
[09:02] The amazing thing is cuz of we talked about how the context window works, you see it says interrupted, you know, what should Claude do instead cuz I hit escape.
[09:09] I can say, "Well, oh, please continue."
[09:13] So, it just says, "Oh, well, I'll just keep working then."
[09:15] So, I can just say this.
[09:17] It's going to say, "Okay, well, I'll just keep
[09:18] working. Let me research some more.
[09:20] it?
[09:21] So, I interrupted it when I when we uh
[09:24] Okay.
[09:24] So, it, you know, it's saying, "Oh, we have these errors."
[09:26] It's going to continue planning. Let's keep going.
[09:29] So, obviously there were some errors here that that occurred.
[09:31] It's going to keep working.
[09:32] It needs a user agent.
[09:35] So, I know the API well.
[09:37] So, it's going to do this.
[09:38] Let me clarify a few things before we do it.
[09:40] So, it It tried to do this and so now this is the coolest part of this.
[09:42] So, this is what one thing that's really lovely about Claude Code is it's It's going to ask me questions about how I want to do it.
[09:45] Remember when we did Claude Co-work, it did the same thing.
[09:49] Yeah. How do you want this structured?
[09:51] And so, do you want a SQL-like database?
[09:53] So, with things?
[09:56] Do you want a flat file?
[09:58] Do you want both?
[10:00] I'm actually going to tell it I want something called DuckDB,
[10:01] um which is very similar to a SQL-like database.
[10:04] It's just how I prefer to do things.
[10:05] Um
[10:07] So, you This is a fourth option.
[10:08] You can say DuckDB database,
[10:10] um please.
[10:12] Uh remember I'm very polite.
[10:13] So, then you could say, "Do you want a keyword search?
[10:17] Um extract the full text
[10:20] and then flag paragraphs?
[10:22] You want to extract the full section and do the keyword analysis later?
[10:27] Or do you want to do LLM classification, but then I would have to pay for it?
[10:30] I'm going to do the keyword search.
[10:31] We're just going to look for words that are related to tariffs.
[10:35] Um and I And says, "What entity should we use for the user agent?"
[10:38] Remember you have to give this header.
[10:41] It's saying, "Do you want to fill it in later? Or do you want to do it right now?"
[10:44] I'm going to say I'm going to do it right now.
[10:46] I'll provide it.
[10:49] And then it's going to ask me.
[10:50] It's going to say, "Okay, well, please write your thing in now."
[10:53] And so, you know, you asked this before is that the planning takes a while.
[10:57] It's kind of similar to you'd say, um So, what name and So, So, what it's going to say is
[11:03] I It's saying, "Oh, it's very kind of interesting."
[11:06] It's saying, "Use placeholder text for now. I'll type it."
[11:10] And it's saying, "Write it here in number three."
[11:12] So, when it says, "I'll type it," it's saying, "Please put it here."
[11:14] So, I'll say, "My name is Paul Goldsmith Bingham.
[11:16] My email is paulgp@gmail.com."
[11:20] So, I did that.
[11:22] And it's going to do this.
[11:23] So, it's going to put this here.
[11:25] And what we were talking about planning,
[11:26] what's really is kind of remarkable is with many projects, the plan is almost just as important as the execution.
[11:34] I see.
[11:37] Because sort of similar to like kind of when we do research, often what we're doing is trying to figure out what the plan is, right?
[11:41] Like we're just sort of saying we have a vague idea, let's iterate.
[11:45] When you work in a software engineer kind of setting, much of what you're doing is related to that.
[11:49] So, it's thinking really hard.
[11:50] You see how it if we went into here, it says, "Let me design this.
[11:54] I've all the info we need.
[11:56] Here's what we're going to do.
[11:57] Here's implement it.
[11:59] Here's my plan.
[12:00] I'm going to do this.
[12:02] It's going to do these targets.
[12:05] Um duck DB schema.
[12:06] Here are words I'm going to look for.
[12:08] Here's all these words.
[12:10] Now, I'm thinking through the target list I should do.
[12:11] If I'd given it a list, it would have been easier for it.
[12:13] So, retail, tech.
[12:16] Now, I need to figure out how to do this.
[12:17] I'm also thinking about what whether to use the API or just download the
[12:21] complete thing is better.
[12:23] So, I'm going to access the filing index through this, use beautiful soup, which is just a package to parse the HTML, and then look for something in the text.
[12:30] But, if I wanted to change something now as in oh, there was something wrong with Yeah, that's a great question.
[12:35] Yeah, so add another firm or something?
[12:38] Great. So, yeah, yeah. So, here's what we're going to do.
[12:39] So, here's the plan summary.
[12:41] So, we now have a plan, it says.
[12:43] Okay.
[12:44] So, remember how we talked about when there was this managed context window we were saying how it's useful to be in charge of it, like you do research, and then you can This is what we talked about in the first video.
[12:53] You would write down a plan.
[12:55] So, Claude has now kind of automated this directly.
[12:57] They're actually just plans I see.
[12:59] in here, and what it does in plan mode is it constructs a file, which is kind of the compressed memory of all the planning it did into an executable idea.
[13:07] Okay. So, here it's saying, "Here's the plan summary.
[13:09] I'm going to make a script that does looks for the CIKs for all of these.
[13:14] Then going to find them here.
[13:16] It's going to get all their filings, and I'm going to filter for 10-Ks.
[13:20] I'm going to download and parse each 10-K and look
[13:22] for item 1A.
[13:24] I'm going to look for certain words, and then I'm going to store this in a database.
[13:28] And what it's going to do is it's going to cache the data, which means that after it's downloaded the HTML, it won't download it again.
[13:33] Remember we talked about this last time it was re-downloading things, which felt slow?
[13:38] This is a good example of it's just going to download all the raw data into a place so that it doesn't get stuck re-downloading every time.
[13:46] And then it's going to give lots of details what the plan is cuz this is the basically it's instructions to itself.
[13:51] So, it's going to say, "All right, I'm going to build a pipeline that does this.
[13:54] Here's the target companies, so it's going to be three different kinds.
[13:58] Here's the architecture. Here's how you're going to do it.
[14:00] This is just the longer version of what I described above.
[14:03] But that's where the one of the keywords I would like to instead of tariffs, I want to
[14:08] Yeah, you want to say you want to add another one.
[14:10] You want to say like "What if we added in Trump as well?"
[14:14] Like Trump or Yeah, so totally.
[14:17] So, we could say "Do you want it This is the plan. Would you like to do it?"
[14:21] And so, you could say yes, clear context, and auto accept.
[14:23] So, what that means is Mhm.
[14:26] Get Remember how you wipe the window and you just read the plan and go?
[14:29] One is you could say just yes, I'm going to I don't I I don't want to wipe it.
[14:31] I just want to go from here.
[14:34] Or yes, but I don't want to just auto accept what you're going to do.
[14:36] I see.
[14:38] Or four, what I'm going to say is
[14:40] "Please add in references to Liberation Day.
[14:44] Perhaps Liberation
[14:45] The Liberation Day.
[14:45] Great idea.
[14:50] Yes.
[14:50] Um to the tariff keywords.
[14:57] So, I It's going to say here, "Tool use rejected with this message.
[14:58] Please add in this."
[15:00] And it's thinking, "Good call.
[15:02] Liberation Day is very positive."
[15:04] Claude is always a very positive.
[15:06] With April 2nd, 2025 tariff now, let me update the keywords.
[15:08] So, add Liberation Day and Liberation Day tariff to the keyword list.
[15:10] And so, it's going to do that and it will put it here.
[15:14] Liberation Day, Liberation Day tariff.
[15:17] And everything else is pretty similar.
[15:19] It would be weird.
[15:21] So, for the interest of this, so I'm going to kind of
[15:24] um I'm going to say yes.
[15:25] I'm going to say go ahead and auto accept edits.
[15:28] So, that's going to let it write everything and do thing It won't be able to execute things, but it will be able to write and edit files on its own.
[15:32] So, but now it the you compress this everything, the old stuff, no?
[15:39] Yes, it will get rid of all the planning it did before when it clears the context window.
[15:42] That's side of what it's saying is it's going to clear all the old stuff, but it's going to save this plan that we just read through into a file.
[15:48] In fact, you can see it right here.
[15:51] Um it's going to edit It's going to have this plan right here as a file that it saves it.
[15:55] It actually keeps track of all your files, these these plans.
[15:59] And so, it says here I could actually this three meek knitting I call Yeah, so it just has The way that it does this is that it just combines three now three different words together.
[16:07] So, it's an adverb, noun, and noun.
[16:12] Um so,
[16:15] I'm going to say yes.
[16:16] Clear context and accept and accept edits.
[16:20] So, now it has this in here.
[16:23] Um and
[16:25] Um.
[16:29] Um and so, now it's running.
[16:32] So, it's going to write some information here.
[16:35] Um thinking again essentially, but it had a plan already.
[16:38] Yeah, but the thinking The thinking is just that it's thinking about how to actually implement the plan.
[16:42] Now, it has It has a plan It has It knows how it wants to do it.
[16:44] Okay.
[16:46] Um.
[16:47] It's And we can look at how it's thinking.
[16:48] It wants me to implement this.
[16:50] Let me create the files, requirements.txt.
[16:51] That's just for the packages, and then How did you switch on the thinking to looking into the thinking?
[16:59] Oh, thinking is just a word that it's using.
[17:00] So, it's like smooshing or anything.
[17:02] This is just random verb or if you want to look into the thinking process.
[17:05] Oh, I'm sorry.
[17:05] Great question.
[17:07] Yes, I should have said this.
[17:09] So, if you hit control O, it will show the what's thinking.
[17:11] And if you hit control O again, it will go back.
[17:13] That's my I should have said that.
[17:15] Yeah.
[17:16] What does the O stand for?
[17:16] Just as a I think just open probably is the easiest way to think of that.
[17:21] Like it opens up all the stuff that's going on behind the scenes.
[17:25] I see.
[17:25] Um so here I just wrote a bunch
[17:27] of code.
[17:29] It wrote 480 so it just wrote all this script here.
[17:31] It's a Python script that just wrote all of this.
[17:33] You know, it has all the phases that we talked about.
[17:35] So here's phase one that does the look up.
[17:37] Mhm.
[17:39] It has all I mean this is all code that would take a little while to write yourself and did it, you know, in just a minute or two.
[17:44] to be able to double-check this code or you just run with it?
[17:47] I mean it's sort of the benefit is that if it if you have a way of knowing that it worked correctly, that's helpful.
[17:50] So you should know that it's going to pull uh you know, three five three to four filings for 30 companies.
[17:56] So you should know you should have 128 five like rows of filing information.
[18:01] Okay.
[18:03] And so that's one way to know.
[18:05] So now it's going to just do Now it's going to try and run stuff.
[18:07] So here it's installing dependencies.
[18:13] Mhm.
[18:13] Um And so I it's going to say, "Hey, you already have these things so you don't need it."
[18:17] So I didn't need to install anything new and that's going to run this scraper.
[18:20] It will take several minutes since it downloads 90 plus filings.
[18:23] I don't know if that's true.
[18:25] Um so it could not find something.
[18:29] it's actually running here in the background.
[18:32] Um could not find the CIK for this company.
[18:35] I see.
[18:36] Now it's downloading here.
[18:38] So it did a bunch very quickly.
[18:41] Here's a 30. It just It's looking for the filings.
[18:44] Mhm. And it's gotten most of those.
[18:46] So this is figuring out where all the filings are and now it's downloading the actual documents.
[18:51] So this is actually this is coming from its own command line that it's running.
[18:55] It's running this It downloader here.
[18:56] It's downloading and extracting information.
[19:00] So it We're missing it for a few I think we're missing it for one company.
[19:04] So it has all but one and so it's running here.
[19:07] Now we're up to 22 26%.
[19:11] It's downloading.
[19:11] Can I ask get laid on which company the CIK stands for which company?
[19:17] Oh, this GPA the one that it missed?
[19:19] Yes.
[19:20] Yeah. Um, we can So, one thing that's nice is you can queue your um questions.
[19:22] You could say, "What company was it unable to find
[19:29] a CIK for?
[19:32] Okay.
[19:32] And so, and then you could you can edit it if you if you change your mind, you can press up to edit it.
[19:40] Um Okay.
[19:40] Oops, sorry.
[19:43] And so, that will answer it.
[19:45] So, we're almost done, 93%.
[19:47] And so, what's nice is you could go and um look at the folder really quick.
[19:52] Um So, if we go and look in the folder um well, I guess I won't do this cuz I don't want to.
[19:58] It's what I can tell you is it's downloaded 116 of these already.
[20:02] So, it it did it.
[20:02] It's running this command and it's it's here.
[20:06] Um the pipeline ran successfully.
[20:08] Here's the summary.
[20:10] It did 29 out of 30.
[20:10] Do you So,
[20:15] And GPS is Gap Inc.
[20:18] It's the one ticker it couldn't find a CIK for.
[20:19] That's because GA Gap Inc. rebranded and it's now traded under Gap in some data sets.
[20:23] So, we can look If you would like, I can look up the correct ticker mapping and add it.
[20:27] Yes, please add it.
[20:31] Okay.
[20:32] So, now we'll see what it does in
[20:33] response to this question.
[20:36] Um
[20:39] And it will go and it will look. So,
[20:41] what it's doing is it's looking in the
[20:42] SEC Edgar
[20:44] and the company tickers of where Gap is.
[20:48] Um
[20:50] So, I'm giving it permission to do this.
[20:54] And
[20:56] See if I can Pretty cool.
[20:59] Yeah. So, it's searching. It's trying to
[21:01] figure out what its CIK is. It looks
[21:02] like it's done.
[21:04] Um
[21:05] CIK 39911 Let's see Let's me check why I
[21:09] did. Was it case sensitivity or
[21:11] something else?
[21:14] And it looks like it was because it was
[21:15] GAAP.
[21:17] Mhm. Not GPS. And so I'll update the
[21:19] script to use GAAP instead. And so it's
[21:22] going to change it in here to be
[21:23] correct. Now let me rerun. Since it's
[21:25] cached, it's only going to download one
[21:27] file. Okay. Okay.
[21:29] So it should be very quick. Yes, it's
[21:30] cached.
[21:32] Only four new files. It was about 43
[21:34] seconds. The cache files are written
[21:35] here. Da da da. Okay. So now it only So
[21:39] now we can ask the next question which
[21:40] it said, "Okay, well we only got 112 out
[21:43] of the 120 sections extracted Mhm. with
[21:46] 113 tariff paragraphs. So
[21:49] um why didn't we get 120 item
[21:54] uh one paragraphs?
[21:57] So it seemed like we are missing a few
[22:00] without these.
[22:01] Um good call. Eight filings failed the
[22:04] item 1A extraction.
[22:06] So let me check which ones. It's going
[22:08] to check. Looks like it's from several
[22:10] of these ones that um are Honeywell and
[22:13] Intel. Every filing for both the regex
[22:16] must not be matching what they're doing.
[22:18] So there's an element to which if you
[22:19] know it's mandatory to have these one
[22:21] check is to say, "Hey, every year there
[22:23] should be an an item for this." Yeah. So
[22:26] figure out what's going on.
[22:28] And so it's writing this code. I'm, you
[22:30] know, the joke with these LM things is
[22:31] often you just hit enter. You don't even
[22:33] look at the code every time because
[22:34] there's so many things it's going to
[22:35] check. Um but obviously, you know,
[22:38] So in both of these, item 1A and risk
[22:41] factor are on separate lines with a new
[22:42] line between them. The current regex
[22:44] requires them to be on the same line.
[22:45] The fix is to allow new lines.
[22:47] Uh it's going to do some work here to
[22:49] fix this. Regex, if you're not familiar,
[22:51] is just a way to look for patterns in
[22:52] string. Now it's going to rerun it. So
[22:55] um after you're done with this, instead
[22:59] of trying to
[23:01] fix these two, can you show how uh uh um
[23:06] how much uh which
[23:10] firms are most um
[23:12] mentioning
[23:14] um tariffs with
[23:17] Uh you always Can you show the show is
[23:19] missing?
[23:22] Oh, yes. Thank you. It probably wouldn't
[23:24] even care. Can you it would say it would
[23:27] probably make its best guess.
[23:29] Okay. Let's see how it did. Big
[23:31] improvement. 119.
[23:33] So, it's going to keep trying to do
[23:35] this. It's still trying to figure it
[23:36] out.
[23:37] But,
[23:38] it's just 3M now.
[23:40] He's managed to break 3M.
[23:43] So, it's going to say it's likely in a
[23:44] filing formatting edge case.
[23:46] Mhm.
[23:46] >> 10K might have unusual structure. Yada
[23:48] yada yada.
[23:50] Um let's proceed. I'm going to ask it to
[23:52] do this thing. It's going to look and
[23:54] it's going to say
[23:55] what's the top word So, it's going to
[23:57] give me some information. Mhm.
[23:59] It wrote a script to do this. And so,
[24:01] what here's what it'll do. So, I said
[24:03] um which firms mention tariffs the most?
[24:07] The ones that mention it the most are
[24:09] manufacturing. So,
[24:10] um
[24:11] Deere John Deere leads with the most
[24:13] tariff mentions increasing from three to
[24:15] four over time.
[24:16] Manufacturing tech hardware retail firms
[24:18] mention it the the least.
[24:21] They both jumped to three.
[24:22] It AVGO dropped from four to zero. Mhm.
[24:26] And Walmart's only started mentioning
[24:27] tariffs in 2025. Oh, yeah. So, most
[24:30] paragraphs have it.
[24:33] So, it's it shows up more. So, the
[24:34] question is is please download the 10Ks
[24:40] for um 2010
[24:43] um for these same firms and identify
[24:48] um
[24:49] the um um
[24:51] um identify the tariff
[24:54] measures there as well.
[24:57] It looks like even in 2010 it mentions
[25:00] it. Mhm.
[25:01] Um so, we're going to sort of see even
[25:04] in 2010, you know, 18 companies mention
[25:07] it. Okay.
[25:09] So, it looks like, you know,
[25:12] it was already widespread here, but the
[25:14] intensity has increased. More paragraphs
[25:15] per company in previous years. Not sure
[25:18] if that's So, 25 across 18 versus 26 to
[25:21] 35 per year in in roughly the same
[25:24] number.
[25:25] In 2010 it was narrower, mainly tariff,
[25:27] trade restrictions, anti-dumping,
[25:28] countervailing. Now it's, you know,
[25:30] trade tension, trade dispute, trade war.
[25:33] I see. Um
[25:35] and so on. So, you know, you could do a
[25:37] lot more with this. It's sort of
[25:38] interesting. I don't think something
[25:39] that I was aware of of kind of how
[25:41] common it is. I think the next thing if
[25:42] I was going to do this, which we could
[25:44] do some other time or
[25:45] you, the viewer, can do is you could do
[25:48] this for a lot more firms. I mean, I
[25:49] think the biggest difference that's
[25:50] happened here, right, is that in this
[25:52] time period there's just far more um
[25:55] things going on. It's hitting a lot more
[25:57] firms and so understanding the
[25:59] cross-section of firms here would be
[26:00] very interesting. Right.
[26:02] >> So, um hopefully this is interesting. I
[26:05] think one of the things that you're
[26:06] interested in thinking about macro risks
[26:08] and doing this, parsing this kind of
[26:10] data, the cost I mean, we did this in
[26:12] about 30 minutes
[26:13] is way lower and you could start
[26:15] thinking about other risk factors and
[26:16] other things to do. It doesn't require a
[26:18] team of RAs or anything else and
[26:19] hopefully, you know, this could be
[26:21] useful for you going forward in the
[26:22] future.
[26:23] Fantastic.
[26:25] All right.
[26:29] Great.
[26:30] So,
[26:32] I think with that, I think we'll wrap
[26:34] up, Marcus, unless you have anything
[26:35] else you wanted to add. I think just to
[26:37] summarize, we basically we did and ended
[26:41] up taking um a little bit more than our
[26:43] usual time, but we we basically
[26:46] worked with this LLM to kind of cons-
[26:49] ask a question of Edgar, which is a the
[26:52] Securities and Exchange Commission's
[26:54] website. We downloaded a bunch of
[26:56] basically raw filing data and asked
[26:58] Claude to write parsing scripts for us.
[27:01] It put it into structured database and
[27:03] then we were able to ask questions in a
[27:05] pretty quick amount of time. So, next
[27:07] time we're going to think about much
[27:08] bigger data sets and how to structure
[27:10] them and I'm looking forward to joining
[27:12] them.
[27:13] That is good. Thanks a lot, Paul, and
[27:15] thanks for all of you for following and
[27:18] be excited about the next mini video
[27:21] um
[27:23] coming up soon.
[27:24] Cheers.
[27:25] >> Great. Bye-bye.
[27:26] >> Cheers. Bye-bye.
