youtube-transcript.ai

Large Datasets: Claude Code for Economists with Paul Goldsmith-Pinkham | Markus Academy | Ep. 162-4

Watch with subtitles, summary & AI chat
Add the free Subkun extension — works directly on YouTube.
  • Watch
  • Subtitles
  • Summary
  • Ask AI
Try free →

Economists and data analysts who work with large administrative datasets and want to improve their data processing workflows.

TL;DR

This video demonstrates how to use Claude Code to process large datasets, specifically the HMDA data, for economic analysis. It covers downloading, harmonizing, and storing data efficiently using tools like Parquet and DuckDB, enabling better data management and replication for economists working with big data.

Key Takeaways

In This Video

  1. 00:08Introduction to Cloud Code for Economists

    Welcome back to the mini-series on Cloud Code for applied economists. Today, we're taking off the training wheels.

  2. 00:42Working with Large Datasets in Cloud Code

    Today, we'll focus on handling big data in a structured way using Cloud Code, especially with large administrative datasets.

  3. 01:13The HMDA Dataset: Size and Scope

    We'll use the HMDA dataset, which contains mortgage origination data. It's publicly available and around 70 gigabytes.

  4. 01:44Improving Data Practices with Cloud Code

    Cloud Code helps integrate better data storage and working practices, improving replication and opening up new research avenues.

  5. 02:05Setting Up the Command Line Environment

    We'll start by going to the command line to set up the project and describe the goals for downloading and harmonizing data.

  6. 02:59Downloading and Harmonizing HMDA Data

    The goal is to download and harmonize the HMDA data from 2007-2024 to study fintech mortgage lenders' geographic expansion.

  7. 05:31Introducing DuckDB for Data Management

    We'll set up a DuckDB database, a relational database good for managing data relationships and storing metadata efficiently.

Questions & Answers

How to download and harmonize large datasets like HMDA using Claude Code?
The video demonstrates downloading and harmonizing the HMDA dataset using Claude Code. It involves writing a script to pull data, convert it to Parquet files for efficiency, and use DuckDB for data storage and querying, especially for large datasets that don't fit into memory.
What is HMDA data and why is it used?
HMDA stands for the Home Mortgage Disclosure Act. It's a public dataset containing information on mortgage origination and denial in the US, originally constructed to address potential racial discrimination in mortgage lending.
Why use Parquet files instead of CSV for large datasets?
Parquet files are more efficient for storing large datasets than CSVs because they are column-oriented, allowing for better compression and faster data retrieval. DuckDB can efficiently query data stored in Parquet format.
What is DuckDB and how is it useful for economists?
DuckDB is a program that sits on top of Parquet files, enabling efficient data querying. It's useful for economists working with large datasets as it allows for relational database-like operations and stores metadata, making data more understandable and usable by LLMs.
How does Claude Code help with large data and replication?
Claude Code helps economists integrate better practices for storing and working with large datasets. This improves data management, enhances replication of results, and allows for more efficient analysis of big data that might not fit into memory.

Key Terms

Download or copy the punctuated YouTube transcript (Markdown)

Full Transcript

Loading transcript…

Source

YouTube video. Original: https://www.youtube.com/watch?v=4uwI1-9DafU
Transcript captured and processed by youtube-transcript.ai on 2026-06-25.