Large Datasets: Claude Code for Economists with Paul Goldsmith-Pinkham | Markus Academy

Large Datasets: Claude Code for Economists with Paul Goldsmith-Pinkham | Markus Academy | Ep. 162-4

Watch with subtitles, summary & AI chat

Add the free Subkun extension — works directly on YouTube.

Watch
Subtitles
Summary
Ask AI

Try free →

Economists and data analysts who work with large administrative datasets and want to improve their data processing workflows.

TL;DR

This video demonstrates how to use Claude Code to process large datasets, specifically the HMDA data, for economic analysis. It covers downloading, harmonizing, and storing data efficiently using tools like Parquet and DuckDB, enabling better data management and replication for economists working with big data.

Key Takeaways

Applied economists often work with massive datasets that exceed the memory capacity of standard tools like Stata or Python, necessitating specialized approaches.

Cloud Code offers a way to integrate better data storage and working practices, improving efficiency, replication, and enabling new types of analysis.

The HMDA dataset, containing public mortgage origination and denial data, is a prime example of a large administrative dataset (70GB+) that requires careful handling.

Using Parquet files for data storage is more efficient than CSVs because they are column-oriented, allowing for better compression and faster querying.

DuckDB is a powerful, fast, and efficient tool that sits on top of Parquet files, enabling SQL-based querying of large datasets without loading them entirely into memory.

Incorporating metadata into datasets, such as descriptive labels for variables and values, significantly enhances understanding and facilitates easier interaction with LLMs for data analysis.

When downloading large files, implementing resume capabilities and including a user agent header are crucial for robustness and to avoid being blocked by servers.

Harmonizing data across different versions or sources, like the HMDA data before and after 2018, is essential for creating consistent analytical datasets.

In This Video

00:08Introduction to Cloud Code for Economists

Welcome back to the mini-series on Cloud Code for applied economists. Today, we're taking off the training wheels.

00:42Working with Large Datasets in Cloud Code

Today, we'll focus on handling big data in a structured way using Cloud Code, especially with large administrative datasets.

01:13The HMDA Dataset: Size and Scope

We'll use the HMDA dataset, which contains mortgage origination data. It's publicly available and around 70 gigabytes.

01:44Improving Data Practices with Cloud Code

Cloud Code helps integrate better data storage and working practices, improving replication and opening up new research avenues.

02:05Setting Up the Command Line Environment

We'll start by going to the command line to set up the project and describe the goals for downloading and harmonizing data.

02:59Downloading and Harmonizing HMDA Data

The goal is to download and harmonize the HMDA data from 2007-2024 to study fintech mortgage lenders' geographic expansion.

05:31Introducing DuckDB for Data Management

We'll set up a DuckDB database, a relational database good for managing data relationships and storing metadata efficiently.

Questions & Answers

How to download and harmonize large datasets like HMDA using Claude Code?

The video demonstrates downloading and harmonizing the HMDA dataset using Claude Code. It involves writing a script to pull data, convert it to Parquet files for efficiency, and use DuckDB for data storage and querying, especially for large datasets that don't fit into memory.

What is HMDA data and why is it used?

HMDA stands for the Home Mortgage Disclosure Act. It's a public dataset containing information on mortgage origination and denial in the US, originally constructed to address potential racial discrimination in mortgage lending.

Why use Parquet files instead of CSV for large datasets?

Parquet files are more efficient for storing large datasets than CSVs because they are column-oriented, allowing for better compression and faster data retrieval. DuckDB can efficiently query data stored in Parquet format.

What is DuckDB and how is it useful for economists?

DuckDB is a program that sits on top of Parquet files, enabling efficient data querying. It's useful for economists working with large datasets as it allows for relational database-like operations and stores metadata, making data more understandable and usable by LLMs.

How does Claude Code help with large data and replication?

Claude Code helps economists integrate better practices for storing and working with large datasets. This improves data management, enhances replication of results, and allows for more efficient analysis of big data that might not fit into memory.

Key Terms

HMDA — Home Mortgage Disclosure Act, a public dataset on US mortgage origination and denial used to track lending practices.

Parquet — An efficient, column-oriented data file format that offers better compression and performance compared to CSV for large datasets.

DuckDB — A fast, in-process analytical data management system that works well with Parquet files and SQL queries.

Metadata — Additional information about data, such as descriptions, data types, and valid values, which helps in understanding and using the data.

Full Transcript

Loading transcript…