youtube-transcript.ai

AI Engineering Speedrun: Complete Course in 15 Minutes (Chip Huyen Book)

Watch with subtitles, summary & AI chat
Add the free Subkun extension — works directly on YouTube.
  • Watch
  • Subtitles
  • Summary
  • Ask AI
Try free →

AI engineering focuses on building applications using pre-trained foundation models, differing from traditional ML by leveraging existing models rather than building from scratch. Key areas include understanding foundation models, prompt engineering, retrieval augmented generation (RAG), agents, and optimizing inference for speed and cost. This field has rapidly grown due to improved AI models and lower barriers to entry, enabling sophisticated applications.

Full Transcript

https://www.youtube.com/watch?v=UktQgawjqis

[00:01] Hey everyone, today we're diving into.
[00:03] Hey everyone, today we're diving into the book AI engineering by Chip Huan.
[00:05] the book AI engineering by Chip Huan.
[00:08] 800 pages of really great content about this in- demand field that's offering.
[00:09] this in- demand field that's offering salaries of $300,000 or more.
[00:13] In this video, I'm summarizing everything from.
[00:14] the book to help you get a highle overview of the field.
[00:16] We'll talk about foundation models, prompt engineering.
[00:18] foundation models, prompt engineering, rag, fine-tuning, agents, how to build a.
[00:21] rag, fine-tuning, agents, how to build a system, improving inference, and more.
[00:25] system, improving inference, and more.
[00:28] I also want to mention this is a super highle overview of a very detailed.
[00:29] also want to mention this is a super highle overview of a very detailed technical book.
[00:31] technical book.
[00:33] Don't expect to learn all the details just from watching this.
[00:35] video.
[00:37] I really recommend using this as a way to get an overview of what the.
[00:39] a way to get an overview of what the field looks like and use it as a jumping.
[00:41] field looks like and use it as a jumping off point for your own research and.
[00:42] off point for your own research and exploration.
[00:45] So what exactly is AI engineering and how is it different from.
[00:47] engineering and how is it different from traditional machine learning?
[00:49] Let AI engineering has exploded recently for.
[00:51] two simple reasons.
[00:53] AI models have gotten dramatically better at solving real problems while the barrier to.
[00:54] gotten dramatically better at solving real problems while the barrier to building with them has gotten much.
[00:56] building with them has gotten much lower.
[00:57] This perfect storm has created.
[00:59] lower.
[00:59] This perfect storm has created one of the fastest growing engineering
[01:01] one of the fastest growing engineering disciplines today.
[01:04] At its core, AI engineering is about building applications on top of foundation models.
[01:09] Those massive AI systems trained by companies like OpenAI or Google.
[01:13] Unlike traditional machine learning engineers who build models from scratch, AI engineers leverage existing ones, focusing less on training and more on adaptation.
[01:22] These foundation models work through a process called selfs supervision.
[01:28] Instead of requiring humans to painstakingly label data, these models can learn by predicting parts of their input data.
[01:34] This breakthrough solved the data labeling bottleneck that held back AI for years.
[01:38] As these models scaled up with more data and computing power, they evolved from simple language models to what we now call large language models or LLMs.
[01:45] And they didn't stop there.
[01:49] They've expanded to handle multiple types of data, including images and video, often becoming large multimodal models.
[01:54] Nowadays, we're seeing foundation models power everything from coding assistants like GitHub copilot to image generation tools, writing aids, customer support
[02:03] tools, writing aids, customer support bots, and sophisticated data analysis
[02:06] bots, and sophisticated data analysis systems.
[02:08] Now that we've covered what AI engineering is, let's dig deeper into
[02:10] engineering is, let's dig deeper into foundation models themselves, how
[02:13] foundation models themselves, how they're trained, how they work, and why
[02:15] they're trained, how they work, and why understanding their architecture matters
[02:17] understanding their architecture matters for AI engineers.
[02:19] for AI engineers. Foundation models at their core can only know what they've
[02:22] their core can only know what they've been trained on.
[02:24] been trained on. This might seem obvious, but it has profound
[02:25] obvious, but it has profound implications.
[02:28] implications. If a model hasn't seen examples of a specific language or
[02:29] examples of a specific language or concept during training, it simply won't
[02:32] concept during training, it simply won't have that knowledge.
[02:33] have that knowledge. Most large foundation models are trained on
[02:35] foundation models are trained on webcrolled data which brings some
[02:36] webcrolled data which brings some inherent problems.
[02:38] inherent problems. This data often contains clickbait, misinformation,
[02:40] contains clickbait, misinformation, toxic content, and fake news.
[02:43] toxic content, and fake news. To combat this, teams use various filtering
[02:45] this, teams use various filtering techniques.
[02:47] techniques. For instance, OpenAI only used Reddit links with at least three
[02:49] used Reddit links with at least three upvotes when training GPD2.
[02:51] upvotes when training GPD2. The language distribution in training data is also
[02:53] distribution in training data is also heavily skewed.
[02:56] heavily skewed. About half of all crawled data is in English, which means
[02:58] crawled data is in English, which means languages with millions of speakers are
[02:59] languages with millions of speakers are often underrepresented.
[03:01] often underrepresented. This is why specialized models for specific
[03:03] specialized models for specific languages and domains are becoming
[03:05] languages and domains are becoming increasingly important.
[03:07] In terms of model architecture, most foundation models use transformer architectures.
[03:11] based on the attention mechanism.
[03:13] But to understand why transformers were such a breakthrough, we need to look at what came before.
[03:17] Transformers were invented to solve the problems of sequence to sequence models, which used recurrent neural networks, RNNs, for tasks like translation.
[03:25] These had two main components.
[03:27] An encoder that processes inputs and a decoder that generates outputs.
[03:31] Both work sequentially token by token.
[03:33] The problem is that the decoder only has access to a compressed representation of the entire input.
[03:42] Imagine trying to answer detailed questions about a book when all you have is a brief summary.
[03:48] Also, input processing and output generation are done sequentially, so it's slow for long sequences.
[03:54] Transformers solved this with the attention mechanism, which allows the model to weigh the importance of different input tokens when generating each output token.
[04:00] It's like being able to reference any page in the book while answering questions.
[04:04] Plus, transformers
[04:06] answering questions.
[04:09] Plus, transformers can process input tokens in parallel, making them much faster.
[04:11] During inference, transformers work in two steps.
[04:13] Prefill, processing all input tokens parallel, and decode, generating one output token at a time.
[04:20] The attention mechanism uses three types of vectors.
[04:22] Query vectors Q, key vectors K, and value vectors V.
[04:30] The model computes how much attention to give each input token by comparing the Q and K vectors.
[04:35] A high similarity score means that the token's content V will heavily influence the output.
[04:40] This is why longer context windows are computationally expensive.
[04:43] More tokens mean more K and V vectors to compute and store.
[04:47] Attention is almost always multi-headed, allowing the model to focus on different groups of tokens simultaneously.
[04:50] In Llama 27B, there are 32 attention heads, for example.
[04:58] A complete transformer consists of multiple transformer blocks, each containing an attention module and a neural network module.
[05:00] Now that we understand models a
[05:08] module. Now that we understand models a little more, let's talk about one of the most crucial yet underappreciated aspects of AI engineering, evaluation.
[05:15] For some applications, figuring out evaluation can consume the majority of your development effort.
[05:21] It's how you mitigate risks, uncover opportunities, and gain visibility into where your system is failing.
[05:27] Evaluating AI systems is significantly harder than traditional ML models because problems are complex and responses are open-ended.
[05:34] Foundation models are black boxes. You can only evaluate them by observing their outputs, not by understanding their internal workings.
[05:42] Publicly available evaluation benchmarks quickly become saturated, meaning the model achieves perfect scores as models improve.
[05:48] So let's start with some fundamental metrics used to evaluate language models during training.
[05:52] Cross entropy and perplexity. These metrics essentially measure how well the model predicts the next token in a sequence.
[06:01] Language models learn the distribution of their training data. The better a model learns this distribution, the better it becomes at predicting what comes next, resulting
[06:09] at predicting what comes next, resulting in lower cross entropy.
[06:11] Perplexity is simply the exponential of cross entropy.
[06:14] simply the exponential of cross entropy.
[06:15] It measures the amount of uncertainty a model has when predicting the next token.
[06:17] model has when predicting the next token.
[06:19] While perplexity is useful for guiding training, it becomes less reliable for models that have undergone significant post- training with SFT or RLHF.
[06:21] guiding training, it becomes less reliable for models that have undergone
[06:23] reliable for models that have undergone significant post- training with SFT or
[06:26] significant post- training with SFT or RLHF.
[06:29] RLHF. For some tasks, we can perform exact evaluation where there's no ambiguity about the correct answer, like multiple choice questions.
[06:31] exact evaluation where there's no ambiguity about the correct answer, like
[06:33] ambiguity about the correct answer, like multiple choice questions.
[06:35] multiple choice questions. In coding tasks, functional correctness translates to execution accuracy.
[06:38] tasks, functional correctness translates to execution accuracy.
[06:40] to execution accuracy. Does the code run and produce the expected output?
[06:43] and produce the expected output? One of the most powerful and common methods for evaluating AI models in production is using another AI model as a judge.
[06:44] the most powerful and common methods for evaluating AI models in production is
[06:47] evaluating AI models in production is using another AI model as a judge.
[06:49] using another AI model as a judge. These AI judges are fast, easy to use, and relatively cheap compared to human evaluators.
[06:52] AI judges are fast, easy to use, and relatively cheap compared to human
[06:53] relatively cheap compared to human evaluators.
[06:55] evaluators. Now that we understand evaluation, let's tackle one of the most crucial decisions in AI engineering, model selection.
[06:57] evaluation, let's tackle one of the most crucial decisions in AI engineering,
[07:00] crucial decisions in AI engineering, model selection.
[07:02] model selection. With the increasing number of readily available foundation models, the challenge isn't developing models, but selecting the right one for your application.
[07:04] number of readily available foundation models, the challenge isn't developing
[07:06] models, the challenge isn't developing models, but selecting the right one for your application.
[07:08] models, but selecting the right one for your application.
[07:11] your application.
[07:11] The selection process typically involves two key steps.
[07:13] typically involves two key steps.
[07:13] Finding the best achievable performance
[07:15] Finding the best achievable performance on the task and mapping models along a
[07:17] on the task and mapping models along a cost performance axis.
[07:20] cost performance axis.
[07:20] Your criteria for evaluating a model can be organized into
[07:22] evaluating a model can be organized into four buckets.
[07:24] four buckets. domain specific
[07:24] domain specific capabilities, general capabilities,
[07:26] capabilities, general capabilities, instruction following capabilities, and
[07:28] instruction following capabilities, and cost latency.
[07:31] cost latency. When evaluating models,
[07:31] When evaluating models, you also need to differentiate between
[07:32] you also need to differentiate between hard attributes, impossible to change,
[07:35] hard attributes, impossible to change, and soft attributes can be improved
[07:37] and soft attributes can be improved through adaptation.
[07:39] through adaptation. A high-level
[07:39] A high-level workflow for model selection looks like
[07:41] workflow for model selection looks like this.
[07:43] this. Filter out models whose hard
[07:43] Filter out models whose hard attributes dawn.
[07:45] attributes dawn. Most companies won't
[07:45] Most companies won't build foundation models from scratch.
[07:48] build foundation models from scratch. So
[07:48] So another question is whether to use
[07:49] another question is whether to use commercial model APIs or host an
[07:51] commercial model APIs or host an open-source model yourself.
[07:54] open-source model yourself. For a model
[07:54] For a model to be accessible to users, a machine
[07:56] to be accessible to users, a machine needs to host and run it.
[08:00] needs to host and run it. The service
[08:00] The service that hosts the model and handles queries
[08:02] that hosts the model and handles queries is often called the inference service.
[08:04] is often called the inference service.
[08:04] Whether to host a model yourself or use
[08:06] Whether to host a model yourself or use a model API depends on several factors.
[08:09] a model API depends on several factors.
[08:09] Data privacy, data lineage, performance,
[08:11] Data privacy, data lineage, performance, and control.
[08:14] Now, let's dive into what and control.
[08:15] Now, let's dive into what might be the most accessible yet surprisingly nuanced aspect of AI engineering, prompt engineering.
[08:17] Prompt engineering.
[08:20] Prompt engineering refers to the process of crafting instructions that guide a model to generate your desired outcome.
[08:24] It's the easiest and most common model adaptation technique because unlike fine-tuning, it doesn't change the model's weights.
[08:32] You're just telling the model what you want it to do.
[08:35] While it's the most accessible entry point to AI engineering, don't be fooled into thinking that it's simplistic.
[08:44] Effective prompt engineering requires the same experimental rigor as any machine learning task.
[08:47] Prompts typically consist of one or more of these components.
[08:51] Task description, examples, and the concrete task.
[08:55] How much prompt engineering you need depends on the model's robustness to prompt perturbation.
[09:00] It's also worth noting that different models have different preferred prompt structures.
[09:03] Teaching models what to do via prompts is known as in context learning.
[09:08] Each example in your prompt is called a shot.
[09:11] So we get
[09:13] Your prompt is called a shot.
[09:17] So we get terms like fshot, zero shot, and oneshot learning.
[09:19] Many modern models distinguish between system prompts, task description, role, and user prompts, specific query.
[09:26] Key strategies for effective prompt engineering include write clear and explicit instructions.
[09:31] Ask the model to adopt a persona, provide examples, specify the output format, break complex tasks into simpler subtasks, and give the model time to think using chain of thoughtpromoting.
[09:42] Iterate systematically.
[09:45] This is so important.
[09:46] Different techniques work better for different models.
[09:48] So, experimentation is crucial.
[09:50] Now that we've covered prompt engineering, let's explore how to give foundation models access to information beyond what they were trained on.
[09:58] Two dominant patterns have emerged for providing models with the information they need.
[10:02] Retrieval augmented generation ra and the agentic pattern.
[10:06] Rag allows models to retrieve relevant information from external data sources while the agentic pattern enables models to use tools like web
[10:15] enables models to use tools like web search and APIs to gather information.
[10:18] search and APIs to gather information actively.
[10:20] actively. Retrieval augmented generation enhances a model's generation.
[10:22] enhances a model's generation capabilities by retrieving relevant.
[10:24] capabilities by retrieving relevant information from external memory sources.
[10:26] information from external memory sources. A rag system consists of two.
[10:28] sources. A rag system consists of two main components. A retriever fetches.
[10:31] main components. A retriever fetches information and a generator produces a.
[10:34] information and a generator produces a response. The success of a rag system.
[10:36] response. The success of a rag system heavily depends on its retriever. A.
[10:39] heavily depends on its retriever. A retriever performs two main functions.
[10:41] retriever performs two main functions, indexing and querying. How you index.
[10:44] indexing and querying. How you index your data determines how you retrieve it.
[10:46] your data determines how you retrieve it later.
[10:47] later. Typically you split documents into.
[10:49] Typically you split documents into smaller chunks. Retrieval algorithms.
[10:51] smaller chunks. Retrieval algorithms include term based retrieval ideas.
[10:55] include term based retrieval ideas retrieval, semantic similarity using.
[10:57] retrieval, semantic similarity using vector databases. A production retrieval.
[11:00] vector databases. A production retrieval system typically combines several.
[11:02] system typically combines several approaches. Tactics to improve retrieval.
[11:04] approaches. Tactics to improve retrieval include chunking, reranking, query.
[11:07] include chunking, reranking, query rewriting, and contextual retrieval.
[11:09] rewriting, and contextual retrieval. It's also important to note that rag.
[11:11] It's also important to note that rag isn't limited to text. It can also be.
[11:14] isn't limited to text. It can also be used with multimodal and tabular data.
[11:16] used with multimodal and tabular data.
[11:18] The agentic pattern is a more active approach to extending AI capabilities.
[11:21] approach to extending AI capabilities.
[11:23] At its broadest definition, an agent is anything that can observe its environment, make decisions based on those observations, take actions that affect the environment, and learn from the outcomes.
[11:26] anything that can observe its environment, make decisions based on those observations, take actions that affect the environment, and learn from the outcomes.
[11:28] environment, make decisions based on those observations, take actions that affect the environment, and learn from the outcomes.
[11:30] those observations, take actions that affect the environment, and learn from the outcomes.
[11:32] affect the environment, and learn from the outcomes.
[11:35] the outcomes. What makes agents powerful is the set of tools they have access to.
[11:38] is the set of tools they have access to.
[11:40] Chat GPT, for example, is an agent that can search the web, execute Python code, and generate images.
[11:43] can search the web, execute Python code, and generate images.
[11:45] and generate images. Complex tasks require planning.
[11:47] require planning. There are many possible ways to decompose a task, and not all will be successful or efficient.
[11:49] possible ways to decompose a task, and not all will be successful or efficient.
[11:52] not all will be successful or efficient. Agents can fail in various ways, so robust evaluation is important.
[11:54] Agents can fail in various ways, so robust evaluation is important.
[11:57] robust evaluation is important. Failures can include planning failures and tool failures.
[11:59] can include planning failures and tool failures.
[12:01] failures. One key challenge for agents is memory.
[12:03] is memory. A memory system allows a model to retain and utilize information across interactions.
[12:05] model to retain and utilize information across interactions.
[12:08] across interactions. By combining rag for information access, tools for capability extension, planning for complex tasks, and memory systems for continuity, agents can tackle
[12:10] for information access, tools for capability extension, planning for complex tasks, and memory systems for continuity, agents can tackle
[12:12] capability extension, planning for complex tasks, and memory systems for continuity, agents can tackle
[12:15] complex tasks, and memory systems for continuity, agents can tackle
[12:17] continuity, agents can tackle increasingly sophisticated problems.
[12:19] increasingly sophisticated problems.
[12:21] Now, let's dive into one of the most practical aspects of AI engineering, inference optimization.
[12:23] practical aspects of AI engineering, inference optimization.
[12:26] A model's real world usefulness boils down to two factors: cost and speed.
[12:27] world usefulness boils down to two factors: cost and speed.
[12:31] To optimize inference, we need to understand bottlenecks.
[12:32] inference, we need to understand bottlenecks.
[12:35] AI workloads generally face two types.
[12:39] Computebound limiting factor is power or memory bandwidth bound.
[12:42] is power or memory bandwidth bound.
[12:44] Limiting factor is data movement.
[12:46] Inference APIs typically come in two types.
[12:49] types. Online APIs optimized for latency and batch APIs optimized for cost.
[12:53] and batch APIs optimized for cost. Key inference performance metrics include latency, time to first token, time per output token, and throughput.
[12:55] inference performance metrics include latency, time to first token, time per output token, and throughput.
[12:58] output token, and throughput. Model compression reduces size to improve speed through quantization, pruning and distillation.
[13:01] compression reduces size to improve speed through quantization, pruning and distillation.
[13:03] speed through quantization, pruning and distillation.
[13:05] To overcome the sequential bottleneck of auto regressive models, we can use speculative decoding, inference with reference or parallel decoding.
[13:08] distillation. To overcome the sequential bottleneck of auto regressive models, we can use speculative decoding, inference with reference or parallel decoding.
[13:10] bottleneck of auto regressive models, we can use speculative decoding, inference with reference or parallel decoding.
[13:12] we can use speculative decoding, inference with reference or parallel decoding.
[13:14] inference with reference or parallel decoding. Finally, service level
[13:17] Finally, service level optimization like batching, static, optimization like batching, static, dynamic, continuous, and caching can significantly improve performance.
[13:22] The optimal strategy depends on your needs.
[13:27] For low latency, replica parallelism is often best.
[13:29] For most use cases, quantization yields the biggest gains.
[13:34] And that wraps up our journey through AI engineering.
[13:36] We've covered foundation models, evaluation, prompt engineering, rag, agents, fine-tuning, data set engineering, and optimization.
[13:44] This was a super high-level overview of a detailed book.
[13:49] I highly recommend checking out AI engineering by Chip Huan for the full depth.
[13:54] I had a great time putting this together.
[13:56] Let me know in the comments which book you want me to summarize next.
[14:00] Don't forget to subscribe.
[14:01] Thanks for watching and see you next time.

Cite this page

If you're using ChatGPT, Claude, Gemini, or another AI assistant, paste this URL into the chat:

https://youtube-transcript.ai/docs/ai-engineering-speedrun-complete-course-in-15-minutes-chip-h-yfijffyeu3

The full transcript and summary on this page will be retrieved as context, so the assistant can answer questions about the video accurately.