# AI Engineering Speedrun: Complete Course in 15 Minutes (Chip Huyen Book)

https://www.youtube.com/watch?v=UktQgawjqis

[00:01] Hey everyone, today we're diving into.
[00:03] Hey everyone, today we're diving into the book AI engineering by Chip Huan.
[00:05] the book AI engineering by Chip Huan.
[00:08] 800 pages of really great content about this in- demand field that's offering.
[00:09] this in- demand field that's offering salaries of $300,000 or more.
[00:13] In this video, I'm summarizing everything from.
[00:14] the book to help you get a highle overview of the field.
[00:16] We'll talk about foundation models, prompt engineering.
[00:18] foundation models, prompt engineering, rag, fine-tuning, agents, how to build a.
[00:21] rag, fine-tuning, agents, how to build a system, improving inference, and more.
[00:25] system, improving inference, and more.
[00:28] I also want to mention this is a super highle overview of a very detailed.
[00:29] also want to mention this is a super highle overview of a very detailed technical book.
[00:31] technical book.
[00:33] Don't expect to learn all the details just from watching this.
[00:35] video.
[00:37] I really recommend using this as a way to get an overview of what the.
[00:39] a way to get an overview of what the field looks like and use it as a jumping.
[00:41] field looks like and use it as a jumping off point for your own research and.
[00:42] off point for your own research and exploration.
[00:45] So what exactly is AI engineering and how is it different from.
[00:47] engineering and how is it different from traditional machine learning?
[00:49] Let AI engineering has exploded recently for.
[00:51] two simple reasons.
[00:53] AI models have gotten dramatically better at solving real problems while the barrier to.
[00:54] gotten dramatically better at solving real problems while the barrier to building with them has gotten much.
[00:56] building with them has gotten much lower.
[00:57] This perfect storm has created.
[00:59] lower.
[00:59] This perfect storm has created one of the fastest growing engineering
[01:01] one of the fastest growing engineering disciplines today.
[01:04] At its core, AI engineering is about building applications on top of foundation models.
[01:09] Those massive AI systems trained by companies like OpenAI or Google.
[01:13] Unlike traditional machine learning engineers who build models from scratch, AI engineers leverage existing ones, focusing less on training and more on adaptation.
[01:22] These foundation models work through a process called selfs supervision.
[01:28] Instead of requiring humans to painstakingly label data, these models can learn by predicting parts of their input data.
[01:34] This breakthrough solved the data labeling bottleneck that held back AI for years.
[01:38] As these models scaled up with more data and computing power, they evolved from simple language models to what we now call large language models or LLMs.
[01:45] And they didn't stop there.
[01:49] They've expanded to handle multiple types of data, including images and video, often becoming large multimodal models.
[01:54] Nowadays, we're seeing foundation models power everything from coding assistants like GitHub copilot to image generation tools, writing aids, customer support
[02:03] tools, writing aids, customer support bots, and sophisticated data analysis
[02:06] bots, and sophisticated data analysis systems.
[02:08] Now that we've covered what AI engineering is, let's dig deeper into
[02:10] engineering is, let's dig deeper into foundation models themselves, how
[02:13] foundation models themselves, how they're trained, how they work, and why
[02:15] they're trained, how they work, and why understanding their architecture matters
[02:17] understanding their architecture matters for AI engineers.
[02:19] for AI engineers. Foundation models at their core can only know what they've
[02:22] their core can only know what they've been trained on.
[02:24] been trained on. This might seem obvious, but it has profound
[02:25] obvious, but it has profound implications.
[02:28] implications. If a model hasn't seen examples of a specific language or
[02:29] examples of a specific language or concept during training, it simply won't
[02:32] concept during training, it simply won't have that knowledge.
[02:33] have that knowledge. Most large foundation models are trained on
[02:35] foundation models are trained on webcrolled data which brings some
[02:36] webcrolled data which brings some inherent problems.
[02:38] inherent problems. This data often contains clickbait, misinformation,
[02:40] contains clickbait, misinformation, toxic content, and fake news.
[02:43] toxic content, and fake news. To combat this, teams use various filtering
[02:45] this, teams use various filtering techniques.
[02:47] techniques. For instance, OpenAI only used Reddit links with at least three
[02:49] used Reddit links with at least three upvotes when training GPD2.
[02:51] upvotes when training GPD2. The language distribution in training data is also
[02:53] distribution in training data is also heavily skewed.
[02:56] heavily skewed. About half of all crawled data is in English, which means
[02:58] crawled data is in English, which means languages with millions of speakers are
[02:59] languages with millions of speakers are often underrepresented.
[03:01] often underrepresented. This is why specialized models for specific
[03:03] specialized models for specific languages and domains are becoming
[03:05] languages and domains are becoming increasingly important.
[03:07] In terms of model architecture, most foundation models use transformer architectures.
[03:11] based on the attention mechanism.
[03:13] But to understand why transformers were such a breakthrough, we need to look at what came before.
[03:17] Transformers were invented to solve the problems of sequence to sequence models, which used recurrent neural networks, RNNs, for tasks like translation.
[03:25] These had two main components.
[03:27] An encoder that processes inputs and a decoder that generates outputs.
[03:31] Both work sequentially token by token.
[03:33] The problem is that the decoder only has access to a compressed representation of the entire input.
[03:42] Imagine trying to answer detailed questions about a book when all you have is a brief summary.
[03:48] Also, input processing and output generation are done sequentially, so it's slow for long sequences.
[03:54] Transformers solved this with the attention mechanism, which allows the model to weigh the importance of different input tokens when generating each output token.
[04:00] It's like being able to reference any page in the book while answering questions.
[04:04] Plus, transformers
[04:06] answering questions.
[04:09] Plus, transformers can process input tokens in parallel, making them much faster.
[04:11] During inference, transformers work in two steps.
[04:13] Prefill, processing all input tokens parallel, and decode, generating one output token at a time.
[04:20] The attention mechanism uses three types of vectors.
[04:22] Query vectors Q, key vectors K, and value vectors V.
[04:30] The model computes how much attention to give each input token by comparing the Q and K vectors.
[04:35] A high similarity score means that the token's content V will heavily influence the output.
[04:40] This is why longer context windows are computationally expensive.
[04:43] More tokens mean more K and V vectors to compute and store.
[04:47] Attention is almost always multi-headed, allowing the model to focus on different groups of tokens simultaneously.
[04:50] In Llama 27B, there are 32 attention heads, for example.
[04:58] A complete transformer consists of multiple transformer blocks, each containing an attention module and a neural network module.
[05:00] Now that we understand models a
[05:08] module. Now that we understand models a little more, let's talk about one of the most crucial yet underappreciated aspects of AI engineering, evaluation.
[05:15] For some applications, figuring out evaluation can consume the majority of your development effort.
[05:21] It's how you mitigate risks, uncover opportunities, and gain visibility into where your system is failing.
[05:27] Evaluating AI systems is significantly harder than traditional ML models because problems are complex and responses are open-ended.
[05:34] Foundation models are black boxes. You can only evaluate them by observing their outputs, not by understanding their internal workings.
[05:42] Publicly available evaluation benchmarks quickly become saturated, meaning the model achieves perfect scores as models improve.
[05:48] So let's start with some fundamental metrics used to evaluate language models during training.
[05:52] Cross entropy and perplexity. These metrics essentially measure how well the model predicts the next token in a sequence.
[06:01] Language models learn the distribution of their training data. The better a model learns this distribution, the better it becomes at predicting what comes next, resulting
[06:09] at predicting what comes next, resulting in lower cross entropy.
[06:11] Perplexity is simply the exponential of cross entropy.
[06:14] simply the exponential of cross entropy.
[06:15] It measures the amount of uncertainty a model has when predicting the next token.
[06:17] model has when predicting the next token.
[06:19] While perplexity is useful for guiding training, it becomes less reliable for models that have undergone significant post- training with SFT or RLHF.
[06:21] guiding training, it becomes less reliable for models that have undergone
[06:23] reliable for models that have undergone significant post- training with SFT or
[06:26] significant post- training with SFT or RLHF.
[06:29] RLHF. For some tasks, we can perform exact evaluation where there's no ambiguity about the correct answer, like multiple choice questions.
[06:31] exact evaluation where there's no ambiguity about the correct answer, like
[06:33] ambiguity about the correct answer, like multiple choice questions.
[06:35] multiple choice questions. In coding tasks, functional correctness translates to execution accuracy.
[06:38] tasks, functional correctness translates to execution accuracy.
[06:40] to execution accuracy. Does the code run and produce the expected output?
[06:43] and produce the expected output? One of the most powerful and common methods for evaluating AI models in production is using another AI model as a judge.
[06:44] the most powerful and common methods for evaluating AI models in production is
[06:47] evaluating AI models in production is using another AI model as a judge.
[06:49] using another AI model as a judge. These AI judges are fast, easy to use, and relatively cheap compared to human evaluators.
[06:52] AI judges are fast, easy to use, and relatively cheap compared to human
[06:53] relatively cheap compared to human evaluators.
[06:55] evaluators. Now that we understand evaluation, let's tackle one of the most crucial decisions in AI engineering, model selection.
[06:57] evaluation, let's tackle one of the most crucial decisions in AI engineering,
[07:00] crucial decisions in AI engineering, model selection.
[07:02] model selection. With the increasing number of readily available foundation models, the challenge isn't developing models, but selecting the right one for your application.
[07:04] number of readily available foundation models, the challenge isn't developing
[07:06] models, the challenge isn't developing models, but selecting the right one for your application.
[07:08] models, but selecting the right one for your application.
[07:11] your application.
[07:11] The selection process typically involves two key steps.
[07:13] typically involves two key steps.
[07:13] Finding the best achievable performance
[07:15] Finding the best achievable performance on the task and mapping models along a
[07:17] on the task and mapping models along a cost performance axis.
[07:20] cost performance axis.
[07:20] Your criteria for evaluating a model can be organized into
[07:22] evaluating a model can be organized into four buckets.
[07:24] four buckets. domain specific
[07:24] domain specific capabilities, general capabilities,
[07:26] capabilities, general capabilities, instruction following capabilities, and
[07:28] instruction following capabilities, and cost latency.
[07:31] cost latency. When evaluating models,
[07:31] When evaluating models, you also need to differentiate between
[07:32] you also need to differentiate between hard attributes, impossible to change,
[07:35] hard attributes, impossible to change, and soft attributes can be improved
[07:37] and soft attributes can be improved through adaptation.
[07:39] through adaptation. A high-level
[07:39] A high-level workflow for model selection looks like
[07:41] workflow for model selection looks like this.
[07:43] this. Filter out models whose hard
[07:43] Filter out models whose hard attributes dawn.
[07:45] attributes dawn. Most companies won't
[07:45] Most companies won't build foundation models from scratch.
[07:48] build foundation models from scratch. So
[07:48] So another question is whether to use
[07:49] another question is whether to use commercial model APIs or host an
[07:51] commercial model APIs or host an open-source model yourself.
[07:54] open-source model yourself. For a model
[07:54] For a model to be accessible to users, a machine
[07:56] to be accessible to users, a machine needs to host and run it.
[08:00] needs to host and run it. The service
[08:00] The service that hosts the model and handles queries
[08:02] that hosts the model and handles queries is often called the inference service.
[08:04] is often called the inference service.
[08:04] Whether to host a model yourself or use
[08:06] Whether to host a model yourself or use a model API depends on several factors.
[08:09] a model API depends on several factors.
[08:09] Data privacy, data lineage, performance,
[08:11] Data privacy, data lineage, performance, and control.
[08:14] Now, let's dive into what and control.
[08:15] Now, let's dive into what might be the most accessible yet surprisingly nuanced aspect of AI engineering, prompt engineering.
[08:17] Prompt engineering.
[08:20] Prompt engineering refers to the process of crafting instructions that guide a model to generate your desired outcome.
[08:24] It's the easiest and most common model adaptation technique because unlike fine-tuning, it doesn't change the model's weights.
[08:32] You're just telling the model what you want it to do.
[08:35] While it's the most accessible entry point to AI engineering, don't be fooled into thinking that it's simplistic.
[08:44] Effective prompt engineering requires the same experimental rigor as any machine learning task.
[08:47] Prompts typically consist of one or more of these components.
[08:51] Task description, examples, and the concrete task.
[08:55] How much prompt engineering you need depends on the model's robustness to prompt perturbation.
[09:00] It's also worth noting that different models have different preferred prompt structures.
[09:03] Teaching models what to do via prompts is known as in context learning.
[09:08] Each example in your prompt is called a shot.
[09:11] So we get
[09:13] Your prompt is called a shot.
[09:17] So we get terms like fshot, zero shot, and oneshot learning.
[09:19] Many modern models distinguish between system prompts, task description, role, and user prompts, specific query.
[09:26] Key strategies for effective prompt engineering include write clear and explicit instructions.
[09:31] Ask the model to adopt a persona, provide examples, specify the output format, break complex tasks into simpler subtasks, and give the model time to think using chain of thoughtpromoting.
[09:42] Iterate systematically.
[09:45] This is so important.
[09:46] Different techniques work better for different models.
[09:48] So, experimentation is crucial.
[09:50] Now that we've covered prompt engineering, let's explore how to give foundation models access to information beyond what they were trained on.
[09:58] Two dominant patterns have emerged for providing models with the information they need.
[10:02] Retrieval augmented generation ra and the agentic pattern.
[10:06] Rag allows models to retrieve relevant information from external data sources while the agentic pattern enables models to use tools like web
[10:15] enables models to use tools like web search and APIs to gather information.
[10:18] search and APIs to gather information actively.
[10:20] actively. Retrieval augmented generation enhances a model's generation.
[10:22] enhances a model's generation capabilities by retrieving relevant.
[10:24] capabilities by retrieving relevant information from external memory sources.
[10:26] information from external memory sources. A rag system consists of two.
[10:28] sources. A rag system consists of two main components. A retriever fetches.
[10:31] main components. A retriever fetches information and a generator produces a.
[10:34] information and a generator produces a response. The success of a rag system.
[10:36] response. The success of a rag system heavily depends on its retriever. A.
[10:39] heavily depends on its retriever. A retriever performs two main functions.
[10:41] retriever performs two main functions, indexing and querying. How you index.
[10:44] indexing and querying. How you index your data determines how you retrieve it.
[10:46] your data determines how you retrieve it later.
[10:47] later. Typically you split documents into.
[10:49] Typically you split documents into smaller chunks. Retrieval algorithms.
[10:51] smaller chunks. Retrieval algorithms include term based retrieval ideas.
[10:55] include term based retrieval ideas retrieval, semantic similarity using.
[10:57] retrieval, semantic similarity using vector databases. A production retrieval.
[11:00] vector databases. A production retrieval system typically combines several.
[11:02] system typically combines several approaches. Tactics to improve retrieval.
[11:04] approaches. Tactics to improve retrieval include chunking, reranking, query.
[11:07] include chunking, reranking, query rewriting, and contextual retrieval.
[11:09] rewriting, and contextual retrieval. It's also important to note that rag.
[11:11] It's also important to note that rag isn't limited to text. It can also be.
[11:14] isn't limited to text. It can also be used with multimodal and tabular data.
[11:16] used with multimodal and tabular data.
[11:18] The agentic pattern is a more active approach to extending AI capabilities.
[11:21] approach to extending AI capabilities.
[11:23] At its broadest definition, an agent is anything that can observe its environment, make decisions based on those observations, take actions that affect the environment, and learn from the outcomes.
[11:26] anything that can observe its environment, make decisions based on those observations, take actions that affect the environment, and learn from the outcomes.
[11:28] environment, make decisions based on those observations, take actions that affect the environment, and learn from the outcomes.
[11:30] those observations, take actions that affect the environment, and learn from the outcomes.
[11:32] affect the environment, and learn from the outcomes.
[11:35] the outcomes. What makes agents powerful is the set of tools they have access to.
[11:38] is the set of tools they have access to.
[11:40] Chat GPT, for example, is an agent that can search the web, execute Python code, and generate images.
[11:43] can search the web, execute Python code, and generate images.
[11:45] and generate images. Complex tasks require planning.
[11:47] require planning. There are many possible ways to decompose a task, and not all will be successful or efficient.
[11:49] possible ways to decompose a task, and not all will be successful or efficient.
[11:52] not all will be successful or efficient. Agents can fail in various ways, so robust evaluation is important.
[11:54] Agents can fail in various ways, so robust evaluation is important.
[11:57] robust evaluation is important. Failures can include planning failures and tool failures.
[11:59] can include planning failures and tool failures.
[12:01] failures. One key challenge for agents is memory.
[12:03] is memory. A memory system allows a model to retain and utilize information across interactions.
[12:05] model to retain and utilize information across interactions.
[12:08] across interactions. By combining rag for information access, tools for capability extension, planning for complex tasks, and memory systems for continuity, agents can tackle
[12:10] for information access, tools for capability extension, planning for complex tasks, and memory systems for continuity, agents can tackle
[12:12] capability extension, planning for complex tasks, and memory systems for continuity, agents can tackle
[12:15] complex tasks, and memory systems for continuity, agents can tackle
[12:17] continuity, agents can tackle increasingly sophisticated problems.
[12:19] increasingly sophisticated problems.
[12:21] Now, let's dive into one of the most practical aspects of AI engineering, inference optimization.
[12:23] practical aspects of AI engineering, inference optimization.
[12:26] A model's real world usefulness boils down to two factors: cost and speed.
[12:27] world usefulness boils down to two factors: cost and speed.
[12:31] To optimize inference, we need to understand bottlenecks.
[12:32] inference, we need to understand bottlenecks.
[12:35] AI workloads generally face two types.
[12:39] Computebound limiting factor is power or memory bandwidth bound.
[12:42] is power or memory bandwidth bound.
[12:44] Limiting factor is data movement.
[12:46] Inference APIs typically come in two types.
[12:49] types. Online APIs optimized for latency and batch APIs optimized for cost.
[12:53] and batch APIs optimized for cost. Key inference performance metrics include latency, time to first token, time per output token, and throughput.
[12:55] inference performance metrics include latency, time to first token, time per output token, and throughput.
[12:58] output token, and throughput. Model compression reduces size to improve speed through quantization, pruning and distillation.
[13:01] compression reduces size to improve speed through quantization, pruning and distillation.
[13:03] speed through quantization, pruning and distillation.
[13:05] To overcome the sequential bottleneck of auto regressive models, we can use speculative decoding, inference with reference or parallel decoding.
[13:08] distillation. To overcome the sequential bottleneck of auto regressive models, we can use speculative decoding, inference with reference or parallel decoding.
[13:10] bottleneck of auto regressive models, we can use speculative decoding, inference with reference or parallel decoding.
[13:12] we can use speculative decoding, inference with reference or parallel decoding.
[13:14] inference with reference or parallel decoding. Finally, service level
[13:17] Finally, service level optimization like batching, static, optimization like batching, static, dynamic, continuous, and caching can significantly improve performance.
[13:22] The optimal strategy depends on your needs.
[13:27] For low latency, replica parallelism is often best.
[13:29] For most use cases, quantization yields the biggest gains.
[13:34] And that wraps up our journey through AI engineering.
[13:36] We've covered foundation models, evaluation, prompt engineering, rag, agents, fine-tuning, data set engineering, and optimization.
[13:44] This was a super high-level overview of a detailed book.
[13:49] I highly recommend checking out AI engineering by Chip Huan for the full depth.
[13:54] I had a great time putting this together.
[13:56] Let me know in the comments which book you want me to summarize next.
[14:00] Don't forget to subscribe.
[14:01] Thanks for watching and see you next time.