# Artificial Intelligence Full Course (2025) | AI Course For Beginners FREE | Intellipaat

https://www.youtube.com/watch?v=9tbaiFIm0HU

[00:00] Hello and welcome to this full course on artificial intelligence by Intellipath.
[00:05] In this course, we'll take you through everything you need to know in order to get started with AI from the basics all the way to real world projects.
[00:12] But first, what exactly is AI?
[00:15] In simple words, AI is when machines are trained to think, learn, and solve problems just like humans.
[00:19] You see AI everywhere.
[00:22] When your phone suggests what to type next, when Google Maps find the fastest route, or when Netflix recommends what to watch.
[00:28] Behind all of this is data, logic, and learning.
[00:30] And that's what AI is all about.
[00:33] Now, here's the exciting part.
[00:35] You can learn how to build these systems, too.
[00:37] In this course, we will take you through the very basics of artificial intelligence such as ANN.
[00:40] Then explain deep learning models like CNN, RNN, LSTM, transformers, autoenccoders, RA, etc.
[00:47] By the end of this course, you'll be equipped with framework and write conceptual knowledge to start building your own AI project.
[00:54] And yes, we'll keep it simple even if you've never written a single line of code before.
[00:58] So without wasting any time, let's quickly dive in.
[01:01] What is an explanable AI?
[01:03] Before I answer this question, let us go back and understand why exactly it exists.
[01:09] First, let's say I have gone ahead and built a model which helps recognize if the patient has a disease based on their X-ray records and MRIs.
[01:19] I have done this by using a deep learning model to recognize the pattern out of the images.
[01:24] Now I gave this particular model to the doctors.
[01:27] They went ahead and put out some patient samples through the model and the model provided almost the right decisions.
[01:34] But this fact scared the heck out of doctors.
[01:36] They asked how exactly is this model coming to these particular deductions?
[01:41] What things or computations are happening behind the scenes to reach these particular decisions?
[01:46] and how can I trust the results generated by a machine to treat my patients?
[01:54] Basically, doctors or you know you viewers don't know anything about my model.
[01:57] It's kind of a black box for you or everyone.
[02:03] mean you can just provide an input some magic will happen within the black box and you'll receive the output but you'll not be able to interpret what's exactly going inside this black box.
[02:14] Well, the irony is even me, the developer who has developed this particular model will not be able to explain how it came to a particular decision.
[02:22] You see, it's not just about me.
[02:24] Any data scientist or machine learning engineer or an AI engineer will not be able to explain the computational details behind predicted output.
[02:33] They'll not be able to explain how a model reached a certain decision for a peculiar instance in a testing sample.
[02:40] And this is a problem, right?
[02:42] Well, if there is a problem, there is going to be a solution as well.
[02:46] And that solution in our case is explanable AI briefly abbreviated as X AI.
[02:50] Well, it do help us humans worried about blackbox situations understand how an AI models comes up with results and how much we can trust those particular results.
[03:04] Let's explore the different components of explainable AI to better understand how exactly it works.
[03:12] In a basic sense, it comprises three different components.
[03:15] Number one is prediction accuracy.
[03:17] Then there is interpretability and justifiability.
[03:21] The prediction accuracy and interpretability are two technical aspects.
[03:26] Whereas justifiability is something that's going to solve our problem.
[03:31] That is understanding how a model reached a certain decision.
[03:34] Let's start by understanding the prediction accuracy.
[03:38] Prediction accuracy is a highly important metric for any machine learning engineer or data science professional.
[03:45] Right? Because the success of your machine learning model is going to be judged by comprehending how accurate results your model provides.
[03:51] Generally, when we compute the accuracy, we try to learn the pattern in data and then check how accurate results the learned model is making for the testing set.
[04:01] Right? Well, in case of explainable AI, it is not going to change.
[04:04] It will
[04:07] remain the pretty much same, but it's just that you'll comprehend the performance by using models like lime and running multiple simulations to match results and understand the performance.
[04:17] Lime is a technique that helps to explain the prediction of an AI model at the local level.
[04:22] Next one is interpretability.
[04:25] Interpretability or in other words what we call as tracibility is yet another important factor.
[04:34] Generally when we try to work and build a model we actually happen to have a problem and data set our disposal right.
[04:41] We try to visualize the attributes present in the data to figure out their correlation.
[04:45] We even dispose of certain columns that don't feel much relevant.
[04:50] Right?
[04:50] By doing all these things, we are actually trying to interpret the question and how a solution can be framed.
[04:57] In a similar manner, in the case of machine learning or AI model, interpretability resembles how the learning of a model can be broken down into chunks.
[05:06] If it is a decision tree
[05:08] model, what are the rules the model is setting?
[05:11] What are the important features the learning process is revolving around?
[05:15] Figuring out answers to these factors is basically what known as interpretability.
[05:21] To achieve this, we have a framework called as deep lift which explains predictions by analyzing input feature changes.
[05:28] Finally, we have something called as justifiability.
[05:33] Well, as we already mentioned, this component is all about solving the human mistrust in AI or ML model.
[05:38] If I were to explain justifiability with a simple decision tree that predicts whether a student can get into BTech program based on the marks he got.
[05:50] If he got more than 85% marks in computer science, he can easily get CSE or even if he got 80% mark but carries excellent extracurricular skills then too he can get into the BTech program.
[06:00] So we arrive at the conclusion that certain factors within my data set would lead to alteration of results.
[06:03] these exact rules, figuring out these exact attributes that have more say in the decision etc. are the main components that we try to break down.
[06:19] Not only that, we try and understand if there is a slight change in the input provided and if there is a change, what kind of output will turn out to be.
[06:29] Now let me put all these three components into a practical use case by taking the same example that I took in the beginning of the video.
[06:35] Say if a person has cancer, a model should be able to take X-ray images and detect the cancer cells accurately.
[06:43] It should not interpret it with any other disease.
[06:47] This highlights how much the outcome of this model can be trusted which is known as prediction accuracy.
[06:54] To understand the nature of disease, the AI model has to break it down into smaller chunks which would narrow down the scope of the data.
[07:01] The amount of data we humans are able to understand from the smallest chunks of data is defined as interpretability of the AI model.
[07:07] Why
[07:10] the cells in the X-ray were termed as cancer cells and why not any other disease underlines the justifiability of an AI model.
[07:17] So you see overall the explainable AI is not just about building trust or understanding how a decision was made but it is about troubleshooting the misprediction and improving models performance.
[07:29] Hello everyone.
[07:31] Intellipath offers executive post-graduate certification in data science and artificial intelligence in collaboration with iHub IIAT RII.
[07:40] Through this particular course, you'll get to learn multiple tools like Python, Pispark, Scypi, Numpy, Pandas, Mattplot, LIP, TensorFlow, Git etc.
[07:52] You are going to learn multiple skills like data science, natural language processing, deep learning, fundamentals of generative AI, prompt engineering and application based generative AI as well as recent trends like agentic AI.
[08:08] This course is designed to get you ready for the AI world.
[08:08] So do
[08:11] check out link available in the description.
[08:12] Also through this course we have already helped thousands of learners take positive step in their career.
[08:18] You can check out their testimonials on our achievers channel.
[08:22] Need to be analyzed to determine the best course of action.
[08:25] This is the main function of the control algorithm and software.
[08:27] And this is where reinforcement learning comes into picture.
[08:30] This is the most complex part of the self-driving car since it has to make decision flawless.
[08:35] A flaw like in case of Uber self-driving car can be fatal.
[08:37] Okay.
[08:40] In today's world, the most famous self-driving car are those from Tesla and Google.
[08:44] Tesla cars work by analyzing the environment using a software system known as autopilot.
[08:48] Autopilot uses high-tech cameras to view and collect data from the world.
[08:53] Same as what we do with our eyes.
[08:55] It's called as computer vision or sophisticated image recognition.
[09:00] It then interprets this information and makes the best decisions based on it.
[09:04] So this was one of the use case of reinforcement learning.
[09:07] Iris FL data set.
[09:10] Well, the project which we are
[09:11] going to perform is generally known as the hello world of machine learning.
[09:15] So, it's the best project to start with because it's so well understood.
[09:19] Okay. And why it's so well understood?
[09:21] Because so all the attributes in this data set is numeric.
[09:25] So, all you have to do is figure out how to load and handle the data.
[09:27] So, it's a classification problem thereby allowing you to practice with perhaps an easier type of supervised learning.
[09:32] Okay.
[09:34] It also supports a multiclass classification problem that may require some specialized handling.
[09:39] So this iris data set consists of four different attributes.
[09:40] Sepilent, sele, petal length and petal width and it consists of 150 rows meaning it is very small and it can easily fit into the memory.
[09:51] All the numeric attributes are in the same unit and the same scale.
[09:53] Okay.
[09:55] So uh this data set doesn't require any special scaling or transformation to get started.
[09:59] Okay.
[09:59] So let's get started to your hello world machine learning project in Python.
[10:02] So let's get started with your hello world machine learning project in Python.
[10:05] Okay.
[10:05] So this is my Jupyter notebook.
[10:07] So the very first thing that I'll be doing up here is importing all the required libraries.
[10:09] So I'm importing pandas.
[10:11] Then
[10:13] From pandas I'm importing scatter matrix.
[10:15] Next I'll be importing mat.live as plt.
[10:18] Then we are importing model selection from skarn.
[10:20] Classification report from skarn.metric.
[10:22] Confusion matrix from skarn. matrix.
[10:25] Accuracy score again from skarn.
[10:28] And these are various machine learning models.
[10:35] Okay. CANN classifier, linear discriminant analysis, gajian name base and support vector machine.
[10:40] Okay, so we are importing all these model from our scikitlearn library.
[10:44] Okay, so let's execute it and next step is to load the data set.
[10:47] So the very first parameter that I'm passing up here is the URL of my data set.
[10:51] Okay, or the path of my data set basically.
[10:53] So there's my data set just to show you.
[10:58] So this link consists of my data set.
[11:00] Okay.
[11:03] Next, I'm defining an array as names.
[11:05] This consists of the name of the attribute sele.
[11:11] Okay. And next I'll be defining a
[11:14] variable as data set.
[11:16] And next I'm defining a variable data set.
[11:17] So data set equal pandas do read CSV URL, names equal names.
[11:21] So what it is basically doing it is fetching the data from here from this link URL from this URL and it is giving the name of the attribute as all these sele length sele petal length petal width and class.
[11:33] Okay.
[11:35] So this will basically load your data set.
[11:37] Fine.
[11:37] Let's execute it.
[11:39] So our data set is loaded.
[11:41] Now let's summarize the data set.
[11:43] Now let's check how many rows and columns do our data set have.
[11:45] So let's execute this and check the shape of our data set.
[11:48] So our data set consists of five different column and 150 rows.
[11:51] Okay.
[11:52] Next, if you want to check the sample data set, then you have a head function over here.
[11:56] So, this will give you first 20 result of your data set.
[11:58] Okay, 0 to 90.
[12:01] Fine.
[12:01] Next is the data set.escribe function.
[12:04] So, this will give you the various description from your data set like count, mean, standard deviation, minimum, percent, 25 percentile, 50 percentile, 75 percentile and max.
[12:13] So, count is basically the
[12:15] Total count of sele.
[12:18] Okay.
[12:18] Similarly, mean of all, similarly standard deviation and so on.
[12:22] Okay.
[12:22] Next, what if I want to check how many different classes are there or how many values contains in each class.
[12:28] So, we'll group by the class and we'll find the size of each class.
[12:31] Okay, let's execute it.
[12:34] So, we got the output up here as iris stosa 50, iris versol 50, iris virginica also 50.
[12:40] So, this means that we have three different classes.
[12:42] Iris satossa vericolor and virginica.
[12:47] All three of them consist of 50 values.
[12:49] Okay.
[12:49] Fine.
[12:49] So next is data visualization.
[12:51] So now that we have some basic idea about our data, let's create some visualization out of it.
[12:55] So we are going to look at two different types of plot.
[12:59] The first would be a univariate plot that is to understand about each attribute and next would be the multivaried plot to understand the relationship between the attributes.
[13:09] Okay.
[13:09] So first let's plot a univariate plot that is plot of each individual variable.
[13:13] Okay.
[13:13] So this is the univariate box and viscous plot.
[13:15] So
[13:17] So the bottom part of here this represents the minimum value and this is the maximum value.
[13:23] Okay.
[13:26] And and this green bar over here it represents the mean value.
[13:28] So if you see the mean of separate length it's around 5.84.
[13:30] It's also representing the same.
[13:33] So it's 5.84.
[13:33] Okay.
[13:36] These circles above and below the minimum point are nothing but the outliers.
[13:38] Okay.
[13:38] Yeah.
[13:40] One more thing.
[13:40] The minimum and the maximum value.
[13:42] Right.
[13:42] So minimum is 4.3 for sele and maximum is 7.9.
[13:44] Okay.
[13:47] So the minimum value you got up here is 4.3.
[13:49] Okay.
[13:49] So there's my 4.3 minimum value and here we have the maximum value as 7.9.
[13:52] Okay.
[13:54] And these are the outliers these circles.
[13:57] Similarly for sele.0 and maximum 4.4.
[14:03] So minimum from here is 2.0.
[14:05] This value it's outlier.
[14:05] And the maximum value is 4.4.
[14:08] This value 4.4.
[14:12] four it's also outlier and the mean value of sele width is 3.05 05.
[14:14] So we got mean somewhere around here.
[14:14] Okay.
[14:14] So
[14:20] in order to understand our data in a better way, we created a univaried box and viscous plot.
[14:24] Okay.
[14:26] This gives us a much clearer idea about the distribution of the input attributes.
[14:28] Okay.
[14:30] Well, if you want you can even create a histogram of each input variable to get a idea of the distribution.
[14:34] Okay.
[14:37] So let's execute it.
[14:37] So there's my histogram.
[14:40] So in sele I can find the gian distribution up here.
[14:42] Okay.
[14:44] So this is a useful to note as we can use algorithms that can exploit this assumption.
[14:46] Okay.
[14:48] Next we have is the multivaried plot.
[14:50] Well multivaried plot is used to check the interaction between the variables.
[14:52] Okay.
[14:55] So let's just execute it and see what's the output.
[14:57] So here's a scatter plot of all pairs of attribute.
[15:00] So this can be helpful to spot structured relationship between input variables.
[15:04] Okay.
[15:06] So as you can see we have diagonal grouping of some pair of attribute.
[15:08] So this suggests a high correlation and a predictable relationship.
[15:12] Okay.
[15:14] So from here you can say that sele length is more dependent on sele width as compared to petal length and petal width.
[15:19] Similarly same
[15:21] thing over here right.
[15:23] So petal length is more dependent on sele width as compared to sele length and petal width.
[15:29] Okay.
[15:29] And the petal width it's more dependent on sele length as compared to sele width and petal length.
[15:33] Okay.
[15:36] So these are the things that we can conclude after visualizing our data.
[15:39] Okay.
[15:39] Now it's time to create some model and estimate the accuracy on unseen data.
[15:46] Okay.
[15:46] So here we'll follow step-by-step procedure.
[15:47] Okay.
[15:47] First we'll separate out a validation data set.
[15:49] That is divide our data set into train and test part.
[15:51] Then we'll use a 10-fold cross validation technique that will randomly uh distribute our data into 10 different data sets.
[15:57] Okay.
[15:57] So that we can perform training and testing on each part.
[16:01] Then we can finally combine the result of all of them to get a better accuracy.
[16:05] Okay.
[16:05] Then we are going to build five to six different models to predict species from flower measurement.
[16:11] Okay.
[16:11] And then finally we'll select our best model.
[16:14] So our first step is creating a validation data set.
[16:16] That is dividing our data set into train and test.
[16:18] So here what we are doing dividing our data set into train
[16:24] and test.
[16:26] Fine. Array equal data set dot valalues.
[16:29] So it consists of all the values inside your data set.
[16:31] Here you're defining x equal array starting.
[16:33] So here you're taking a portion of the array.
[16:36] Okay. And you're placing it in x.
[16:38] So what is that?
[16:40] So your row up here is nothing.
[16:43] So it will start from the very first row and the column 0 to 4.
[16:45] So it will take up to fourth column.
[16:49] So four column from the first row that is x is storing all the name of the attribute.
[16:51] Fine. Similarly we are defining a variable y.
[16:53] So y will consist of the fourth column.
[16:56] Next I'm defining a variable y array of starting from first row and fifth column okay as 0 1 2 3 4.
[16:58] So here number four represents my fifth column.
[17:00] Okay. Next I'm defining a validation size as 0.20.
[17:02] So here validation size means it will split my data as 80% in one part and 20% in other.
[17:05] Okay. So next we are defining a seed value.
[17:08] So this seed value is used
[17:25] to initialize the randomization.
[17:28] Saving or setting it to same number each time guarantees that every time you execute the algorithm it'll come up with the same result.
[17:34] Okay.
[17:36] Next we are defining scoring as accuracy.
[17:39] And here here we are defining X train X validation Y train and Y validation.
[17:43] So here we are defining X train X validation Y train and Y validation.
[17:46] So x train would be my training data from this part and x validation is the testing data from same.
[17:54] Okay.
[17:56] So according to my above code my 80% of the data will lie in the training part and the rest 20% would be treated as a validation data or the test data.
[18:04] As I have already mentioned the validation size is 0.20 or 20%.
[18:06] So this means that 20% of my data is test data.
[18:09] Fine.
[18:11] Similarly, we'll define y train and y validation that is 80% of the training data from this array and rest 20% of the testing data from same array.
[18:18] Okay.
[18:21] So here we are basically splitting our data set.
[18:23] Okay.
[18:23] On the base of xy
[18:26] test size equal validation size that we have already specified as .20 and my random state equals c and my random state is c.
[18:33] So every time I execute it the algorithm will come up with the same result.
[18:38] Fine.
[18:40] So now that we have created a split among our data set and we have divided our data set into train and test.
[18:44] So our next step would be to build our model.
[18:46] Okay.
[18:48] So this is how we are building our model.
[18:51] Okay.
[18:51] So from the scikit loone library we are calling logistic regression.
[18:53] We are calling linear discriminant analysis KN&N decision tree go base and SVM or support vector machine.
[19:00] Okay.
[19:02] Here what we are doing we are creating 10 folds.
[19:04] Okay.
[19:04] So 10 folds.
[19:07] Let me just go step by step.
[19:07] So here we are defining a variable as kfold.
[19:09] So k-fold equal model selection kffold how many split we want to perform is 10 split and random state again see that is seven.
[19:17] So every time it will perform the same random split.
[19:19] Okay.
[19:19] So this is how my data would be splitted into 10 different parts.
[19:22] Okay.
[19:22] Now once the data is split into 10 parts then
[19:27] we'll define a variable as cv results.
[19:30] Okay.
[19:30] So cv results equal model selection.
[19:33] cross validation score.
[19:33] Inside that we are passing our model what model we are choosing.
[19:37] We are passing the X train data the Y train data CV equal Kfold whatever the value of Kfold is there and then finally scoring which we have initialized up here as accuracy.
[19:49] Okay.
[19:49] Then what we are doing we are appending our result.
[19:51] So our CV result would be appended to results that is to this list.
[19:56] So next we have is result.append CV result.
[19:57] So this list would be updated with the CV result value.
[20:03] And here we are pending the name and finally we are printing our desired result.
[20:07] So one by one this will go in a loop.
[20:10] Okay.
[20:10] So every time this loop execute you'll get the accuracy of one of the model.
[20:14] Okay.
[20:14] So here right now our model consists of all these values.
[20:19] Okay.
[20:19] As we are appending it that is we are adding all these values one by one to our model.
[20:24] It consist of logistic regression decision tree go base and
[20:28] support vector machine. All of them are
[20:30] in our model. Okay. And finally it is
[20:32] printing the message. Let's check the
[20:34] output. So we got the output as this. So
[20:37] here the accuracy of a logistic
[20:39] regression is 0.96. That of linear
[20:42] discriminant analysis is 0.97. KN&N
[20:45] 0.98. Dentry 0.96. Neby is 0.97. Support
[20:50] vector machine as 0.99. So from this we
[20:53] can say that support vector machine is
[20:55] having the highest accuracy. Okay. Now
[20:57] our next step would be to compare the
[20:59] algorithm and select the best model. So
[21:00] just by looking at this output we can
[21:02] say that support vector machine is
[21:04] having the highest accuracy. But let's
[21:06] compare all these algorithm and see
[21:07] which one fit the best. So for that I'll
[21:10] be plotting a box and viscous plot for
[21:12] the accuracy versus the name of the
[21:14] algorithm. Okay. So there's my algorithm
[21:16] comparison. Here I have the name of my
[21:18] algorithm and this is the accuracy for
[21:21] them. The green point which you can see
[21:22] up here is nothing but the mean values
[21:25] or this value. Okay. For linear
[21:27] regression I had 0.96. for linear
[21:31] discriminant analysis I had 0.97 and so
[21:34] on. So from this graph what I can say is
[21:37] KN&N neighbors and support vector
[21:40] machine these are having the highest
[21:42] accuracy okay as there is no minimum and
[21:45] maximum value only they are having a
[21:47] maximum value and few outliers these
[21:48] outliers okay so in case if you are
[21:51] removing these outliers like this one
[21:53] these two and this so by removing them
[21:56] KN&N will give you the 100% accuracy
[21:58] result by removing these two outliers
[22:00] nei would give you the highest accuracy
[22:03] result or 100% accur accurate result. So
[22:05] even in case of SVM, if you remove this
[22:07] outlier, you'll get a model with 100%
[22:10] accuracy. Okay. Now, next is making the
[22:12] prediction. Well, well, just accuracy
[22:15] shouldn't be your metric to decide that
[22:17] this is your best model. Okay. So, for
[22:19] selecting a model, you have several
[22:20] other features which you should decide
[22:23] apart from accuracy. Okay. So, your
[22:25] accuracy will not always be the metric
[22:27] to select the best model for. Okay. So,
[22:29] let's move ahead and see how to make the
[22:31] prediction. So we have already
[22:32] calculated the accuracy right. So now
[22:35] from this graph we know that support
[22:36] vector machine is giving the highest
[22:38] accurate result even with the outlier.
[22:41] If this outlier is removed then it's the
[22:42] best one. Okay. Now let's make some
[22:44] prediction on it. Okay. So let's execute
[22:46] this. So here we are printing the
[22:48] accuracy score. We are printing the
[22:49] confusion matrix and the classification
[22:52] report for Y validation and prediction
[22:54] that is testing and predicted value. So
[22:57] here we are printing the accuracy score
[22:59] for the tested value and the predicted
[23:01] value. We are printing a confusion
[23:02] matrix for the tested value and the
[23:04] predicted value. And finally a
[23:05] classification report. Same for the
[23:07] tested value and predicted value. So if
[23:09] you check the score for tested value and
[23:11] predicted value like accuracy score
[23:13] while using SVM or support vector
[23:15] machine is 0.93. So there's my confusion
[23:18] matrix up here. And finally we have a
[23:20] precision recall FN measure and support
[23:23] data for all these three classes. Iris
[23:25] Sattosa, Iris vericolo and iris
[23:27] virginica. So let me just explain you
[23:29] what is precision, what is recall, what
[23:31] is F1 score and support. So this
[23:33] precision it tells how accurate or
[23:35] precise uh is your model that is out of
[23:38] those predicted positive how many of
[23:40] them are actual positive. So this is
[23:42] precision. Okay. Next is recall. It
[23:45] calculates how many of the actual
[23:47] positives our model has captured through
[23:49] labeling it as positive. Okay. Next is
[23:51] F1 score. It is used when you want to
[23:53] seek a balance between precision and
[23:55] recall because of a large number of
[23:56] actual negative values. Okay. And
[23:59] finally we have support which is the
[24:01] number of sample of the true response
[24:02] that lies in that particular class.
[24:04] Okay. So while predicting with SVM we
[24:06] had seven true responses in case of iris
[24:10] stosa 12 in case of iris versolo and 11
[24:13] in case of iris venica. Okay. Similarly
[24:16] I made prediction using SVM model. If
[24:18] you make prediction using cannon, you'll
[24:20] get a different result up here. In this
[24:22] case we got the accuracy as 0.9 and the
[24:25] precision score, recall score, reference
[24:27] score and support score all are
[24:28] different. Okay. So here in case of KNN
[24:31] algorithm you can see that we have the
[24:33] accuracy as 0.9 or 90%. The confusion
[24:36] matrix provides an indication of the
[24:38] three errors that we made. Finally the
[24:40] classification report provides a
[24:41] breakdown of the class by precision,
[24:43] recall, fn score and support. Okay. So
[24:45] we use Python to create few machine
[24:47] learning model and select the best model
[24:49] out of them. So our best fit model is
[24:52] SVM as it is giving the highest accuracy
[24:55] and high values of precision and recall.
[24:58] Here's a quiz question for you guys.
[25:00] What is machine learning? Your options
[25:02] are a type of computer hardware, a
[25:05] programming language used for data
[25:06] analysis, a branch of artificial
[25:09] intelligence that focuses on the
[25:10] development of algorithms that can learn
[25:12] and make predictions based on data or a
[25:15] type of advanced robotics technology.
[25:18] Please mention your answers in the
[25:19] comment section. So, regression is a
[25:22] technique of finding the relationship
[25:23] between two or more variable. Okay, it
[25:26] is based on the fact that a change in
[25:28] dependent variable is associated with a
[25:30] change in one or more independent
[25:32] variable. Like you can see over here, a
[25:34] regression is a technique that displays
[25:36] the relationship between variable Y
[25:38] based on the values of variable X. For
[25:41] example, as the temperature drop, people
[25:43] puts on more jacket to keep them warm.
[25:45] So the selling of jacket and the drop in
[25:48] temperature are both directly related
[25:51] right here temperature is a independent
[25:54] variable and the selling of jacket is
[25:56] directly dependent on temperature. Okay.
[25:59] So the jacket selling is a dependent
[26:01] variable and it depends on what on the
[26:04] temperature. Fine. So let's move ahead.
[26:07] Next is regression use case. Let's see
[26:10] some of the use cases related to
[26:12] regression. First we have temperature
[26:15] versus number of cones sold at ice cream
[26:17] store. So temperature is a independent
[26:20] variable and number of ice cream cones
[26:22] sold at ice cream store depends directly
[26:25] on temperature. It's like if it's a
[26:27] summer time or when the temperature is
[26:29] high so number of ice cream cones sold
[26:32] will be more. Right? Next is inches of
[26:35] rainfall is a independent variable and
[26:38] new car sold directly depends on it.
[26:40] Fine. If there would be more rain then
[26:42] the number of cars sold in the market
[26:44] would be more. Fine. Next is daily
[26:47] snowfall versus number of scare visit.
[26:50] Again daily snowfall is a independent
[26:53] variable. It does not depend on
[26:54] anything. But the number of scare
[26:57] visiting directly depends on snowfall.
[26:59] If there would be no snowfall then there
[27:01] would be no scare visit. Right? More
[27:03] snowfall means more scare visit. So
[27:06] these type of problem come under the
[27:08] regression category. So if you think
[27:11] there's a relationship between two
[27:12] things or two variable then regression
[27:15] would help to confirm it. For example,
[27:17] temperature as one variable and number
[27:19] of cone other variable. If you think
[27:20] that there's a relation between
[27:22] temperature and number of cones sold,
[27:24] then you can use regression to confirm
[27:26] it. Okay, let's move ahead. Next we have
[27:30] is the type of regression. Well, there
[27:32] are various types of regression but for
[27:34] now we'll mainly focus on linear
[27:36] regression and logistic regression.
[27:39] So let's see what is linear regression
[27:41] and logistic regression. So linear
[27:44] regression generally deals with
[27:46] continuous variable. Okay. So what do I
[27:48] mean by continuous variable? For
[27:50] example, the amount of rainfall. Okay.
[27:52] So it's a continuous variable. It can
[27:54] range from 0 to 100 cm anything. Or the
[27:58] price of the house. Again, it's a
[28:00] continuous thing, right? The price of
[28:01] the house can range from $100 to around
[28:04] $5,000. Anything. Okay. On the other
[28:07] hand, logistic regression deals with
[28:10] categorical variable. These are the kind
[28:12] of variable in which you have to just
[28:14] predict either yes or no like whether it
[28:17] will rain tomorrow or not. Okay. So
[28:19] these are categorical variable. Next
[28:22] linear regression. It is used to solve
[28:24] the regression issues. We already
[28:26] discussed some of the regression issues
[28:28] previously. For example, selling of the
[28:30] car directly depends on amount of
[28:32] rainfall in a year. If it is raining
[28:35] more then the amount of cars sold in the
[28:37] market would be more. So it's a
[28:39] regression issue and it deals with
[28:41] continuous variable. Okay. On the other
[28:43] hand, logistic regression. Logistic
[28:45] regression is used to solve
[28:47] classification issues like is this mail
[28:50] a spam or not? Okay. Or will it rain
[28:53] tomorrow or not? So these kind of
[28:56] problem comes under classification. You
[28:58] have to classify either yes or no. Okay.
[29:01] Next is when you have to predict a new
[29:03] point, the point is predicted using a
[29:06] straight line in case of linear
[29:07] regression. But the prediction in case
[29:10] of logistic regression is done by
[29:12] drawing a scurve. Okay, don't worry.
[29:14] We'll see in detail how these lines are
[29:16] plotted. All right, for now we'll mainly
[29:18] focus on linear regression and how the
[29:21] straight line is formed. So what is
[29:23] linear regression? Well, linear
[29:25] regression is plotting a straight line
[29:28] of a form y= mx + c such that it
[29:31] predicts a new data point. In other
[29:34] words, you can understand that if your
[29:36] model is well trained using linear
[29:37] regression, then in that case, the
[29:40] predicted point would lie on the
[29:42] regression line. Okay? And your aim is
[29:44] to draw that regression line. A simple
[29:47] linear regression is useful for finding
[29:49] relationship between two continuous
[29:51] variables. One of them is independent
[29:53] variable and the other is dependent
[29:56] variable. Let's understand linear
[29:58] regression in depth. Let's suppose we
[30:00] have two axis over here x-axis and
[30:02] y-axis. X has independent variable and Y
[30:06] has dependent variable. And your aim is
[30:09] to draw a line of regression. Okay. Now
[30:13] let's suppose we have a data point on
[30:15] the x-axis. It's a independent variable.
[30:17] So if the independent variable is
[30:19] increasing what should be the change in
[30:22] dependent variable. So if the dependent
[30:24] variable is also increasing then in that
[30:26] case we'll get a positive line right
[30:28] both the independent variable and the
[30:30] dependent variable is increasing so we
[30:32] get a positive line. Next if the
[30:35] dependent variable is decreasing and for
[30:38] the decrease in dependent variable we
[30:40] have a increase in the independent
[30:42] variable. Okay, it's like dependent
[30:44] variable is decreasing and independent
[30:46] variable is increasing. So in that case
[30:48] we'll get a negative line. All right, so
[30:51] the line I'm talking about is the line
[30:53] of linear regression. All right, let's
[30:55] move ahead. Uh let's suppose this is our
[30:57] observation. Let's add some observation.
[31:01] Okay, now our aim is to draw a
[31:04] regression line that would help us to
[31:06] predict the point in future and we'll be
[31:08] doing this using the le square method.
[31:11] Let's understand what exactly is least
[31:13] square method. So suppose there's a
[31:15] estimated or the predicted value and
[31:17] this was the actual value. The
[31:19] difference between actual value and
[31:21] estimated value is nothing but the error
[31:23] and our main goal is to reduce this
[31:26] error. Okay. And remember this red point
[31:28] which I am plotting over here is the
[31:30] predicted point and it lies on the
[31:32] linear regression line. Okay. So closer
[31:35] the regression line lies to the actual
[31:38] value the more better it will be. Okay,
[31:40] so coming back your main aim is to
[31:43] reduce this error or reduce the
[31:45] difference between the actual value and
[31:46] the estimated or the predicted value.
[31:49] Okay, you have to repeat this step for
[31:51] every single point. Okay, and your goal
[31:55] is to minimize the error. Fine. So this
[31:58] is what a le square method is. Don't
[32:00] worry, we'll understand about it in
[32:01] detail. Okay, for now uh let's just
[32:04] continue.
[32:05] So let's suppose we have speed on the
[32:07] x-axis and distance on the y-axis. So if
[32:10] you plot a graph you'll get a positive
[32:12] relation between both of them. It's like
[32:14] for a constant time if the speed is
[32:16] increasing the distance would also
[32:18] increase. Okay we get a equation of line
[32:21] as y = mx + c. Here y is the distance
[32:25] traveled in a fixed duration of time. x
[32:27] is the speed of the vehicle. On the
[32:30] other hand, m is the slope or the
[32:32] positive slope of the line and c is the
[32:35] y intercept of the line or the point
[32:37] where the line y= mx + c cuts the yaxis.
[32:42] Okay. Now let's consider another
[32:44] example. Suppose we have speed on the
[32:46] x-axis and time on the y-axis. Then for
[32:50] a fixed distance, if the time is
[32:52] increasing, the speed would decrease or
[32:54] if the speed is increasing, the time
[32:56] would decrease. Both are inversely
[32:58] related to each other. So we'll get a
[33:00] negative relation in that case. So here
[33:03] our equation of line would change to y=
[33:06] minus of mx + where y is the time taken
[33:09] to travel a fixed distance m again. Here
[33:12] we'll get m as negative. So m is the
[33:15] negative slope of the line. x again
[33:17] speed and c is a point where the line y=
[33:20] minus of mx + c cuts or intersect the
[33:23] yaxis. Okay, let's move ahead. So when
[33:28] you plot a graph between the dependent
[33:30] and independent variable and you get a
[33:32] regression line of y= mx + c in that
[33:36] your y is dependent variable and x is
[33:39] independent variable. Okay. So let's
[33:42] understand linear regression in depth.
[33:44] For this we'll be plotting few random
[33:46] points and check whether the two
[33:48] variables which lie on the x-axis and on
[33:51] the y-axis are actually even related to
[33:53] each other or not. Okay, I'll see the
[33:56] relationship between x and y variable.
[33:58] So on the x-axis we have 1 2 3 4 and 5.
[34:02] Let's plot them. X-axis 0 1 2 3 4 5 6.
[34:08] Okay. Or on the y-axis we have 4 3 4 2
[34:12] 5. So let's plot it again on the y-axis.
[34:15] 1 2 3 4 and 5. Maximum is five. Okay.
[34:19] Now let's plot our point. So we need to
[34:22] plot x1 and y4. So x = 1 here and y = 4
[34:30] this point. So the coordinate of this
[34:32] point would be 1, 4. Okay. Similarly the
[34:36] next coordinate that we have is x2 and
[34:39] y3. So x2 y3. Similarly, next coordinate
[34:42] is 3 4 then 42 then 55. Okay. Now our
[34:47] goal is to create a regression line
[34:49] which passes through all these point
[34:51] with the least error. Okay. So for that
[34:54] the very first step that we'll do is
[34:56] calculate the mean of both X and Y. Uh
[34:58] remember the line would pass through the
[35:00] mean of X and Y always. So the very
[35:04] first step that we'll do we'll calculate
[35:06] the mean of X and Y. So mean of x is 1 +
[35:10] 2 + 3 + 4 + 5 that is 15. 15 x 5 is 3
[35:15] again. So we'll draw a line x= 3. So
[35:19] this is our line. Next we'll calculate
[35:21] the mean of y that is 4 + 3 + 4 + 2 + 5
[35:26] that is 18 x 5. Okay. And we'll draw the
[35:29] line y = 3.6. Okay. The equation of
[35:32] green line is y = 3.6. So the
[35:36] intersection of both the lines would be
[35:38] the point from where a line of
[35:40] regression would cross. So the
[35:42] coordinate of this point is 3 and 3.6.
[35:46] And the equation of the line is y= mx +
[35:49] c. Okay. Now our goal is to find the
[35:52] value of m and c. Let's see how we are
[35:55] going to find it. So for finding the
[35:57] value of m we have formula as summation
[35:59] of x - xr * y - y bar upon summation of
[36:05] x - xr whole square. Okay. So let's
[36:08] calculate each point one by one. So
[36:10] first let's see for x - xr. So x - xr is
[36:15] nothing but the distance of all the
[36:17] point from the line x= 3. Okay. So first
[36:21] point that we have is value of x is 1
[36:24] and xar is 3. So 1 - 3 that is -2. For
[36:29] next point we have x is 2 and y is 3. So
[36:33] 2 - 3 that is -1. 3 - 3 0 4 - 3 1 and 5
[36:40] - 3 2. Okay. So we got the value of x -
[36:44] x bar as -2 -1 0 1 and 2. Now we'll
[36:48] calculate the value of y - y bar. And
[36:51] what is that? It is the distance of all
[36:53] the points from the line y = 3.6.
[36:57] Okay. So, y - y bar. We have y as 4 and
[37:00] y bar as 3.6. So, 4 - 3.6.
[37:05] How much? 0.4.
[37:08] Next 3 - 3.6 that is - of 0.6.
[37:14] Next is 4 - 3.6 that is 0.4. Then 2 -
[37:20] 3.6. 6 that is - 1.6 then 5 - 3.6 that
[37:25] is 1.4. Okay. Now we'll calculate the
[37:29] value of x - xr whole square. Okay. So
[37:33] let's calculate the value of x - xr
[37:35] whole square. So this is the square of
[37:37] this green box. So - 2 square is 4 1
[37:41] square 1 0 square 0 1 square 1 2 4.
[37:45] Okay. Now we have to calculate the
[37:48] multiplied product of x - xr and y - y
[37:52] bar.
[37:54] This is x - xr * y - y bar. So -2 * 0.4
[38:02] that is - of 0.8. Similarly -1 * minus
[38:07] of 0.6 is 0.6. Then 0. Then 1 * minus of
[38:12] 1.6 is minus of 1.6. Then 2 * 1.4 it's
[38:16] 2.8. So we got all the values. Now next
[38:20] our turn would be to summate them. So
[38:22] summation of x - xr * y - y bar upon
[38:26] summation of x - xr whole square. So
[38:28] let's calculate the summation of each
[38:30] one of them. So summation of x - xr
[38:33] whole square is 4 + 4 8 + 2 10. Okay.
[38:37] And summation of minus of 0.8 + 0.6 -
[38:42] 1.8 + 2.8 8 is 1. How? So it's like 2.8
[38:47] + 0.6 it's 3.4 3.4 minus of 2.4 minus of
[38:53] 1.6 minus of 0.8 right? It's uh 2.4. So
[38:57] 3.4 - 2.4 is 1.0. So we'll get the value
[39:01] as 1 by 10. So we got the value of slope
[39:05] as 0.1. Now let's talk about our
[39:08] equation of regression line. So for that
[39:11] we have to find an equation of a line
[39:13] such that we have a slope and it passes
[39:16] through a particular point. So our slope
[39:18] is 0.1 and it passes through the point 3
[39:21] and 3.6 which we have already calculated
[39:24] right where I said that the regression
[39:25] line would definitely pass through the
[39:27] mean of all these points. Okay. So we
[39:31] have y as 3.6 six this line m is 0.1 and
[39:38] we have x as 3 right this line so we
[39:43] have to calculate the value of c we have
[39:46] the equation 3.6 = 0.3 + c so c= 3.6 6
[39:52] minus of 0.3 that is 3.3. So we got the
[39:57] value of C as 3.3. So now we know that a
[40:00] regression line would intersect yaxis at
[40:03] the point 3.3 and it will cross through
[40:06] the point 3.6 and 3. So for the
[40:10] regression line for m= 0.1 and c= 3.3
[40:15] we'll get the equation as y = 0.1x +
[40:18] 3.3. So this is our regression line.
[40:21] Okay. So now that we have drawn our
[40:23] regression line, we have to check
[40:25] whether the independent variable is even
[40:28] dependent on the dependent variable or
[40:30] not. Or in other words, we have to find
[40:32] out if our independent variable is even
[40:34] related to the dependent variable or
[40:36] not. Right? So for that what we'll do
[40:38] for given m= 0.1 and c= 3.3, we have to
[40:42] predict the values for y when x= 1 2 3 4
[40:46] and 5. So for x= 1, m= 0.1 and z= 3.3,
[40:51] the predicted value of y would be 3.4.
[40:54] Okay. Similarly, for x= 2, y predicted
[40:58] is 3.5. x= 3, y predicted as 3.6.
[41:03] Similarly, for x= 4, y predicted is 3.7.
[41:07] And for x= 5, y predicted is 3.8. All
[41:11] right? And the line passing through all
[41:13] the predicted values of coordinate X and
[41:17] Y predicted is nothing but the line of
[41:20] regression. So next is finding the best
[41:23] fit line. So in order to find the best
[41:25] fit line, you have to do multiple
[41:26] iteration for different values of M
[41:29] andb. What you'll do? You'll calculate
[41:31] the error for different values of M andb
[41:34] and the line having the minimum error
[41:36] would be your final line or the best
[41:38] line of regression. For now just check
[41:41] if our independent variable is even
[41:43] related to dependent variable or not.
[41:45] Right? This was our main goal. So for
[41:47] that we have to calculate R square. R
[41:50] square is a parameter which tells us
[41:52] whether the independent variable is
[41:54] related to dependent variable or not. If
[41:56] yes then by how much factor? If no then
[41:59] by how much? Okay. So the formula for R
[42:01] square is predicted distance minus mean
[42:04] square divided by actual distance minus
[42:06] mean whole square and summation of both.
[42:09] So it's like summation of y predicted
[42:11] minus y bar whole square divided by
[42:13] summation of y - y bar square. Okay. So
[42:16] let's calculate it. So what all we have
[42:20] for now? We have x and y predicted for
[42:23] x= 1 y= 3.4 2 3.5 3 3.6 4 3.7 5 3.8. So
[42:30] for now let's plot the value of y as we
[42:33] need it in the calculation. So we had y
[42:36] as 43 425. Okay. Now we'll calculate YP
[42:40] minus Y bar. The numerator of R square
[42:43] that is YP minus Y bar. So YP - Y bar is
[42:47] 3.4 - 3.6. Okay, it's minus of 0.2. Next
[42:53] is -0.1. Then 3.6 - 3.6. Then 3.7 - 3.6
[42:59] is 0.1. Add 3.8 - 3.6 is 0.2. So next
[43:04] we'll calculate is y - y bar that is 4 -
[43:08] 3.6 that is 0.4.
[43:12] Okay. Next 3 - 3.6 that is minus of 0.6.
[43:17] 4 - 3.6 that is 0.4. 2 - 3.6 that is
[43:22] minus of 1.6. Then 5 - 3.6 that is 1.4.
[43:27] Okay. So let's move ahead. Next what we
[43:29] have to do? We have to square them. So
[43:32] first we'll calculate y predicted minus
[43:34] y bar whole square. So it's minus of 0.2
[43:38] square that is 0.04.
[43:41] Then minus of 0.1 squared that is 0.01
[43:45] 0² 0.1 square again 0.01 0.2 square that
[43:50] is 0.04.
[43:52] Okay. Then we'll calculate the value of
[43:54] y - y bar square that is 0.4 square that
[43:57] is 0.16.
[44:00] Then minus of 0.6 whole square that is
[44:03] 0.36 again 0.4 square that is 0.16 minus
[44:08] of 1.6 square that is 2.56 then again
[44:12] 1.4 4 square that is 1.96. So if you
[44:15] summate them so for the formula of R
[44:18] square we have to summate the value of Y
[44:20] minus Y bar square and Y - Y bar whole
[44:24] square. Okay. So we have to summate
[44:26] them. So summation of YP - Y bar square
[44:29] is 1.0 and summation of Y - Y bar square
[44:33] is 5.2. So if we put these values in the
[44:36] formula we'll get R² as 0.1 / 5.2 2
[44:42] which is approximately equal to 0.019.
[44:46] That means the value of R² is almost
[44:49] equal to Z. So when you get R square
[44:51] value tends to zero then in that case
[44:53] you can say that independent variable is
[44:55] not at all related to your dependent
[44:57] variable. Okay. So it's like more the
[45:00] value of R square more you can say that
[45:02] your independent variable is dependent
[45:04] on the or related to your dependent
[45:06] variable. If you have the value of R
[45:08] square as 0.6 six you'll find that error
[45:11] in this case or the difference between
[45:13] the actual point and the predicted point
[45:15] is less as compared to the line where r²
[45:18] is almost equal to zero. So if you keep
[45:20] on increasing the value of r square the
[45:22] error would keep on reducing or the
[45:25] distance between the actual point and
[45:27] the predicted point would keep on
[45:29] reducing for the line r² tends to 1 that
[45:32] would be a line in which your predicted
[45:34] point and the actual point almost
[45:36] overlaps each other. Okay, now let's
[45:38] just move forward and perform a hands-on
[45:41] in Python and let me show you how you
[45:43] can calculate and plot the linear
[45:45] regression in Python. So that is where
[45:47] I'm going to write my linear regression
[45:49] code. Well, if you see there are
[45:51] multiple data sets available out there
[45:53] on the internet, but for now we'll
[45:54] implement the simple linear regression
[45:56] on our own data set. For this, we'll be
[45:59] considering two variable X and Y. X
[46:02] would be the house size which ranges
[46:04] from 1,000 square ft to 10,000 square
[46:07] ft² and Y which is a independent
[46:09] variable that would be the cost of house
[46:11] which ranges from 300,000 to 1200,000.
[46:15] Okay. Now since uh we are using simple
[46:18] linear regression so we have only one
[46:20] factor that is size of the house which
[46:22] is affecting the price of the house.
[46:24] Right? When you are dealing with
[46:26] multiple linear regression, we would
[46:27] have more than one factor affecting the
[46:29] house like locality or the number of
[46:32] room etc. Don't worry, we'll deal with
[46:34] them later. But for now, as we are
[46:36] dealing with simple linear regression,
[46:38] so we have one variable which is
[46:39] dependent on the other one. Okay. So now
[46:41] our goal is to find a regression line. A
[46:44] line which fits the best if we plot both
[46:47] our variables X and Y. So both x and y
[46:51] so that we can predict the response y
[46:53] that is the cost of the house for any
[46:55] new value of x that is size of the
[46:57] house. Okay. So let's start coding. So
[47:00] first of all let's import the required
[47:02] libraries. So we'll be needing numpy and
[47:04] mattpl.
[47:06] So import
[47:08] numpy as np and I'll be importing
[47:14] macplot
[47:16] lib
[47:17] dot piplot
[47:21] as plot. Okay, as we have discussed
[47:24] already what is numpy and what is
[47:26] mapplot lib. So just to give you a
[47:28] recap. So numpy is a library for python
[47:30] programming language which is generally
[47:32] used in machine learning because we have
[47:34] to deal with large data in machine
[47:36] learning and this is faster than the
[47:37] normal array or the list. So a numpy is
[47:40] a library for the python programming
[47:41] language which is generally used in
[47:44] machine learning as we have to deal with
[47:46] large data in machine learning and this
[47:48] is fast than normal array or the list.
[47:50] Okay. And mattplot lib is a plotting
[47:52] library for python which is generally
[47:54] used for plotting the graph. Okay. So
[47:56] here we are using numpy as np and
[47:58] mattplot lib.pipplot as plt. This is
[48:01] done to rename the huge name to
[48:03] something smaller. Okay, just to ease
[48:06] it. That's it. So instead of writing
[48:08] numpy array, the short form is np.
[48:12] That's it. Or mattplot lib.pipplot.plotx
[48:16] and y, the short form would be plt.plot
[48:19] x and y. That's it. That's just for ease
[48:21] of coding. Okay. So now that we have
[48:23] imported our libraries, our next task
[48:26] would be to create a function. So we
[48:27] have to create a function to estimate
[48:29] the coefficient of x and y values that
[48:31] are passed into this function.
[48:34] So we'll start as define name of the
[48:37] function. Estimate the coefficient
[48:43] of x and y.
[48:47] And this what we have to define first
[48:49] the size of data set or the number of
[48:51] observation or point. So the number of
[48:53] observation or the point let's say n my
[48:57] n equal np dot
[49:01] size of x or size of y anything okay
[49:05] then we'll calculate the mean of x and y
[49:08] since we are using numpy so just calling
[49:10] the mean of numpy is sufficient okay
[49:13] there's a predefined function mean in
[49:15] that
[49:16] so mean of sorry so mean of x and
[49:22] mean of Y equal NP dot
[49:27] mean of X
[49:30] and NP dot mean of Y. Fine. Next, we'll
[49:36] be calculating SS_XY
[49:38] and S S_Xs, which is nothing but the sum
[49:42] of squared errors. So my SS_XY
[49:47] will be equal to NP dot sum Y cross X
[49:56] minus of N *
[50:00] total mean of Y multiplied by
[50:05] mean of X.
[50:08] Find these two variables mean X and mean
[50:11] Y. N is what?
[50:13] So np dot size of x. So total size of
[50:17] your observation multiplied by mean of y
[50:20] multiplied by mean of x. Okay. Next sum
[50:24] of square of x
[50:27] x and x equal
[50:30] np dot sum x * x minus of n *
[50:38] mean of x
[50:41] * again mean of x.
[50:44] Fine. Now we'll be calculating the
[50:46] regression coefficient that is the
[50:49] amount or the value by which the
[50:51] regression line need to be moved. So for
[50:54] that let's say B1 equal
[50:59] SS_XY
[51:02] divided by SS_XS
[51:06] XX
[51:08] and the value of my B will be equal to
[51:13] mean of Y minus B1 *
[51:19] mean of X. Okay. And from this function
[51:23] I want to return the value of and from
[51:25] this function I want the value of B kn
[51:28] and B1. So this is my coefficient
[51:31] estimation function. Okay. So this is my
[51:34] estimate coefficient function which will
[51:36] be used to determine or estimate the
[51:38] coefficient when X and Y values are
[51:40] passed into the function. So my next
[51:42] function would be to plot the graph
[51:44] based on the calculated values. So I'll
[51:47] create a function for plotting my graph.
[51:48] So plot my regression line
[51:55] inside that I'll pass three parameters
[51:58] X, Y and B. Okay. So now we'll be
[52:02] plotting our point as per our data set.
[52:04] So since this is a function so let's
[52:06] define a function for that. So I need a
[52:08] scatter plot plt.scatter
[52:11] inside that X coordinate Y coordinate
[52:13] these are the location of the point on
[52:15] the graph. Then I'll specify the color.
[52:18] So I'll select the color as M. Color
[52:21] here is basically the color of the
[52:22] plotted point. You can change it to red
[52:24] or green or orange or any color
[52:26] depending on your need. Then we'll
[52:28] define a marker. So marker is the shape
[52:30] of the point like a circle or any other
[52:33] symbol. Okay. So I'll specify marker as
[52:35] circle and marker equal O. So these are
[52:40] basically optional. Uh you can just
[52:42] write plt.catter
[52:44] that is also enough. Okay. Next what
[52:46] we'll do we need to predict. So next we
[52:49] have to create a predicted response
[52:51] vector. So predicted values of y. So y
[52:54] predicted equal
[52:57] b
[52:58] + b1 * of x. Next we'll plot the
[53:04] regression line. So plt dot plot x and
[53:09] the predicted values of y y red. And
[53:14] let's take the color as green.
[53:18] Okay. So next we'll plot the label of
[53:20] both the coordinate. So plt dot x label
[53:24] it's what size
[53:28] and plt do. Y label it's what cost.
[53:35] Okay. And finally plot dot show to show
[53:40] the plotted graph. Okay. Fine. Now
[53:43] finally let's create our data set and
[53:45] call these function. Now let's add some
[53:47] point on our x-axis. So x= np dot array.
[53:53] All things I want to add 1 2 3 4 5 6 7 8
[53:59] 9 and 10. These are house sizes ranging
[54:03] from 1,000 square ft to 10,000 square
[54:06] ft. Okay.
[54:08] Here 1 represents 1,000 and 10
[54:11] represents 10,000. Okay. And y equ= np
[54:15] dot array.
[54:17] The cost of the house 300k
[54:21] comma 350k
[54:23] comma 500k
[54:25] comma 700
[54:28] 800 850 900
[54:33] again 900 then 1,000
[54:37] 1,000 and finally 1 1200.
[54:41] So these are our array. Our next task
[54:43] would be to estimate the coefficient. So
[54:46] B equal estimate the coefficient
[54:52] of X and Y. Both the values will be
[54:55] taken from here. Then print
[54:58] estimated coefficient.
[55:06] Estimated coefficients. Estimated
[55:08] coefficient
[55:10] /ash n
[55:12] B equal S / N B1 equal S dot format
[55:23] P of0
[55:25] comma P of 1. And finally we'll call
[55:31] plot underscore
[55:35] regression line
[55:38] the values of x y and b here b is the
[55:42] estimated coefficient. Okay let's
[55:45] execute it
[55:47] built-in function.
[55:50] Okay, I think we have missed these
[55:54] brackets over here
[55:57] and here.
[56:01] [Music]
[56:02] Okay, so we got the value of B not as
[56:05] minus of 7.5 and B1 as 137.72.
[56:09] We got a error name plot regression line
[56:12] is not defined.
[56:14] Let's just copy this as it is.
[56:19] Okay.
[56:22] Double D. Double D. Where is that? Okay.
[56:26] Sorry.
[56:32] Okay. So, we got our regression line.
[56:35] The plot which we got is a line of
[56:37] linear regression. So, the plot which we
[56:40] got is our linear regression line. This
[56:42] green line is nothing but linear
[56:44] regression line. So what basically
[56:46] happens in simple linear regression is
[56:48] there is just one independent variable.
[56:51] So there is one dependent variable and
[56:53] one independent variable and you try to
[56:55] understand how does the dependent
[56:57] variable change with respect to that
[56:59] independent variable. So let's go ahead
[57:01] and start a demo. So our first task
[57:03] would be to load the Boston data set and
[57:06] we'll be loading the Boston data set
[57:08] with the help of pdread CSV function. So
[57:11] again for that we'll have to import the
[57:13] pandas library first. So I'll just type
[57:15] in import pandas as pd and then what
[57:18] I'll do is I will load the Boston csv
[57:21] data file and I'll store this in this
[57:23] object and name it as data. Now after
[57:26] that I'll have a glance at the first few
[57:28] records of this data set. Right? So this
[57:31] is our Boston data set and it comprises
[57:33] of all of these columns. So there is
[57:35] scrim zn indas ches no rm and so on. Now
[57:40] out of these the main column which we
[57:42] are interested in is mev. So this MEV
[57:45] column basically indicates the median
[57:47] price value of the houses in Boston. And
[57:51] for our simple linear regression, what
[57:53] we're going to do is we're going to take
[57:55] this as our dependent variable and we're
[57:57] going to take L stat as our independent
[58:00] variable. So here Lstat signifies the
[58:03] percent of population which is below
[58:05] poverty line. All right. Now let's go
[58:07] ahead and have a glance at the shape of
[58:09] this data set. So we'll type in data
[58:11] dotshape. So what we get is 506 and 14.
[58:15] Now this basically means that there are
[58:17] 506 rows in this data set and 14 columns
[58:20] in total. Now let me also go ahead and
[58:22] use the describe method on this data
[58:24] set. So I'll just type in data.escribe.
[58:28] Right now this just gives me the
[58:30] different aggregate functions on top of
[58:33] all of these columns. So I've got the
[58:34] count, mean, standard deviation, the
[58:37] minimum value, the 25 percentile, 50
[58:39] percentile, 75 percentile, and the
[58:42] maximum value. Right? So let me actually
[58:44] take this crime rate over here. Now this
[58:46] count which you see this basically tells
[58:48] you the number of records in each of
[58:50] these columns. And as we already know
[58:52] since all of these columns belong to the
[58:54] same data set and hence the number of
[58:56] records should be the same for all of
[58:58] these columns. And the mean crime rate
[59:00] we see that it's around 3.6 the minimum
[59:02] crime rate is 0.00. 006 the maximum is
[59:05] around 88 right so these are the
[59:07] different summary statistics which you
[59:09] can get from this data set now let's
[59:12] just have a glance at our two main
[59:14] variables which would be our independent
[59:16] variable and the dependent variable so
[59:18] as I've already told you guys our
[59:19] independent variable would be this lstat
[59:22] column and that dependent variable would
[59:24] be this mev column now from this entire
[59:27] data set I'd be only segregating these
[59:30] two columns and storing that in data
[59:33] underscore and after that I'll have a
[59:35] glance at the first five records. All
[59:38] right, so this is my independent
[59:40] variable and this is my dependent
[59:42] variable. Now let me go ahead and make a
[59:44] plot with respect to these two and I'll
[59:46] be plotting the lstat column onto the
[59:49] x-axis and mev column onto the y-axis.
[59:53] So for that purpose I'd have to load the
[59:54] mattplot lib package. So I'll just type
[59:57] in import mattplot.pipplot
[59:59] as pl and I'll type in data.plot. So
[01:00:02] Lstat is mapped onto the x-axis and mev
[01:00:05] is mapped onto the yaxis. And I'll just
[01:00:08] go ahead and give the labels to them. So
[01:00:10] the label which I'm giving to x-axis is
[01:00:12] lstat and the label which I'm giving to
[01:00:14] y-axis is mev. And then I'll go ahead
[01:00:17] and just plot this. So y-axis we have
[01:00:19] mev, x-axis we have lstat. Now what we
[01:00:22] see is there is sort of an inverse
[01:00:24] relationship. So as lstat increases, mev
[01:00:28] decreases. Right? So as the population
[01:00:31] percentage which is below the poverty
[01:00:33] line increases the median value of the
[01:00:36] price of the house decreases which is
[01:00:39] actually quite intuitive isn't it? So if
[01:00:41] the population is below poverty line
[01:00:43] obviously they can't really afford
[01:00:45] luxury houses can they? So what we
[01:00:47] basically understand from this is there
[01:00:49] is sort of an inverse relationship
[01:00:51] between these two columns. Right now
[01:00:54] it's finally time to go ahead and build
[01:00:56] our model. And before we do that, we
[01:00:58] have to prepare our data first. So what
[01:01:00] I'm going to do is I'm going to take
[01:01:01] this lstat column and store it in x. And
[01:01:04] similarly, I'll take this mev column and
[01:01:06] store it in y. So this is done. Let me
[01:01:10] have a glance at the size of these two
[01:01:11] objects. So this tells me that there are
[01:01:13] 56 rows in x and y. Well, that is again
[01:01:17] very simple to understand. Now I'll
[01:01:19] finally go ahead and divide this data
[01:01:21] set into training and testing sets. Now
[01:01:24] it's very important to divide our data
[01:01:26] set into training and testing sets
[01:01:28] because if you go ahead and just build
[01:01:30] our model on top of the entire data set
[01:01:33] then there are chances of overfitting
[01:01:35] and it'll fail miserably when new data
[01:01:37] comes in and that is the main reason why
[01:01:39] we'd have to divide a data into training
[01:01:41] and testing sets. So I'll go ahead and
[01:01:44] import train test split from
[01:01:46] skarn.mmodel selection and I'll use this
[01:01:49] function. So I'm passing in all of these
[01:01:51] parameters. So x is basically all of my
[01:01:53] features. Y is the labels or the
[01:01:55] dependent variable. And then I'll set
[01:01:57] the test size to be equal to 0.2. So
[01:02:00] this basically means that 20% of the
[01:02:02] records would be in the test set and the
[01:02:04] rest of the 80% records would be in the
[01:02:06] training set. Right? And I'll be storing
[01:02:08] all of these results in X train, X test,
[01:02:11] Y train, and Y test. So X train over
[01:02:13] here basically denotes the training set
[01:02:16] for Lstat column. Similarly, X test over
[01:02:18] here denotes the test set for Lstat
[01:02:21] column. And then we have Y train which
[01:02:23] denotes the training set for MEV column.
[01:02:26] And then we have Y test which is
[01:02:28] basically the test set for MEV column.
[01:02:31] Right? Now again let me go ahead and
[01:02:33] have a glance at the shapes of all of
[01:02:35] these four objects. So the shape of
[01:02:37] Xrain is 4041.
[01:02:39] So this means that there are 404 records
[01:02:42] in XRain. And then we have X test. So
[01:02:45] the shape of X test is 102. This means
[01:02:47] there are 102 records in X test and it's
[01:02:50] the same for Y train and Y test. All
[01:02:53] right. Now I'll go ahead and import
[01:02:55] linear regression from sklearn.linear
[01:02:58] model and create an instance of it and
[01:03:00] I'll store that in regressor. And then
[01:03:03] finally I'll go ahead and fit this model
[01:03:05] on top of the train set. So I'll type in
[01:03:07] regressor.fit fit and I'll pass in these
[01:03:10] two objects which are basically X train
[01:03:12] and Y train which in turn is nothing but
[01:03:14] your training set right so we have fit
[01:03:16] the model on top of the training set as
[01:03:18] you already know linear regression is
[01:03:19] basically built on top of the linear
[01:03:21] line or in other terms it is basically
[01:03:23] y= mx + c here y is your dependent
[01:03:26] variable and x is your independent
[01:03:28] variable and you're trying to understand
[01:03:30] how does y vary with x now apart from y
[01:03:33] and x you have two other terms over here
[01:03:35] so the two other terms are m and c now M
[01:03:38] is your slope and C is your intercept.
[01:03:41] So let's go ahead and find out the
[01:03:43] values of M and C. Right? So I'll go
[01:03:46] ahead and print the value of C which is
[01:03:47] basically the Y intercept. So I'll print
[01:03:50] regressor dot intercept. So we see that
[01:03:52] the value of intercept is 34.33.
[01:03:55] And similarly I'll go ahead and also
[01:03:57] print the value of coefficient which is
[01:03:58] nothing but the slope. So the slope
[01:04:00] value is -0.92.
[01:04:02] Now when you see that there's a negative
[01:04:05] value associated with the coefficient
[01:04:08] and this basically means that as the
[01:04:10] independent variable increases the
[01:04:13] dependent variable would decrease or in
[01:04:15] other terms there is an inverse
[01:04:17] relationship between the independent
[01:04:19] variable and the dependent variable. So
[01:04:22] now that we've built the model let's go
[01:04:23] ahead and predict the values on top of
[01:04:25] the test set. So I'll print in regressor
[01:04:28] dotpredict and I'll give X test as the
[01:04:31] parameter inside this. All right. So I
[01:04:34] have also predicted the values on top of
[01:04:36] X test. So it's finally time to go ahead
[01:04:39] and find out the error in prediction. So
[01:04:42] for that we'll be importing metrics from
[01:04:44] skarn and I'll be having a glance at
[01:04:46] different metrics such as mean absolute
[01:04:48] error, mean squared error and root mean
[01:04:50] squared error. I'll click on run and we
[01:04:53] have these values. So mean absolute
[01:04:55] error comes out to be around 5.07. Mean
[01:04:58] squared error comes out to be 46.99
[01:05:01] and root mean squared error comes out to
[01:05:03] be 6.85.
[01:05:05] Now what you need to understand is the
[01:05:07] lower the value of root mean squared
[01:05:10] error the better the model. And this
[01:05:12] basically helps you to compare multiple
[01:05:14] models. So let's say you have model one
[01:05:16] and model two. And model 1's root mean
[01:05:19] squared error is around 10. And model
[01:05:20] 2's root mean squared error is around
[01:05:22] five. So now when you see these two
[01:05:25] metrics, you can clearly say that model
[01:05:27] 2 is better than model one because its
[01:05:30] root mean squared error is less than
[01:05:32] model one. All right. So we have
[01:05:34] successfully built a simple linear
[01:05:36] regression model. Now let's go ahead and
[01:05:38] build a multiple linear regression model
[01:05:40] where we'll have multiple independent
[01:05:42] variables. Right? So we'll pretty much
[01:05:45] perform the same steps over here. I'll
[01:05:48] go ahead and load the pandas and numpy
[01:05:50] package and I'll load up this Boston
[01:05:52] data set and store this in this data set
[01:05:54] object. Again, I'll have a glance of the
[01:05:57] data set which is basically the same
[01:05:58] data set which we were dealing with.
[01:06:00] Now, this is where the change comes. So,
[01:06:02] I'll be taking all of the columns except
[01:06:06] the last column into this X object. And
[01:06:10] this X object basically denotes all of
[01:06:12] my independent variables. And in y
[01:06:15] object I'm just taking the final column.
[01:06:17] So here the final column represents this
[01:06:20] mev column. Right? So all of these
[01:06:22] columns over here that is the first 13
[01:06:25] columns would be my independent
[01:06:27] variables and the 14th column would be
[01:06:29] my dependent variable. Right? So I'm
[01:06:32] storing this in x and y. Now I'll do the
[01:06:35] same thing. I'll go ahead and divide
[01:06:37] this data set into train and test sets.
[01:06:39] And over here I'm setting the test size
[01:06:41] to be equal to 0.3. So this means that
[01:06:44] 30% of the records would be in the test
[01:06:46] set and the rest 70% of the records
[01:06:48] would be in the training set.
[01:06:51] Right? Now I'll also go ahead and fit
[01:06:54] the model on top of the train set. So
[01:06:56] I'll just say regressor dot fit and I'll
[01:06:58] pass in X train and Y train as the
[01:07:00] parameters. I'll click on run. Right? So
[01:07:03] we've successfully fit the model and
[01:07:04] we'll also go ahead and predict the
[01:07:06] values on top of the test set. So over
[01:07:08] here I'll type in regressor.predict
[01:07:10] predict and I'll pass in X test as the
[01:07:13] parameter right so we have also
[01:07:16] predicted the values on top of the test
[01:07:18] set now I'll go ahead and find out all
[01:07:21] of the metrics so we'll see the same
[01:07:23] metrics which are basically mean
[01:07:24] absolute error mean squared error and
[01:07:26] root mean squared error right and this
[01:07:29] is what we get this time let me compare
[01:07:31] these values with the first model so for
[01:07:35] the first model we saw that the root
[01:07:37] mean squared error came out to be 6.85
[01:07:39] 85 and for a second model the root mean
[01:07:42] squared error comes out to be 5.54.
[01:07:45] So as I have already told you guys if
[01:07:47] the root mean squared error is lower for
[01:07:49] a model it basically means that this
[01:07:52] model is better than other model. So we
[01:07:55] can basically conclude that model 2 is
[01:07:57] better than model one. And guys this is
[01:07:59] how we can implement linear regression
[01:08:01] with the help of skarn. So this is
[01:08:04] Lauren. She's looking for a property to
[01:08:06] buy but she's confused how to start. So
[01:08:09] she goes to one of her friend Josh and
[01:08:12] asks him if he could help her to find a
[01:08:15] property with bigger garden area for
[01:08:17] Xbox. A guy Josh on the right agrees to
[01:08:20] help her to find a property but he
[01:08:23] himself doesn't know how.
[01:08:25] So what he do? He goes to another friend
[01:08:28] of him explains him the whole situation
[01:08:30] and ask him if he can do anything about
[01:08:32] it. He immediately says yes and start
[01:08:35] doing some calculations. Then he says to
[01:08:38] Josh that spending X bucks can get her a
[01:08:41] property of area Y. Now Josh is confused
[01:08:44] and asks him how did he find out. The
[01:08:47] friend goes like simple linear
[01:08:49] regression. Now let's see how exactly he
[01:08:52] used simple linear regression to solve
[01:08:54] his issue.
[01:08:59] So here are our dependent variable and
[01:09:01] independent variable. Property size
[01:09:03] being the dependent variable and money
[01:09:05] being the independent variable. What
[01:09:07] this guy want to do? He want to find out
[01:09:09] a relation between the property size and
[01:09:11] money. So it's like if you want a house
[01:09:13] of bigger area or bigger property size
[01:09:16] then you have to spend more money. So
[01:09:18] both of them are directly proportional.
[01:09:19] Right? So in this case we'll get a
[01:09:21] positive linear regression line. Right?
[01:09:24] That is spending more money will get her
[01:09:26] a bigger property. Okay? in other case
[01:09:30] but again our next scenario is she wants
[01:09:33] a bigger garden area right so imagine
[01:09:36] the scenario keeping the property area
[01:09:38] constant the house area is inversely
[01:09:41] proportional to the garden area it's
[01:09:43] like if you want to increase the garden
[01:09:45] area so you have to reduce the house
[01:09:48] area so take it like this suppose you
[01:09:50] have already fixed the property and you
[01:09:53] want to construct a house with a garden
[01:09:54] in it but now your demand is you want to
[01:09:57] have a bigger garden area So if the size
[01:09:59] of the property is fixed and if you
[01:10:01] increase the size of the garden then
[01:10:03] obviously you have to reduce the size of
[01:10:05] your house right. So in this case if you
[01:10:08] try to plot a regression line you'd get
[01:10:10] a negative regression line right bigger
[01:10:13] garden area means smaller house right
[01:10:17] now let's take an example to see how
[01:10:19] exactly did he predicted the value of
[01:10:21] the house. So in order to find the
[01:10:23] regression line what he did he took the
[01:10:25] historical data of property area sold in
[01:10:28] the particular price. Okay. And he
[01:10:30] plotted it on the graph.
[01:10:34] So this was the plotted point of
[01:10:36] property area sold in past for a
[01:10:38] particular price. Okay. Now what he did
[01:10:41] he draw a regression line. Now to find
[01:10:43] out what property area she can buy with
[01:10:45] X bucks, he plots X on the independent
[01:10:49] variable scale and projected to the
[01:10:51] regression line and then against that
[01:10:53] point there he has the area Y. This is
[01:10:56] the area Y. So this is how he predicted
[01:11:00] Lauren can buy Y area of property in
[01:11:03] Xbox. All right. Now let's see what he
[01:11:06] can and what he cannot say from this. So
[01:11:09] he can say that if Lawrence spends X
[01:11:11] amount of money she can buy a property
[01:11:14] area of Y. But what he cannot say is
[01:11:18] will the property would have a good
[01:11:19] neighborhood or not or will the location
[01:11:22] be noless suburb or a bustling city.
[01:11:25] Right? So these are the question which
[01:11:27] even he cannot answer using this graph.
[01:11:30] So the questions like will the property
[01:11:33] will have a good neighborhood or will it
[01:11:36] rain tomorrow or not or is this mail a
[01:11:39] spam or not all these kind of problem
[01:11:41] fall under a particular category known
[01:11:43] as classification problems in machine
[01:11:45] learning. Now with linear regression
[01:11:47] algorithm we could not answer these
[01:11:49] problem. So that is where logistic
[01:11:51] regression comes into picture. Now let
[01:11:53] us see where this logistic regression
[01:11:55] algorithm that we just talked about lies
[01:11:57] in the machine learning algorithm tree.
[01:12:00] So in machine learning we use two
[01:12:02] traditional learning techniques to build
[01:12:04] a predictive model. Supervised learning
[01:12:07] and unsupervised learning. Again look at
[01:12:10] the supervised learning. There are two
[01:12:12] categories regression and
[01:12:14] classification. Right? In regression we
[01:12:16] have linear regression and in
[01:12:18] classification we have logistic
[01:12:20] regression and SVM. So our today's topic
[01:12:22] of discussion is logistic regression
[01:12:25] which comes under the category of
[01:12:26] classification. Okay. So now that we
[01:12:29] have got a little bit idea about
[01:12:30] logistic regression, let's go a little
[01:12:32] bit deeper and discuss about what
[01:12:34] exactly is logistic regression and why
[01:12:37] do we use it.
[01:12:40] So what is logistic regression? Well,
[01:12:42] logistic regression is a statistical
[01:12:44] classification model that deals with
[01:12:47] categorical dependent variable. Again,
[01:12:49] you must be wondering what are these
[01:12:51] categorical dependent variable? Well,
[01:12:53] these are some of the discrete variable
[01:12:55] that have two or more categories without
[01:12:57] having any kind of natural order. For
[01:12:59] example, temperature, area or gender.
[01:13:02] Okay. So, you can say that logistic
[01:13:04] regression is generally used where the
[01:13:06] dependent variable is binary or where
[01:13:09] the dependent variable is binary that is
[01:13:12] only two outcomes are possible either
[01:13:14] yes, no, true, false, 1, zero, etc.
[01:13:17] Right? And also remember a fact that you
[01:13:20] can use both continuous and discrete
[01:13:22] input data with logistic regression. So
[01:13:24] before moving ahead let's look at the
[01:13:26] graph. See there are two variables. One
[01:13:28] is independent and other is dependent.
[01:13:31] Can you figure out which one is
[01:13:33] dependent and which one is independent.
[01:13:35] So before moving ahead let's take a look
[01:13:37] at this graph. So before moving ahead
[01:13:41] let's look at this graph. So over here
[01:13:43] we have two different variables our
[01:13:46] studying and probability of passing
[01:13:48] exam. Can you figure out which one of
[01:13:50] them is dependent and which one of them
[01:13:52] is independent? So if you have guessed
[01:13:55] that hours of studying is your
[01:13:57] independent variable and probability of
[01:13:59] passing the exam is dependent variable
[01:14:01] then I'd say that you are 100% correct.
[01:14:04] So now that you know what exactly is
[01:14:07] logistic regression, let's move ahead
[01:14:09] and see why do we use logistic
[01:14:11] regression. Well, logistic regression
[01:14:13] can be used as a tool for applied
[01:14:15] statistics and discrete data analysis.
[01:14:18] Why? Because it gets the output in the
[01:14:21] form of probabilities which help us to
[01:14:23] easily classify the given data. Okay. So
[01:14:26] this is why we are using logistic
[01:14:28] regression. So now that we have
[01:14:30] successfully established the basic of
[01:14:32] logistic regression by understanding
[01:14:34] what and why of it, let's go ahead and
[01:14:36] see how logistic regression can be
[01:14:38] applied for classifying data with the
[01:14:40] help of an example.
[01:14:43] So here we are using an example of spam
[01:14:45] email classifier. We need to build a
[01:14:47] predictive model that would classify
[01:14:49] whether a male is spam or not. So let's
[01:14:52] look at the approach that we are going
[01:14:54] to take while building this model. First
[01:14:57] we'll try to understand the variable
[01:14:58] that on the basis of which we are
[01:15:00] classifying the male. Next we'll plot
[01:15:03] the label data. Once we are done with
[01:15:05] plotting the label data we'll draw the
[01:15:06] regression curve. And finally we'll try
[01:15:08] to find out the best fitted curve using
[01:15:10] maximum likelihood estimator. All right.
[01:15:14] So let's get started. So step one is
[01:15:18] defining the variable. So let's start
[01:15:20] off by understanding what is independent
[01:15:22] variable in our case. So in our case the
[01:15:24] independent variable is count of spam
[01:15:26] words. Well here are some example of
[01:15:29] commonly used spam words like buy, get
[01:15:32] paid, guarantee, winner, unlimited etc.
[01:15:35] Okay. So these are the kind of words
[01:15:38] which when found in the mail the mail is
[01:15:40] treated as spam. If the number of these
[01:15:42] kinds of words are more in a mail, then
[01:15:44] that mail would definitely be a spam
[01:15:46] mail. Okay, just for a better
[01:15:48] representation, let me put them in a bag
[01:15:50] of spam words. Let's put the buy. So
[01:15:53] there's a bag of spam words. Let's put
[01:15:56] all the words in them one by one. Buy,
[01:15:59] get paid, guarantee, winner, and
[01:16:02] unlimited. Okay. Now, what about our
[01:16:04] dependent variable? Well, our dependent
[01:16:07] variable is going to be the probability
[01:16:09] of male being a spam. If the probability
[01:16:11] is one, that means the male is spam. If
[01:16:14] it's zero, means it's not a spam mail.
[01:16:17] Well, in general, the male with less
[01:16:19] number of words from the list of spam
[01:16:21] words will be treated as a spam mail
[01:16:24] with five or more spam words in a male
[01:16:26] would be treated as a spam mail. But
[01:16:28] there can be cases where you might find
[01:16:30] males with less spam words being spam.
[01:16:32] Also, you might find cases where males
[01:16:35] with more number of spam words is not a
[01:16:38] spam mail. So here our aim is to build a
[01:16:40] predictive model to classify the male
[01:16:43] with minimum error. Okay, let's see what
[01:16:46] is our next step. So our next step is
[01:16:48] plotting the label data. Let's say this
[01:16:50] is a set of data that we'll be using to
[01:16:53] build the model. This is a very small
[01:16:55] data set. But just remember that
[01:16:56] whenever you are using logistic
[01:16:58] regression, make sure that you are using
[01:17:00] a large amount of data. Logistic
[01:17:02] regression works pretty well with large
[01:17:04] amount of data. It doesn't work that
[01:17:06] good with small amount of data. Okay,
[01:17:08] here just for the purpose of
[01:17:09] understanding, we are using small data
[01:17:11] set. All right. So we have two variables
[01:17:14] number of spam words and the probability
[01:17:16] of male being spam against each male.
[01:17:19] Okay.
[01:17:20] Next as a step three we'll draw the
[01:17:23] regression line. So next what we'll do
[01:17:25] we'll plot our data set on x-axis and
[01:17:28] y-axis with independent variable on the
[01:17:31] x-axis and dependent variable on the
[01:17:33] y-axis. So number of spam words in a
[01:17:36] male is a independent variable. Right?
[01:17:39] And the probability of that particular
[01:17:41] male being a spam or not a spam is a
[01:17:43] dependent variable. It depends on the
[01:17:46] number of spam words in a male. Right?
[01:17:48] So let's plot these words one by one. So
[01:17:51] first we have is one word and the
[01:17:53] probability of this male being a spam is
[01:17:55] zero. So it will be plotted up here. So
[01:17:58] next we have is five spam words in a
[01:18:00] male and the probability of that male
[01:18:02] being a spam is one. So it will be
[01:18:04] plotted somewhere here. Next is three
[01:18:06] spam word is a spam mail. So it will be
[01:18:09] plotted again here. Two words, it's not
[01:18:12] a spam mail. So I'll be plotting up
[01:18:14] here. Seven words again a spam mail up
[01:18:17] here. Four words not a spam mail here.
[01:18:21] Nine words it's a spam. Eight spam words
[01:18:25] again it's not a spam. So once we are
[01:18:27] done with the plotting, this is how our
[01:18:29] plotted data would look like. Now let's
[01:18:32] say that we have a new mail and now we
[01:18:34] want to figure out whether it's a spam
[01:18:36] or not. So before moving ahead let me
[01:18:39] just tell you in real world scenario to
[01:18:41] perform logistic regression you need a
[01:18:43] large amount of data set and also you
[01:18:45] might find many cases where a spam mail
[01:18:48] might contain only two words whereas
[01:18:50] spam mail might contain only two spam
[01:18:52] words or also it might be possible that
[01:18:55] you get a mail where you are having more
[01:18:57] than five spam words and even in that
[01:18:59] case your mail is not a spam. Okay. So
[01:19:02] here we are building a predictive model
[01:19:03] with primary aim to reduce the error.
[01:19:06] Okay, now let's say we have a new male
[01:19:08] up here. Now we need to figure out
[01:19:10] whether this male is spam or not. But
[01:19:14] how do we do that? Well, first of all,
[01:19:16] we need to plot a regression curve which
[01:19:18] would fit the best. And that curve would
[01:19:20] be our logistic regression curve. But
[01:19:23] now the question comes how to find out
[01:19:25] which is the best regression curve.
[01:19:27] Okay. Well, this will contain three
[01:19:29] steps. Well, the first step is to
[01:19:32] convert the y-axis from the scale of
[01:19:34] probability confined between 0 and 1 to
[01:19:36] a scale of log odds. Then drawing a
[01:19:39] random regression line out of the data
[01:19:41] that we already have. Then with the help
[01:19:44] of sigmoid function, we'll convert the
[01:19:47] log odds to the probability of male
[01:19:49] being spam. We'll plot each male on the
[01:19:52] base of their new probability values and
[01:19:54] this will form our regression curve.
[01:19:57] Then finally from this plot we'll find
[01:19:58] out the log likelihood values of each
[01:20:00] male.
[01:20:02] Now from this plot we'll find out the
[01:20:04] log likelihood values of each male
[01:20:09] the individual likelihood. At last we'll
[01:20:12] find the log of likelihood that would be
[01:20:14] our log likelihood of the regression
[01:20:16] curve. Now the question comes what are
[01:20:19] these terms such as log odds or log
[01:20:21] likelihood means. So before moving ahead
[01:20:24] let's discuss that.
[01:20:26] So what does log of odds mean? Let us
[01:20:29] explore this with the help of an
[01:20:30] example. So before we proceed any
[01:20:32] further let me just clarify you one
[01:20:34] thing. This probability and odd these
[01:20:37] are not the same thing. Let me explain
[01:20:39] you this with an example. Suppose this
[01:20:41] guy he goes to fishing five times a
[01:20:43] week. So out of five times he catches a
[01:20:46] fish two times and he failed to catch
[01:20:49] three times. Okay. So now in this case
[01:20:52] what is the probability and odd for
[01:20:54] getting a fish for dinner. So let's
[01:20:56] first calculate probability. So
[01:20:58] probability is chances for divided by
[01:21:01] total chances. So chances for catching a
[01:21:03] fish. So what is the probability of
[01:21:05] catching a fish? That is how many times
[01:21:07] he caught a fish that is two divided by
[01:21:10] total chance he had to catch the fish.
[01:21:13] So that was five. Right? So here
[01:21:16] probability of getting a fish for the
[01:21:18] dinner is 2x 5. Okay. Next comes the odd
[01:21:21] chances for divided by chances against
[01:21:25] that is the ratio of how many times he
[01:21:27] caught the fish divid by how many time
[01:21:29] he failed to catch a fish. Okay. So he
[01:21:32] caught the fish for two times and he
[01:21:34] failed in catching the fish for three
[01:21:37] times. So the odd for getting a fish for
[01:21:39] dinner is 2x3. Okay. So now that we know
[01:21:43] odds now let's see what log of odds and
[01:21:46] log odd ratio are. Are they the same?
[01:21:48] Let's find out. For your information,
[01:21:50] log odds is also called as logit
[01:21:53] function. Okay. So in our previous
[01:21:55] example where the fisherman was catching
[01:21:57] a fish, let's add another factor to his
[01:21:59] fishing. Let's add a factor as weather.
[01:22:02] So then we can recreate the entire
[01:22:04] scenario as he was successful two times
[01:22:06] on a rainy day. But on a sunny day, he
[01:22:08] was successful for three times. Now the
[01:22:11] odds of catching fish on a sunny day is
[01:22:14] how much? It's 2x3, right? And the odds
[01:22:17] of catching on a rainy day is 3x2.
[01:22:20] Right? As it's already mentioned that on
[01:22:22] a sunny day he catches three times. So
[01:22:24] he is successful for three times and he
[01:22:26] fail for two times. So let's see he was
[01:22:29] successful three times on a rainy day
[01:22:31] and two times on a sunny day. So odds
[01:22:34] for catching a fish on a sunny day is
[01:22:36] 2x3. That is he's successful two times
[01:22:39] on a sunny day and he failed for three
[01:22:41] times in a week. So odd for catching
[01:22:43] fish on a sunny day is 2x3. Similarly,
[01:22:46] odd for catching fish on a rainy day is
[01:22:48] 3x2. And now log of odds of catching a
[01:22:52] fish on a sunny day is just log value of
[01:22:55] 2x3. And similarly, log of odds of rainy
[01:22:58] day is log of 3x2. Now log of odds ratio
[01:23:02] is nothing but the log of odds on a
[01:23:05] rainy day divided by odds on a sunny
[01:23:07] day. Okay. Next is log of odds ratio. So
[01:23:11] log odds ratio is nothing but the ratio
[01:23:13] of log of the odds on a sunny day to log
[01:23:16] of the odds on a rainy day. That is log
[01:23:18] of 2x3 by 3x2 which is nothing but log
[01:23:22] of 0.44. So here we can say that odds
[01:23:25] and odd ratio are both different thing.
[01:23:30] Now let us go back to this step. So now
[01:23:32] that we have understanding of log odds,
[01:23:34] so we are ready to perform this step.
[01:23:36] Okay. So let's see how we converted the
[01:23:39] 01 axis to minus infinity to plus
[01:23:42] infinity axis. So here we'll be
[01:23:45] converting the probability scale to
[01:23:47] scale of log odds. Okay. So for the log
[01:23:50] odds we have a formula as log of
[01:23:53] probability of spam divided by 1 minus
[01:23:55] probability of spam. So here the
[01:23:58] probability of a male being a spam is 1.
[01:24:00] Okay. So we get the value of log odds as
[01:24:02] log of 1 / 1 - 1 that is log of 1 by 0.
[01:24:07] It's positive infinity. How we got that?
[01:24:09] So log of 1 by 0 is nothing but log of 1
[01:24:12] minus log of 0. And log of 0 up here is
[01:24:16] minus infinity. So minus of minus
[01:24:19] infinity is what? Plus infinity. How log
[01:24:22] of 0 up here is minus infinity. Let's
[01:24:24] see. So in general logarithm, so we have
[01:24:27] log 0 with base b equals c. So if you
[01:24:30] convert it into exponential form, we get
[01:24:32] 0= B ^ C. Right? So if the value of B is
[01:24:36] less than 1. So the value of C has to be
[01:24:39] extremely small or closer to minus
[01:24:41] infinity for this equation to be true.
[01:24:44] Okay? And we'll get a positive infinity
[01:24:46] in the case where B is greater than 1 or
[01:24:49] our base is greater than 1. Okay. So
[01:24:51] coming back up here. So here log of 1 by
[01:24:54] 0 is nothing but log of 1 minus log of 0
[01:24:57] and we got the result as positive
[01:24:59] infinity as log of 0 up here is minus of
[01:25:03] infinity and minus of minus infinity is
[01:25:05] what? Plus infinity. How we got the
[01:25:07] value of plus infinity? Let's see. So
[01:25:10] log 0 with base b= c. If we convert this
[01:25:13] into exponential form we get something
[01:25:15] like this. 0= b ^ c. Right? So for the
[01:25:19] equation one to be true if the value of
[01:25:22] B or the base is less than 1, then in
[01:25:24] that case the value of C will be
[01:25:26] positive infinity. Okay. For example 0.1
[01:25:30] ^,000 would be smaller than 0.1 ^ 100.
[01:25:35] Right? So more the value of C up here
[01:25:37] more closer will the number get to zero.
[01:25:40] Right? And in next case if the value of
[01:25:42] B is greater than 1. So for this we have
[01:25:44] to make the value of C closest to minus
[01:25:47] infinity why for example we have 10 ^
[01:25:50] minus1 okay or 10 ^ - 10 which is more
[01:25:54] smaller 10 ^ - 10 right so which is much
[01:25:58] smaller or closer to zero 10 ^ - 10
[01:26:01] right so that's why we have to keep the
[01:26:04] value of c as less as we can so here if
[01:26:07] the value of b is greater than 1 the
[01:26:09] value of c would have to be close to
[01:26:12] minus infinity in in order to make the
[01:26:14] equation true. So in this case log of 1
[01:26:16] by0 by default we have base as 10. Okay.
[01:26:20] So that's why we took the value of C as
[01:26:22] minus infinity. So minus and minus of
[01:26:25] infinity is plus of infinity. So that's
[01:26:28] why we got the plus infinity up here. I
[01:26:30] hope this thing is clear to you how we
[01:26:33] got the value of log odds as plus
[01:26:35] infinity. So we'll plot this up as plus
[01:26:38] infinity log odds.
[01:26:40] Now next let's find the log odds of
[01:26:43] non-spam mail. So here we have the
[01:26:45] formula as log of probability of a male
[01:26:48] not being a spam divid 1 minus
[01:26:50] probability of male not being a spam. So
[01:26:52] log of 0 / 1 - 0. So we have log of 0 by
[01:26:57] 1 which is log 0 - log 1 which tends to
[01:27:01] minus infinity. Okay, similar concept.
[01:27:05] Now we have our data up here. So first
[01:27:07] we'll assume one regression line. So now
[01:27:10] we have our data up here. So first we'll
[01:27:12] assume one regression line. Then we'll
[01:27:14] project our data onto the regression
[01:27:16] line. Okay. Now let's just go back to
[01:27:19] the step where with the help of sigmoid
[01:27:21] function we'll convert the log odds to
[01:27:24] the probability of male being spam.
[01:27:26] Okay. But what does this sigmoid
[01:27:29] function mean? So sigmoid function is
[01:27:31] the standard logistic function. The
[01:27:33] logistic function is defined as L * E ^
[01:27:38] K * of K minus K dot upon 1 + E ^ K * of
[01:27:43] X - X. So here L is the curve's maximum
[01:27:47] value. K is the steepness of the curve.
[01:27:50] X - X is the value of sigmoid point.
[01:27:54] Okay. So here the sigmoid function E ^ X
[01:27:57] / 1 + E ^ X. Here k= 1 and x not= 0 and
[01:28:03] l= 1. So this mathematical sigmoid
[01:28:07] function form sshaped curve which is
[01:28:09] confined between 0 and 1. Let's try to
[01:28:11] understand how logistic function works
[01:28:13] with the help of an example.
[01:28:16] So let's say we have a set of
[01:28:18] unspecified data and on these data we
[01:28:20] need to apply a sigmoid function. So
[01:28:22] let's see what a sigmoid function can do
[01:28:25] by visualizing this graph. So let's say
[01:28:27] we have a unspecified data and on these
[01:28:30] data we need to apply sigmoid function.
[01:28:32] So plot this data and find out the
[01:28:36] respective y for these point.
[01:28:42] Okay. So what a sigmoid function is
[01:28:44] doing it can be visualized from this
[01:28:46] graph. Right? So you are giving some
[01:28:48] values on your x-axis and using sigmoid
[01:28:51] function you can predict its probability
[01:28:53] on the y-axis. Right? So this is the
[01:28:55] reason why sigmoid function is very
[01:28:57] useful while solving the classification
[01:28:59] problem. It takes any real valued number
[01:29:02] and maps onto a value between 0 and one.
[01:29:05] Okay.
[01:29:07] Well, now that we have an idea of how
[01:29:09] sigmoid function works, let us move
[01:29:10] ahead with our spam email classifier. So
[01:29:13] we are ready to perform this step that
[01:29:16] is converting the log odds graph into a
[01:29:18] sigmoid function graph. Now we have to
[01:29:21] find out the best MLE. Okay, our best
[01:29:24] maximum likelihood estimator. So we are
[01:29:27] going to replace the log odd value of
[01:29:28] each male to get the probability of each
[01:29:31] male being a spam. So we have the
[01:29:33] formula up here. Probability equal e ^
[01:29:36] log odds / 1 + e ^ log odds. So one by
[01:29:40] one will place the log odds of each male
[01:29:42] into this formula and calculate the
[01:29:44] probability for each male. Okay. For
[01:29:47] example, I have this male which after
[01:29:49] projecting into the regression line
[01:29:51] gives us log value of minus of 3.2.
[01:29:54] Okay. So what we'll do up here? We'll
[01:29:57] calculate the probability using the
[01:29:58] value of minus 3.2. We'll place the
[01:30:01] value minus 3.2 in our formula. E ^ log
[01:30:04] of minus of 3.2 / 1 + e ^ log of minus
[01:30:08] of 3.2. So from here we'll get the
[01:30:11] probability as 0.03.
[01:30:14] So plot it accordingly onto the new
[01:30:16] graph between probability of male being
[01:30:18] spam versus spam word count. So on the
[01:30:21] basis of probability a male would lie
[01:30:23] somewhere near to zero. So according to
[01:30:26] the prediction this male is not a spam
[01:30:29] mail. Again for another male which is
[01:30:31] projecting on a regression line this
[01:30:32] gives us a logout value of 5.6. So when
[01:30:36] we put the value of 5.6 in the formula
[01:30:38] we get the probability as 0.99. So the
[01:30:41] probability of this male being a spam is
[01:30:45] 0.99. So again we put this into this
[01:30:48] graph.
[01:30:49] Similarly one by one we'll calculate for
[01:30:52] each one of them. For this male after
[01:30:54] projecting onto the regression line we
[01:30:56] get the log odds value as minus of 4.5.
[01:30:59] So minus of 4.5 and put into the formula
[01:31:02] we get the probability as 0.01. So the
[01:31:05] prediction of this male is not a spam
[01:31:07] mail which is same as actual. Right? So
[01:31:10] again we plot this male onto a graph.
[01:31:13] All right. So similarly you can repeat
[01:31:15] this step for the rest of the email as
[01:31:17] well.
[01:31:19] And finally we got the S curve up here.
[01:31:21] So there's our regression curve. But you
[01:31:23] must be wondering is this the best
[01:31:25] fitted curve or how do we find out
[01:31:28] whether it's best or not? Well this is
[01:31:30] when the concept of maximum likelihood
[01:31:32] comes into picture.
[01:31:34] So now that we have regression curve
[01:31:37] let's find out the likelihood of this
[01:31:38] curve. So first find out the individual
[01:31:41] likelihood of each male. Again you must
[01:31:43] be thinking how do we get the likelihood
[01:31:45] value. Well likelihood of each male is
[01:31:47] nothing but the probability value of
[01:31:49] each male being spam. So likelihood of
[01:31:52] first male being spam is 0.01.
[01:31:54] Likelihood of second male being spam
[01:31:56] again 0.01. Similarly third 0.03 fourth
[01:32:00] 0.05 and so on till 8th male being 0.99.
[01:32:04] Okay.
[01:32:05] So once you get the individual
[01:32:07] likelihood of each male, multiply them
[01:32:09] to find out the likelihood of the entire
[01:32:11] curve.
[01:32:13] Okay?
[01:32:15] Then calculate the log of likelihood.
[01:32:17] For calculating the log of likelihood,
[01:32:20] you can just take the log of the
[01:32:21] previous result. Okay? You can just take
[01:32:24] the log of previous multiplied result.
[01:32:27] Here we are adding all the logs because
[01:32:29] log of a multiplied by b equal log of a
[01:32:33] plus log of b. Okay. So we got the value
[01:32:37] of log likelihood of this curve as minus
[01:32:40] of 0.084.
[01:32:44] Now let us rotate this line to find out
[01:32:46] the best fitted regression line. So we
[01:32:48] got the log likelihood of this curve as
[01:32:51] minus of 0.084.
[01:32:54] Now let us rotate this line to find out
[01:32:56] the best fitted regression line. Again
[01:32:59] we calculate the individual log
[01:33:01] likelihood of each male.
[01:33:04] For this one uh let's say we got log
[01:33:06] likelihood which is shown on your
[01:33:08] screen. So final value we got up as
[01:33:11] minus of 0.207.
[01:33:14] So we got the value as minus of 0.207.
[01:33:19] So now if we compare the log likelihood
[01:33:21] values for these two regression line
[01:33:23] we'll see that line A has bigger value
[01:33:26] of log likelihood than line B. Right? As
[01:33:29] line A has log likelihood value of minus
[01:33:32] of 0.08. 084. So the log likelihood
[01:33:36] value for line A is minus of 0.084
[01:33:40] whereas for line B is minus of 0.207.
[01:33:44] So minus of 0.084 is bigger than 0.207.
[01:33:49] Right? So therefore we can say that line
[01:33:51] A has better likelihood value than line
[01:33:54] B. Now again we'll rotate the line.
[01:33:57] We'll keep on rotating the line until we
[01:34:00] get the maximum value of log likelihood
[01:34:04] and then finally we'll choose a line
[01:34:06] which is having the maximum log
[01:34:08] likelihood and that line would be the
[01:34:10] best fitted regression line. So let's
[01:34:12] quickly go to Jupyter notebook and start
[01:34:14] with a demo. Right? So this is Jupyter
[01:34:16] notebook guys and our first task would
[01:34:18] be to load up the heart disease data set
[01:34:20] and for that purpose we would have to
[01:34:21] import the pandas package. So I'll just
[01:34:24] type in import pandas as pd and I'll use
[01:34:26] this read csv method from the pandas
[01:34:29] package. So I'll type in pd readad csv
[01:34:32] and I'll pass in the name of the data
[01:34:34] set which is basically hard dot csv and
[01:34:36] I'll store this in this data set object.
[01:34:39] Now let me have a glance at the first
[01:34:41] few records of this data set. So this is
[01:34:44] our data set which comprise of all of
[01:34:46] these columns and we're going to build
[01:34:48] the logistic regression algorithm on top
[01:34:50] of this column over here which is
[01:34:51] basically target. So target would be our
[01:34:54] dependent variable and the rest of the
[01:34:56] columns would be the independent
[01:34:57] variables. All right. And this target
[01:35:00] basically means that so you have one and
[01:35:02] zero values over here. The one value
[01:35:04] means that the person or the patient has
[01:35:06] the heart disease and zero basically
[01:35:07] means that the patient does not have
[01:35:09] heart disease. Right? Now let me also
[01:35:11] have a glance the shape of this data
[01:35:13] set. So I'll just type in print data
[01:35:15] set.shape and this gives me a value of
[01:35:18] 303 and 13. So 303 means that there are
[01:35:22] 303 records in this data set and 13
[01:35:26] columns. Now let me actually have a
[01:35:27] glance at the value counts of this
[01:35:29] target column. So this value counts
[01:35:32] basically tells me the frequency of
[01:35:34] these two values. So I have these two
[01:35:37] values in this column which is basically
[01:35:38] 1 and zero. So there are 165 records
[01:35:43] where the value is 1 and there are 138
[01:35:46] records where the value is zero. So this
[01:35:49] basically means that in this data set
[01:35:51] there are 165 patients who actually have
[01:35:54] the heart disease and 138 patients who
[01:35:57] do not have the heart disease. Now I'll
[01:35:58] go ahead and actually visualize this. So
[01:36:01] I'll load up the mattplot lip package
[01:36:03] and seaborn packages and I will pass in
[01:36:05] this target column onto the x-axis and
[01:36:08] the data is our data set which is
[01:36:10] basically this hard disease data set.
[01:36:11] And what I'm doing is basically building
[01:36:13] a histogram and I'll plot this up over
[01:36:16] here. Right? So this is the bar plot for
[01:36:19] the value of zero and this is the bar
[01:36:21] plot for the value of one and this
[01:36:23] basically tells us the same thing. So
[01:36:25] 165 is the value of the number of
[01:36:27] patients who actually have the heart
[01:36:29] disease. So this basically is for all of
[01:36:31] those patients who do not have the heart
[01:36:32] disease. All right. Now let me go ahead
[01:36:34] and divide the data set into features
[01:36:36] and label sets. So I'm storing all of
[01:36:39] the features into this X object. So all
[01:36:41] of these 12 columns would be my features
[01:36:44] and this target column would be my label
[01:36:46] or would be my dependent variable. All
[01:36:48] right. And this is how I'm going to
[01:36:50] divide the data set. So I'm going to
[01:36:51] extract all of the columns except the
[01:36:53] last column and store it in this X
[01:36:55] object. And similarly I'll only take the
[01:36:57] last column and store it in this Y
[01:37:00] object. Right now let me have a glance
[01:37:02] at these individual independent variable
[01:37:04] and target variable. So X do gives me
[01:37:07] all of the independent variables and
[01:37:09] Y.ad head gives me the target. Right? So
[01:37:12] now that we have our independent
[01:37:14] variables and the dependent variable,
[01:37:15] let me go ahead and divide this data set
[01:37:17] into training and testing set. And for
[01:37:19] that purpose, I'd have to load up the
[01:37:21] train test split method from
[01:37:23] sklearn.mmodel selection. And over here,
[01:37:26] I'm setting the test size to be equal to
[01:37:28] 0.2. So this means that 20% of the
[01:37:30] records are in the test set and the rest
[01:37:32] 80% records are in the training set.
[01:37:35] Right? I'll click on run again. Now we
[01:37:38] have divided the data set into training
[01:37:40] and testing sets. Now finally it's time
[01:37:42] to build the model and for that purpose
[01:37:44] I'll be importing logistic regression
[01:37:46] from sklearn.linear model and I'm going
[01:37:49] to create an instance of this. So I'll
[01:37:51] just use this method logistic regression
[01:37:53] and I'll name that instance to be log
[01:37:56] model and I'm going to fit this model on
[01:37:58] top of the train set. So I'm basically
[01:38:00] passing x train and y train as the
[01:38:02] parameters. Right. I'll click on run.
[01:38:05] Right. So we have successfully built the
[01:38:07] model on top of the train set. Now we're
[01:38:09] going to go ahead and predict the values
[01:38:11] on top of the test set. So I'll type in
[01:38:14] log model dot predict and I'll pass in x
[01:38:17] test as the parameter and I'll store the
[01:38:20] result in y bread pred. So we have also
[01:38:23] predicted the values. Now it's time to
[01:38:25] calculate the accuracy. So I will type
[01:38:27] log model dotsore and I'll pass in x
[01:38:31] test and y test. So I want to calculate
[01:38:33] the accuracy for the prediction on top
[01:38:35] of the test set. Right? So the accuracy
[01:38:38] comes out to be 73% which is actually
[01:38:40] not that bad. And let me actually also
[01:38:42] build a confusion matrix. So confusion
[01:38:45] metrics would give me a table of values
[01:38:47] which actually comprises of the
[01:38:49] correctly predicted values and
[01:38:50] mclassified values. So I'd have to
[01:38:52] import confusion metrics from
[01:38:54] sklearn.metrics. And again I'll just
[01:38:56] pass in y test and y bread pred as the
[01:38:58] parameters inside this function. And
[01:39:00] I'll print this out. All right. So this
[01:39:03] left diagonal which you see this left
[01:39:05] diagonal actually represents all of
[01:39:06] those values which have been correctly
[01:39:08] classified. And this right diagonal
[01:39:10] represents all of those values which
[01:39:12] have been mclassified. And if you want
[01:39:14] to get the accuracy, all you have to do
[01:39:17] is add up 20 + 25 and divide it with all
[01:39:21] of the values and you'll get the same
[01:39:22] accuracy. So let me actually add up a
[01:39:24] new cell over here and calculate the
[01:39:26] accuracy from this confusion matrix. So
[01:39:28] I have to divide this left diagonal with
[01:39:30] all of the values. So that would be 20 +
[01:39:33] 25 divided by 20 + 25 + 10 + 6 and this
[01:39:41] gives me a value of 73.77
[01:39:44] which is the same as I got over here. So
[01:39:47] the accuracy is 73%. Right? So we have
[01:39:50] built the confusion matrix. Now I'll
[01:39:51] also go ahead and build the ROC curve.
[01:39:54] So the ROC curve it sort of gives me the
[01:39:56] right tradeoff between the true positive
[01:39:58] rate and the false positive rate. So let
[01:40:01] me go ahead and plot this and this is
[01:40:03] what we get over here. So on the y-axis
[01:40:05] we have the true positive rate and on
[01:40:07] the x-axis we have the false positive
[01:40:09] rate. And basically you can understand
[01:40:11] this plot this way. So the closer this
[01:40:13] curve is to this top right corner over
[01:40:16] here the better the model. That is this
[01:40:18] curve needs to cover greater area and
[01:40:21] this what you see red line. So this
[01:40:23] basically represents a classifier which
[01:40:25] would give you around 50% accuracy and
[01:40:28] your model would be as better as it is
[01:40:30] far away from this red line. So
[01:40:32] classification is a process of grouping
[01:40:34] things according to similar features
[01:40:36] they share. The example given up here
[01:40:38] represents set of different garbages
[01:40:41] which are segregated as per the category
[01:40:43] into different bins. For example, paper,
[01:40:46] metal, plastic, e-waste, glass, organic,
[01:40:49] everything are segregated properly and
[01:40:51] are separated in different bins. So here
[01:40:55] we are classifying what? So here we are
[01:40:58] classifying the waste and adding them
[01:41:00] into different bins. Okay. So let's move
[01:41:03] ahead. Next is classification versus
[01:41:06] regression. Well, there's a very
[01:41:07] important thing to understand. So what
[01:41:10] is the difference between a
[01:41:11] classification and a regression? I am
[01:41:13] assuming you know that both are related
[01:41:15] to prediction. Right? Regression is used
[01:41:17] to predict a value from a continuous
[01:41:19] set. That is it deals with continuous
[01:41:22] variable. On the other hand,
[01:41:24] classification is used to predict the
[01:41:26] class or to which class that particular
[01:41:29] variable or that particular data belongs
[01:41:31] to. So basically it deals with
[01:41:33] categorical variable. For example, the
[01:41:36] price of the house depends on its size.
[01:41:38] Right? size here is a numerical value
[01:41:41] which can be continuous. So this relates
[01:41:44] to regression. Similarly, prediction of
[01:41:47] price can be in words like very costly,
[01:41:50] costly, affordable, cheap and very
[01:41:52] cheap. So this relates to
[01:41:54] classification. So this was about
[01:41:57] regression and classification. I hope
[01:42:00] now you know when to use regression and
[01:42:02] when to use classification. So let's
[01:42:04] move ahead. Next is the types of
[01:42:06] classification algorithm. So we have
[01:42:09] logistic regression, we have decision
[01:42:11] tree, random forest, k nearest neighbor
[01:42:13] and nei. Let's have a look at them one
[01:42:16] by one. So first is the logistic
[01:42:19] regression. So we have already learned
[01:42:21] about it in detail in our previous
[01:42:22] session. Let me just summarize things
[01:42:24] for you. So logistic regression is used
[01:42:27] when the dependent variable is
[01:42:28] categorical. For example, you have to
[01:42:30] predict whether the given mail is spam
[01:42:32] or not. Okay? So in that case, you'll be
[01:42:35] using a logistic regression. So for
[01:42:37] detailed understanding on loistic
[01:42:38] regression I'd suggest you to go through
[01:42:40] our previous session. Okay for now let's
[01:42:43] move ahead. Next is the decision tree.
[01:42:47] Well decision tree is a graphical
[01:42:48] representation of all the possible
[01:42:50] solutions to a decisions. Here the
[01:42:53] decisions are mainly based on some
[01:42:55] conditions and the decision and the
[01:42:57] output generated can be easily
[01:43:00] explained. From the image shown here,
[01:43:01] you can see that the main stem of the
[01:43:04] tree is nothing but the issue at hand.
[01:43:07] The branches of a tree or the sub tree
[01:43:10] are the possible decisions. Okay? And
[01:43:12] the leaf nodes or the leaf of the tree
[01:43:15] are possible scenarios. Let me just give
[01:43:18] you an example of a decision tree. For
[01:43:20] example, you have to predict whether the
[01:43:23] given person is fit or not. So the very
[01:43:25] first thing that you'll be looking for
[01:43:27] is age less than 30. If the condition is
[01:43:30] true, then you will check if the person
[01:43:32] eats a lot of pizza or not. If the
[01:43:35] person eats a lot of pizza, then he is
[01:43:37] unfit. In case he doesn't eat a lot of
[01:43:40] pizza, then you can say he is fit. But
[01:43:42] what if the person's age is greater than
[01:43:44] 30? In that case, you'll check another
[01:43:46] condition like you'll go for whether he
[01:43:49] exercise in the morning or not. If he
[01:43:51] does, then he is fit. If not, then he is
[01:43:54] unfit. Okay. So this was about decision
[01:43:58] tree. Next is the random forest. Well,
[01:44:02] random forest builds multiple decision
[01:44:04] tree and merges them together to get a
[01:44:06] more accurate and stable prediction. So
[01:44:09] you can say that decision tree is a
[01:44:11] basic building block of a random forest.
[01:44:13] Okay. Now the question arises why it is
[01:44:16] called random? Well, it's called random
[01:44:18] because each decision tree in a forest
[01:44:20] considers a random subset of features
[01:44:23] when forming questions. Now the question
[01:44:25] arises why it is called random. Well,
[01:44:28] it's called random because each decision
[01:44:30] tree in the forest considers a random
[01:44:33] subset of features when forming
[01:44:34] questions and they only have the access
[01:44:37] to a random set of training data set. So
[01:44:39] this increases the diversity in the
[01:44:41] forest leading to a more robust overall
[01:44:44] prediction and the name random forest.
[01:44:46] Okay. So now when it comes to
[01:44:48] prediction, random forest takes an
[01:44:51] average of all the individual decision
[01:44:53] tree estimate. Well, you can say that
[01:44:55] there are two main fundamental idea
[01:44:56] behind the random forest and both of
[01:44:59] them are well known to us in our daily
[01:45:01] life. The first is constructing a
[01:45:03] flowchart of questions and answers
[01:45:05] leading to a decision and the next is
[01:45:07] the wisdom of the random and diverse
[01:45:10] crowd. So the random forest is the
[01:45:12] combination of these ideas that has led
[01:45:15] to the power of random forest model.
[01:45:17] Well, if you talk about its training,
[01:45:19] the training is done using bagging
[01:45:21] method. Now you would ask what exactly
[01:45:24] is bagging? Well, we have already
[01:45:26] discussed it not the terminology but I
[01:45:29] have explained you the concept. Bagging
[01:45:31] is building multiple decision tree by
[01:45:33] using random set of the training data
[01:45:35] set and finally voting the trees for a
[01:45:37] consensus prediction. One last thing so
[01:45:40] using random forest you can correct the
[01:45:42] decision trees habit of overfitting the
[01:45:44] training data set. So this was about
[01:45:46] random forest. Let's move ahead. Next we
[01:45:49] have is K nearest neighbor. Well, KN&N
[01:45:52] algorithms use a data and classify new
[01:45:55] data points based on a similarity
[01:45:57] measures. The KN&N algorithm is the
[01:46:00] nearest neighbor we wish to take vote
[01:46:02] from. It is one of the simplest and most
[01:46:04] used learning algorithm. Well, KN&N is a
[01:46:07] nonparametric lazy algorithm whose
[01:46:10] purpose is to use a database in which
[01:46:12] the data points are separated into
[01:46:14] several classes to predict the
[01:46:16] classification of a new sample point.
[01:46:18] And when I say it's non-parametric, it
[01:46:21] means that it does not make any
[01:46:22] assumptions on the underlining data
[01:46:24] distribution. And when I say it's lazy
[01:46:27] learning algorithm, it doesn't mean that
[01:46:29] it's uh lazy like a polar beer or
[01:46:31] something. Okay, it means that it does
[01:46:34] not use the training data points to do
[01:46:36] any generalization. In other words,
[01:46:39] there is no explicit training phase or
[01:46:41] it's very minimal. So this implies that
[01:46:44] training phase is pretty fast and the
[01:46:46] lack of generalization means that KN&N
[01:46:48] keeps all the training data. So all of
[01:46:51] them are needed during the testing
[01:46:53] phase. So in this example you can see
[01:46:56] that we have a test sample up here in
[01:46:58] green. We need to classify. We have to
[01:47:00] predict whether it belongs to a red
[01:47:02] triangle or whether it belongs to a
[01:47:04] square. So the test sample should be
[01:47:06] classified either to the first class of
[01:47:08] the blue square or to the second class
[01:47:10] of red triangle. If k equal 3 that is
[01:47:13] it's outside the circle. So it is
[01:47:15] assigned to the second class because
[01:47:18] there are two triangles and only one
[01:47:20] square inside the circle. Now if for
[01:47:22] example k equ= 5 then it is assigned to
[01:47:25] the first class which has three squares
[01:47:28] versus two triangles outside the outer
[01:47:30] circle. Okay. I hope the example is
[01:47:33] clear to you. So let's move ahead. And
[01:47:35] finally in the end we have nbased
[01:47:37] classifier. Nate based classifier is a
[01:47:40] probabilistic machine learning model
[01:47:41] that is used for classification. It is
[01:47:44] completely based on base theorem. We can
[01:47:46] find the probability of happening of A
[01:47:48] given that B has already occurred. So
[01:47:51] here B is the evidence and A is the
[01:47:53] hypothesis. So the assumption made here
[01:47:56] is that the features are independent
[01:47:58] that is the presence of one particular
[01:48:00] feature does not affect the other. Hence
[01:48:03] it is called nave. Okay. So this was the
[01:48:06] summary of all the different types of
[01:48:08] classification algorithm tree. So let's
[01:48:10] say this is our data set. So our data
[01:48:12] set consists of three different
[01:48:13] attributes color, diameter and label.
[01:48:16] Label consists of different types of
[01:48:18] fruit. So we have three different types
[01:48:20] of fruit up here. Mango, lemon and
[01:48:22] cherry. And if you talk about color, so
[01:48:25] in color we have two varieties of mango.
[01:48:27] One is of green and one is of yellow.
[01:48:30] Okay. Now our task is to create a
[01:48:32] decision tree for this data set. So now
[01:48:34] we'll split this data set on the basis
[01:48:36] of some condition and create a decision
[01:48:38] tree. So for splitting the data set
[01:48:40] let's consider this attribute diameter.
[01:48:43] So as you can see we have three
[01:48:44] different fruits with common diameter of
[01:48:46] three and two diameters which are less
[01:48:49] than three. So we have a condition of
[01:48:51] split over here. So we can check if the
[01:48:54] diameter of the fruit is greater than or
[01:48:56] equal to three or not. If the condition
[01:48:57] is true then we'll have green mango,
[01:49:00] yellow mango and yellow lemon as a part
[01:49:03] of our splitted data set. And if the
[01:49:05] condition is false then in that case all
[01:49:08] we'll get is cherry. Okay. So here you
[01:49:10] can say that this is our leaf node. So
[01:49:13] in case if you reach till here so you
[01:49:16] can predict your value and say that it's
[01:49:18] a 100% cherry. Okay. But if the diameter
[01:49:20] is greater than equal to three then what
[01:49:22] will you predict? Is it a green mango?
[01:49:24] Is it a yellow mango or it's a yellow
[01:49:26] lemon? You can't say right. So again you
[01:49:29] have to split this data set. So how can
[01:49:30] we split? Uh let's see if we have any
[01:49:33] common feature to split this data set.
[01:49:35] So as you can see we have a common color
[01:49:37] over here. So we have a yellow mango and
[01:49:39] a yellow lemon. So again we got the
[01:49:41] condition of split over here. So we'll
[01:49:43] check here if the color is yellow or
[01:49:46] not. So if the condition is true we'll
[01:49:48] get all the yellow fruits in this data
[01:49:50] set and if the condition is false we'll
[01:49:52] get all the non yellow fruits in it. So
[01:49:55] we got green mango over here. So if you
[01:49:57] reach till this node then you can say
[01:49:58] that the probability of finding green
[01:50:01] mango at this particular node is 1 or
[01:50:04] it's 100%. And at this leaf node you can
[01:50:07] say that the probability of finding a
[01:50:09] yellow mango is 50% and the probability
[01:50:11] of finding a yellow lemon is also 50%.
[01:50:14] Okay. So now you can see that we have
[01:50:16] some terms over here like genie impurity
[01:50:18] equals 0 or genie impurity equals 0.44.
[01:50:21] So what does this mean in this
[01:50:22] visualization? So for any particular
[01:50:24] node if genie impurity equals zero it
[01:50:27] means that that particular node is a
[01:50:29] pure node and it does not have any mixed
[01:50:32] values. So as you can see we have all
[01:50:34] cherries in it. So if you focus to this
[01:50:36] node you can see that the genome purity
[01:50:39] value is non zero. So it means that it's
[01:50:41] a mixed data set or it's a mixed value.
[01:50:44] Okay, it's not a pure value and uh there
[01:50:47] is a further chance to split the data
[01:50:49] set on some condition. Okay, so now if
[01:50:51] you focus on these condition like is
[01:50:53] diameter greater than three or is color
[01:50:55] equal zero. So they have some value of
[01:50:57] information gained. So it's like you can
[01:50:59] gain maximum information at your root
[01:51:01] node and as you reach towards the leaf
[01:51:03] node, the information gained is
[01:51:05] decreased. So as you can see this is the
[01:51:08] parent node for this node right or this
[01:51:10] is the child node for this node. So the
[01:51:12] value of information gain at the parent
[01:51:14] node would always be greater than the
[01:51:17] value of the information gain at child
[01:51:19] node. Okay. So this was about how you
[01:51:22] can visualize a decision tree. Now let's
[01:51:25] move ahead and understand about the
[01:51:26] decision tree terminologies. So the very
[01:51:28] first thing that we have is the root
[01:51:30] node. The root node represents the
[01:51:33] entire population or sample. And this
[01:51:36] population further gets divided into two
[01:51:38] or more homogeneous set. Okay. So when
[01:51:40] we pass the entire data set to one
[01:51:43] particular node. So that node is root
[01:51:45] node. Okay. Next is leaf node. Well leaf
[01:51:50] nodes are the node which cannot be
[01:51:52] further segregated into further nodes.
[01:51:54] Like in our previous example the point
[01:51:56] where we got 100% lemon or 100% mango.
[01:52:00] So those nodes were leaf node. So it's
[01:52:03] like we cannot further segregate it
[01:52:04] down. It's the last point of that
[01:52:06] particular branch or that particular
[01:52:08] node. Next is splitting. Well, splitting
[01:52:11] is dividing the root node or the subnode
[01:52:13] into different parts on the basis of
[01:52:14] some condition. Like in our example, we
[01:52:17] divided the tree on the base of two
[01:52:18] condition. The first one was is the
[01:52:20] diameter greater than or equal to three
[01:52:23] and the second one was if the color
[01:52:25] equals yellow or not. So next is branch
[01:52:28] or sub tree. Well, a branch or sub tree
[01:52:30] is formed by splitting the tree or the
[01:52:32] node. Next is pruning. Well, pruning is
[01:52:36] just opposite of splitting. Pruning is
[01:52:38] basically removing unwanted branch from
[01:52:40] the tree. So, it's like if you want to
[01:52:41] reach to a particular scenario, so you
[01:52:43] would remove all the unrequired
[01:52:45] decisions and the condition from
[01:52:47] between. Okay. So, this is the concept
[01:52:48] of pruning. Next is parent node or child
[01:52:51] node. Well, root node is the parent node
[01:52:53] and all the other nodes branch from it
[01:52:55] is known as child node. Okay. So any
[01:52:58] doubt up till here if you have please
[01:53:00] reach out to us. Okay. So this was all
[01:53:03] about decision tree terminology. Now
[01:53:05] let's move ahead and see how we can
[01:53:07] create a decision tree. So this is our
[01:53:10] sample data set. Our data set consists
[01:53:12] of outlook, temperature, humidity, windy
[01:53:15] and play. So based on outlook,
[01:53:17] temperature, humidity and windy
[01:53:19] condition, we have to predict whether a
[01:53:22] person can play or not. Okay. So how
[01:53:25] will we do that? So in order to resolve
[01:53:27] this query I'll be creating a decision
[01:53:29] tree for it. Now in order to create a
[01:53:31] decision tree the very first thing that
[01:53:33] I would have to do is we have to pick
[01:53:36] one of the attribute. So now the
[01:53:38] question arises which one among outlook,
[01:53:40] temperature, humidity and windy should
[01:53:42] you pick first. So the answer is
[01:53:44] determine the attribute that best
[01:53:46] classifies the training data. But how do
[01:53:48] we choose the best attribute or how does
[01:53:51] a tree decide where to split? So how do
[01:53:54] we split a tree? So in order to split
[01:53:56] the tree we have various concepts. Let's
[01:53:58] have a look at them one by one. So on
[01:54:00] number one we have entropy. Entropy
[01:54:03] defines the randomness in the data. It
[01:54:05] is a metric which measures the impurity.
[01:54:07] It is the very first step to solve the
[01:54:09] problem of a decision tree. Now once you
[01:54:11] find the entropy your next task would be
[01:54:13] to calculate the information gain. So
[01:54:15] the information gain is the decrease in
[01:54:17] entropy after a data set is split on the
[01:54:19] base of an attribute. Constructing a
[01:54:21] decision tree is all about finding
[01:54:24] attribute that returns the highest
[01:54:25] information gain. Okay. And the
[01:54:27] attribute with the highest information
[01:54:29] gain will be the attribute which would
[01:54:31] be selected as a root node or a node
[01:54:33] from where we'll split. Now once you
[01:54:35] calculate the information gain, the next
[01:54:37] thing that you'll calculate is the gen
[01:54:39] index. So now that your data set is
[01:54:41] divided, now you have to check whether
[01:54:43] the splitted data is pure or not. So for
[01:54:46] checking the purity of the data, we have
[01:54:48] a measure as genie index. It is the
[01:54:50] measure of impurity or purity used in
[01:54:53] building a decision tree. Okay. And
[01:54:55] finally we have reduction in variance.
[01:54:58] Well reduction in variance is an
[01:54:59] algorithm used for continuous target
[01:55:01] variable that is regression problem. The
[01:55:03] split with lower variance is selected as
[01:55:05] a criteria to split the population. So
[01:55:08] as I said earlier the very first thing
[01:55:10] in order to construct a decision tree is
[01:55:12] to calculate its entropy. So this is the
[01:55:15] formula for entropy. Entropy of S total
[01:55:18] space equals minus of probability of yes
[01:55:21] multiplied by log of probability of yes
[01:55:24] minus probability of no multiplied by
[01:55:26] log of probability of no with base 2. So
[01:55:29] if you take a look at the graph and the
[01:55:31] formula so from here you can say that so
[01:55:34] if number of yes equal number of no that
[01:55:36] is probability of total sample space
[01:55:38] becomes 0.5 that is this point. Okay. So
[01:55:42] here entropy becomes one and if it
[01:55:45] contains all yes or all no then in that
[01:55:47] case probability of total sample space
[01:55:49] is one or zero. If it consists of all
[01:55:52] yes that is all one or all zero then in
[01:55:56] that case the entropy is zero. Okay. Now
[01:55:59] let me just show you what I just told
[01:56:00] you. So the formula for entropy was
[01:56:03] minus of probability of yes times of log
[01:56:06] of probability of yes minus probability
[01:56:09] of no multiplied by log of probability
[01:56:11] of no. So now when probability of yes
[01:56:14] equal probability of no that is total
[01:56:16] sample space consist of equal number of
[01:56:19] yes and no. So in that case both would
[01:56:22] have the equal probability of 0.5. So if
[01:56:25] you put 0.5 in the formula above you'll
[01:56:28] get minus of 0.5 * of log 2 0.5 minus
[01:56:32] 0.5 * of log 2 0.5. Okay, which will
[01:56:36] evaluate to 1. So here it proves that
[01:56:39] entropy for the case when probability of
[01:56:42] yes equal probability of no would be 1.
[01:56:46] Or you can say that there is maximum
[01:56:48] randomness in the data set when
[01:56:50] probability of yes equals probability of
[01:56:52] no. Let's take another case where we
[01:56:54] have all yes. Okay. So probability of
[01:56:57] yes equal 1. Our total sample space
[01:57:00] consists only of yes. Okay. So for that
[01:57:03] we'll get the formula as e of s equal 1
[01:57:07] of log base 2 * of 1 which is nothing
[01:57:10] but zero. This point when the
[01:57:12] probability of yes is 1 then in that
[01:57:15] case the total entropy that you would
[01:57:17] get is zero. Now what if the probability
[01:57:20] of no is one that is your entire data
[01:57:23] set only consist of no there is no yes
[01:57:26] in the entire data set. So similarly you
[01:57:28] will get the value of entropy as zero.
[01:57:31] Okay I hope I made my point clear. So
[01:57:34] let's move ahead. So now coming back to
[01:57:36] our data set. So our data set consists
[01:57:38] of 14 different instances. Out of them
[01:57:41] we have nine yes and five no. So now if
[01:57:44] you add the value of yes and no to the
[01:57:46] formula. So it results to minus of 9x4 *
[01:57:50] of log of 9x4 - 5x4 * of log of 5x4. So
[01:57:56] it drills down to 0.41 + 0.43 which
[01:57:59] equals 0.94. So here the total entropy
[01:58:02] for my entire sample space is 0.94.
[01:58:06] Fine. So now that we have calculated the
[01:58:08] entropy for our data set, our next task
[01:58:10] would be to calculate the information
[01:58:12] gain. So information gain equal entropy
[01:58:15] minus weighted average multiplied by
[01:58:17] entropy for each feature. So in order to
[01:58:19] find the root node you have to calculate
[01:58:21] information gain for each attribute and
[01:58:24] the attribute having the maximum value
[01:58:26] of information gain will be the
[01:58:27] attribute that will be selected as the
[01:58:30] root node. Okay. So let's do it one by
[01:58:33] one. So starting with outlook.
[01:58:36] So if we choose outlook we have three
[01:58:38] different parameter sunny, overcast and
[01:58:41] rainy. In sunny we have two yes and
[01:58:44] three nos. In case of overcast we have
[01:58:46] all yes. And in case of rainy we have
[01:58:49] three yes and two no. So moving on ahead
[01:58:52] we'll calculate the entropy when outlook
[01:58:54] equals sunny, when outlook equal
[01:58:56] overcast and when it equals rainy. So
[01:58:59] we'll calculate all three individual
[01:59:01] entropies. So starting with when outlook
[01:59:05] equals sunny. So in that case we had two
[01:59:07] yes and three nos. So if we put those
[01:59:09] values in the formula for calculating
[01:59:11] the entropy it would look something like
[01:59:13] this and we'll get the result as 0.971.
[01:59:17] So in case of overcast we'll get the
[01:59:19] value as zero as the probability of yes
[01:59:23] in case of overcast is one. We have all
[01:59:25] yes over there right and the probability
[01:59:27] of no is zero as it is not available
[01:59:30] there in the data set. Right? So from
[01:59:32] here you can see that we calculated the
[01:59:34] entropy for outlook equal overcast is as
[01:59:38] zero. Now entropy for outlook equal
[01:59:40] rainy in case of rainy we had 3s and 2
[01:59:44] no. So if you put all those value in the
[01:59:46] formula so you'll get the result as
[01:59:48] 0.971. Okay. So now we have calculated
[01:59:51] entropy for each feature of the outlook.
[01:59:53] So next we'll be calculating is the
[01:59:55] information from outlook. So for
[01:59:57] calculating the information we had the
[01:59:59] formula weighted average multiplied by
[02:00:01] entropy of each feature. So total
[02:00:04] information from the outlook equals so
[02:00:06] summation of all the information from
[02:00:09] different features starting with sunny.
[02:00:12] So in case of sunny the weighted average
[02:00:13] is 5x4. So how did we get this 5x4
[02:00:16] value? Like in case of sunny we had two
[02:00:19] yes and three nos. So 2 + 3 is 5 divided
[02:00:22] by total number of data points that is
[02:00:24] 14. So 5x4 is the weighted average for
[02:00:27] sunny multiplied by the entropy value of
[02:00:30] sunny that is 0.971 plus 4x4. How we got
[02:00:35] four? So we had 4 years in case of
[02:00:38] overcast. Right? So here we got the
[02:00:40] weighted average as 4x4 multiplied by
[02:00:43] the entropy which was 0 plus again 5x4 3
[02:00:47] + 2 * 0.971. So if you calculate all
[02:00:51] these value this result to 0.693.
[02:00:54] So the total information from the
[02:00:55] outlook is 0.693.
[02:00:58] Now we'll calculate the information
[02:00:59] gained from outlook. So total
[02:01:01] information gained from outlook is total
[02:01:04] entropy minus information from outlook
[02:01:08] which equals to 0.94 minus of 0.693
[02:01:12] which results to 0.247.
[02:01:14] Okay. Now similarly we'll calculate
[02:01:17] entropy for each feature. We calculated
[02:01:19] the entropy for outlook, temperature,
[02:01:22] humidity.
[02:01:23] So as you can see in case of outlook we
[02:01:26] got a information gain of 0.247 in case
[02:01:29] of temperature the information gain was
[02:01:30] 0.029 and in case of humidity the
[02:01:33] information gain was 0.152. So here we
[02:01:36] can see that our outlook is having the
[02:01:38] maximum information gain. So here we can
[02:01:41] conclude two things. First our outlook
[02:01:44] would be our root node as it is having
[02:01:46] the maximum value of information gain.
[02:01:48] Okay. And secondly, we won't use
[02:01:51] temperature to create a decision tree or
[02:01:53] we can avoid using temperature in our
[02:01:55] decision tree as the information gained
[02:01:57] from the temperature is very less. Okay.
[02:01:59] So here we got our root node as outlook.
[02:02:02] So here we got our root node as outlook.
[02:02:05] We can split it into sunny, overcast and
[02:02:07] rainy. But now sunny and rainy both have
[02:02:10] some numbers of yes and no. Right?
[02:02:12] Overcast is the only feature which
[02:02:14] consist of only yes. So our overcast
[02:02:17] would be treated as a leaf node. But
[02:02:19] again we have to decide which node would
[02:02:22] be used when outlook is sunny and which
[02:02:24] node would be used when outlook is
[02:02:26] rainy. So again we have to decide which
[02:02:28] node would come when outlook is sunny or
[02:02:31] outlook is rainy. Okay. So let's see how
[02:02:33] we'll do that. So again we'll select it
[02:02:36] one by one. Next we'll calculate the
[02:02:38] entropy for windy. Okay. So in case of
[02:02:41] windy you can see we have two different
[02:02:42] features false and true. In case of
[02:02:45] false we have six yes and two no. In
[02:02:48] case of true we have three yes and three
[02:02:51] nos. So next we'll calculate the entropy
[02:02:53] of windy equal true which we got as one.
[02:02:56] Why? Cuz it has equal number of yes and
[02:02:59] no. Okay. Then in this case the
[02:03:01] probability of yes equal probability of
[02:03:03] no. So remember the first case where I
[02:03:05] told you if you' be having a equal
[02:03:07] number of yes and no. So for that case
[02:03:09] the total entropy that you'll calculate
[02:03:12] you'll get it as one. Okay. And for
[02:03:15] windy equal false, you'll get it as
[02:03:17] 0.811
[02:03:18] using the same formula again. Okay. So
[02:03:20] if you calculate the information from
[02:03:22] windy, you'll get it as 8x4 * 0.811.
[02:03:27] This is for windy equal false plus 6x4 *
[02:03:31] 1 for windy equal true. And if you
[02:03:33] summate it, you'll get the value as
[02:03:35] 0.892.
[02:03:37] Now information gained from windy is
[02:03:39] total entropy minus information from
[02:03:42] windy which is 0.94 minus 0.892 that is
[02:03:46] 0.048.
[02:03:48] Okay. So similarly you will calculate
[02:03:50] for other features as humidity and
[02:03:52] temperature and finally you will get
[02:03:54] this decision tree as the final output.
[02:03:56] So this is the final decision tree that
[02:03:58] you get. Okay. So let's see how does it
[02:04:01] answer our query and see the scenario
[02:04:03] where we can play. So first outlook. So
[02:04:05] if outlook is sunny and the humidity is
[02:04:08] normal then we can play or if outlook is
[02:04:11] overcast then we can definitely play. No
[02:04:14] further condition attached to it. Or if
[02:04:16] the outlook is rainy and it's windy and
[02:04:19] the wind is not strong enough it's a
[02:04:21] weak wind then in that case also we can
[02:04:24] play. Okay. So this was about how you
[02:04:26] calculate entropy and information gain
[02:04:29] to create a decision tree using cat
[02:04:31] method that is classification and
[02:04:33] regression. Here's a quiz question for
[02:04:36] you guys. What is supervised learning?
[02:04:39] Your options are a machine learning
[02:04:41] technique where the model learns from
[02:04:42] labeled data to make predictions or
[02:04:44] decisions without human intervention. A
[02:04:47] machine learning technique where the
[02:04:49] model learns from unlabelled data to
[02:04:51] make predictions or decisions with high
[02:04:53] accuracy. A machine learning technique
[02:04:55] that doesn't involve data labeling
[02:04:57] making it highly efficient. Or a machine
[02:05:00] learning technique used exclusively for
[02:05:02] image recognition. Please mention your
[02:05:04] answers in the comment section.
[02:05:06] >> Decision tree regressor that is we'll be
[02:05:08] using decision tree as a regression
[02:05:09] algorithm and we'll be implementing this
[02:05:11] decision tree algorithm on top of the
[02:05:13] Boston data set. So before we go ahead
[02:05:16] and load up the data set, let me
[02:05:17] actually import the requisite packages.
[02:05:19] So we would require the numpy pandas and
[02:05:22] mattplot lip package and with the help
[02:05:24] of this read csv function which comes
[02:05:26] from the pandas library, I load up the
[02:05:28] Boston.csv data set. Right. So now that
[02:05:32] we have loaded up the data set, let me
[02:05:34] have a glance at the first few records
[02:05:36] of this data set. So I'll just type in
[02:05:37] Boston.head and these are all of the
[02:05:40] columns present in this data set. All
[02:05:42] right. Now for our first model, we'll be
[02:05:44] taking MEV as our dependent variable and
[02:05:48] RM to be our independent variable. So RM
[02:05:51] basically stands for the average number
[02:05:53] of rooms per dwelling and MEV denotes
[02:05:55] the median price value of the house. So
[02:05:58] we basically want to understand how does
[02:06:00] the median price value of the house
[02:06:02] change with respect to number of rooms
[02:06:04] per dwelling. And I'll go ahead and make
[02:06:07] a scatter plot between these two
[02:06:08] columns. So I'll map RM onto the x-axis
[02:06:12] and I will map onto the y-axis. Right?
[02:06:16] So let me click on run and this is what
[02:06:19] we get. So what we see is as the average
[02:06:21] number of rooms increase the median
[02:06:24] value of the price of the house also
[02:06:26] increases. And this again is quite
[02:06:28] intuitive, isn't it? So if the number of
[02:06:30] rooms increase, the size of the house
[02:06:32] would also increase. And this would in
[02:06:33] turn increase the price of the house.
[02:06:36] All right. So now that we've done a bit
[02:06:38] of visualization, let me actually go
[02:06:40] ahead and extract the features and the
[02:06:42] target from the original data frame. So
[02:06:45] I'll extract this RM column and store it
[02:06:47] in the X object. Similarly, I'll extract
[02:06:50] the MEV column and store it in the Y
[02:06:52] object. So X basically denotes my
[02:06:55] feature column and Y basically denotes
[02:06:57] my target column. Right? Now I'll go
[02:06:59] ahead and divide this data set into
[02:07:01] training and testing sets. So for that
[02:07:04] purpose I'll be importing train test
[02:07:06] split method from sklearn.mmodel
[02:07:08] selection and I'll be passing in all of
[02:07:10] these parameters. So X and Y I'll just
[02:07:12] pass in features and target and I'll set
[02:07:15] the test size to be equal to 0.20. And
[02:07:18] this means that 20% of the records would
[02:07:20] be in the test set and the rest of the
[02:07:22] 80% records would be in the train set.
[02:07:25] All right. So we have our testing and
[02:07:27] training sets ready. Now it's time to
[02:07:29] finally build the model. So I'll be
[02:07:31] importing decision tree regressor from
[02:07:34] sklearn.tree. So again I'm restating it
[02:07:36] guys. So even though we're using the
[02:07:38] decision tree algorithm, we are able to
[02:07:40] perform regression with this. Right? So
[02:07:43] now I'll create an instance of this. So
[02:07:45] I'll just call in this method decision
[02:07:47] tree regressor and I'll name that
[02:07:48] instance to be regressor and I'll fit
[02:07:51] this model on top of the train set. So
[02:07:53] the parameters which I'm passing inside
[02:07:55] this are extreme and y train. So now
[02:07:58] that I've built the model let me go
[02:07:59] ahead and predict the values on top of
[02:08:01] the test set. So I'll type regressor
[02:08:04] dotpredict and the parameter which I'm
[02:08:06] passing inside this is x test. Right. So
[02:08:09] now we have also predicted the values.
[02:08:11] Now it's time to find out the RMSE value
[02:08:14] or the root mean square error value. And
[02:08:16] we'll be importing mean squared error
[02:08:18] from skarn.metric.
[02:08:20] And I'll be using this method. And I'll
[02:08:22] just pass in y bread and y test. So y
[02:08:25] bread contains all of the predicted
[02:08:27] values. Y test contains all of the
[02:08:29] actual values. And I want to find out
[02:08:30] the error in prediction. Now since I've
[02:08:32] imported mean squared error, I'd
[02:08:34] actually have to find out the square
[02:08:35] root of it. So I'll type in np.square
[02:08:38] square root and I'll pass in the value
[02:08:40] of MSE into this and this will give me
[02:08:42] the root mean squared value. So the root
[02:08:44] mean square value for the model which we
[02:08:46] built is 7.102.
[02:08:49] Now again the model which you've built
[02:08:51] that just has one independent variable.
[02:08:54] So this is a simple model. Now we'll go
[02:08:56] ahead and add multiple independent
[02:08:58] variables. So this time our features are
[02:09:01] RM, Lstat and age columns. So in the
[02:09:04] first model we just had one independent
[02:09:06] variable. And in the second model which
[02:09:08] we are building we'll be having three
[02:09:10] independent variables which are RM,
[02:09:12] Lstat and age. And again our dependent
[02:09:15] variable is the MEV column. And I'll
[02:09:17] store them in X and Y again. Now I'll go
[02:09:20] ahead and divide this data set into
[02:09:22] train and test split. And another
[02:09:24] difference which I'm making over here is
[02:09:26] I'm setting the test size to be equal to
[02:09:28] 0.30. So this means that 30% of the
[02:09:31] records would be in the test set and the
[02:09:33] rest 70% of the records would be in the
[02:09:35] training set. Right? I'll click on run
[02:09:37] and then again I'll import the decision
[02:09:39] tree regressor and I will fit the model
[02:09:41] on top of X train and Y train and I'll
[02:09:45] predict the values on top of X test. Now
[02:09:47] again I'll find out the RMSSE value.
[02:09:50] Right? So this time the RMS value comes
[02:09:53] to be around 5509.
[02:09:55] So now let me compare it with the RMS
[02:09:57] value of the first model. So the RMSSE
[02:10:00] value of the first model is 7.10 and the
[02:10:03] RMSSE value of the second model is 550.
[02:10:06] This basically means that the second
[02:10:08] model produces less error or in other
[02:10:10] words second model is better than the
[02:10:13] first model. Right? So that was an
[02:10:14] example where we use decision tree as a
[02:10:17] regression algorithm. Now we'll go ahead
[02:10:19] and use decision tree as a classifier
[02:10:21] and we'll be building this decision tree
[02:10:23] classifier on top of the Iris data set.
[02:10:26] So over here I'll type in pd.read
[02:10:29] csv and I'm loading up the iris do.csv
[02:10:32] file and now I'll have a glance at the
[02:10:34] head of this and these are all the
[02:10:36] columns. So I've got sele length, sele
[02:10:38] width, petal length, petal width and the
[02:10:40] species column. So now what I basically
[02:10:42] want to understand is what is the specy
[02:10:45] of this flower based on these four
[02:10:48] columns over here. So these four columns
[02:10:50] would be my independent variables and
[02:10:53] species would be my dependent variable.
[02:10:55] So that is what I'm doing over here.
[02:10:57] From this entire Iris data frame, I'll
[02:10:59] extract these four columns and store
[02:11:01] them in the X object and I'll just
[02:11:03] extract the species column and I'll
[02:11:05] store it in the Y object. All right. Now
[02:11:08] I'll go ahead and divide this data set
[02:11:10] into training and testing set. So again
[02:11:12] I'd have to import the train test split
[02:11:14] and I'm setting the test size to be
[02:11:16] equal to 0.30.
[02:11:19] Now since we are building a classifier,
[02:11:22] we'll have to import decisionfree
[02:11:24] classifier from sklearn.ry
[02:11:26] and I will create an instance of this.
[02:11:29] So I'll just use this method decision
[02:11:31] free classifier and I'll store the
[02:11:33] result in this new object. I'll name
[02:11:35] that object to be classifier and I will
[02:11:38] fit this model on top of X train and Y
[02:11:41] train. Right? So we have successfully
[02:11:43] built the model on top of the train set.
[02:11:45] Now we have to go ahead and predict the
[02:11:47] values on top of the test set. So I'll
[02:11:49] type in classifier.predict and the
[02:11:51] parameter which I'm passing in this is X
[02:11:54] test. So this is also done. Now let me
[02:11:56] go ahead and find out how the prediction
[02:11:58] has been done. So I'll start off by
[02:11:59] creating a confusion matrix. So I'll
[02:12:02] import confusion matrix from
[02:12:03] sklearn.metrics
[02:12:05] and this takes in Y test and Y bread. So
[02:12:08] Y test comprise of all of the actual
[02:12:10] values and Y bread comprise of all of
[02:12:13] the predicted values. So I'll click on
[02:12:15] run over here. Now this left diagonal
[02:12:17] which you see this left diagonal tells
[02:12:20] you all of the values which have been
[02:12:22] correctly predicted and the rest of the
[02:12:24] values have been mclassified. So what we
[02:12:27] see is there is just one value or one
[02:12:30] record which has been mclassified. The
[02:12:33] rest of the records have been correctly
[02:12:34] classified. Right now I'll go ahead and
[02:12:37] calculate the accuracy score. and the
[02:12:39] accuracy score comes out to be 97.77%.
[02:12:44] Right? So this is how we can use a
[02:12:45] decision tree algorithm. So the
[02:12:47] confusion matrix shows the ways in which
[02:12:49] a classification model is confused when
[02:12:52] it makes predictions. It is basically a
[02:12:54] summary of prediction result on a
[02:12:56] classification problem. The main key to
[02:12:58] a confusion matrix is summarize the
[02:13:00] count value of correct and incorrect
[02:13:01] prediction. So the image shown on your
[02:13:03] screen represents a confusion matrix.
[02:13:06] Let's see what exactly does it mean. But
[02:13:08] before that, let me just tell you how to
[02:13:10] create a confusion matrix. This would
[02:13:12] make things more clear for you. So let's
[02:13:14] see how. So for creating a confusion
[02:13:16] matrix, you'd be needing a test data set
[02:13:18] or a validation data set with expected
[02:13:21] outcome values. Then make a prediction
[02:13:23] for each row in your test data set. Then
[02:13:25] from the expected outcome and
[02:13:27] prediction, count the number of correct
[02:13:29] prediction for each class and the number
[02:13:32] of incorrect prediction for each class
[02:13:34] organized by that class that was
[02:13:36] predicted. Okay, let's see what exactly
[02:13:38] does it mean. So here's an example. We
[02:13:41] have some expected output and a
[02:13:42] predicted output for that. So from here
[02:13:44] you can see that all the red color
[02:13:46] results are the incorrect predicted
[02:13:48] values and the green ones are the
[02:13:50] correct one. So in total we have seven
[02:13:53] correct prediction out of 10. Okay. So
[02:13:56] from here you can say that the accuracy
[02:13:58] of your model is 70%. Now here men
[02:14:01] classified as men are three 1 2 and
[02:14:05] three and women classified as women are
[02:14:07] four 1 2 3 and four and now men
[02:14:11] classified as women men as women one and
[02:14:14] men as women two okay so two and women
[02:14:17] classified as men 1. Now if you create a
[02:14:20] confusion matrix out of it, you'll get
[02:14:22] something like this. Men classified as
[02:14:24] men 3, men classified as women 1, women
[02:14:27] classified as men 2 and women classified
[02:14:30] as women is four. So from here you can
[02:14:32] say that total actual men 3 + 2 is five.
[02:14:36] Total actual women 1 + 4 again five and
[02:14:39] total correct values men classified as
[02:14:42] men and women classified as women that
[02:14:44] is 3 + 4 it's 7. So from here you can
[02:14:47] say that there are more errors while
[02:14:49] predicting men as women rather than
[02:14:51] predicting women as men. Okay. So this
[02:14:54] was about how you can calculate a
[02:14:55] confusion matrix. Now let's come back
[02:14:58] and see how to interpret a given
[02:15:00] confusion matrix. This is the sample of
[02:15:03] a confusion matrix. So here we have
[02:15:05] created a confusion matrix for a fire
[02:15:07] alarm. So this represents a actual
[02:15:09] alarm. This represents no actual alarm.
[02:15:12] Here predicted fire positive and here
[02:15:15] predicted fire as negative. So if the
[02:15:17] alarm goes on in case of fire so it's a
[02:15:20] true positive event. The alarm goes on
[02:15:23] and there is no fire so it's a false
[02:15:25] negative event. There is no alarm in
[02:15:28] case of fire so it's a false negative
[02:15:30] event. And there is no alarm and there
[02:15:33] is no fire then that means it's a true
[02:15:36] negative event. Okay. So let me just
[02:15:38] explain you this example. This should
[02:15:40] make things more clear to you. So actual
[02:15:43] alarm and predicted fire. So total true
[02:15:45] positive events we have 40 and total
[02:15:48] false negative event we have 10. So from
[02:15:50] here you can say that total number of
[02:15:52] times the alarm rang was 40 + 10 that is
[02:15:56] 50. Okay. Here we have false positive
[02:15:59] event as five and true negative event as
[02:16:02] 95. So the total number of times the
[02:16:04] alarm did not rank was 5 + 95 that is
[02:16:08] 100. And this one is the predicted fire
[02:16:11] or not. So true positive plus false
[02:16:13] positive that is 40 + 5. How many times
[02:16:16] the machine positively predicted the
[02:16:18] fire? So that is 40 + 5 45. And how many
[02:16:22] times the machine was not able to
[02:16:24] predict the fire? That is 10 + 95 that
[02:16:26] is 105. And total number of events that
[02:16:29] is 50 + 100 or 45 + 105 is 150. So we
[02:16:34] have mentioned n equal 150 up here. that
[02:16:37] is total number of events. Okay. So this
[02:16:40] is how you interpret a confusion matrix.
[02:16:42] So let's move ahead. So now let me just
[02:16:45] show you in my Jupyter notebook how you
[02:16:47] can create a confusion matrix. Let me
[02:16:49] just open my Jupyter notebook. So this
[02:16:52] is my Jupyter notebook and what we are
[02:16:54] going to do is create a confusion matrix
[02:17:00] in Python.
[02:17:03] So the very first thing that I'll be
[02:17:05] doing up here is importing the required
[02:17:07] libraries. So from skarn dometrics.
[02:17:12] So I'll be importing confusion matrix.
[02:17:17] Next let's create some expected value.
[02:17:20] Let's say expected equal. So let's add
[02:17:24] some values in it like 1 1 0 1 0 0 1 0 0
[02:17:33] and zero. Now this is my expected value.
[02:17:36] Now let's create some predicted values
[02:17:38] for that. So predicted
[02:17:41] equals first it's predicting correct.
[02:17:45] Next let's say 0 0 0 1 0 0 0 1 0 0 1.
[02:17:53] So this is a predicted value. Now let's
[02:17:56] calculate the confusion matrix. So let's
[02:17:59] say results equal confusion
[02:18:03] matrix.
[02:18:05] Inside this we'll pass our expected and
[02:18:07] predicted value expected,
[02:18:10] predicted
[02:18:11] and print the result.
[02:18:14] That's it. Let's execute it. So here you
[02:18:17] got the result as 4 2 and 1 3. So what
[02:18:21] does it mean? So first we have is four.
[02:18:23] So 0 predicted as 0 is four times 1 2 3
[02:18:28] and 4. So 0 predicted as 1 is two times.
[02:18:32] 0 predicted as 1. 1 0 predicted as 1 2.
[02:18:36] Okay. Next is one predicted as zero. So
[02:18:40] one predicted as zero is just once here.
[02:18:43] Okay. And next is one predicted as one
[02:18:46] that is three times. One predicted as 1
[02:18:48] 1 2 and three. So this is a confusion
[02:18:51] matrix. And what we can say from here?
[02:18:54] So total number of correct prediction
[02:18:55] made by machine is 4 + 3 that is 0
[02:18:58] classified as 0 and 1 classified as 1.
[02:19:01] Okay. And total number of incorrect
[02:19:03] prediction is 2 + 1 that is three. So we
[02:19:06] have seven correct prediction and three
[02:19:09] incorrect prediction. Okay. And the
[02:19:11] total number of times the machine
[02:19:12] predicted the value to be zero is 4 + 2
[02:19:15] that is six times. And total number of
[02:19:17] times the machine predicted the value to
[02:19:19] be as 1 was four times. Okay. So from
[02:19:22] here we can say that our machine
[02:19:24] predicts the result seven times correct
[02:19:26] and three times wrong. So the accuracy
[02:19:28] of our machine is 70%. So this was all
[02:19:31] about how you can create a confusion
[02:19:33] matrix in well namebase classifier is a
[02:19:36] commonly used algorithm in machine
[02:19:38] learning. It is a classification
[02:19:40] algorithm which is mainly based on base
[02:19:42] theorem. According to nbas algorithm,
[02:19:44] the presence of a particular feature in
[02:19:46] a class is completely unrelated or is
[02:19:48] independent to the presence of any other
[02:19:50] feature. For example, a fruit may be
[02:19:53] considered to be an apple if it is red,
[02:19:55] round and about 3 in in diameter. A
[02:19:59] namebased classifier considers each of
[02:20:01] these features to contribute
[02:20:03] independently to the probability that
[02:20:06] the fruit is an apple regardless of any
[02:20:09] correlation between the features. But
[02:20:12] these features are not always
[02:20:14] independent and this is one of the
[02:20:16] disadvantage of the navebased algorithm
[02:20:18] and this is the reason why it is called
[02:20:21] nave because it makes assumption that
[02:20:24] may or may not be correct. So in a
[02:20:26] nutshell you can understand that this
[02:20:28] algorithm allows us to predict a class
[02:20:30] from a given set of features using the
[02:20:32] probability. So in some other fruit
[02:20:34] example you can predict the class like
[02:20:37] whether the fruit is an apple, orange or
[02:20:39] banana based on its feature like its
[02:20:42] color, shape etc. Okay, so let's move
[02:20:45] ahead. Now before we move ahead and see
[02:20:47] how a namebase classifier work, we need
[02:20:49] to understand the basic of it. Okay, so
[02:20:52] the very base of the name base
[02:20:53] classifier is conditional probability.
[02:20:56] Okay, so what exactly is this
[02:20:58] conditional probability? Well, it is
[02:21:00] used to calculate the probability of
[02:21:03] happening of the second event given that
[02:21:05] the first event has already happened.
[02:21:08] For example, drawing a second ace from a
[02:21:10] deck given that we already got the first
[02:21:13] ace or finding the probability of a
[02:21:15] disease given that you were already
[02:21:18] tested positive or finding the
[02:21:20] probability of liking Game of Thrones
[02:21:22] given that the person likes fiction.
[02:21:25] Okay, let's see what exactly it is. So
[02:21:28] basically you're defining two events
[02:21:30] over here. Event A is the probability of
[02:21:33] the event that we are trying to
[02:21:34] calculate and event B is the condition
[02:21:37] that we know or the event that has
[02:21:39] already happened. So conditional
[02:21:41] probability is represented as
[02:21:43] probability of A bar B which means the
[02:21:46] probability of the occurrence of event A
[02:21:49] given that B has already happened. So
[02:21:51] probability of a by b equal probability
[02:21:54] of a intersection b divide by
[02:21:56] probability of b. That is probability of
[02:21:58] the occurrence of both a and b divide by
[02:22:01] probability of b. Okay. Let's understand
[02:22:03] this with the help of an example. So
[02:22:05] suppose you have a jar containing six
[02:22:07] marbles out of which you have three
[02:22:10] black and three red. So what is the
[02:22:12] probability of getting a black given
[02:22:15] that the first one was black too. So
[02:22:17] let's see how we'll calculate it. So we
[02:22:19] have P of A as the probability of
[02:22:21] getting a black marble in the first
[02:22:23] turn. P of B is the probability of
[02:22:26] getting a black marble in the second
[02:22:28] turn. So probability of A that is
[02:22:31] probability of getting the black marble
[02:22:32] in the first turn is 3x 6. Okay. And
[02:22:36] probability of B. So how many marble
[02:22:38] remain after we have taken one out? So
[02:22:40] we are left with two black marble and in
[02:22:43] total of five marbles. So the
[02:22:45] probability of again getting a black
[02:22:47] marble would be 2x5. Okay. And next is
[02:22:51] probability of a and b or probability of
[02:22:54] a intersection b equal probability of a
[02:22:56] ult*lied by probability of b. Okay. So
[02:22:59] that is 3x 6 * 2x 5. 3x 6 is nothing but
[02:23:03] half. So half multiplied by 2x 5 that is
[02:23:06] 1x 5. Okay. So probability of B given A
[02:23:10] equal probability of A intersection B
[02:23:13] upon probability of A which equals to
[02:23:16] 0.2 / by 0.5 that equals to 0.4. Clear?
[02:23:21] Fine. Let's take another example.
[02:23:24] Let's say John's favorite breakfast is
[02:23:26] cereal and his favorite lunch is pizza.
[02:23:29] So the probability of John having cereal
[02:23:31] for breakfast is 0.6.
[02:23:34] The probability of him having pizza for
[02:23:36] lunch is 0.5. And the probability of him
[02:23:40] having a cereal for breakfast given that
[02:23:42] he eats pizza for lunch is 0.7.
[02:23:46] Okay. Now what if I want to know the
[02:23:49] probability of having a pizza given you
[02:23:52] had a bowl of cereal for breakfast. So
[02:23:54] you have to calculate probability of B
[02:23:56] by A. So that is you have to calculate
[02:23:58] the reverse probability. So here in this
[02:24:01] example you already know the probability
[02:24:03] of him having cereal for breakfast. But
[02:24:06] after having a cereal for breakfast you
[02:24:08] have to calculate what is the
[02:24:10] probability of him having a pizza for
[02:24:11] the lunch. Okay. So now this is where
[02:24:15] base theorem comes into picture. Base
[02:24:17] theorem describes the probability of an
[02:24:19] event based on the prior knowledge of
[02:24:21] the conditions that might be related to
[02:24:23] the event. In simple words, B theorem
[02:24:26] shows the relation between a conditional
[02:24:28] probability and its reverse form. If
[02:24:31] conditional probability is probability
[02:24:33] of a by b, then you can use base rule to
[02:24:36] find the reverse probability that is
[02:24:38] probability of b given a. Okay, let's
[02:24:41] see how here's a proof of base theorem.
[02:24:43] So according to conditional probability
[02:24:45] formula. So if you compute over here
[02:24:48] probability of A given B equal
[02:24:50] probability of A intersection B upon
[02:24:52] probability of B and probability of B
[02:24:55] given A equal probability of A
[02:24:57] intersection B upon probability of A.
[02:25:00] Correct? So from this equation we have
[02:25:02] probability of A intersection B common
[02:25:04] in both. So we can equate them right? So
[02:25:08] probability of A intersection B equal
[02:25:10] probability of A given B multiplied by
[02:25:13] probability of B which in turn equals
[02:25:15] probability of B given A multiplied by
[02:25:18] probability of A. So here if you reverse
[02:25:20] the conditional probability you'll get
[02:25:22] the formula as probability of B given A
[02:25:25] equal probability of A given B
[02:25:27] multiplied by probability of B divide by
[02:25:30] probability of A and this is nothing but
[02:25:33] B theorem. So in this formula the
[02:25:35] probability of A given B is the
[02:25:38] probability of A being true given that B
[02:25:41] is already true. Probability of B given
[02:25:44] A is the probability of B given true
[02:25:47] given that A is already true and
[02:25:49] probability of A is the probability of A
[02:25:52] being true and probability of B is
[02:25:54] probability of B being true. Okay. So
[02:25:57] coming back to where we left. So now
[02:25:59] what if I want to know what is the
[02:26:00] probability of having a pizza given that
[02:26:02] you had a bowl of cereal for the
[02:26:04] breakfast. So you have to calculate
[02:26:06] probability of P by A that is you have
[02:26:08] to calculate the probability of him
[02:26:10] having pizza for the lunch given that he
[02:26:12] already had cereal in his breakfast. So
[02:26:15] that equals probability of cereal given
[02:26:17] that he's having pizza for the lunch
[02:26:19] multiplied by probability of having
[02:26:22] pizza upon probability of having cereal.
[02:26:24] Okay, which in turn equates to 0.7 *
[02:26:29] probability of pizza that is 0.5 upon
[02:26:31] probability of cereal that is 0.6. Okay,
[02:26:34] so let's move ahead. So here's a base
[02:26:36] theorem use case like find out a
[02:26:39] patient's probability of having liver
[02:26:41] disease if they are an alcoholic. So
[02:26:43] here we are defining a event and the
[02:26:45] test. So a event is patient has liver
[02:26:48] disease. So from the past data you can
[02:26:51] tell that 10% of the patients entering
[02:26:53] your clinic have liver disease. So here
[02:26:56] probability of a equal 0.10. Okay. So
[02:26:59] next is the test that a patient is an
[02:27:02] alcoholic. So 5% of the clinic's patient
[02:27:04] are alcoholic. So from here you can say
[02:27:07] that probability of B equal 0.05.
[02:27:10] Okay. Now you might also know that among
[02:27:12] those patient diagnosed with liver
[02:27:14] disease from that 7% are alcoholics. So
[02:27:18] this is your B given A correct. It is
[02:27:20] the probability that a patient is
[02:27:22] alcoholic given that they have liver
[02:27:25] disease is 7%. Okay. So now according to
[02:27:29] base theorem probability of A given B
[02:27:32] equal 0.07 * 0.1 / 0.05 which in turn
[02:27:37] equals to 0.14.
[02:27:39] So in other words, if the patient is an
[02:27:41] alcoholic, their chances of having a
[02:27:43] liver disease is 0.14 that is 14%.
[02:27:47] There's a large increase from 10%
[02:27:49] suggested by past data. But it's still
[02:27:52] unlikely that any particular patient has
[02:27:54] liver disease. Okay. So coming back to
[02:27:57] our base theorem formula. So probability
[02:27:59] of A given B is posterior. Probability
[02:28:02] of B given A is the prior that is what
[02:28:05] you believe before you saw the evidence.
[02:28:07] Probability of A this is likelihood of
[02:28:10] seeing that evidence if your hypothesis
[02:28:12] is correct and probability of B this is
[02:28:15] likelihood of that evidence under any
[02:28:17] circumstances.
[02:28:19] Okay. So this was all about conditional
[02:28:21] probability and base theorem and
[02:28:24] whatever information or whatever
[02:28:25] knowledge you gained till now we'll be
[02:28:27] using them to learn nay bias in a
[02:28:29] stepbystep way. So next is stepbystep to
[02:28:32] NAB base classifier. So here's my data
[02:28:35] set consisting of 14 different rows with
[02:28:38] five different attributes. From this I
[02:28:40] have to predict whether I'll play today
[02:28:43] or not. Okay. So from here outlook,
[02:28:47] temperature, humidity and windy are the
[02:28:50] main attribute which will help us to
[02:28:52] predict the class play whether we are
[02:28:54] going to play or not. Okay. So total
[02:28:56] sample here is 14. Total yes is 9. Total
[02:29:00] no is five. So probability of getting
[02:29:02] yes is 9 by4 and probability of getting
[02:29:05] no is 5x4. Okay. Now the very first
[02:29:08] thing that we'll do is calculate
[02:29:10] frequency table for each attribute. So
[02:29:12] for outlook we have sunny, overcast and
[02:29:15] rainy. So how many yes and how many nos
[02:29:17] are over there. So we'll calculate the
[02:29:19] frequency for that. So we have 2 yes for
[02:29:21] sunny, four yes for overcast and three
[02:29:23] yes for rainy. So total number of yes
[02:29:26] again nine. And we have 3 no for sunny,
[02:29:29] zero no for overcast and 2 no for rainy
[02:29:32] that is in total five. Okay. Next is
[02:29:35] temperature. So we have 2 yes for hot,
[02:29:38] four yes for mild and 3 yes for cool.
[02:29:41] And similarly two no for hot, two no for
[02:29:44] mild and one no for cool. Again back to
[02:29:47] humidity we have 3 yes when the humidity
[02:29:51] is high. We have 6 yes when the humidity
[02:29:54] is normal. And next we have four no when
[02:29:58] the humidity is high and we have just
[02:30:00] one no when the humidity is normal.
[02:30:03] Similarly for windy for windy. So if the
[02:30:06] weather is not windy we have 6 yes but
[02:30:09] if the weather is windy we have 3 yes.
[02:30:12] And next if the weather is not windy we
[02:30:15] have 2 no. And if the weather is windy
[02:30:18] we have 3 no. So now that we have
[02:30:20] calculated the frequency table for each
[02:30:22] attribute, next thing that we are going
[02:30:24] to do is probability for each attribute
[02:30:26] that is we are going to calculate
[02:30:27] probability of a. Okay. So from here you
[02:30:30] can see that total times we'll get sunny
[02:30:33] is five out of 14. Total times will get
[02:30:36] overcast that is 4 4x 14. Total times
[02:30:40] will get rainy 3 + 2 that is 5 by 14.
[02:30:43] Okay. Similarly you'll calculate it for
[02:30:46] temperature, humidity and windy. Okay.
[02:30:49] So here we got the probability for each
[02:30:50] attribute that is probability of A. Next
[02:30:53] is the probability for each attribute
[02:30:55] that is probability of B. Probability of
[02:30:57] B is nothing but probability of playing
[02:30:59] the game or not. Right? That is
[02:31:02] probability of yes and probability of no
[02:31:04] that is 9x4 and 5x4.
[02:31:07] Next we'll calculate the likelihood for
[02:31:09] each attribute. That is we are going to
[02:31:11] calculate probability of A given B. So
[02:31:13] let's have a look on this example. So
[02:31:16] when outlook is sunny. So the
[02:31:17] probability of getting yes in that case
[02:31:19] is 2 / 9. And how many no are there?
[02:31:23] Three out of five. Right? So we got 3x
[02:31:26] 5. Similarly for overcast we got 4x 9
[02:31:29] and 0x 5. For rainy 3x 9 and 2x 5. Okay.
[02:31:35] Similar calculation you'll do for
[02:31:36] temperature, humidity and windy. Okay.
[02:31:39] So now we have to calculate probability
[02:31:42] of B given A. So we have the formula
[02:31:44] probability of A given B equal
[02:31:47] probability of A intersection B upon
[02:31:49] probability of B that equals to
[02:31:50] probability of A multiplied by
[02:31:52] probability of B given A upon
[02:31:54] probability of B. Correct. From here I
[02:31:57] want to find this value. I have this
[02:31:59] value this value and this value. Okay.
[02:32:03] So let's see. So if I want to play what
[02:32:06] would be my ideal condition for playing?
[02:32:08] Uh the outlook should be sunny. So the
[02:32:11] probability for outlook to be sunny is
[02:32:13] 5x4. Correct? The temperature should be
[02:32:16] cool. So the probability of the
[02:32:18] temperature to be cool is 4x4. Next is
[02:32:21] humidity should be normal. So the
[02:32:23] probability of humidity to be normal is
[02:32:25] 7 by 14. And the weather should not be
[02:32:27] windy. So the probability of the weather
[02:32:30] not being windy is 8 by 14. Okay. So now
[02:32:34] if you calculate total probability of
[02:32:36] the idle condition that is P of X equal
[02:32:40] 5x4 * 4x4 * 7x4 * 8x4 which in turns
[02:32:47] come out to be 0.029.
[02:32:50] Okay. So this was the ideal condition.
[02:32:52] Now ideal condition to play the game. So
[02:32:55] probability of outlook equals sunny when
[02:32:58] play equal yes. That is 2x9. Probability
[02:33:01] of temperature being cool when play
[02:33:03] equal yes that is 3x 9. Next is
[02:33:07] probability of humidity being normal
[02:33:09] when play equal yes that is 6 by 9 and
[02:33:13] probability of windy equal false when
[02:33:16] play equal yes that is 6x 9. Okay. So
[02:33:20] the probability of playing game in idle
[02:33:22] condition is so probability of x given
[02:33:25] yes equal multiplication of all these
[02:33:27] value that is 2x9 multiplied by 3x 9
[02:33:30] multiplied by 6x9 * again 6x9. So you'll
[02:33:33] get the value as 0.033.
[02:33:36] Okay. So the probability of playing the
[02:33:39] game in ideal condition. So we have to
[02:33:41] calculate probability of yes given x. So
[02:33:45] probability of yes given X equal
[02:33:48] probability of X given yes into
[02:33:51] probability of yes upon probability of
[02:33:54] X. So probability of yes given X equal
[02:33:57] 0.033
[02:33:59] multiplied by probability of yes that is
[02:34:01] 9 by4 upon probability of X that we just
[02:34:04] calculated 0.029
[02:34:07] which in turn result to 0.73.
[02:34:10] So from here you can say that the
[02:34:11] probability of playing the game in ideal
[02:34:13] condition according to name based
[02:34:15] classifier is 0.73. What exactly is
[02:34:18] support vector machine? Well, it is a
[02:34:20] supervised machine learning algorithm
[02:34:22] which classifies data based on its
[02:34:24] features. So let's say we have this
[02:34:27] input data comprising of apples and
[02:34:29] tomatoes. Now we'll feed this data to a
[02:34:31] support vector machine. Now SVM will
[02:34:34] basically learn all the features
[02:34:36] associated with the input data and
[02:34:38] separate apples and tomatoes into
[02:34:40] different classes. So now that we know
[02:34:43] what exactly is SVM, let's understand
[02:34:45] the working mechanism behind SVM. So
[02:34:48] support vector machine separates or
[02:34:50] classifies data based on hyperplanes. So
[02:34:53] over here this is our hyper plane which
[02:34:56] separates apples and tomatoes into
[02:34:58] different classes. But the problem is
[02:35:01] there could be infinite possibilities to
[02:35:03] draw this hyper plane. So there could be
[02:35:06] a hyper plane like this and this is also
[02:35:09] another possibility and it could be
[02:35:11] anyone out of all these different
[02:35:13] hyperplanes. So how do we determine
[02:35:16] which is the best one? So this is where
[02:35:18] we'd have to take the help of support
[02:35:20] vectors. So support vectors are
[02:35:22] basically the two nearest data points to
[02:35:25] our hyper plane. And we have to choose a
[02:35:27] hyper plane in such a way that the
[02:35:29] distance between two support vectors is
[02:35:32] maximum. And the distance between these
[02:35:34] two support vectors is known as margin.
[02:35:37] So the aim of the model is to maximize
[02:35:39] this margin between the support vectors.
[02:35:43] Now let's say we add a new data point to
[02:35:45] our sample and then implement the SVM
[02:35:48] model. So we'd have to draw a hyper
[02:35:50] plane in such a way that it best
[02:35:52] separates these two classes. So we'll
[02:35:54] start off by drawing a random hyper
[02:35:56] plane and then we'll find the support
[02:35:58] vectors and then we'll go ahead and find
[02:36:00] the margin between the support vectors.
[02:36:03] Now similarly we'll draw another hyper
[02:36:05] plane like this and again find the
[02:36:07] support vectors and also calculate the
[02:36:09] margin between the support vectors. So
[02:36:12] now when we compare these two
[02:36:14] hyperplanes over here we see that the
[02:36:17] margin for the first hyper plane is
[02:36:19] greater than the margin for the second
[02:36:21] hyper plane. And that is why the first
[02:36:24] hyper plane would be the optimal one for
[02:36:26] the scenario.
[02:36:28] So till now the data which we saw was
[02:36:30] linearly separable. But what in this
[02:36:32] case? So can we actually draw a line
[02:36:34] over here to separate the circles and
[02:36:35] the triangles? Well, let's check it out.
[02:36:38] So I've drawn three hyperplanes and we
[02:36:41] see that in all these three cases it is
[02:36:44] not able to separate the data into
[02:36:45] different classes. So this is where SVM
[02:36:48] uses something known as a kernel
[02:36:50] function. So this kernel function helps
[02:36:52] in transforming the 2D nonlinear data
[02:36:56] into higher dimensions so that we can
[02:36:58] separate the data using a hyper plane.
[02:37:00] So obviously when we look at this image
[02:37:02] on the right side we see that we can
[02:37:04] separate the circles and the triangles
[02:37:07] by drawing a hyper plane along the Z
[02:37:09] axis.
[02:37:11] And these are some of the kernel
[02:37:12] functions which we can use along with
[02:37:14] our SPM model. So we have polomial
[02:37:17] kernel, gshian kernel, gshian radial
[02:37:19] basis kernel and lapless RBF kernel. Now
[02:37:22] it's time to head on to the demo. So
[02:37:25] we'll be using this cancer data set to
[02:37:27] implement SVM. So this data set is
[02:37:29] computed from a digitized image of a
[02:37:32] fine needle aspirate of a breast mass
[02:37:34] and the columns basically describe
[02:37:36] characteristics of the cell nuclei
[02:37:39] present in the image. So there are
[02:37:41] different features such as mean radius,
[02:37:43] mean texture, concavity error and so on.
[02:37:46] And the target column compris of two
[02:37:48] labels which are malignant and benign.
[02:37:50] So you're basically trying to classify
[02:37:52] whether the tumor of the patient is
[02:37:54] malignant or benign on the basis of all
[02:37:56] the features. So our first task would be
[02:37:58] to import the package and the data set
[02:38:01] required for this demo. So from this
[02:38:03] sklearn library, we are importing all
[02:38:05] the data sets and then the data set
[02:38:07] which we need is breast cancer. So I'll
[02:38:09] just use this function load breast
[02:38:11] cancer and I will store this in this new
[02:38:14] object called as cancer. I'll click on
[02:38:16] run. So we have successfully loaded this
[02:38:19] data set. Now let me have a glance at
[02:38:21] the features and the target values. So
[02:38:23] all I have to do is use this print
[02:38:25] function and with the help of cancer dot
[02:38:27] feature names I'll get all the names of
[02:38:30] the features and over here with
[02:38:32] cancer.target names I'll get the target
[02:38:34] names. I'll click on run. So these are
[02:38:36] all of the different features which are
[02:38:38] present in the data set. So we have mean
[02:38:40] radius, mean texture, concavity error,
[02:38:42] symmetry error, worst concavity and so
[02:38:45] on. And then in the target we have two
[02:38:47] labels. First is malignant and then
[02:38:49] second is benign. And then again as I
[02:38:51] already told you guys, we're trying to
[02:38:52] determine whether the patient's tumor is
[02:38:55] malignant or benign. So now that we've
[02:38:58] had a glance at the features and the
[02:38:59] labels, let's have a glance at the shape
[02:39:01] of this data set. So for that we'll just
[02:39:03] type in cancer.data. data.shape. I'll
[02:39:06] click on run. So we have this value 569
[02:39:09] and 30. So this means that there are 569
[02:39:13] rows in this data set and 30 columns.
[02:39:16] Now let me have a glance at the first
[02:39:18] five records of all of the features. So
[02:39:20] cancer dot data and then I want to have
[02:39:22] a glance at the first five records. So
[02:39:24] this will go from zero to five. Right?
[02:39:27] So these are the first five records or
[02:39:29] the values for the first five records.
[02:39:31] Now similarly I'll have a glance at the
[02:39:34] target values. So cancer.target I'll
[02:39:36] just print it out. See that we have the
[02:39:38] values which are zero and one. So
[02:39:40] wherever we have the value zero it
[02:39:42] represents that the patient's tumor is
[02:39:45] malignant. And wherever we have one it
[02:39:47] represents that the patient's tumor is
[02:39:49] benign. Now it's time to go ahead and
[02:39:52] build the model. So before we go ahead
[02:39:54] and build the model we are actually
[02:39:55] supposed to divide our data set into
[02:39:57] training and test set. So for that we'll
[02:40:00] be importing this train test split
[02:40:02] method from sklearn.mmodel selection and
[02:40:06] after that using this train test split
[02:40:08] function I will go ahead and divide this
[02:40:11] into training and testing sets. So this
[02:40:13] over here takes in these parameters.
[02:40:15] First parameter is the list of all of
[02:40:17] the features which will come from cancer
[02:40:19] data. Next parameter is basically the
[02:40:21] target values which is this result over
[02:40:23] here. And then we'll set the test size.
[02:40:25] So here I'm setting the test size to be
[02:40:27] 0.3. So this will mean that 30% of the
[02:40:31] records from the data set will be in the
[02:40:33] test set and the rest of the 70% of the
[02:40:36] records from the data set would go into
[02:40:38] the training set and then again I'll
[02:40:40] just set a seed value so that I can
[02:40:42] build the same model again. Now over
[02:40:44] here on the left hand side we see that
[02:40:46] we are storing this result into X train,
[02:40:49] X test, Y train and Y test. So these X
[02:40:53] labels basically represent all of the
[02:40:54] features and these Y labels basically
[02:40:56] represent all the target values. So this
[02:40:59] X train is the training set for all the
[02:41:02] features. X test is the test set for all
[02:41:04] the features. Similarly, this Y train is
[02:41:07] the training set for all the target
[02:41:09] values and this Y test is the test set
[02:41:12] for all the target values. Right? So now
[02:41:14] we have successfully split this data set
[02:41:16] into training and testing set. Now it's
[02:41:19] time to go ahead and build the model on
[02:41:21] top of the training set. So now what
[02:41:23] I'll do is I'll go ahead and import SVM
[02:41:26] method from sklearn library and after
[02:41:29] that I'll go ahead and build the model.
[02:41:32] So for that I would have to set the
[02:41:34] kernel. So as we saw in the slides we
[02:41:36] can have different kernels. So for this
[02:41:39] demo I'm just setting the kernel to be
[02:41:40] linear. That is we are building a linear
[02:41:42] model over here. So I'll use SVM. SVC
[02:41:46] and kernel is equal to linear and I'll
[02:41:48] store this in CLF. Now after that I will
[02:41:51] fit the model. So CLF dot fit and it
[02:41:54] takes in two parameters. First is X
[02:41:57] train. Next is Y train. That is I'm
[02:42:00] basically building this model on top of
[02:42:02] the training set for both the features
[02:42:04] and the target values. And after I build
[02:42:07] the model, I'll go ahead and predict the
[02:42:09] values on top of the test set. So it'll
[02:42:11] be X test. I'll click on run. So we've
[02:42:14] built the model, we have predicted the
[02:42:16] values. Now let's go ahead and find out
[02:42:18] the metrics of the model which we've
[02:42:20] built. So I'll import metrics from
[02:42:22] sklearn and then first we'll go ahead
[02:42:25] and build a confusion matrix. So again
[02:42:27] matrix dot confusion matrix. This over
[02:42:29] here takes in two parameters which is y
[02:42:31] test and y bread. So this y test compris
[02:42:34] of the actual values and this y bread
[02:42:37] comprise of the predicted values from
[02:42:39] the model which we've built. Now I'll
[02:42:42] click on run and this gives me this
[02:42:44] confusion matrix. So the 61 which you
[02:42:47] see it represents all of those values
[02:42:50] where the actual label was malignant and
[02:42:53] it has been correctly classified as
[02:42:55] malignant. So 61 are such cases and then
[02:42:59] we have this 104 which basically
[02:43:01] represents all of those values where the
[02:43:03] actual value was benign and it has been
[02:43:06] correctly predicted as benign and this
[02:43:09] what you see. So over here the actual
[02:43:11] was malignant but it has been predicted
[02:43:13] as benign and this four what you see so
[02:43:16] this represents all of those cases where
[02:43:18] the actual was benign but it has been
[02:43:20] predicted as malignant. So this left
[02:43:22] diagonal represents all of those values
[02:43:24] which have been correctly classified and
[02:43:26] this right diagonal represents all of
[02:43:29] those values which have been
[02:43:30] mclassified. So this is the information
[02:43:32] which we can get from the confusion
[02:43:33] matrix. Now I'll go ahead and also
[02:43:36] calculate the accuracy. So this again
[02:43:38] metrics do accuracy score takes in two
[02:43:40] parameters. First is Y test and then we
[02:43:42] have Y bread. So see that the accuracy
[02:43:44] is 96%. So you can actually get this
[02:43:46] accuracy value from this confusion
[02:43:48] matrix. So as I have already told you
[02:43:50] guys this left diagonal represents all
[02:43:52] of those values which have been
[02:43:53] correctly classified. So we'd have to
[02:43:55] divide this left diagonal with all of
[02:43:57] the values. So let me go ahead and do
[02:43:59] that. That'll be 61 + 104 divided by 61
[02:44:05] + 104 + 2 + 4. Now I'll click on run. So
[02:44:10] you see that we get the same accuracy
[02:44:12] which is 96.49.
[02:44:13] So we have successfully build the SVM
[02:44:15] model and we've got an accuracy of 96%.
[02:44:19] So you can consider ensemble modeling to
[02:44:21] be a collection of multiple models. So
[02:44:23] we basically combine several base models
[02:44:26] in order to produce one optimal
[02:44:28] predictive model. So let's take this
[02:44:30] example to understand ensemble modeling
[02:44:33] better. So let's say there's this data
[02:44:35] set on which you want to do some
[02:44:36] prediction. Now instead of building one
[02:44:39] model on this entire data set, we can
[02:44:42] take samples of this data set and build
[02:44:44] a model on each of the sample data set.
[02:44:48] So let's say we take K samples and then
[02:44:50] we'll build K models in total. Now each
[02:44:54] of this model will give us a result. So
[02:44:56] we'll take the aggregate of all the
[02:44:58] results and that would be our final
[02:45:00] answer. Now one such ensemble method
[02:45:03] involving decision trees is bagging. So
[02:45:06] let's understand bagging with this
[02:45:07] example. So let's say we have this data
[02:45:10] set A and there are n records in it. Now
[02:45:13] what we'll do is draw samples from this
[02:45:15] data set. So this actually will be
[02:45:17] sampling with replacement. Now I'll take
[02:45:20] one record from data set A. Take note of
[02:45:23] it. Enter the same sample in data set A1
[02:45:27] and then put the record back to where it
[02:45:29] came from. And I'll repeat this process
[02:45:31] n times. So that there are n records in
[02:45:34] data set A1 as well. So what you need to
[02:45:36] keep in mind is out of these n records
[02:45:38] in A1, some of them might have come
[02:45:41] twice, thrice or even several times over
[02:45:44] while some records from A might not have
[02:45:46] made it at all to A1. So this is how
[02:45:49] I've created A1. And then I'll go ahead
[02:45:52] and create multiple data sets the same
[02:45:54] way. So I have A1, A2, and it'll go on
[02:45:58] till Ax. And each of these have the same
[02:46:00] number of records as A. And the X over
[02:46:03] here, it could be anything. let's say
[02:46:05] 100, 500 or even thousand. So from just
[02:46:08] one data set A, we're able to create
[02:46:11] multiple data sets for our advantage. So
[02:46:14] again just for our sake, let's say data
[02:46:15] set A has 1,000 rows and the value of X
[02:46:18] is also 1,000. So this would be 1,000
[02:46:21] cross,000 which would give us 1 million
[02:46:24] rows. That is from just,000 rows of data
[02:46:27] we able to get 1 million rows. Now what
[02:46:30] we'll do is for each of these X data
[02:46:32] sets we'll fit one decision tree each.
[02:46:35] So we have X decision trees coming from
[02:46:38] X data sets. So now we have a group of
[02:46:41] trees or in other words what we have
[02:46:43] over here is ensemble of trees. Now
[02:46:46] let's say a new record RA I comes away.
[02:46:49] Then we're going to pass this record to
[02:46:51] each of these X trees and we're going to
[02:46:53] get each trees prediction on what class
[02:46:56] this new record is going to represent.
[02:46:58] And since we have x trees, we'll have x
[02:47:01] predictions in total. That is let's say
[02:47:03] if x was 500, we would get 500
[02:47:05] predictions. Similarly, if x was 1,000,
[02:47:08] we would get 1,000 predictions. Now to
[02:47:10] get the final prediction, all we have to
[02:47:12] do is select that class which would have
[02:47:15] the majority of the words across all the
[02:47:17] predictions from individual trees. So
[02:47:20] what we really doing is aggregating the
[02:47:22] predictions across all of these trees.
[02:47:25] So guys, this is the concept of bagging.
[02:47:29] Now we'll head on to random forest. So
[02:47:31] random forest is just an extension of
[02:47:33] bagging. So till here the process would
[02:47:35] be the same. We'll be creating X
[02:47:38] bootstrap samples from the original data
[02:47:40] set. And this is where the difference
[02:47:42] comes. So now what we'll do is for each
[02:47:44] of these X data sets, we'll fit one
[02:47:47] decision tree. But the process of
[02:47:49] building the decision tree changes over
[02:47:51] here. So let's say this A1 data set has
[02:47:54] 10 independent variables. Now when it
[02:47:57] came to bagging, we considered all of
[02:48:00] these 10 independent variables to be a
[02:48:03] choice for the split candidate. But what
[02:48:05] happens in random forest is each time a
[02:48:07] node is being split in a decision tree,
[02:48:10] not all 10 variables will be provided to
[02:48:13] the algorithm. This is important. So I'm
[02:48:16] reiterating this guys. Each time a node
[02:48:18] is being split in a decision tree, not
[02:48:21] all the columns will be provided to the
[02:48:24] algorithm. So now the question arises
[02:48:26] what will be made available to the
[02:48:28] algorithm. So only a random subset of
[02:48:31] these 10 columns would be available to
[02:48:33] the algorithm. So let's say I want to
[02:48:35] split this root node. Now instead of
[02:48:38] providing it all the 10 columns, only a
[02:48:40] subset of columns will be provided. So
[02:48:43] let's say three columns. Now it could be
[02:48:45] any three out of the 10 and with those
[02:48:48] three the algorithm goes on to split the
[02:48:51] node and similarly for the left node
[02:48:53] over here it is again going to be
[02:48:55] provided with a random set of variables
[02:48:58] and it is not necessary that the left
[02:49:00] node should get the same three
[02:49:02] variables. It can be a different set of
[02:49:04] three columns altogether. So whenever we
[02:49:06] splitting a node it is given a random
[02:49:08] set of m predictors from the entire
[02:49:11] predictor space. And the reason this is
[02:49:13] done is to make each of these X trees
[02:49:16] very different. So let's compare bagging
[02:49:19] and random forest. So in bagging all
[02:49:22] trees had the entire predictor space
[02:49:25] available to them. So the eventual trees
[02:49:27] which you would end up building would be
[02:49:29] very similar to each other. And in the
[02:49:31] case of random forest, you bring in
[02:49:33] randomness with respect to the columns
[02:49:35] provided. that is only a random set of
[02:49:38] columns are provided to the entire
[02:49:40] predictor space and that is why the set
[02:49:43] of decision trees which we get could be
[02:49:45] pretty different from each other. So now
[02:49:48] we'll head on to the demo. So we'll be
[02:49:50] implementing random forest on top of
[02:49:52] this iris data set. Right? So we'll go
[02:49:54] ahead and load this iris data set. So
[02:49:57] first we'll import the data sets module
[02:49:59] from sklearn library and then we'll load
[02:50:01] the iris data set and I'll store this in
[02:50:03] the iris object. I'll click on run. Now
[02:50:06] I'll go ahead and have a glance at the
[02:50:08] target names and feature names. So all I
[02:50:10] have to do is use iris.target names
[02:50:13] which would give me the target names.
[02:50:14] And then I have iris do.feature names
[02:50:17] which would give me the feature names.
[02:50:18] I'll click on run. So these are the
[02:50:21] target names. So this is the target
[02:50:22] variable which comprises of these
[02:50:25] labels. So the iris specy it could
[02:50:27] either be satossa or color and
[02:50:30] virginica. And these are the features.
[02:50:33] So we have sele length, sele width,
[02:50:35] petal length and petal width. Now I'll
[02:50:38] have a glance of the top five records of
[02:50:40] this iris data set. So iris dot data and
[02:50:43] it'll go from zero to five. So these are
[02:50:46] the first five records of this iris data
[02:50:49] set and this will give us only the
[02:50:51] feature values. Now if I want to have a
[02:50:53] glance at the target values, I'd have to
[02:50:55] print iris.target.
[02:50:57] So these are the target values. So if
[02:50:59] the value is zero, it represents that
[02:51:02] the iris speci is stosa. If the value is
[02:51:04] one, it represents that the iris specy
[02:51:07] is wicolor and if the value is two, it
[02:51:10] represents that the iris speci would be
[02:51:12] virginica. Now let me go ahead and have
[02:51:14] a glance at the type of this iris data
[02:51:17] and iris target. So we see that both of
[02:51:20] them are numpy arrays and for a model
[02:51:22] building process first we'd have to
[02:51:25] convert these two numpy arrays into data
[02:51:28] frames. So I'll take the first column
[02:51:32] from the iris data and I'll name it as
[02:51:35] sele length. Similarly I'll take the
[02:51:37] second column from iris data. I'll name
[02:51:38] it as sele width. The third column I'll
[02:51:41] name it as petal length. And the fourth
[02:51:43] column I'll name it as petal width. And
[02:51:46] finally I'll take these values from
[02:51:47] irs.target and I'll name them to be
[02:51:50] species. So we are creating this data
[02:51:52] frame which would comprise of five
[02:51:54] columns and those five columns would be
[02:51:56] sele width petal length petal width and
[02:52:00] species. And I'm storing this in this
[02:52:02] data object. I'll click on run. And we
[02:52:05] also have the head of this new data
[02:52:07] frame which we've just created. Right?
[02:52:09] So these are all of the columns which
[02:52:11] are present in this data frame. So now
[02:52:14] before we go ahead and build the model,
[02:52:16] we are again supposed to separate the
[02:52:18] features and the target column. So I
[02:52:22] will store all of the features in this X
[02:52:25] object and I will store the target in
[02:52:28] this Y object. I'll click on run. So all
[02:52:31] of these features are sele width, petal
[02:52:33] length and petal width. and a target
[02:52:35] would be species or in other words we're
[02:52:37] trying to determine what is the specy of
[02:52:40] the iris flower on the basis of these
[02:52:44] feature values. Now I'll have to divide
[02:52:47] this data set into training and testing
[02:52:50] set and for that purpose I'll be
[02:52:53] importing this train test split function
[02:52:55] from sklearn.mmodel selection and after
[02:52:59] that inside this function I'll give in
[02:53:01] these parameters. So the first parameter
[02:53:03] is the object which comprise of all of
[02:53:05] the features. The second parameter is
[02:53:07] the object which comprise of all of the
[02:53:09] target labels and then I'll give in the
[02:53:11] test size which is 0.3. So this
[02:53:14] basically states that test data would
[02:53:17] have 30% of the entire data and the
[02:53:20] training data would comprise of 70% of
[02:53:23] the entire data. So this is the train
[02:53:26] test split and we are storing this in
[02:53:28] these four objects. So we have X train,
[02:53:31] X test, Y train and Y test. So this X
[02:53:35] train is basically the training set for
[02:53:37] the features. This X test is the test
[02:53:39] set for the features. And then we have Y
[02:53:41] train which is the training set for the
[02:53:43] target. And then we have Y test which is
[02:53:46] the test set for the target. So we've
[02:53:48] went ahead and we've also divided the
[02:53:50] data set into training and testing sets.
[02:53:53] Now it's time to build the model on top
[02:53:55] of the training set and predict the
[02:53:56] values on top of the test set. So I'd
[02:53:59] have to import the random forest
[02:54:02] classifier from sklearn.semble
[02:54:05] and I'll create a model from it and I'll
[02:54:07] set the number of decision trees to be
[02:54:09] 100. So over here this parameter which
[02:54:11] you see n estimators equals 100. So this
[02:54:14] basically means that the random forest
[02:54:16] algorithm which we'll be creating it
[02:54:18] will have 100 decision trees. So we'll
[02:54:21] get the aggregate result of these 100
[02:54:23] trees. Now I'm storing this in this CLF
[02:54:26] object. Now I'll go ahead and fit or
[02:54:30] build this model on top of the train
[02:54:32] set. So we have X train and Y train. Now
[02:54:35] after I build the model I would have to
[02:54:37] predict the values. So I'll predict the
[02:54:39] values on top of the X test. So now that
[02:54:42] we've predicted the values, it's time to
[02:54:45] build the confusion metrics and find out
[02:54:47] how good is our model. So I'll import
[02:54:50] this matrix from sklearn and then I'll
[02:54:52] go ahead and build the confusion matrix
[02:54:54] which takes in the y test object which
[02:54:56] basically has all of the actual values
[02:54:58] and then the next parameter is the y
[02:55:00] bread object which has all of the
[02:55:02] predicted values. Now I'll go ahead and
[02:55:04] build this confusion matrix. So this
[02:55:07] first row represents the satossa specy.
[02:55:09] Second row represents the vericolor
[02:55:11] species and the third row represents the
[02:55:13] virginica specy. And this left diagonal
[02:55:16] which you see this diagonal represents
[02:55:18] all of those values which have been
[02:55:21] correctly classified. So if we take this
[02:55:23] row here then this means that so in
[02:55:25] total there was 14 species which were
[02:55:28] actually stosa and all of them have been
[02:55:30] correctly classified as stosa and over
[02:55:33] here we see that in total there were 16
[02:55:35] species of versic color. All of them 15
[02:55:38] have been correctly classified as versic
[02:55:41] color. And this row if you take so you
[02:55:42] see that there were 15 values of
[02:55:44] virginica. Out of them 14 have been
[02:55:47] correctly classified. Now if you want to
[02:55:50] get the accuracy of this model which
[02:55:51] you've just built we'll use this
[02:55:53] accuracy score and we'll pass in the
[02:55:55] same objects y test and y bread pred.
[02:55:58] I'll click on run. So see that the
[02:55:59] accuracy for the model which you built
[02:56:01] is 95.55%.
[02:56:04] Now I'll go ahead and predict values for
[02:56:06] a single item. So let's say if I give
[02:56:08] individual values for each of the
[02:56:10] columns. So these are the values for
[02:56:12] sele length, sele width, petal length
[02:56:14] and petal width. So let's say if sele
[02:56:16] length was three, sele width was five,
[02:56:19] petal length was four and petal width
[02:56:22] was two, then what is the species of
[02:56:24] this iris flower? I'll click on run. So
[02:56:27] see that if these are the values for the
[02:56:30] columns, then it belongs to array 2 or
[02:56:33] in other words, it basically belongs to
[02:56:35] the virginica specy. Similarly and if
[02:56:38] these are the values for sele length,
[02:56:39] sele width, petal length and petal
[02:56:41] width, then let me predict this. So
[02:56:44] again we see that this time also the
[02:56:46] specy would be virginica. Now I'll go
[02:56:49] ahead and have a glance at the
[02:56:50] importance of the different features. So
[02:56:54] inside this PD do series I'll get the
[02:56:58] importance of each of the individual
[02:57:00] features. So clf dot feature importances
[02:57:03] and I'll also have the feature names
[02:57:06] along with the feature importances and
[02:57:08] I'm basically creating a pd. Object out
[02:57:10] of it and I am sorting it in descending
[02:57:13] order. So this parameter ascending is
[02:57:16] equal to false. This would basically
[02:57:18] give us these feature importances in
[02:57:20] descending order and I've stored that
[02:57:22] result in feature imp. I'll click on run
[02:57:25] and these are the different feature
[02:57:27] importances. So we see that the most
[02:57:29] important features or the most important
[02:57:31] independent variables are petal width
[02:57:34] and petal length. So now I'm going to go
[02:57:37] ahead and make a plot of the feature
[02:57:39] importance. So I'll import the seaborn
[02:57:41] and mattplot lip packages and then I'll
[02:57:44] make a bar plot for the feature
[02:57:46] importance. So on the x-axis I would
[02:57:48] have the feature importance and on the
[02:57:50] y-axis I would have the labels for each
[02:57:53] of these features. I'll click on run. So
[02:57:56] see that the X label is feature
[02:57:58] important score, Y label is features and
[02:58:01] the title is visualizing important
[02:58:03] features and again we get the same
[02:58:05] result. So petal width and petal length
[02:58:08] are the two most important features when
[02:58:10] we build this model. Now since you've
[02:58:13] understood that the two most important
[02:58:14] features are petal width and petal
[02:58:16] length, we're going to go ahead and
[02:58:18] build a model with only these two
[02:58:21] features. So this time in the X object I
[02:58:25] will take only petal length and petal
[02:58:28] width as the features and again the
[02:58:31] target would be the species column. Now
[02:58:33] I'll again go ahead and divide this data
[02:58:35] set into train and test sets and again
[02:58:38] I'll set the test size to be 0.3. So in
[02:58:40] other words test set would comprise of
[02:58:43] 30% of all of the data values and
[02:58:45] training set would comprise of 70% of
[02:58:48] all of the data values. I'll click on
[02:58:50] run. So we've divided the data set into
[02:58:52] training and test set. Now again it's
[02:58:54] time to build the model and predict the
[02:58:56] values. So we'll use this random forest
[02:58:58] classifier. We'll build the model and
[02:59:00] then we'll fit the values and then we'll
[02:59:02] predict the values on top of X test. Now
[02:59:05] it's time to build the confusion matrix.
[02:59:07] I'll click on run and this is the
[02:59:09] confusion matrix which we get. So we see
[02:59:11] that all of the Satossa species have
[02:59:13] been classified correctly. When it comes
[02:59:15] to versic color out of the 16 15 have
[02:59:18] been classified correctly and when it
[02:59:20] comes to virginica out of the 14 13 have
[02:59:22] been classified correctly. Now again
[02:59:24] we'll calculate the accuracy. So we see
[02:59:27] that the accuracy is 95.55%.
[02:59:30] So in the first model which we built the
[02:59:33] accuracy was 95.55%
[02:59:35] and also in the second model which we
[02:59:37] built the accuracy is also 95.55%.
[02:59:41] So we've understood that we don't have
[02:59:43] to include the other two independent
[02:59:46] variables. So without including sele
[02:59:48] length and sele width we've got the
[02:59:51] optimal accuracy value for this data
[02:59:54] set. So my first task would be to import
[02:59:56] the pandas library. So I'll type in
[02:59:59] import pandas as pd.
[03:00:03] Now it's time to load the data set. So I
[03:00:05] will use the read CSV function and
[03:00:08] inside this I will give in the name of
[03:00:10] the data set. So the name of the data
[03:00:13] set is iris dot CSV and I will load this
[03:00:15] in a new object and I will name that
[03:00:18] object to be iris. So I have loaded the
[03:00:21] file. Now I'll go ahead and have a
[03:00:23] glance at the top five records of this
[03:00:24] data set. So I would have to use the
[03:00:26] head function for that. So iris do head
[03:00:29] and this would give me the top five
[03:00:30] records which are present in this data
[03:00:32] set. So we've got all of these columns
[03:00:34] which are present in the Iris data set.
[03:00:36] So we've got sele length, sele width,
[03:00:38] petal length, petal width and the
[03:00:39] species column. Now our task for this
[03:00:42] session would be to implement the
[03:00:43] exibboost classifier on top of this data
[03:00:46] set to find out what species does this
[03:00:49] iris FL belong to on the basis of the
[03:00:52] rest of the columns. So on the basis of
[03:00:54] sele length, sele width, petal length
[03:00:56] and petal width, I'd want to know what
[03:00:58] is the specy of this particular FL. So
[03:01:01] we've got three different species of
[03:01:02] iris FL. We've got Satossa, WC color,
[03:01:05] and Denica. All right. So before we go
[03:01:08] ahead and implement the algorithm, we'd
[03:01:10] have to divide the data set into
[03:01:11] training and testing set. Now again
[03:01:13] before that, we'd have to divide all of
[03:01:15] the predictors and the response
[03:01:17] variable. So all of these would be the
[03:01:19] predictors and this would be the
[03:01:21] response variable. So let me extract
[03:01:23] that. So inside the X object I will go
[03:01:26] ahead and I will store the rest of the
[03:01:28] columns except the species column. So it
[03:01:31] would be sele
[03:01:34] length over here.
[03:01:37] So after sele length I would also
[03:01:39] require the sele width and before this
[03:01:41] I'd have to give in the name of the data
[03:01:42] frame over here. Right? So from Iris
[03:01:44] data frame I would require sele
[03:01:48] width.
[03:01:50] So let me type in sele width over here.
[03:01:54] And after sele width I would require the
[03:01:56] petal length column. So I'll type in
[03:01:59] petal width. And uh after petal width I
[03:02:02] would require the petal length column.
[03:02:05] So I'll type in petal.length.
[03:02:09] So these are all of the columns the
[03:02:11] first four columns which I'm storing
[03:02:12] into the x object. And then I also need
[03:02:15] the dependent variable which would be
[03:02:17] the species column.
[03:02:20] So inside this I'll just give in
[03:02:23] species.
[03:02:25] So I've got all of the independent
[03:02:26] variables stored in the X object and
[03:02:28] I've got my dependent variables stored
[03:02:30] in the Y object. So now that I have my
[03:02:32] dependent and independent variables with
[03:02:34] me, I will go ahead and divide this data
[03:02:36] set into train and test set. So for that
[03:02:39] purpose, I need to import the train test
[03:02:42] split from model selection. So I'll type
[03:02:45] from sklearn
[03:02:47] dot model selection
[03:02:50] I'd be importing train test split.
[03:02:55] Now I have the train test split method.
[03:02:57] Let me use it. And inside this I will
[03:03:00] pass in X and Y. So X consists of all of
[03:03:02] the predictors. Y consists of the
[03:03:04] dependent variable. And after this I
[03:03:06] will give in the test size. So the test
[03:03:08] size is equal to 0.3. So this would mean
[03:03:11] that 30% of the records would go into
[03:03:13] the test set and the rest of the 70%
[03:03:15] records would go into the train set. Now
[03:03:18] this would return four objects and those
[03:03:21] four objects I'd be storing in X train X
[03:03:25] test Y train and there would be Y test
[03:03:30] after this.
[03:03:35] Right? So I've divided all of this into
[03:03:37] four parts. I've got X train X test Y
[03:03:39] train and Y test. So X train basically
[03:03:41] consists of the independent variables of
[03:03:42] the train set. X test consists of
[03:03:45] independent variables of test set. Y
[03:03:47] train would have the dependent variable
[03:03:49] of the train set and Y test would have
[03:03:51] the dependent variable of the test set.
[03:03:54] So now that is done, it's time to go
[03:03:55] ahead and import the XG boost classifier
[03:03:58] model. So I will type import XGB boost
[03:04:04] as XGB.
[03:04:06] Now I will go ahead and create an
[03:04:08] instance of this. So I'll type XGB
[03:04:11] boost. XGB classifier. And let's say I
[03:04:15] want 10 decision trees inside this. So I
[03:04:18] will set the value for n estimators. So
[03:04:20] with the help of the n estimators
[03:04:22] attribute, I can decide the number of
[03:04:24] trees which would be present in this
[03:04:26] exiboost classifier. And I'd want 10
[03:04:29] trees to be present in this. And also I
[03:04:31] can set the tree depth. So I'll set the
[03:04:34] maximum tree depth for all of these
[03:04:36] decision trees to be equal to five. And
[03:04:39] I will store this in a new instance and
[03:04:41] name this to be XGBC.
[03:04:45] So XGB and C. Right. So I have created
[03:04:49] an instance of this XG boost classifier.
[03:04:52] Now it's time to fit the model. So XGBC
[03:04:55] dot fit and inside this I will pass in
[03:04:58] the X train
[03:05:00] and Y train objects. So I'm basically
[03:05:03] fitting this model on top of the train
[03:05:06] set. So I have successfully built this
[03:05:08] XC boost classifier on top of the train
[03:05:10] set. Now it's time to break the values.
[03:05:13] So to predict the values, I'll just type
[03:05:15] XGBC dot predict and I will pass in the
[03:05:20] X test inside this. Right? So because X
[03:05:23] test is basically all of the independent
[03:05:25] variables which are present in the test
[03:05:26] set. Now for all of these values I am
[03:05:29] going ahead and predicting a result and
[03:05:32] I will store this in let's say y prred.
[03:05:36] So I have also predicted the values on
[03:05:38] top of the test set. Now it's time to
[03:05:40] build the confusion matrix and see how
[03:05:42] much classification is correct.
[03:05:46] So from sklearn dot metrics I'd be
[03:05:51] importing the confusion metrics.
[03:05:54] Now once I import the confusion matrix
[03:05:57] inside the confusion matrix I'll pass in
[03:05:59] the actual values which are present in
[03:06:01] the test set and the predicted values
[03:06:03] which are present in yred. So this is
[03:06:06] our confusion matrix. So this what you
[03:06:09] see are the actual values and these are
[03:06:10] the predicted values and this left
[03:06:12] diagonal would tell you how many of the
[03:06:15] values have been correctly predicted. So
[03:06:18] this is for stosaic color and virginica.
[03:06:21] So we see that all of the stosa species
[03:06:23] have been correctly classified. When it
[03:06:25] comes to warol 12 of them have been
[03:06:27] correctly classified, two of them have
[03:06:29] been incorrectly classified. Similarly
[03:06:31] for virginica 12 of them have been
[03:06:33] correctly classified and two of them
[03:06:34] have been incorrectly classified. To
[03:06:37] find out the accuracy I'll just divide
[03:06:39] the left diagonal with the rest of the
[03:06:41] values. So it'll be 17 + 12 + 12
[03:06:46] divided by 17 + 12 + 12 + 2 + 2. So this
[03:06:54] gives us an accuracy of 91.11%.
[03:06:57] And this is how we can build the XC
[03:06:59] boost classifier. Click on run. Right?
[03:07:02] So we have successfully created this
[03:07:04] list of lists. Now let me have a glance
[03:07:07] at the first list. So you see that the
[03:07:09] first list or in other words the first
[03:07:11] row this is our first row and this
[03:07:14] contains these items burgers meatballs
[03:07:16] and eggs. So this is the first
[03:07:18] transaction which contains these items.
[03:07:22] So now that we've converted the pandas
[03:07:24] data frame into a list of flesh we can
[03:07:26] finally go ahead and build the a priaryy
[03:07:28] model on top of this list of flesh. So
[03:07:31] for that purpose we'll be using this a
[03:07:33] priaryy function which takes in all of
[03:07:36] these predefined parameters. So first
[03:07:39] we'd have to give in the object on which
[03:07:42] we are supposed to build the a primary
[03:07:43] model. So the object is obviously
[03:07:45] records and then we'll set the values
[03:07:47] for minimum support, minimum confidence
[03:07:49] and minimum lift. So in this model we
[03:07:52] are setting the minimum support to be
[03:07:54] 0.45
[03:07:56] 45 or in other words the minimum support
[03:07:59] threshold is 0.45%.
[03:08:02] Now this basically means that out of all
[03:08:05] of the transactions this model should
[03:08:09] include at least 0.45 of the
[03:08:12] transactions.
[03:08:14] Next we have minimum confidence set to
[03:08:17] be 20%. So this minimum confidence level
[03:08:19] of 20% states that out of all the
[03:08:22] anticedants at least 20% of them should
[03:08:26] also have the consequence with them. And
[03:08:28] then finally we have the minimum lift
[03:08:30] threshold. So we see that the minimum
[03:08:32] lift threshold which we've set is three.
[03:08:35] And then we have the minimum length
[03:08:36] which is equal to two. So this minimum
[03:08:39] length of two states that there needs to
[03:08:41] be at least two elements in the rule.
[03:08:43] Now I will store this result in
[03:08:45] association rules and then I'll go ahead
[03:08:48] and convert this object into a list and
[03:08:50] I'll store this in association results.
[03:08:53] I'll click on run.
[03:08:56] Now let me go ahead and have a glance at
[03:08:58] the number of rules this algorithm has
[03:09:00] given me. So I'll just use the len
[03:09:02] function which will basically give me
[03:09:03] the length of this list. So you see that
[03:09:06] this algorithm has generated 48 rules in
[03:09:09] total. Now I'll go ahead and have a
[03:09:11] glance at the first rule. So I will just
[03:09:14] type in association results and since I
[03:09:17] want to have a look at the first rule
[03:09:18] I'll set in the index value to be equal
[03:09:20] to zero. Over here we see that this is
[03:09:24] our rule light cream chicken and over
[03:09:26] here we have item base which is light
[03:09:28] cream and items add which is chicken. So
[03:09:31] over here light cream would be our
[03:09:33] anticedent and chicken would be our
[03:09:35] consequent and we are trying to
[03:09:37] determine the association between light
[03:09:40] cream and chicken or in other words we
[03:09:42] are trying to determine if someone buys
[03:09:44] light cream what is the association or
[03:09:47] what is the correlation of him buying
[03:09:49] chicken along with light cream. So for
[03:09:52] this the support value is 04
[03:09:55] the confidence value is 0.29 and the
[03:09:57] lift value is 4. So the support is 4%,
[03:10:01] confidence is 29% and left is four. So
[03:10:05] this confidence value of 0.29 states
[03:10:07] that out of all the anticidence at least
[03:10:11] 29% of them have the consequent which is
[03:10:14] chicken present in them. Now again
[03:10:16] similarly let me have a glance with the
[03:10:18] second rule. I'll click on run. So this
[03:10:21] is our second rule where the anticedent
[03:10:23] is mushroom cream sauce and the
[03:10:25] consequent is escaloupe. So we're trying
[03:10:27] to determine if someone buys mushroom
[03:10:30] cream sauce, what is the probability of
[03:10:32] him also buying escaloupe with the
[03:10:35] mushroom cream sauce. So for this the
[03:10:37] support value is 0.005. The confidence
[03:10:40] value is 30% and the lift is 3.79. So
[03:10:44] this was our first model. Now similarly
[03:10:46] we'll go ahead and build another model
[03:10:48] where we'll give different threshold
[03:10:50] values. So this time over here I am
[03:10:53] setting the minimum support to be equal
[03:10:55] to 1%, minimum confidence to be equal to
[03:10:59] 50% and minimum lift to be equal to two.
[03:11:02] I'll click on run.
[03:11:05] Now again let me have a glance the
[03:11:06] number of rules given by this algorithm.
[03:11:09] I'll click on run. So you see that there
[03:11:11] are just four rules with these minimum
[03:11:12] support, minimum confidence and minimum
[03:11:14] lift values. So now again I will go
[03:11:17] ahead and have a glance at the first
[03:11:19] rule. So in this first rule we see that
[03:11:21] the anticedident comprise of eggs and
[03:11:24] ground beef and the consequent comprise
[03:11:26] of mineral water. So we are trying to
[03:11:28] determine if someone buys eggs and
[03:11:30] ground beef together what is the
[03:11:33] correlation of him also buying mineral
[03:11:35] water after he or she buys eggs and
[03:11:38] ground beef. And for this rule the
[03:11:40] support is 1% the confidence is 50% and
[03:11:44] the lift is 2.12. Now similarly let me
[03:11:47] also have a glance at the second rule.
[03:11:50] So this is the second rule where the
[03:11:52] anticedent is milk and ground beef and
[03:11:55] the consequent is mineral water. So this
[03:11:57] time we're trying to determine if
[03:11:59] someone buys milk and ground beef. What
[03:12:01] is the probability of him also buying
[03:12:03] mineral water along with these two? And
[03:12:06] this time the support is 0.11, the
[03:12:08] confidence is 0.50 and the left is 2.11.
[03:12:12] So guys this is how we can find out
[03:12:14] association rules in Python. Now what is
[03:12:16] a recommendation engine? So a
[03:12:18] recommendation engine is basically a
[03:12:20] filtering system that seeks to predict
[03:12:23] and show the items that a user would
[03:12:25] like to purchase. Now it may not be
[03:12:27] entirely accurate but if it shows you
[03:12:29] what you like then it is doing its job
[03:12:31] right. So recommended systems have
[03:12:33] become increasingly popular in recent
[03:12:35] years and are utilized in a variety of
[03:12:37] areas including music, news, books,
[03:12:40] social tags and any other products in
[03:12:42] general. So mostly used in the digital
[03:12:45] domain, majority of today's e-commerce
[03:12:47] sites like eBay, Amazon and Alibaba make
[03:12:50] use of this proprietary recommendation
[03:12:52] algorithms in order to better serve the
[03:12:54] customers with the products that they're
[03:12:56] bound to like. And if it is set up and
[03:12:59] configured properly, it can
[03:13:00] significantly boost revenues,
[03:13:02] conversions, and other important
[03:13:04] metrics. So moreover, they can have
[03:13:06] positive effects on the user experience
[03:13:08] as well, which translates into metrics
[03:13:11] that are harder to measure, but are
[03:13:12] nonetheless of much importance to online
[03:13:15] businesses, such as customer
[03:13:17] satisfaction and retention. And all of
[03:13:20] this is possible only with a
[03:13:21] recommendation engine. So recommendation
[03:13:23] engines basically are data filtering
[03:13:26] tools that make use of algorithms and
[03:13:28] data to recommend the most relevant
[03:13:30] items to a particular user. or in simple
[03:13:32] terms they're nothing but an automated
[03:13:35] form of a shop counter guy. So you ask
[03:13:37] him for a product now not only he shows
[03:13:39] you that product but also the related
[03:13:42] ones which you could buy. So they are
[03:13:44] well trained in cross-selling and
[03:13:45] upselling. And with the growing amount
[03:13:48] of information on the internet and with
[03:13:49] a significant rise in the number of
[03:13:51] users, it is becoming important for
[03:13:54] companies to search, map and provide
[03:13:56] them with a relevant chunk of
[03:13:58] information according to their
[03:14:00] preferences and tastes. So let's
[03:14:02] consider an example to better understand
[03:14:04] the concept of a recommendation engine.
[03:14:06] So if I'm not wrong, almost all of you
[03:14:08] must have used Amazon for shopping,
[03:14:10] right? And just so you know, 35% of
[03:14:13] Amazon.com's revenue is generated by its
[03:14:15] recommendation engine. So let's
[03:14:17] understand the strategy. Now, Amazon
[03:14:19] uses recommendations as a targeted
[03:14:21] marketing tool in both email campaigns
[03:14:24] and on most of its website pages. So
[03:14:26] Amazon will recommend many products from
[03:14:28] different categories based on what
[03:14:31] you're browsing and pull those products
[03:14:33] in front of you which are likely to buy.
[03:14:35] So like the frequently bought together
[03:14:37] option that comes at the bottom of the
[03:14:39] product page to lure you into buying the
[03:14:42] combo. And this recommendation engine
[03:14:44] has one main goal increase the average
[03:14:48] order value. That is to upsell and
[03:14:50] cross-ell customers by providing product
[03:14:53] suggestions based on the items in their
[03:14:55] shopping cart. So, Amazon uses browsing
[03:14:58] history of a user to always keep those
[03:15:00] products in the eye of the customer and
[03:15:02] it uses the ratings and reviews of the
[03:15:04] customers to display the products with a
[03:15:06] greater average in the recommended and
[03:15:08] bestselling option. Now, Amazon wants to
[03:15:10] make you buy a package rather than just
[03:15:13] one product. Say you bought a phone,
[03:15:15] it'll then recommend you to buy a case
[03:15:17] or a screen protector. It'll further
[03:15:19] send these recommendations to your email
[03:15:21] and keep you engaged with the current
[03:15:23] trend of the product or the category. So
[03:15:26] now that we have defined recommended
[03:15:27] systems, their objective and usefulness,
[03:15:29] next we'll go through the different
[03:15:31] types of popular recommended systems in
[03:15:33] use. So mainly there are three types of
[03:15:35] recommendation engines. So collaborative
[03:15:37] filtering methods are based on
[03:15:39] collecting and analyzing a large amount
[03:15:42] of information on users behaviors,
[03:15:45] activities or preferences and predicting
[03:15:48] what users will like based on their
[03:15:50] similarity to other users. While
[03:15:52] contentbased filtering methods are based
[03:15:54] on a description of the item and a
[03:15:57] profile of the user's preference. So in
[03:15:59] a content based recommendation system,
[03:16:01] keywords are used to describe the items.
[03:16:03] And a user profile is built to indicate
[03:16:05] the type of item this user likes. So
[03:16:07] recent research has demonstrated that a
[03:16:10] hybrid approach combining collaborative
[03:16:12] filtering and content based filtering
[03:16:14] could be more effective in some cases.
[03:16:16] So hybrid approaches can be implemented
[03:16:18] in several ways by making content based
[03:16:20] and collaborative based predictions
[03:16:22] separately and then combining them or by
[03:16:25] adding contentbased capabilities to a
[03:16:28] collaborative based approach and vice
[03:16:29] versa. Or we can even unify the
[03:16:31] approaches into just one model. So
[03:16:33] Netflix is a good example of a hybrid
[03:16:35] system. They make recommendations by
[03:16:37] comparing the watching and searching
[03:16:39] habits of some similar users. that is
[03:16:41] collaborative filtering as well as by
[03:16:44] offering movies that share
[03:16:46] characteristics with films that a user
[03:16:48] has rated highly that is contentbased
[03:16:50] filtering. So now we'll discuss
[03:16:52] collaborative filtering recommended
[03:16:54] systems in detail. So in this type of
[03:16:56] recommendation engine filtering items
[03:16:58] from a large set of alternatives is done
[03:17:00] collaboratively by users preferences. So
[03:17:04] the basic assumption in a collaborative
[03:17:05] filtering recommended system is that if
[03:17:08] two users share the same interests as
[03:17:11] each other in the past, they will also
[03:17:13] have similar tastes in the future. So if
[03:17:15] for example user A and user B have
[03:17:18] similar movie preferences and user A
[03:17:20] recently watched Titanic which user B
[03:17:22] has not seen yet, then this movie
[03:17:25] Titanic is recommended to user B. So the
[03:17:27] movie recommendations on Netflix are a
[03:17:29] good example of this type of recommended
[03:17:31] system. So there are two types of
[03:17:34] collaborative filtering recommended
[03:17:35] systems. So first is user based
[03:17:38] collaborative filtering. So in userbased
[03:17:40] collaborative filtering recommendations
[03:17:42] are generated by considering the
[03:17:44] preferences in the users's neighborhood.
[03:17:47] So this user based collaborative
[03:17:48] filtering is done in two steps. by
[03:17:50] identifying similar users based on
[03:17:52] similar user preferences and
[03:17:55] recommending new items to an active user
[03:17:58] based on the rating given by similar
[03:18:00] users on the items not rated by the
[03:18:02] active user. So another type of
[03:18:04] collaborative filtering is item based.
[03:18:06] Now in item based collaborative
[03:18:08] filtering the recommendations are
[03:18:10] generated using the neighborhood of
[03:18:12] items. So unlike user based
[03:18:14] collaborative filtering, we first find
[03:18:16] similarities between items and then
[03:18:18] recommend non-rated items which are
[03:18:20] similar to the items that active user
[03:18:23] has rated in the past. So item based
[03:18:26] recommended systems are constructed in
[03:18:28] two steps. So we calculate the item
[03:18:30] similarity based on the item preferences
[03:18:33] and find the top similar items to the
[03:18:36] non-rated items by active user and
[03:18:38] recommend them. So now let's discuss
[03:18:40] each of them in depth. So first let's
[03:18:43] try to understand how does user based
[03:18:45] collaborative filtering work. As
[03:18:47] previously mentioned the basic intuition
[03:18:49] behind user based collaborative
[03:18:51] filtering systems is that people with
[03:18:53] similar tastes in the past will like
[03:18:56] similar items in the future as well. For
[03:18:58] example, if user A and user B have very
[03:19:01] similar purchase histories and if user A
[03:19:04] buys a new book which user B has not yet
[03:19:06] seen, then we can suggest this book to
[03:19:09] user B as they have similar taste. So
[03:19:11] from here we know that we need to
[03:19:12] compute the similarity between users in
[03:19:15] user based collaborative filtering. So
[03:19:18] how do we measure the similarity? So
[03:19:19] there are basically two options. Pearson
[03:19:22] correlation and the cosine similarity.
[03:19:24] So let U of I K denote the similarity
[03:19:27] between user I and user K and V of I J
[03:19:31] denote the rating that user I gives to
[03:19:33] the item J. So now both the measures P
[03:19:37] correlation and cosine similarity are
[03:19:39] commonly used. So the difference is that
[03:19:41] Pearson correlation is invariant to
[03:19:44] adding a constant to all the elements.
[03:19:47] So now we can predict the user's opinion
[03:19:49] on the unrated items with this equation
[03:19:52] over here. So now let us illustrate it
[03:19:54] with a concrete example. So in the
[03:19:56] following matrix each row represents a
[03:19:59] user while the columns correspond to
[03:20:01] different movies except the last one
[03:20:03] which records the similarity between
[03:20:05] that particular user and the target
[03:20:07] user. So each cell represents the rating
[03:20:10] that the user gives to that movie. So
[03:20:12] assume user E is the target over here.
[03:20:15] Since user A and user F do not share any
[03:20:18] movie ratings in common with user E,
[03:20:20] their similarities with user E are not
[03:20:22] defined in Pearson correlation.
[03:20:24] Therefore we only need to consider user
[03:20:26] B, C and D. So based on Pearson
[03:20:30] correlation, we can compute the
[03:20:31] following similarity and this is the
[03:20:34] resultant table that we see over here.
[03:20:36] So in this table you can see that user D
[03:20:38] is very different from user E as the
[03:20:41] Pearson correlation between them is
[03:20:42] negative. So he rated me before you
[03:20:45] higher than his average rating while
[03:20:47] user E did the exact opposite. So now we
[03:20:50] can start to fill in the blanks for the
[03:20:52] movies that user E has not rated based
[03:20:55] on other users. Now although computing
[03:20:57] user based collaborative filtering is
[03:20:59] very simple, it suffers from several
[03:21:01] problems. So one main issue is that
[03:21:03] users preference can change over time.
[03:21:06] So it indicates that premputing the
[03:21:09] metrics based on their neighboring users
[03:21:11] may lead to bad performance. And to
[03:21:13] tackle this problem, we can apply item
[03:21:15] based collaborative filtering. So now we
[03:21:17] look at how does item based
[03:21:18] collaborative filtering work. So now
[03:21:20] instead of measuring the similarity
[03:21:22] between two users, the item based
[03:21:24] collaborative filtering recommends items
[03:21:27] based on their similarity with the items
[03:21:30] that the target user rated. So likewise
[03:21:33] the similarity can be computed with POS
[03:21:35] correlation or cosine similarity. And
[03:21:37] the major difference is that with item
[03:21:39] based collaborative filtering we fill in
[03:21:41] the blanks vertically as opposed to the
[03:21:43] horizontal manner that user based
[03:21:46] collaborative filtering does. So the
[03:21:48] following table shows how to do so for
[03:21:50] the movie me before you. It successfully
[03:21:52] avoids the problem posed by dynamic user
[03:21:55] preference as item based collaborative
[03:21:57] filtering is more static. However,
[03:21:59] several problems remain for this method
[03:22:01] as well. The first main issue is
[03:22:04] scalability. The computation grows with
[03:22:06] both the customer as well as the product
[03:22:08] and the worst case complexity is order
[03:22:11] of mn with m users and n items. In
[03:22:14] addition, sparity is also another
[03:22:17] concern. So take a look at the above
[03:22:18] table again. Now although there is just
[03:22:20] one user who rated both metrics and
[03:22:23] Titanic, the similarity between them is
[03:22:26] one. So in extreme cases, we can have
[03:22:29] millions of users and similarity between
[03:22:31] two fairly different movies could be
[03:22:33] very high simply because they have
[03:22:36] similar rank for the only user who
[03:22:38] ranked them both. Or in other words, the
[03:22:41] genre of metrics and Titanic is
[03:22:43] completely different. And even though
[03:22:45] the similarity over here is given to be
[03:22:47] one. So while building collaborative
[03:22:49] filtering recommended systems, we need
[03:22:51] to know how to calculate the similarity
[03:22:53] between users, how to calculate the
[03:22:55] similarity between items, how
[03:22:57] recommendations are generated, and how
[03:22:59] to deal with new items and new users
[03:23:02] whose data is not known. So the
[03:23:04] advantage of collaborative filtering
[03:23:05] system is that they are simple to
[03:23:07] implement and very accurate. However,
[03:23:10] they have their own sort of limitations
[03:23:11] such as the cold start problem which
[03:23:14] means collaborative filtering systems
[03:23:16] fail to recommend to the first time
[03:23:18] users whose information is not available
[03:23:20] in the system. So let's understand what
[03:23:22] is dimensionality reduction. So
[03:23:24] dimension reduction refers to the
[03:23:26] process of converting a set of data
[03:23:28] having vast dimensions into data with
[03:23:31] lesser dimensions. in turn ensuring that
[03:23:34] it conveys similar information
[03:23:35] concisely. In general, it is used to
[03:23:38] reduce the complexity of data while
[03:23:41] keeping the relevant structure intact.
[03:23:43] So this method is typically used to
[03:23:45] solve machine learning problems to
[03:23:47] obtain better features. Now let's have a
[03:23:49] look at the types of dimensionality
[03:23:51] reduction. Broadly dimensionality
[03:23:53] reduction has two classes. Feature
[03:23:56] elimination and feature extraction. So
[03:23:58] feature elimination is the process of
[03:24:01] removing some variables completely if
[03:24:03] they are redundant or if they're not
[03:24:05] providing any new information about the
[03:24:06] data set and this helps us to keep a
[03:24:09] data set small. However, one major
[03:24:12] drawback is we might lose some
[03:24:13] information. On the other hand, feature
[03:24:16] extraction is nothing but extracting the
[03:24:18] information of new variables from old
[03:24:20] variables. For example, say you have 10
[03:24:23] variables in your data set. Then feature
[03:24:25] extraction will create 10 new variables
[03:24:28] which will also include all the 10 old
[03:24:30] variables. Well, PCA is one such
[03:24:33] technique that works on feature
[03:24:35] extraction. So why do we need
[03:24:37] dimensionality reduction? The main
[03:24:39] motive behind this technique is to
[03:24:41] decrease the unwanted dimensions in
[03:24:43] machine learning. For example, consider
[03:24:45] a motorbike rider in racing
[03:24:47] competitions. So the rider position and
[03:24:50] movements are measured by various
[03:24:52] factors such as GPS sensors on bike,
[03:24:55] chyrometers, multiple video feeds and
[03:24:57] smart devices. Now the data analyst has
[03:25:00] to analyze the racing strategy of a
[03:25:02] biker where he'll have a lot of
[03:25:05] variables or dimensions which are
[03:25:07] similar and of little incremental
[03:25:08] values. Hence such data has to be
[03:25:11] treated to reduce the number of
[03:25:12] dimensions or remove unwanted
[03:25:15] dimensions. Now let's have a look at
[03:25:17] some applications of dimensionality
[03:25:19] reduction. The best example which we can
[03:25:22] give is image processing. So I'm sure
[03:25:24] you might have come across this Facebook
[03:25:26] application. Which celebrity do you look
[03:25:29] like? Concept behind this is a dimension
[03:25:31] reduction technique to identify the
[03:25:34] match celebrity image. The algorithm
[03:25:36] uses pixel data and each pixel is
[03:25:39] equivalent to one dimension. So in every
[03:25:42] image there is a high number of pixels
[03:25:44] that is a high number of dimensions and
[03:25:47] every dimension is important here. Now
[03:25:50] you can't just omit dimensions randomly
[03:25:52] to make better sense of your overall
[03:25:53] data set. So in such cases dimension
[03:25:56] reduction techniques help you to find
[03:25:58] the significant dimensions using various
[03:26:01] methods. Now let us move on to principal
[03:26:03] component analysis. So what is principal
[03:26:06] component analysis? The process of
[03:26:09] reducing the number of random variables
[03:26:11] of the data set under consideration by
[03:26:14] obtaining a set of principal variables
[03:26:16] is nothing but PCA. So the main idea
[03:26:19] behind PCA is to find the low dimension
[03:26:21] set of axis that precisely fit the data.
[03:26:25] For better understanding, consider a
[03:26:27] data set of vehicle properties which
[03:26:29] includes different variables such as
[03:26:32] number of wheels, the color of the car,
[03:26:34] the number of seats and so on. While
[03:26:36] many of these features are measure
[03:26:38] related and will be redundant. Hence
[03:26:40] this redundant data has to be eliminated
[03:26:43] to describe the car with few features.
[03:26:45] So this is where the PCA comes into the
[03:26:48] picture. For instance, think about the
[03:26:50] number of wheels. Almost all cars and
[03:26:53] buses will have four wheels and only in
[03:26:55] very rare cases some buses will have six
[03:26:58] wheels and hence we can say that this
[03:27:01] feature has less variance. So this
[03:27:03] feature will make the bus and car look
[03:27:05] the same, but in fact they're actually
[03:27:07] not. Now let's consider the height
[03:27:10] variable. Now it's very obvious that
[03:27:12] cars and buses will have completely
[03:27:14] different values when it comes to
[03:27:16] height. Well, this property can be taken
[03:27:18] into account to separate them as PCA
[03:27:21] does not take information of classes
[03:27:23] into account. It'll just consider the
[03:27:25] variance of each feature. Now let us
[03:27:29] move on to the working mechanism of
[03:27:30] PCAM. So consider that some data is
[03:27:33] placed in an oval shape on the
[03:27:35] coordinate axis. In order to find the
[03:27:38] direction with more variance, we'd have
[03:27:40] to find the line where data is more
[03:27:42] spread out when projected onto it. Here
[03:27:44] the data is not spread out widely.
[03:27:46] Therefore, it will not have large
[03:27:48] variance and probably it is not the
[03:27:50] principal component we are looking for.
[03:27:52] Now let us try using the horizontal
[03:27:54] line. So on this line we can see that
[03:27:57] the data is more spread out and has
[03:27:59] large variance and hence it can be
[03:28:01] considered as the direction for our
[03:28:03] principal component. Now you might be
[03:28:05] thinking how can we find out this
[03:28:07] principal component? Well, we have
[03:28:10] vectors and values to find the principal
[03:28:12] component rather than drawing lines and
[03:28:15] unevenly shaped triangles. So what are
[03:28:17] these igen vectors and IGEN values? So
[03:28:20] these two exist in pairs. Every vector
[03:28:23] has a corresponding igon value. The line
[03:28:26] or the vector which we drew in our
[03:28:28] previous example is nothing but the igen
[03:28:30] vector and the variance is the igen
[03:28:32] value. So in the example above the igen
[03:28:35] value is a number telling us how spread
[03:28:38] out the data is on the line and the igen
[03:28:40] vector with the highest value is
[03:28:43] therefore the principal component. For
[03:28:45] better understanding, let us consider
[03:28:47] that you have two measures, age and
[03:28:50] hours on the internet. So there are two
[03:28:52] variables and hence it'll be a
[03:28:54] two-dimensional data set and therefore
[03:28:56] it'll have two values and two vectors.
[03:29:00] The reason for this is that IGEN vectors
[03:29:02] put the data into a new set of
[03:29:04] dimensions and these new dimensions have
[03:29:07] to be equal to the original number of
[03:29:08] dimensions. So this sounds complicated
[03:29:11] but again an example should make it
[03:29:13] clear. So you can see that the O is on X
[03:29:16] and Y axis where X refers to age and Y
[03:29:19] refers to hours on the internet. Now let
[03:29:21] us split the data by drawing a line. So
[03:29:24] the igon vector should be able to span
[03:29:26] the whole XY area and therefore they
[03:29:29] need to be orthogonal to each other.
[03:29:31] Hence the second vector would look like
[03:29:34] this. Now we'll just reframe the data in
[03:29:37] these new dimensions and the resultant
[03:29:39] is as shown below. So you can also make
[03:29:42] sure that nothing changes by moving the
[03:29:44] data to the origin. So these igen
[03:29:46] vectors basically give us a new set of
[03:29:48] axis and these axis are much more
[03:29:51] intuitive to the shape of the data. Now
[03:29:53] as these are the directions where most
[03:29:55] of the variation occurs. Now let us see
[03:29:58] how PCA is used to reduce the dimensions
[03:30:01] of a given data set. So it reduces the
[03:30:04] data down into basic components by
[03:30:06] taking out the unnecessary parts. So
[03:30:09] consider that you're measuring three
[03:30:11] variables age, hours on internet and
[03:30:14] hours on mobile. Now imagine that the
[03:30:16] data forms an oval shape on a plane as
[03:30:19] shown in the figure. Since we have
[03:30:21] considered three measures, it'll have
[03:30:23] three igon vectors and three igon values
[03:30:26] for the data set and out of these two
[03:30:29] igon vectors will have large igon values
[03:30:31] and one igon vector will be zero. Here
[03:30:34] EV1 is the first igon vector. EV2 is the
[03:30:38] second igon vector and EV3 is the third
[03:30:41] one. And since it has an igen value of
[03:30:43] zero, we'll not consider that. So we can
[03:30:46] now rearrange the axis to the origin and
[03:30:49] we'll also represent the data in two
[03:30:51] dimensions as EV3 is zero. So what we've
[03:30:54] done here is nothing but dimension
[03:30:55] reduction and hence the
[03:30:57] three-dimensional problem is now reduced
[03:30:59] to two-dimensional problem. So reducing
[03:31:02] dimensions helps to simplify the data
[03:31:04] and make it easier to visualize. So this
[03:31:07] is how PCA algorithm works. So here we
[03:31:09] have a multi-dimensional movie rating
[03:31:11] data set where every dimension or column
[03:31:14] is a movie and different users are rows.
[03:31:16] So we'll call this matrix as a and if
[03:31:19] you have learned about the PCA which is
[03:31:21] the principal component analysis. So if
[03:31:23] you use PC on this particular data set,
[03:31:25] so it can boil down this data set to a
[03:31:28] much smaller number of dimensions and
[03:31:30] those dimensions will best describe the
[03:31:32] variance in the data and often those
[03:31:34] dimensions uh that or PCA will find that
[03:31:37] will correspond to the features that we
[03:31:39] have learned as humans in the real
[03:31:41] world. So for example, if you run PCA on
[03:31:44] this particular data set, so it will
[03:31:45] find those latent features and extract
[03:31:48] them from the data. So PC actually would
[03:31:50] not know what those features mean but
[03:31:52] still it will find those features. So
[03:31:53] we'll ask PCA to find three dimensions
[03:31:55] in this data and if you run PCA on this
[03:31:57] data to find three dimensions. So it
[03:31:59] will boil down our data or ratings down
[03:32:02] to three latent features that it
[03:32:04] identified. So PCA would not know what
[03:32:06] those features are and what to call
[03:32:08] them. But let's say that they end up
[03:32:10] being measures of each person's interest
[03:32:12] in action, sci-fi or comedy movies. Uh
[03:32:15] so for example we might think of user
[03:32:17] one as being defined as interested in
[03:32:20] 30% action 50% interested in sci-fi
[03:32:22] movies and 20% interested in comedy
[03:32:25] movies. So now let's take a look at the
[03:32:27] columns in this new matrix. So each
[03:32:30] column here is a description of users
[03:32:32] that make up that particular feature. So
[03:32:33] we can say that the action can be
[03:32:35] described as the 30% of user one and the
[03:32:38] 10% of user two and the 30% of user
[03:32:41] three. So let's call this matrix as V.
[03:32:43] So just like we can run PCA on our user
[03:32:46] rating data set to find profiles of
[03:32:48] different kinds of users. So we can now
[03:32:50] flip things around and try to run PC on
[03:32:52] profiles of different kinds of movies.
[03:32:54] So if we rearrange our input data so
[03:32:56] that now the rows are the movies and the
[03:32:59] columns are the users. So it will look
[03:33:01] like this. So we'll call this as the
[03:33:03] transpose of the original ratings matrix
[03:33:05] or a of t for short. So now if we run
[03:33:08] PCA on this particular data set, so it
[03:33:11] will identify some latent features again
[03:33:13] and it will describe each individual
[03:33:15] movie as some combination of them. So
[03:33:17] now each column is now a description of
[03:33:19] some typical movie that exhibits some
[03:33:22] latent feature. So again these columns
[03:33:25] have no inherent meaning but in practice
[03:33:27] they might fall along the movie genre
[03:33:29] lines but in reality it is more complex
[03:33:31] to name these features. So now let's
[03:33:33] call this particular resulting matrix uh
[03:33:35] that describes typical movies as u. So
[03:33:38] how do these matrices actually help us
[03:33:40] to describe typical users and typical
[03:33:42] movies to help us predict ratings. So it
[03:33:44] turns out that uh the typical movie
[03:33:46] matrix which is u here and the typical
[03:33:49] user matrix transpose which is v of t
[03:33:51] and they're both the factors of the
[03:33:53] original matrix a that we started with.
[03:33:55] So here a is a m byn matrix with m rows
[03:33:58] and n columns and u here is m by matrix.
[03:34:01] So it has M rows and M columns. And S is
[03:34:03] M by N matrix and V of T is your N byN
[03:34:07] matrix. So if we have both U and if we
[03:34:10] have V, so we can reconstruct the whole
[03:34:12] matrix A. And even if they are missing
[03:34:14] values or missing ratings and if we have
[03:34:16] both U and V, so we can re fill in all
[03:34:19] those missing values using these two
[03:34:21] matrices in A. So that is why we call
[03:34:22] this as matrix factorization. So we'll
[03:34:24] describe our training data A in terms of
[03:34:27] smaller matrices and that are the
[03:34:28] factors of the ratings that we want to
[03:34:30] predict. So there is also a matrix S
[03:34:32] which is in the middle. So it is just a
[03:34:34] simple diagonal matrix which means that
[03:34:36] the non-zero elements will be on the
[03:34:38] diagonal of this matrix and rest of the
[03:34:40] elements will be zero. So the only
[03:34:41] purpose of this matrix is to scale the
[03:34:44] values so that we end up into the proper
[03:34:46] scale of the values that we calculate
[03:34:48] using U and V. So here U and V are both
[03:34:51] orthogonal matrices which means that if
[03:34:53] you multiply these matrices with their
[03:34:55] transpose so you'll get the identity
[03:34:56] matrix and S is a diagonal matrix which
[03:34:59] means that the non-zero elements will be
[03:35:01] on the main diagonal and rest of the
[03:35:02] elements will be zero. So that is why we
[03:35:04] called this as matrix factorization. So
[03:35:07] you can just multiply the scaling matrix
[03:35:09] into V or U and you will still think of
[03:35:12] R as just or A as just the product of
[03:35:14] the two matrices. So you can just
[03:35:16] multiply the scaling matrix into v or u
[03:35:19] and still think of a as just the product
[03:35:21] of the two matrices that is v and u. So
[03:35:23] you can reconstruct whole a at once by
[03:35:26] multiplying these factors together and
[03:35:28] get the ratings for every combination of
[03:35:30] the users and the items or the movies we
[03:35:33] in this particular case. So once you
[03:35:35] have these factors so you can also just
[03:35:37] predict a rating for a specific user. So
[03:35:39] here is a screenshot for a real data set
[03:35:42] that we'll code in our coding
[03:35:43] environment and then you can if you want
[03:35:45] to calculate the rating or if you want
[03:35:47] to predict the rating for the customer 7
[03:35:49] for the movie three. So the columns
[03:35:51] represent the movie ids and the rows
[03:35:54] represent the customer ids or the user
[03:35:56] ids. So if you want to know or predict
[03:35:58] the rating for the customer seven user
[03:36:00] seven for the movie three. So we can do
[03:36:02] that by taking the dotproduct of the
[03:36:04] respective row in U and for the user and
[03:36:06] the respective column in V of T for that
[03:36:09] particular movie. So which will give us
[03:36:10] the predicted rating for the user ID 7
[03:36:13] for the movie 3. So finally we are going
[03:36:15] to tie this all together. So later on in
[03:36:17] this course we'll use a built-in
[03:36:19] recommener called the SVD. So SVD is
[03:36:21] known to produce very accurate results
[03:36:23] and SVD stands for singular value
[03:36:25] decomposition. So it is just a way of
[03:36:28] computing the matrix U S and V all
[03:36:31] together all at once very efficiently.
[03:36:33] So all SVD does is to run PCA on both
[03:36:37] the users and the movies and it gives us
[03:36:39] back the matrices that we need which are
[03:36:41] the factors of the original rating
[03:36:43] matrix that we had. So SVD is just a way
[03:36:45] to get all of these factors in one go.
[03:36:47] So mathematically we can represent the
[03:36:49] expected rating or the predicted rating
[03:36:51] as the dotproduct of u and v of t where
[03:36:54] p of u is here is your u and q is your
[03:36:57] v. So what svd does is it tries to
[03:36:59] minimize the values of p and q such that
[03:37:02] we minimize this particular loss
[03:37:04] function. So now the problem arises when
[03:37:06] we have missing data in our rating data
[03:37:08] set. So which is most of the times we'll
[03:37:10] have. So this is an actual screenshot of
[03:37:12] the Netflix price data where you can see
[03:37:15] most of the values are missing. So when
[03:37:17] our original matrix that is a has some
[03:37:19] missing values. So we cannot actually
[03:37:20] run PC on this matrix because it has
[03:37:23] missing values. So it must be a complete
[03:37:25] matrix to run PC. So one way to solve is
[03:37:28] to just fill these missing values using
[03:37:30] the mean values. But there is a better
[03:37:32] way to do that. So if you remember from
[03:37:34] the previous slides that every rating
[03:37:36] can be described as a dotproduct of some
[03:37:39] row in the matrix U and some column in
[03:37:42] the matrix V of T. So if you want to
[03:37:44] find the uh rating for user one, so
[03:37:46] we'll take a dotproduct of the
[03:37:49] corresponding row in the matrix U and
[03:37:52] the corresponding column in the matrix V
[03:37:54] of T. So let's assume that if we have at
[03:37:56] least some known ratings for any given
[03:37:58] row and column in both matrices U and V.
[03:38:01] So we can treat this as a minimization
[03:38:03] problem where we actually try to find
[03:38:05] the values of those complete rows and
[03:38:08] columns that best minimize the errors in
[03:38:10] the non ratings for the matrix A. So
[03:38:12] there are a lot of machine learning
[03:38:13] techniques that we can use to solve this
[03:38:15] problem. So one technique is called the
[03:38:17] stochastic gradient descent which is SGD
[03:38:19] for short. So basically what it does is
[03:38:21] it just keeps iterating at some given
[03:38:24] learning rate until it revs at a minimum
[03:38:26] error value. So again SGD is just one
[03:38:28] way of doing it. So there is another
[03:38:29] method called the ALS which is
[03:38:31] alternating less that we can use to
[03:38:33] solve this problem. And when we say that
[03:38:35] we are using SVD for recommendations, it
[03:38:37] is not really just SVD because you
[03:38:39] cannot actually apply SVD to missing
[03:38:42] data. So it is just an SVD inspired
[03:38:44] algorithm, not a pure SVD. So a specific
[03:38:47] variant of SVD is called the SVD++. So
[03:38:50] and there's another technique called the
[03:38:52] restricted BMAN machines which are used
[03:38:54] to solve these problems. So the
[03:38:56] important points that you have to take
[03:38:57] away is that you can think of all the
[03:38:59] rating matrix as a set of users and
[03:39:01] items as a matrix A and the matrix A can
[03:39:05] be factored into smaller matrices that
[03:39:07] describe the general categories of the
[03:39:08] users and the items that can be
[03:39:10] multiplied together. So a quick way to
[03:39:12] get those matrices is by using a
[03:39:13] technique called SVD which is called the
[03:39:15] singular value decomposition. So once we
[03:39:17] have those factor matrices so you can
[03:39:19] predict the rating of any given item. So
[03:39:21] the rating is predicted by taking a dot
[03:39:23] product from each matrix. So techniques
[03:39:26] such as SGD and ALS which is alternating
[03:39:28] le squares can be used to learn the best
[03:39:31] values of those factored matrices when
[03:39:33] you have missing data. So now let's
[03:39:34] implement the SVD algorithm in Python on
[03:39:36] the movie lens data set. So I'll start
[03:39:38] by importing the required libraries that
[03:39:39] is numpy as NP and paras. So once you
[03:39:43] have imported these libraries now let's
[03:39:45] import our data data sets. So we have
[03:39:47] two data sets called ratings and movies.
[03:39:49] So first of all I'll import the ratings
[03:39:51] CSV which I have already uploaded to the
[03:39:53] Jupyter notebook. So we have four
[03:39:55] columns in this data set that is user
[03:39:57] ID, movie ID, rating and time step. So
[03:39:59] we'll use these columns and we'll store
[03:40:01] the data set in ratings data frame. And
[03:40:03] after that if you want to see the head
[03:40:05] of this data frame that is the first
[03:40:06] five rows. So we'll run the ratings head
[03:40:08] and it will print the first five rows of
[03:40:10] this data set. So we have user ID for a
[03:40:12] particular user and the movie ID that
[03:40:14] this particular user has rated and then
[03:40:16] the rating that this particular user has
[03:40:18] given to these movie ids and then the
[03:40:20] timestamps. So now let's see the second
[03:40:22] data set that is our movies dot CSV. So
[03:40:24] we'll use the read CSV function to read
[03:40:26] this data set and we have three columns
[03:40:28] that is movie ID, title and genres. So
[03:40:31] we'll store this in movies data frame.
[03:40:33] So once we have imported these two data
[03:40:35] sets now let's look at the heads of both
[03:40:37] of these data sets. So our movie movies
[03:40:40] do head it prints these three columns.
[03:40:42] So we have movie id as the first column
[03:40:44] which is which corresponds to the movie
[03:40:46] ids in our ratings data set and then we
[03:40:48] have the title of these movies the year
[03:40:50] of the release and then we have the
[03:40:51] genres of these movies. So if you want
[03:40:53] to look at the head of the ratings data
[03:40:54] set we have user IDs movie ids ratings
[03:40:57] and timestamps. So this is our movie
[03:40:59] lens data set and we will going to we
[03:41:00] are going to make an SVD recommended
[03:41:03] system using SVD algorithm. So before
[03:41:05] that we'll calculate how many unique
[03:41:07] users are there and how many unique
[03:41:09] movies are there in our data set. So you
[03:41:11] know ratings data set or user ID columns
[03:41:13] contains the user ID for different
[03:41:15] users. So each user might have given
[03:41:17] ratings to more than one movie. So here
[03:41:19] you can see this particular user with
[03:41:21] the user ID of one has given movie
[03:41:22] number two a rating of 3.5. Movie number
[03:41:25] 29 a rating of 3.5. So there are
[03:41:27] different movies and movie ids and the
[03:41:29] different ratings. So if you want to
[03:41:30] know how many unique users are there so
[03:41:32] we'll use the unique method of the
[03:41:35] column that is user ID and then we we
[03:41:37] only want to get the first element. So
[03:41:39] if you omit the dot shape method from
[03:41:41] here if you run this so you'll get an
[03:41:44] array and if you just write the shape
[03:41:48] so you'll get a pupil where the first
[03:41:51] element of the pupil is the number of
[03:41:52] users or unique users in our data set.
[03:41:54] So if you want to access the first
[03:41:56] element you will just write the
[03:41:57] indexing. So in the indexing or the
[03:42:00] slicing we'll write the zero or the
[03:42:02] first index that is our users. So we
[03:42:04] have around 7,120 users in our whole
[03:42:07] data set. And then if you want to know
[03:42:08] how many movies or unique movies are
[03:42:10] there in our data set. So we'll use the
[03:42:12] same unique method on the movie ID
[03:42:14] column. And then we'll get the first
[03:42:15] element. So once we print this, so we
[03:42:18] have printed it using the f string. So
[03:42:20] an f string is just a string that is
[03:42:21] used to print python expressions. So if
[03:42:24] you have a python expression that you
[03:42:25] want to print including a string. So we
[03:42:27] have written number of users and you
[03:42:28] have to keep the Python expression in
[03:42:30] these braces and these Python
[03:42:32] expressions are the variables that we
[03:42:33] have just calculated above that is
[03:42:35] number of users and number of movies. So
[03:42:37] once you print uh this you'll get number
[03:42:40] of users as 7,120 and number of movies
[03:42:43] as 14,26. So we have 7,100 unique users
[03:42:47] who have given different ratings to
[03:42:49] 14,26 movies. So now before moving ahead
[03:42:52] as I've already told you that our
[03:42:54] algorithm SV algorithm it takes a matrix
[03:42:57] which is a real rating matrix. So that
[03:42:59] matrix should be in such a way that the
[03:43:02] columns are movie ratings or the movie
[03:43:04] ids and the user ids are represented by
[03:43:07] rows. So we'll do that by using the
[03:43:10] pivot method of the data frames. So our
[03:43:13] data frame is ratings. So we'll use the
[03:43:15] pivot method. So it will create an array
[03:43:17] that corresponds to the movie ids on the
[03:43:20] column side and the row ids or the user
[03:43:23] ids on the row side. So the index of
[03:43:25] this particular matrix will be user ID
[03:43:27] and the columns will be represented by
[03:43:29] movie ids and the values in the columns
[03:43:31] and the rows will be represented by our
[03:43:32] ratings. So these are the different
[03:43:34] ratings given by different users to
[03:43:35] different movies in this ratings data
[03:43:37] set. And if you want to fill all the
[03:43:39] missing values with zero, so we'll write
[03:43:41] fill na zero. And if you delete this
[03:43:44] particular method from here and you
[03:43:46] print print it so you'll get all the
[03:43:48] missing values represented by n values.
[03:43:50] And if you want to replace these n
[03:43:52] values by zero so we'll replace it with
[03:43:55] fill na0. So it will replace all the na
[03:43:57] values with zero. So now to implement
[03:44:00] the svd algorithm in python we we're
[03:44:02] going to use the library called scikit
[03:44:04] surprise. So if you want to install this
[03:44:06] library you will run you you have to run
[03:44:08] this particular command in your anaconda
[03:44:11] prompt. So if you go to your Anaconda
[03:44:13] folder and in the Anaconda prompt you
[03:44:16] have to type this particular command and
[03:44:18] once you run it so you'll be asked to
[03:44:21] press Y or not. So you press one and
[03:44:23] then this particular package which is
[03:44:24] cyc surprise it will be installed in
[03:44:26] your anaconda system. So since I have
[03:44:29] already installed this package so I have
[03:44:30] not I will not install it again. So now
[03:44:32] we'll import all the required libraries
[03:44:34] and classes that we need to implement
[03:44:36] the SVD algorithm. So from the subprise
[03:44:38] package we'll import the reader class,
[03:44:40] the data set class and the SVD
[03:44:42] algorithm. So there are different
[03:44:43] versions of SVD algorithm that are
[03:44:45] present in our subprise package. So
[03:44:46] we're going to use the SVD and there's
[03:44:48] also one algorithm called the SVD++. So
[03:44:51] we are going to firstly use the SVD
[03:44:53] algorithm and also we want to evaluate
[03:44:55] the performance of our model. So for
[03:44:57] that we'll import the cross validate
[03:44:59] function from the model selection
[03:45:01] package or the model selection module of
[03:45:03] the surprise package. So once you've
[03:45:05] imported these two now you can
[03:45:07] instantiate a object of the reader
[03:45:09] class. So we'll use this object to read
[03:45:11] data sets and we'll convert those data
[03:45:14] sets to your data set object that that
[03:45:16] is used by your SVD algorithm.
[03:45:19] So once you run this particular code
[03:45:21] your data set will be loaded. So to load
[03:45:24] the data set we have the data set class
[03:45:26] and we'll use the load from DF method of
[03:45:28] this class and inside the method we have
[03:45:30] to mention our data frame and a
[03:45:32] different column. So there's a format
[03:45:33] that this particular class requires. So
[03:45:35] the format is that your first column is
[03:45:37] your user ID. The second column is your
[03:45:39] movie ID and the third column should be
[03:45:41] your ratings. And we'll use a reader
[03:45:43] class to read this particular data
[03:45:44] frame. And it will convert this
[03:45:46] particular data frame to a data set
[03:45:48] object that will be used by your SVD
[03:45:50] algorithm. And after that we'll
[03:45:51] instantiate an SVD class object that is
[03:45:54] called SVD that we'll use later on to
[03:45:56] fit the model. And then first of all
[03:45:58] we'll evaluate the performance of the
[03:45:59] model using the cross validation. So
[03:46:01] we're not going to implement training
[03:46:03] and test sets. We'll implement the cross
[03:46:05] validation and using the cross
[03:46:07] validation we'll get to know the
[03:46:08] performance of our model. So we have
[03:46:10] used the cross validate function that we
[03:46:12] have just imported from the model
[03:46:13] selection package. And this model takes
[03:46:16] and this function takes two arguments
[03:46:17] that is SVD your model and the data that
[03:46:20] you're going to run on and the data that
[03:46:22] you want to run SVD on. Then the
[03:46:24] measures will be the your two measures
[03:46:26] or more than two measures that you want
[03:46:27] to evaluate the performance for. So
[03:46:29] we'll calculate the RMSSE values which
[03:46:31] is the root mean square errors and the
[03:46:32] mean absolute errors and then CV equals
[03:46:35] 3 means that we want to this algorithm
[03:46:37] to run cross validation three times and
[03:46:40] then we have verbose equals 2. So it
[03:46:42] will also print when whenever it runs
[03:46:44] cross validation so it will print the
[03:46:46] results along with the process. So after
[03:46:48] running this it might take some time
[03:46:50] because the cross validation process
[03:46:51] takes some time. So when when it's once
[03:46:53] it's over so we'll move ahead with the
[03:46:55] rest of the code.
[03:46:58] So here are the results for our
[03:47:00] three-fold cross validation. And if you
[03:47:02] want to know more about these functions,
[03:47:03] so you can press shift and tab here. So
[03:47:05] it will open the documentation. So in
[03:47:07] the documentation, you can see that if
[03:47:09] you want use this, if you want to use a
[03:47:11] custom data set that is stored in a
[03:47:13] pandas data frame. So it takes two
[03:47:15] arguments that is your data frame and
[03:47:16] the reader object of the reader class.
[03:47:18] And if you want to know more about this,
[03:47:20] so you can read this particular
[03:47:21] documentation. And after that, if you
[03:47:22] want to know more about the cross
[03:47:24] validate function, so you can click
[03:47:25] here. So it takes your algorithm as the
[03:47:27] first argument. So our our algorithm is
[03:47:30] SVD and then the data is data and
[03:47:32] different measures as RMSC and MA and
[03:47:34] cross validation is the number of times
[03:47:36] it will perform the cross validation. So
[03:47:38] it will be a three-fold cross
[03:47:39] validation.
[03:47:46] So these are different results for our
[03:47:47] cost cross validation. So for the fold
[03:47:49] one we have an RMSSE of 0.84 and for
[03:47:53] fold three we have an RMSSE of 0.8. 8
[03:47:55] 473 and this is the mean and the
[03:47:57] standard deviations. So now if you look
[03:48:00] at the head of your ratings data set so
[03:48:02] you'll see we have different user ids,
[03:48:03] movie ID and rating. So now if you want
[03:48:05] to predict the ratings for the user one
[03:48:08] for different movies which this user has
[03:48:10] not rated yet so that we can recommend
[03:48:12] those movies and we'll recommend all
[03:48:14] those movies where the prediction is
[03:48:16] five or the highest rated movies by this
[03:48:18] particular user. So first of all let's
[03:48:20] create a matrix or let's create a data
[03:48:22] frame. So now let's before moving on
[03:48:24] let's see how how many movies this
[03:48:25] particular user has rated as five stars.
[03:48:28] So we'll take all the movies which are
[03:48:30] rated by user one and we'll take all the
[03:48:32] movies which are rated by user one as
[03:48:34] four. So if you write five here so it
[03:48:36] will display all the movies which have
[03:48:37] been rated by user one as five stars.
[03:48:40] Then we have set the index of this
[03:48:42] particular data frame as movie ID. So
[03:48:44] this movie ids will be treated as an
[03:48:46] index. And now after that we'll join
[03:48:48] movie ids and our ratings. So we'll join
[03:48:51] our ratings and the movies data frame
[03:48:53] using the title column. So once you run
[03:48:55] this, you'll get this particular data
[03:48:57] frame that will contain all the movie
[03:48:59] ids along with the movie names that has
[03:49:01] been that have been rated as four stars
[03:49:03] by this particular user. So you can see
[03:49:05] these are the movies that have been
[03:49:06] rated as four stars by this particular
[03:49:08] user. And if you write five here, so it
[03:49:10] will display all the movies that have
[03:49:12] been rated as five stars by this
[03:49:14] particular user. So based on these
[03:49:16] results and based on the movies that
[03:49:18] have been rated five stars by other
[03:49:20] users we will recommend different mo
[03:49:22] movies that this particular user has not
[03:49:24] seen that other user has other users
[03:49:26] have seen. So now first of all let's
[03:49:28] copy or movies which data set which
[03:49:30] contains the title of the movies. So
[03:49:31] once you use the dotcopy method so it
[03:49:33] will create a copy of this particular
[03:49:35] data set it will store in user one. So
[03:49:37] if you now make any changes to the user
[03:49:39] one so our original data that is or
[03:49:41] movies data will not be affected. So
[03:49:42] that is why we use the copy method. So
[03:49:44] once we have used the copy method so now
[03:49:46] we'll reset the index. So reset the
[03:49:49] index is when you start the index from
[03:49:51] zero again. So if the index was movie
[03:49:53] add is before. So now it will be reseted
[03:49:55] to zeros and ones.
[03:50:25] So this is our movie ID or user one data
[03:50:27] set looks like. So we have index from 0
[03:50:30] to one again and this is the movie ID is
[03:50:32] the these are the titles and these are
[03:50:33] the genres.
[03:50:35] So now we'll train our SVD algorithm use
[03:50:38] the using this particular data set. So
[03:50:40] we'll use again the data set class and
[03:50:42] we'll use the load from DF method from
[03:50:44] this class and we'll pass our data. So
[03:50:46] our data is ratings data. So it contains
[03:50:49] three columns that is user ID, movie ID
[03:50:51] and or rating and then we'll use the
[03:50:52] reader class to read this particular
[03:50:54] data set and it will convert into a data
[03:50:56] set object that will be used by our SVD
[03:50:58] algorithm. So after creating the data
[03:51:00] set, we will build the training set. So
[03:51:02] if you want to build the training set,
[03:51:03] we have a method called build full train
[03:51:06] set. So it will take all the set or all
[03:51:07] the data set and it will build a
[03:51:09] training set from that particular data
[03:51:11] set. So we'll store that data set in
[03:51:12] train set and then we'll pass our train
[03:51:15] set to our SVD algorithm. So we have
[03:51:17] already instantiated an object of the
[03:51:19] SVD class that is SVD. So we'll use the
[03:51:21] fit method of this particular object and
[03:51:23] we'll pass our training set to this
[03:51:25] particular method. So once we pass pass
[03:51:27] our training set to this particular
[03:51:28] method. So our SVD algorithm will be
[03:51:31] fitted to this particular training set.
[03:51:32] So after fitting our algorithm to our
[03:51:34] training set, we'll now estimate the
[03:51:36] scores or the ratings by this particular
[03:51:38] user for different movies. So we'll
[03:51:40] store that in a new variable in our data
[03:51:43] frame that is user one. The variable
[03:51:44] name is estimated score. So we'll write
[03:51:46] user one then movie id. So we'll take
[03:51:49] all the movie ids and we'll apply this
[03:51:51] lambda function on all the movie ids. So
[03:51:53] this lambda function will take all the
[03:51:55] movie ids as an input and we'll use the
[03:51:58] dotpredict method. So lambda function is
[03:52:00] a function which is an unnamed function.
[03:52:02] So it will take it will use this
[03:52:04] particular method which is the predict
[03:52:06] method of our SVD object and it will
[03:52:08] pass X which is our movie ids here and
[03:52:11] it will find the estimated ratings for
[03:52:13] different movies that is not yet given
[03:52:16] by this particular user and based on
[03:52:17] those estimated ranks or estimated
[03:52:20] rating ratings we'll only recommend
[03:52:22] those movies which are estimated to be
[03:52:24] the highest rated movies by this
[03:52:25] particular user. So after we pass this
[03:52:28] we will have all the estimated or
[03:52:30] predicted ratings for this particular
[03:52:32] user for different movies. So this that
[03:52:34] object or that data frame will also
[03:52:36] contain these columns as well. So movie
[03:52:38] ids will be there, genres will be there,
[03:52:40] indexes will be there. So we don't want
[03:52:42] these three columns. We just want the
[03:52:44] title of the movies and the our movie
[03:52:46] ids which are which will be already
[03:52:47] stored as the index. So we'll drop all
[03:52:50] these three columns and we write x is
[03:52:52] equals 1 if you want to drop anything
[03:52:53] from the column side and we'll store all
[03:52:55] the results or we'll store the results
[03:52:58] in the data frame called user one. And
[03:52:59] now we'll sort the values of the user
[03:53:01] one. So our user one now will contain
[03:53:03] the estimated values or the predicted
[03:53:05] ratings for this particular user whose
[03:53:07] ID is user whose ID is one. So we'll
[03:53:09] sort the values of the estimated scores
[03:53:11] by descending order. So we'll have to
[03:53:13] write ascending equals false. So it will
[03:53:15] be solid descending in descending order.
[03:53:17] So then we'll print the first 10 ratings
[03:53:20] which is the highest 10 ratings for our
[03:53:22] user one. So it will also contain the
[03:53:24] movie titles. So once you run this
[03:53:26] particular code, so we'll have all the
[03:53:28] highest rated movies or the the movies
[03:53:30] that are highest rated and those are
[03:53:33] predicted by our particular SVD
[03:53:35] algorithm. So after this code finishes
[03:53:37] executing, you'll get all the first 10
[03:53:40] titles or the movie titles which are
[03:53:42] expected or which are predicted to be
[03:53:44] the highest rated by this particular
[03:53:46] user. So the highest rating this is that
[03:53:48] this particular user will give is 4.7 to
[03:53:50] these movies. So we'll recommend these
[03:53:52] movies to this user which this move this
[03:53:54] user has never seen these movies based
[03:53:56] on the rankings that have been given by
[03:53:58] other users. So this is how we use the
[03:54:00] SVD algorithm to build a recommendation
[03:54:03] system that will recommend different
[03:54:05] movies to a user based on the UCBF which
[03:54:08] is the user based collaborative
[03:54:10] filtering. Now there are many ways of
[03:54:12] knitting the nodes of a neural network
[03:54:14] together and each way results in a more
[03:54:17] or less complex behavior. Possibly the
[03:54:20] simplest of all topologies is the feed
[03:54:22] forward network. So when feed forward
[03:54:25] neural network signals flow in one
[03:54:27] direction without any loop in the signal
[03:54:29] paths and typically artificial neural
[03:54:32] networks have a layered structure. The
[03:54:34] input layer picks up the input signals
[03:54:37] and passes them on to the next layer
[03:54:39] known as the hidden layer. And there can
[03:54:41] be more than one hidden layer in a
[03:54:43] neural network. And at last comes the
[03:54:45] output layer that delivers the result.
[03:54:47] Now the first question to pop into your
[03:54:49] head would be what is the inspiration
[03:54:51] behind these artificial neural networks.
[03:54:54] Well, the answer to that is the
[03:54:56] biological neural network of our brain.
[03:54:58] So let us first understand the
[03:55:00] architecture of a biological neuron. So
[03:55:03] as you can see in the slide, our
[03:55:05] biological neuron has these three main
[03:55:07] components. So we have the dendrite, the
[03:55:10] cell body and the axon. These dendrites
[03:55:13] receive the signals and the cell body
[03:55:16] processes these signals and the axon
[03:55:18] finally sends out these signals to other
[03:55:21] neurons. So just like the biological
[03:55:23] neuron, the artificial neuron has a
[03:55:26] number of input channels, a processing
[03:55:28] stage and one output that can fan out to
[03:55:32] multiple other artificial neurons. So
[03:55:34] now let's understand these artificial
[03:55:36] neurons in detail. These artificial
[03:55:38] neurons are the most fundamental units
[03:55:40] of deep neural network. It takes an
[03:55:43] input, processes it, passes it through
[03:55:46] an activation function and returns the
[03:55:49] output if the condition is met or else
[03:55:51] it'll process it again until you get the
[03:55:54] correct output. And such type of
[03:55:56] artificial neurons is called as a
[03:55:58] perceptron. So they are basically like a
[03:56:01] linear model which is used for binary
[03:56:04] classification. So as the figure shows
[03:56:06] we have x1, x2, x3 and going on till xn
[03:56:10] as inputs in the input layer. Now to
[03:56:13] which we add the weights and the bias
[03:56:16] that are randomly selected. So here we
[03:56:18] have w1, w2, w3 going on till wn as
[03:56:21] weights. So we multiply these weights
[03:56:25] with the corresponding inputs and add
[03:56:28] all the values together and finally we
[03:56:31] add bias to that sum. So this final sum
[03:56:34] is passed through an activation function
[03:56:37] which finally gives us the output. So
[03:56:39] let us see this in detail. So here we
[03:56:42] have three arrows which correspond to
[03:56:44] the three inputs coming into the
[03:56:45] network. Now for these three inputs we
[03:56:48] have corresponding weights associated
[03:56:51] with them. So input one is associated
[03:56:54] with a weight of 0.7. Input two is
[03:56:57] associated with a weight of 0.6 and
[03:56:59] input 3 is associated with a weight of
[03:57:01] 1.4. Now these inputs are multiplied
[03:57:04] with their respective weights and their
[03:57:07] sum is taken. So if the three inputs are
[03:57:10] x1, x2 and x3, the sum would be x1 into
[03:57:15] 0.7 plus x2 into 0.6 plus x3 into 1.4.
[03:57:20] And to the sum we add an offset which is
[03:57:24] called as bias. So this bias is just a
[03:57:28] constant which is used for scaling
[03:57:30] purpose. Now let us understand the
[03:57:33] concept behind these weights. So these
[03:57:35] weights basically determine the relative
[03:57:37] importance of the inputs. So let's say
[03:57:40] we have two inputs humidity and wearing
[03:57:43] a blue shirt. So here we can see that
[03:57:46] wearing a blue shirt has almost no
[03:57:49] correlation with the possibility of
[03:57:51] rainfall. So that is why the weight
[03:57:54] assigned to input X2 would be low in
[03:57:57] order to bring down its importance.
[03:58:00] Now let us see why do we need activation
[03:58:02] functions. So consider the scenario
[03:58:04] where you have two different classes.
[03:58:06] One class is represented with triangles
[03:58:09] and the other class is represented with
[03:58:12] circles. Now let's say I ask you guys to
[03:58:15] draw a linear decision boundary which
[03:58:18] can separate these two classes. So is
[03:58:20] that really possible? Can we draw a
[03:58:23] linear line which can segregate these
[03:58:26] two classes? Well, the answer is
[03:58:28] obviously no, isn't it? So, let me tell
[03:58:30] you guys how can we do this? So, we'll
[03:58:33] have to add a third dimension to create
[03:58:36] a linearly separable model which is easy
[03:58:39] to deal with. So, the logic is when
[03:58:41] you're going from 2D to 3D, you're
[03:58:44] making your equation nonlinear. So, with
[03:58:47] the third dimension, I have introduced
[03:58:50] nonlinearity in our data which helps in
[03:58:53] creating a linearly separable model. And
[03:58:56] in real world situations, you don't
[03:58:58] always get linear problems. So you
[03:59:01] should know how to deal with nonlinear
[03:59:03] problems as well. And this is where
[03:59:05] activation functions help us to convert
[03:59:08] the linear equation to nonlinear form.
[03:59:12] So these activation functions bring in
[03:59:14] nonlinear functional mappings between
[03:59:17] the input and the response variable.
[03:59:20] Their main purpose is to convert an
[03:59:22] input signal of a node in an artificial
[03:59:24] neural network to an output signal. And
[03:59:28] if we do not apply an activation
[03:59:30] function, then the output signal would
[03:59:32] just be a simple linear function. Now
[03:59:35] there are many types of activation
[03:59:37] functions and today we'll be discussing
[03:59:39] some of the widely used ones. So let's
[03:59:42] start with the identity function. So the
[03:59:45] identity function gives out the same
[03:59:47] output as the input. So no matter how
[03:59:50] many layers we have, if all the
[03:59:52] activations are identity functions, then
[03:59:54] the final output of the last layer would
[03:59:57] be the same as the input given to the
[03:59:59] first layer. And the range of the
[04:00:01] identity function goes from minus
[04:00:03] infinity to plus infinity. So after that
[04:00:07] we have the binary step function. So
[04:00:09] this binary step function is usually
[04:00:12] denoted by h or theta and it is a
[04:00:15] discontinuous function. So if the input
[04:00:18] is less than zero then the output would
[04:00:22] be zero and if the input is equal to or
[04:00:25] greater than zero then the output would
[04:00:27] be one and this is why binary step
[04:00:30] function is used to solve a binary
[04:00:31] classification problem. So after that we
[04:00:34] have the sigmoid function. So the
[04:00:36] formula for the sigmoid function is
[04:00:38] denoted by 1 upon 1 + e power min - x.
[04:00:42] The sigmoid function basically scales
[04:00:44] the values between zero and one. So if
[04:00:48] the input is a large negative number, it
[04:00:50] is scaled towards zero. And similarly,
[04:00:53] if the input is a large positive number,
[04:00:55] it is scaled towards one. Then we have
[04:00:58] the tanh function. It is a hyperbolic
[04:01:00] trigonometric function which scales the
[04:01:02] values between minus1 and 1. So one
[04:01:05] advantage of tanh function over sigmoid
[04:01:07] is that it can deal more easily with
[04:01:10] negative numbers. And after that we have
[04:01:13] the relu function which stands for
[04:01:15] rectified linear unit. So this function
[04:01:18] will give out zero if input is less than
[04:01:21] zero. And on the other hand if input is
[04:01:24] equal to or greater than zero then it'll
[04:01:27] act as an identity function and give out
[04:01:30] the same value as the input. And this
[04:01:33] relu function is the most widely used
[04:01:35] activation function and is primarily
[04:01:38] implemented on the hidden layers of the
[04:01:40] neural network. Then we have the leaky
[04:01:42] relu which is just a modified version of
[04:01:45] relu. So the leaky relu instead of just
[04:01:48] completely removing the negative part it
[04:01:51] just lowers the magnitude.
[04:01:54] And finally we have the softmax function
[04:01:56] which is ideally used in the output
[04:01:59] layer for classification problems. So
[04:02:02] the softmax function basically gives a
[04:02:04] set of probability values for each class
[04:02:08] of the output and that particular class
[04:02:11] which would have the maximum probability
[04:02:13] will be our output class. So that was
[04:02:16] all about activation functions. Now let
[04:02:19] us learn more about perceptrons.
[04:02:21] So like we were taught how to behave in
[04:02:24] certain conditions perceptrons also
[04:02:27] require training. So they have a
[04:02:29] learning algorithm through which they
[04:02:31] produce the output. By training a
[04:02:32] perceptron, we try to find a line plane
[04:02:36] or some hyper plane which can accurately
[04:02:39] separate these two classes by adjusting
[04:02:42] the weights and biases. So uh consider
[04:02:46] this image where we give the dogs and
[04:02:48] horses as input. So here after the first
[04:02:51] iteration error value is two since the
[04:02:55] horse has been classified as dog and
[04:02:58] there is one dog which is placed in the
[04:03:01] horses class. And in the second
[04:03:03] iteration the error value is reduced to
[04:03:05] one as it is just the dog which is
[04:03:08] classified as a horse. And finally in
[04:03:11] the third iteration we get the correct
[04:03:13] output as the posetron has been trained
[04:03:16] well with no error. So all the dogs have
[04:03:19] been placed in one class and all the
[04:03:21] horses have been placed in one class.
[04:03:24] Now let's understand the perceptron
[04:03:26] training algorithm. So this perceptron
[04:03:28] over here receives multiple inputs and
[04:03:32] each input is initialized with a random
[04:03:34] weight. So after these we multiply these
[04:03:37] weights with their corresponding inputs
[04:03:39] and then we get the sum. Now this input
[04:03:43] is passed through the activation
[04:03:44] function which would give us a nonlinear
[04:03:47] output. So this process until here is
[04:03:50] known as feed forwarding. Now if the
[04:03:53] output which we get is not optimum we
[04:03:55] calculate the error in prediction and
[04:03:58] then go back and then update the weights
[04:04:01] and bias. So this process where we go
[04:04:04] from output to the input layer is known
[04:04:06] as back propagation and we keep on back
[04:04:09] propagating until we get the desired
[04:04:11] output. So that was the perceptron
[04:04:13] training algorithm. Now let's have a
[04:04:15] look at the benefits of using artificial
[04:04:17] neural networks. So the artificial
[04:04:20] neural networks can learn organically.
[04:04:22] This means an artificial neural networks
[04:04:25] outputs aren't limited entirely by
[04:04:27] inputs and results given to them
[04:04:29] initially by an expert system. So
[04:04:32] artificial neural networks have the
[04:04:33] ability to generalize their inputs. This
[04:04:36] ability is valuable for robotics and
[04:04:39] pattern recognition systems. Artificial
[04:04:42] neural networks also help in nonlinear
[04:04:44] data processing. So nonlinear systems
[04:04:47] have the capability of finding shortcuts
[04:04:50] to reach computationally expensive
[04:04:52] solutions. These systems can also infer
[04:04:55] connections between data points rather
[04:04:57] than waiting for records in a data
[04:04:59] source to be explicitly linked. This
[04:05:02] nonlinear shortcut mechanism is fed into
[04:05:06] artificial neural networking which makes
[04:05:08] it valuable in commercial big data
[04:05:11] analysis. Artificial neural networks
[04:05:14] also have high potential for fault
[04:05:16] tolerance. When these networks are
[04:05:18] scaled across multiple machines and
[04:05:21] multiple servers, they are able to route
[04:05:23] around missing data or servers and nodes
[04:05:26] that can't communicate. And these
[04:05:28] artificial neural networks can also
[04:05:30] self-repair themselves. So if they're
[04:05:33] asked to find out specific data that is
[04:05:35] no longer communicating, these
[04:05:37] artificial neural networks can
[04:05:39] regenerate large amounts of data by
[04:05:42] inference and help in determining the
[04:05:44] node that is not working. This trait is
[04:05:48] useful for networks that require
[04:05:50] informing their users about the current
[04:05:53] state of the network and effectively
[04:05:55] results in a self-debugging and
[04:05:57] diagnosing network.
[04:05:58] >> Here's a quiz question for you guys.
[04:06:00] What do you mean by artificial
[04:06:01] intelligence? Your options are the
[04:06:03] ability of a machine to think and learn
[04:06:05] like a human, the use of computers to
[04:06:08] solve complex mathematical problems, a
[04:06:10] computer program that performs
[04:06:12] repetitive tasks automatically, or the
[04:06:15] study of natural intelligence in animals
[04:06:17] and humans. Please mention your answers
[04:06:19] in the comment section. Hello everyone,
[04:06:22] Intellipath offers executive
[04:06:24] post-graduate certification in data
[04:06:26] science and artificial intelligence in
[04:06:28] collaboration with IHUB IT RKI. Through
[04:06:31] this particular course, you'll get to
[04:06:33] learn multiple tools like Python,
[04:06:36] pispark, sci, numpy, pandas, mattplot,
[04:06:40] lip, tensorflow, git, etc. You are going
[04:06:44] to learn multiple skills like data
[04:06:46] science, natural language processing,
[04:06:48] deep learning, fundamentals of
[04:06:50] generative AI, prompt engineering, and
[04:06:52] application based generative AI as well
[04:06:55] as recent trends like agentic AI. This
[04:06:58] course is designed to get you ready for
[04:07:00] the AI world. So do check out link
[04:07:03] available in the description. Also
[04:07:05] through this course we have already
[04:07:06] helped thousands of learners take
[04:07:08] positive step in their career. You can
[04:07:10] check out their testimonials on our
[04:07:12] achievers channel.
[04:07:13] >> So why AI? Well AI is actually
[04:07:16] everywhere. It's omnipresent like God
[04:07:20] and AI's applications are present in
[04:07:22] every single industry from banking and
[04:07:25] finance to medical science and also in
[04:07:28] aerospace. Now it is actually a known
[04:07:30] fact that many banks have numerous
[04:07:34] activities on a day-to-day basis which
[04:07:36] need to be done accurately and most of
[04:07:39] these activities take up a lot of time
[04:07:42] and efforts from the employees and at
[04:07:45] times there is also a chance of a human
[04:07:48] error in these activities so to speak.
[04:07:50] So some of the works that banks and
[04:07:53] financial institutions handle are
[04:07:56] investing money in stocks, financial
[04:07:58] operations, managing various properties
[04:08:01] and so on. And with the use of AI system
[04:08:04] in this process, the institutions are
[04:08:07] able to achieve efficient results in a
[04:08:10] quick turnaround time. So the strategic
[04:08:13] implementation of artificial
[04:08:15] intelligence in the bank helps them to
[04:08:18] focus on every customer and provide them
[04:08:21] quick resolution and similarly in
[04:08:24] medical science field as well. It has
[04:08:27] wide applications. So AI has completely
[04:08:31] changed the way medical science was
[04:08:33] perceived just a few years ago. So there
[04:08:36] are numerous areas in medical science
[04:08:39] where AI is used to achieve incredible
[04:08:42] value. So with the help of AI, the
[04:08:44] medical science was able to create a
[04:08:46] virtual personal healthcare assistant.
[04:08:49] So these are used for the research and
[04:08:51] analysis purpose. There are also many
[04:08:54] efficient health care bots introduced in
[04:08:57] the medical field to provide constant
[04:09:00] health support to patients and it is
[04:09:03] also used in the aerospace industry. So
[04:09:06] in aerospace there are lot of features
[04:09:08] from booking the tickets to the takeoff
[04:09:11] and operation of the flights that AI
[04:09:14] takes care of. AI applications make air
[04:09:17] transport efficient, fast, safe and also
[04:09:20] provides comfortable journey to the
[04:09:23] passengers and it has also changed the
[04:09:26] face of gaming. So these days we're able
[04:09:29] to play TV and computer games on the
[04:09:31] whole new level all thanks to artificial
[04:09:35] intelligence application. So these are
[04:09:38] just some of the applications where
[04:09:40] artificial intelligence is used and all
[04:09:42] in all it is used to reinvent the world.
[04:09:45] So scientists are riding on the back of
[04:09:47] AI when machine intelligence will
[04:09:50] surpass the human intelligence.
[04:09:53] Scientists believe that once the AI
[04:09:55] system starts working in its full
[04:09:57] capacity, it will reinvent the world
[04:10:00] that we know today. So think of the
[04:10:02] world where all the manial tasks such as
[04:10:05] garbage disposal, construction, digging
[04:10:08] and so on will be taken care of by AI
[04:10:11] applications. So it'll all be a time
[04:10:14] when the hierarchical order dictates the
[04:10:17] limits of a human. It'll be the world
[04:10:19] where no one will be looked down upon
[04:10:22] and every human will be considered
[04:10:24] equal. So this way humans can then focus
[04:10:28] their strengths on higher levels of work
[04:10:31] to accomplish a lot more and always
[04:10:34] taking technology to new heights. So now
[04:10:37] that we've understood the importance of
[04:10:39] artificial intelligence, let's learn
[04:10:42] more about AI. So AI is basically a
[04:10:45] field of computer science which
[04:10:47] emphasizes on the creation of
[04:10:50] intelligent machines which can work and
[04:10:53] react like humans. So I am reiterating
[04:10:57] it. So AI is that field of computer
[04:11:01] science where we create machines which
[04:11:04] can work and react like humans. So let's
[04:11:08] keep this definition at the back of our
[04:11:10] head. Using this definition, let's
[04:11:12] actually look at some applications of AI
[04:11:15] which are currently existing in today's
[04:11:17] world. So, so we have simple chat bots
[04:11:19] like okay Google and Siri which help us
[04:11:22] in assisting whatever we want. So, let's
[04:11:25] say I want to know the current time. All
[04:11:27] I need to do is ask Siri tell me what is
[04:11:29] the current time. Similarly, if I want
[04:11:32] to know the distance between India and
[04:11:34] Malaysia, I'll ask Siri tell me what is
[04:11:37] the distance between India and Malaysia.
[04:11:40] Again, let's say I'm just sad and I want
[04:11:43] to listen to a simple joke. I'll ask
[04:11:45] Siri, tell me a simple joke, right? So,
[04:11:48] these are some of the applications of
[04:11:50] artificial intelligence. And then we
[04:11:53] have Sophia. So, Sophia is the first
[04:11:56] humanoid robot. So, she is the first
[04:12:00] humanoid robot who can actually speak to
[04:12:04] us like natural human. So Sophia can
[04:12:08] show some wide range of emotions
[04:12:11] exhibited by humans but she is actually
[04:12:14] a robo. Another application of
[04:12:17] artificial intelligence is a
[04:12:18] self-driving car. So you have
[04:12:21] self-driving cars by Google and Tesla
[04:12:24] which actually drive by themselves. So
[04:12:27] they do not need any external driver to
[04:12:30] drive. Right? So these cars work by
[04:12:33] themselves. So similar to self-driving
[04:12:35] cars, we also have self-flying drones
[04:12:37] which do not need any human intervention
[04:12:40] and they can navigate by themselves. So
[04:12:43] now let's actually get a bit deeper and
[04:12:45] understand what is intelligence. So
[04:12:48] intelligence can be defined as one's
[04:12:50] capacity for understanding, one's
[04:12:54] capacity for self-awareness, one's
[04:12:56] capacity for learning and one's capacity
[04:12:59] for problem solving. That is how well is
[04:13:03] something or someone able to understand?
[04:13:06] How well is someone able to learn new
[04:13:09] things and how well is someone able to
[04:13:11] solve problems by themselves. So now
[04:13:15] that we know what is intelligence, let's
[04:13:18] understand what is artificial
[04:13:19] intelligence. So when you apply the same
[04:13:23] intelligence to machines, this is known
[04:13:26] as artificial intelligence. Now just
[04:13:29] imagine there's a machine which can
[04:13:32] understand things which are normally
[04:13:35] understood by human. There is a machine
[04:13:38] which is self-aware and there is a
[04:13:40] machine which can solve problems by
[04:13:43] itself. Now that's just amazing isn't
[04:13:46] it? Right? So this is the artificial
[04:13:49] intelligence which I'm talking about. So
[04:13:52] now so now that we also know what is
[04:13:54] intelligence I'll ask another question.
[04:13:57] So tell me what is it that makes humans
[04:13:59] intelligent? Well, we as humans can
[04:14:03] reason. We as humans can learn. We can
[04:14:07] perceive. We can solve problems and we
[04:14:10] also have linguistic intelligence. That
[04:14:12] is we can figure out what is someone
[04:14:15] else saying and we can also understand
[04:14:18] the grammatical intricacies of different
[04:14:20] languages. So again my question would be
[04:14:23] what if a machine could exhibit all of
[04:14:26] these factors normally shown by a human.
[04:14:30] Again that's just amazing isn't it? So
[04:14:32] this is what is known as artificial
[04:14:34] intelligence. So a machine which can
[04:14:38] show traits normally shown by a human
[04:14:41] that is known as artificial
[04:14:43] intelligence. All right. So now that
[04:14:46] we're clear with artificial
[04:14:47] intelligence, let's segregate AI, ML and
[04:14:51] DL. So normally most people get confused
[04:14:54] between artificial intelligence, machine
[04:14:56] learning and deep learning. So this is
[04:14:58] where I'm going to help you out in
[04:15:00] understanding the difference between
[04:15:01] these three. So we have AI at the top
[04:15:06] and you can consider machine learning
[04:15:09] and deep learning to be subsets of AI.
[04:15:13] So again, machine learning and deep
[04:15:15] learning are just ways to achieve
[04:15:19] artificial intelligence. So I'll restate
[04:15:22] it again. Machine learning and deep
[04:15:24] learning are just ways to achieve
[04:15:27] artificial intelligence. Now machine
[04:15:30] learning is that part of artificial
[04:15:32] intelligence which aims to teach the
[04:15:35] computers the ability to do tasks with
[04:15:39] data without any explicit programming.
[04:15:42] Right? So we don't need to do any
[04:15:44] explicit programming and the algorithms
[04:15:48] do tasks by themselves. And in ML we
[04:15:51] mostly use numerical and statistical
[04:15:54] approaches to achieve artificial
[04:15:56] intelligence. And then we have deep
[04:15:58] learning which is actually a subset of
[04:16:01] machine learning. So first we have AI
[04:16:03] and then we have ML and then we have DL.
[04:16:06] So deep learning comes in where machine
[04:16:09] learning fails and we apply deep
[04:16:12] learning through something known as
[04:16:14] artificial neural networks about which
[04:16:16] we'll obviously learn later. Right? So
[04:16:20] now let's understand artificial
[04:16:22] intelligence in a bigger set. So as I've
[04:16:24] already told you artificial intelligence
[04:16:27] is the superset under which comes
[04:16:30] machine learning under which comes deep
[04:16:33] learning and then machine learning and
[04:16:35] deep learning are basically ways to
[04:16:38] achieve artificial intelligence. Now
[04:16:41] these are the different areas of
[04:16:43] research of artificial intelligence. So
[04:16:46] you have ML again a part of ML is deep
[04:16:49] learning. Then we have natural language
[04:16:52] processing. So over here we basically
[04:16:54] understand what is spoken or written by
[04:16:58] a human and then we have speech where we
[04:17:00] either translate the speech to text or
[04:17:03] we translate the text to speech. The
[04:17:05] next sub field is robotics and then we
[04:17:08] have autonomous vehicles under robotics.
[04:17:11] So Google self-driving car is an example
[04:17:14] of this over here. So now that we've
[04:17:16] also understood the difference between
[04:17:18] artificial intelligence, machine
[04:17:20] learning and deep learning, let's see
[04:17:22] different examples of machine learning
[04:17:24] around us. So most of you would have
[04:17:26] shopped on Amazon. Now when you go into
[04:17:30] Amazon, you see that there are some
[04:17:32] products recommended to you. Now how do
[04:17:34] you think that would happen? So this is
[04:17:37] something known as recommendation
[04:17:39] engine. And recommendation engine is
[04:17:42] nothing but a component of machine
[04:17:44] learning. So let's say you and your
[04:17:48] friend buy similar products. So your
[04:17:51] friend buys five products and you buy
[04:17:54] three products. Now out of those
[04:17:57] whatever three products you buy are same
[04:18:00] as what your friend buys. So let's say
[04:18:04] the common products are an iPhone, a
[04:18:07] back cover for the iPhone and a
[04:18:10] Bluetooth headset. Now let's say the
[04:18:12] other two products bought by your friend
[04:18:14] would be a MacBook and a mouse. Now
[04:18:17] since there are three products which are
[04:18:19] same between you two, this is why the
[04:18:22] products which your friend has also
[04:18:24] bought, those are the products which
[04:18:26] will be recommended to you. So on the
[04:18:29] basis of the commonality between you and
[04:18:32] your friend you will be recommended a
[04:18:35] MacBook and a mouse as well. So this is
[04:18:38] nothing but a concept of machine
[04:18:40] learning. And then we have Amazon Alexa.
[04:18:43] So Amazon Alexa is so Amazon Alexa is a
[04:18:46] really good example of speech
[04:18:48] recognition. You know when you say Alexa
[04:18:50] turn on the lights it'll turn on the
[04:18:52] lights. When you say Alexa book a ride
[04:18:54] for me it'll do exactly that. when you
[04:18:57] say Alexa order a cheese pizza and that
[04:19:00] is exactly what Amazon Alexa will do.
[04:19:03] Now Alexa is just a machine right but
[04:19:07] when you say do something order a pizza
[04:19:10] book a cab for me turn on the lights you
[04:19:12] know how is the machine able to
[04:19:13] understand all of this so the idea
[04:19:16] behind this is speech recognition and
[04:19:19] that is again a component of machine
[04:19:21] learning and then we have Netflix's
[04:19:23] movie recommendation so let's say you
[04:19:25] watch two TV series first TV series is
[04:19:28] friends and the next TV series is Big
[04:19:31] Bang Theory And since you watch these
[04:19:34] two TV series which belong to the genre
[04:19:37] comedy that is why maybe you'll be
[04:19:40] recommended How I Met Your Mother or you
[04:19:43] can be recommended Silicon Valley or
[04:19:45] some other TV series belonging to the
[04:19:48] comedy genre. So this again is machine
[04:19:51] learning. And then we also have Google
[04:19:53] traffic prediction. Let's just say
[04:19:55] you're traveling in your car and there
[04:19:58] is huge traffic there. you and you
[04:20:01] desperately want to get out of the
[04:20:03] traffic. So you turn on Google Maps and
[04:20:06] Google Maps tells you the best direction
[04:20:09] from where the traffic would be the
[04:20:12] least. Now how does Google Maps do this?
[04:20:15] This again is machine learning. So now
[04:20:17] that we've looked at different real
[04:20:18] world applications of machine learning,
[04:20:21] let's actually understand what exactly
[04:20:24] is machine learning. So as I've already
[04:20:27] told you machine learning is a subset of
[04:20:30] artificial intelligence which gives the
[04:20:33] machine ability to learn without being
[04:20:37] explicitly programmed. So over here data
[04:20:40] is the key or in other words you
[04:20:43] basically teach a machine how to learn
[04:20:47] without any explicit programming and the
[04:20:50] machine learns with the help of data.
[04:20:54] Right? So now that we know what exactly
[04:20:56] is machine learning, let's also
[04:20:59] understand how does machine learning
[04:21:01] work. So as I've already told you,
[04:21:04] machine learning depends totally on
[04:21:07] data. So first we taken a data set and
[04:21:11] divide it into two parts. The first part
[04:21:14] would be the training set and the second
[04:21:17] part would be the testing set. And we
[04:21:20] will train the model on top of the
[04:21:23] training set. So now once we train the
[04:21:27] model we will give it new data and check
[04:21:31] for its accuracy on top of that new data
[04:21:36] and if the accuracy of that new data
[04:21:40] comes out to be good enough then we will
[04:21:43] go ahead and use that machine learning
[04:21:45] model. On the other hand, the model
[04:21:48] which you built, if the accuracy of that
[04:21:50] model is not good enough, then we'll go
[04:21:53] ahead and fine-tune that model till we
[04:21:56] get the desired accuracy. This is the
[04:21:59] basic premise behind machine learning.
[04:22:02] Now let's look at the subcategories of
[04:22:04] machine learning. So we have supervised
[04:22:06] learning, unsupervised learning and
[04:22:08] reinforcement learning. So when
[04:22:10] supervised learning, you can consider
[04:22:12] that the learning is guided by a
[04:22:14] teacher. So we have a data set which
[04:22:17] actually acts as a teacher and its role
[04:22:20] is to train the model or the machine. So
[04:22:22] once the model gets trained, it can
[04:22:25] start making a prediction or decision
[04:22:28] when new data is given to it. So let's
[04:22:31] take this example. So over here we are
[04:22:35] training this machine by giving it
[04:22:38] samples of data. So over here the data
[04:22:41] is nothing but different images of apple
[04:22:45] and along with each image of apple we
[04:22:48] are also giving it the label of the
[04:22:51] image right. So this image goes with its
[04:22:55] label which is apple. Again this image
[04:22:58] goes with its label which is apple.
[04:23:00] Again the same with these two. Right? So
[04:23:02] we are teaching this machine that
[04:23:05] whenever it sees an image something like
[04:23:08] this it is nothing but an apple and
[04:23:11] after time when we give it a new data
[04:23:14] from whatever learning it has done it
[04:23:18] will predict whether it's an apple or
[04:23:20] not. So on the basis of its learning
[04:23:23] this machine predicts that there is a
[04:23:26] good possibility there is actually 97%
[04:23:28] possibility that the image which has
[04:23:31] been fed to the machine is nothing but
[04:23:33] an apple. So a use case of supervised
[04:23:36] learning could be spam classifier. So
[04:23:39] spam classifier basically means that
[04:23:42] whether the email which we get it's a
[04:23:45] spam or not and that is done on the
[04:23:48] basis of different textual parameters.
[04:23:50] So, let's say a genuine email wouldn't
[04:23:53] contain too many exclamation marks. It
[04:23:55] wouldn't contain a catchy headline and
[04:23:58] so on. But on the other hand, if it's a
[04:24:01] spam email, maybe it'll contain a lot of
[04:24:04] exclamation marks with maybe a lot of
[04:24:07] numbers and it'll have statements like,
[04:24:09] "Hey, congrats. You've won a lottery."
[04:24:12] Or, "Hey, could you help me out?" So
[04:24:13] this spam classification is basically an
[04:24:16] example of supervised learning.
[04:24:20] Then we have unsupervised learning. So
[04:24:22] in unsupervised learning, the model
[04:24:24] learns through observation and find
[04:24:27] structures in the data. So once the
[04:24:29] model is given a data set, it
[04:24:31] automatically finds patterns and
[04:24:34] relationships in the data set by
[04:24:36] creating clusters in it. So what it
[04:24:38] cannot do is add labels to the cluster
[04:24:42] like it cannot say this is a group of
[04:24:44] apples or mangoes but it will separate
[04:24:47] all the apples from mangoes. So over
[04:24:50] here we have this set of images. Now
[04:24:54] this unsupervised learning model which
[04:24:56] is applied on this it will segregate
[04:24:59] these fruits on the basis of similar
[04:25:02] characteristics. So over here we have
[04:25:05] segregated these four into one cluster,
[04:25:09] these three into second cluster and
[04:25:11] these three into the third cluster. Now
[04:25:14] even though the unsupervised learning
[04:25:17] does not have any labels, it has still
[04:25:20] segregated these three into three
[04:25:23] clusters. Right? So the machine over
[04:25:25] here does not know that these are apples
[04:25:27] or these are oranges or these are
[04:25:29] bananas. Yet it has segregated these
[04:25:33] three on the basis of similarity of
[04:25:36] characteristics. So it found out that
[04:25:39] these four objects are similar to each
[04:25:42] other and there is quite a bit of
[04:25:44] variability when it comes to these four
[04:25:46] objects and these three objects.
[04:25:48] Similarly, this machine was able to
[04:25:50] figure out that these three objects,
[04:25:52] they are quite similar to each other.
[04:25:54] But when compared with these three
[04:25:56] objects, they are very dissimilar. This
[04:25:58] is the underlying concept of
[04:26:00] unsupervised learning and a good example
[04:26:03] of unsupervised learning would be again
[04:26:06] Netflix movie recommendation. So over
[04:26:08] here the movies are segregated on the
[04:26:12] basis of different genres. So over here
[04:26:16] TV series like friends, how I met your
[04:26:19] mother and Silicon Valley are clustered
[04:26:22] into one group because those come into
[04:26:25] the same category. Similarly, movies
[04:26:28] such as Secret Superstar and Dangal
[04:26:30] could come under the same category
[04:26:32] because they have the same lead actors.
[04:26:36] So over here we are segregating the
[04:26:39] movies on the basis of similar
[04:26:41] characteristics even though there are no
[04:26:44] labels in it. And it's finally time for
[04:26:47] the third machine learning type which is
[04:26:49] reinforcement learning. So over here
[04:26:52] there is an agent and there is an
[04:26:54] environment and the agent interacts with
[04:26:58] the environment and finds out what is
[04:27:01] the best outcome for it. So it basically
[04:27:03] follows the concept of hidden trial
[04:27:05] method. The agent is rewarded or
[04:27:08] penalized with a point for a correct or
[04:27:11] a wrong answer and on the basis of
[04:27:14] positive reward points gained the model
[04:27:17] trains itself. So let's take this
[04:27:19] example. So over here this self-driving
[04:27:22] car would be our agent and the road is
[04:27:25] the environment and this car is
[04:27:27] interacting with this environment. So it
[04:27:30] will observe the environment and it has
[04:27:33] two choices over here. So either to go
[04:27:35] straight or turn right. Now let's say
[04:27:39] this agent or the self-driving car
[04:27:42] decides to go straight. Then what
[04:27:44] happens is it goes and banks straight
[04:27:46] into this barricade. So then it realizes
[04:27:49] that the action taken by it was not in
[04:27:52] its best interest and that is why it is
[04:27:54] penalized. So since it is penalized it
[04:27:58] realizes that the action taken by it was
[04:28:01] wrong and that is why from the next time
[04:28:04] onwards it will do the opposite action.
[04:28:07] So instead of going straight it'll take
[04:28:09] the right turn. And when it takes the
[04:28:11] right turn, it realizes that the road is
[04:28:15] correct and the agent is rewarded. So
[04:28:17] this is how reinforcement learning
[04:28:19] basically works. So the agent interacts
[04:28:22] with the environment, it takes an action
[04:28:25] and if the action turns out to be
[04:28:27] incorrect, it is penalized and if the
[04:28:29] action turns out to be correct, it is
[04:28:32] rewarded. So this cycle goes on and on
[04:28:35] till it completely learns it environment
[04:28:38] properly. And a best use case of
[04:28:41] reinforcement learning is again
[04:28:43] self-driving car. So companies such as
[04:28:45] Tesla and Google are working on the
[04:28:48] self-driving cars. So just to sum it
[04:28:51] off, these are the three different types
[04:28:53] of machine learning algorithms. So we
[04:28:55] have supervised, unsupervised and
[04:28:57] reinforcement machine learning. So under
[04:29:00] supervised we have regression and
[04:29:02] classification and in unsupervised we
[04:29:06] have clustering techniques, association
[04:29:08] analysis and hidden macro model. And
[04:29:11] then the third is obviously
[04:29:12] reinforcement learning which works on
[04:29:15] trial and error method. And if you want
[04:29:17] to do some really cool machine learning
[04:29:19] projects, you can check the sites out.
[04:29:22] Now let's look at some limitations of
[04:29:24] machine learning. So when it comes to
[04:29:26] machine learning algorithms, they would
[04:29:28] require massive stores of training data.
[04:29:32] So again as I've told you machine
[04:29:35] learning is totally based on the data
[04:29:37] which it has. So if you have more amount
[04:29:41] of data only then it'll be able to give
[04:29:44] correct accuracy. So let's say you take
[04:29:47] in very small amount of data and there's
[04:29:50] a good possibility that the results
[04:29:52] which you're getting are very biased or
[04:29:54] very incorrect and also error diagnosis
[04:29:57] is quite difficult when it comes to
[04:29:59] machine learning because again the
[04:30:01] amount of data is very huge and wherever
[04:30:04] there's a mistake you'd have to go
[04:30:07] through the entire algorithm which
[04:30:09] you've written and then find out that
[04:30:12] particular mistake by yourself which is
[04:30:14] very difficult and also when it comes to
[04:30:16] machine learning algorithms, they're not
[04:30:18] really that creative. So these ML
[04:30:21] algorithms are built only for one
[04:30:24] specific purpose. So let's say I'll
[04:30:27] build a machine learning model which
[04:30:28] will predict whether it'll train or not
[04:30:30] today. Now if I want to use the same
[04:30:34] model to predict the stock prices,
[04:30:37] that'll not work. Right? So basically
[04:30:40] one model is built only for one
[04:30:43] particular task. So this is the lack of
[04:30:46] creativity that I'm talking about when
[04:30:48] it comes to machine learning and also
[04:30:50] there are a lot of time constraints as
[04:30:53] the model has to learn through a lot of
[04:30:55] historical data. So that was everything
[04:30:58] about machine learning. Now let's start
[04:31:00] off with deep learning. So deep learning
[04:31:04] is a subset of machine learning where it
[04:31:08] learns through data representations as
[04:31:11] opposed to task specific algorithms. So
[04:31:14] we saw that the drawback in machine
[04:31:17] learning models was that the models are
[04:31:20] specific to only one particular task.
[04:31:23] But this is not the case with deep
[04:31:25] learning models as these deep learning
[04:31:28] models are based on the data
[04:31:31] representations
[04:31:32] and these deep learning models are
[04:31:35] mostly built with something known as
[04:31:37] deep neural networks. So this is how a
[04:31:41] deep neural network looks like. So these
[04:31:44] deep neural networks completely learn
[04:31:47] the data which is fed to it. So this is
[04:31:51] the data. So let's say if this image of
[04:31:53] woman is fed as the data to the deep
[04:31:56] learning model then it'll completely
[04:32:00] extract all the features of this data by
[04:32:03] itself. Again the difference between ML
[04:32:05] and deep learning over here is that the
[04:32:08] extraction the feature extraction in
[04:32:11] machine learning is manual but when it
[04:32:13] comes to deep learning the feature
[04:32:15] extraction is automatic. So the deep
[04:32:18] learning model automatically extracts
[04:32:21] all of the features associated with that
[04:32:24] image and when new images are fed to it,
[04:32:27] it automatically is able to tell whether
[04:32:29] the image is seen to this or not. So
[04:32:32] this over here we have a graph which
[04:32:34] tells us how does performance vary with
[04:32:37] respect to the amount of data. So what
[04:32:40] happens in machine learning is that as
[04:32:43] we keep on increasing the data the
[04:32:45] performance increases only up to a
[04:32:48] particular threshold. After that if we
[04:32:50] increase any more data there is no
[04:32:53] increase in performance. So this is
[04:32:56] another problem when it comes to machine
[04:32:58] learning. But on the other hand when it
[04:33:02] comes to deep learning the more amount
[04:33:04] of data you give it the better will be
[04:33:07] its performance. And that again is
[04:33:09] because deep learning is based on
[04:33:11] learning data interpretation. So the
[04:33:14] more data you give it, it'll
[04:33:16] automatically learn all those features
[04:33:18] of the data by itself and it'll be keep
[04:33:21] on increasing its performance gradually.
[04:33:24] Now let's look at some applications of
[04:33:26] deep learning. So speech recognition is
[04:33:29] one application of deep learning. Now
[04:33:32] you need to understand that you cannot
[04:33:34] build speech recognition applications
[04:33:37] with machine learning. So this is where
[04:33:39] machine learning fails and deep learning
[04:33:42] comes in and helps you to build speech
[04:33:45] recognition applications. Also another
[04:33:47] application of deep learning is
[04:33:49] self-driving cars. So we see over here
[04:33:51] that the person is just sitting. He's
[04:33:53] not even touching the steering wheel and
[04:33:55] the car is driving by itself. So just an
[04:33:58] amazing application of deep learning.
[04:34:00] And then we have language translation
[04:34:02] over here. So this again is a power of
[04:34:04] deep learning. So over here we are
[04:34:06] typing something in Spanish and it is
[04:34:09] being automatically converted into
[04:34:11] English. So we also have visual
[04:34:13] translation over here. So over here this
[04:34:16] text or this board is in some random
[04:34:19] language and this app over here which
[04:34:22] uses deep learning automatically
[04:34:25] converts this visual into English. So
[04:34:29] those were some applications of deep
[04:34:31] learning. Now let's actually understand
[04:34:34] how does deep learning work. So most
[04:34:37] deep learning methods use neural network
[04:34:40] architectures and that is why deep
[04:34:42] learning models are often referred to as
[04:34:45] deep neural networks. So a deep neural
[04:34:49] network basically has these three
[04:34:52] models. An input layer, the hidden
[04:34:54] layers and the output layer. And the
[04:34:57] term deep usually refers to the number
[04:34:59] of hidden layers in the neural network.
[04:35:02] So traditionally neural networks only
[04:35:04] contain two to three hidden layers while
[04:35:07] deep networks can have as many as 150
[04:35:11] hidden layers. Now that's a very huge
[04:35:13] amount, isn't it? So deep learning
[04:35:16] models are trained by using large sets
[04:35:18] of label data and neural network
[04:35:21] architectures that learn features
[04:35:24] directly from the data without the need
[04:35:26] for manual feature extraction. So all of
[04:35:31] the input data is given to this input
[04:35:34] layer and this input layer automatically
[04:35:38] extracts the features by itself. Now
[04:35:41] that data is sent to this hidden layer
[04:35:44] which performs all sorts of processing
[04:35:47] tasks and then the final result is given
[04:35:50] out through the output layer. So now
[04:35:52] let's also understand what exactly is a
[04:35:54] neural network. So a neural network is a
[04:35:57] computing model whose layered structure
[04:36:00] resembles the network structure of
[04:36:02] neurons in the brain with layers of
[04:36:05] connected nodes. So it can learn from
[04:36:07] data which can be trained to recognize
[04:36:10] patterns, classify data and forecast
[04:36:13] future events. So the neural network is
[04:36:16] based on the biological neural network
[04:36:19] of our brain. So that is why it is given
[04:36:22] the name neural network. So the layers
[04:36:26] are interconnected v nodes or neurons
[04:36:29] with each layer using the output of the
[04:36:32] previous layer as its input. So its main
[04:36:36] function is to receive a set of inputs,
[04:36:38] perform calculations and then use the
[04:36:41] output to solve the problem. Now as I've
[04:36:44] already said these artificial neural
[04:36:46] networks are based on something known as
[04:36:49] a biological neural network. So our
[04:36:53] biological neural network has dendrites,
[04:36:57] cell body and axon. So dendrites are
[04:37:01] where the input is taken. Cell body is
[04:37:03] where the processing is done and axon is
[04:37:06] where the message is transferred to
[04:37:08] other neurons and the same thing happens
[04:37:11] in artificial neural network as well. So
[04:37:14] first we give in the data that data is
[04:37:18] processed and then the final processed
[04:37:20] result is given out as the output. So
[04:37:24] over here let's say we train the data
[04:37:27] with images of cat and the labels would
[04:37:30] be either cat or not cat. After that we
[04:37:34] given a new image of a cat and then we
[04:37:39] basically try to predict whether the
[04:37:41] model correctly classifies this as cat
[04:37:44] or not cat and since the model has
[04:37:47] learned the data properly it correctly
[04:37:50] classifies this image as cat. Now to
[04:37:53] implement these artificial neural
[04:37:55] networks you would need the help of a
[04:37:57] deep learning framework. So the first
[04:38:00] question to pop into your head would be
[04:38:01] what are the different deep learning
[04:38:03] frameworks available. So TensorFlow is
[04:38:05] arguably one of the best deep learning
[04:38:08] frameworks that we have today. It is an
[04:38:10] open-source software library developed
[04:38:13] by the researchers and engineers from
[04:38:15] the Google brain team for high
[04:38:17] performance numerical computation. One
[04:38:19] well-known use case of TensorFlow is
[04:38:21] Google Translate. So Google translate is
[04:38:24] coupled with capabilities such as
[04:38:26] natural language processing, text
[04:38:28] classification, forecasting and tagging.
[04:38:30] So TensorFlow basically comes with two
[04:38:32] tools, TensorBoard and TensorFlow
[04:38:35] serving. So building massive deep neural
[04:38:37] networks could be complex and confusing.
[04:38:40] This is where we can use TensorBoard to
[04:38:42] visualize our TensorFlow graph and plot
[04:38:45] quantitative metrics. And then we have
[04:38:47] TensorFlow serving which is a flexible
[04:38:49] high performance serving system and can
[04:38:51] be used for rapid deployment of new
[04:38:54] algorithms while retaining the same
[04:38:56] server architecture and APIs. So now
[04:38:59] let's look at the next deep learning
[04:39:01] framework which is Keras. So Keras is
[04:39:04] actually a highle API which can run on
[04:39:07] top of other deep learning libraries
[04:39:08] such as TensorFlow, Theano or CNTK. And
[04:39:12] with the help of Keras you can implement
[04:39:14] both convolutional neural networks as
[04:39:16] well as recurrent neural networks. And
[04:39:18] the best thing about Kerasus model
[04:39:21] building is extremely easy. It's like
[04:39:23] stacking layers on top of each other. So
[04:39:26] next we have PyTorch which is a
[04:39:29] scientific computing framework developed
[04:39:31] by Facebook. So we can get from the name
[04:39:34] itself that PyTorch is Pythonic in
[04:39:36] nature. That is it can leverage all the
[04:39:39] services and functionalities offered by
[04:39:42] the Python environment and also smoothly
[04:39:44] integrates with the Python data science
[04:39:47] stack. Another great feature of PyTorch
[04:39:49] is that it offers dynamic computational
[04:39:51] graphs which can be changed during
[04:39:53] runtime. This is highly useful when we
[04:39:56] have no idea how much memory will be
[04:39:59] required for creating a neural network
[04:40:01] model. And the next deep learning
[04:40:03] framework is DL4G. So unlike deep
[04:40:06] learning frameworks which we saw till
[04:40:07] now which were all based on Python, deep
[04:40:10] learning 4G is a deep learning
[04:40:13] programming library which is written for
[04:40:15] Java and the Java virtual machine. And
[04:40:18] the biggest advantage of DL4G is it
[04:40:21] includes inbuilt integration with Apache
[04:40:24] Hadoop and Spark. So it helps in getting
[04:40:26] state-of-the-art results on image
[04:40:28] recognition tasks. So it shows matchless
[04:40:32] potential for image recognition, fraud
[04:40:34] detection, text mining, parts of speech
[04:40:37] tagging and also natural language
[04:40:39] processing. And finally we have MXNet.
[04:40:42] So, MXNet is a deep learning framework
[04:40:45] developed by Apache software foundation
[04:40:47] specifically for the purpose of high
[04:40:49] efficiency, productivity and
[04:40:52] flexibility. And the beauty of MXNet is
[04:40:54] that it gives users the ability to code
[04:40:57] in a variety of programming languages
[04:40:59] such as Python, R, Julia, and Scala.
[04:41:03] This means that you can train your deep
[04:41:05] learning models with whichever language
[04:41:07] you're comfortable in without having to
[04:41:09] learn something new from scratch. And
[04:41:12] this deep learning framework is known
[04:41:13] for its capabilities in imaging, speech
[04:41:16] recognition, forecasting and NLP. So
[04:41:18] when you hear the term TensorFlow, the
[04:41:21] first question to pop into your head
[04:41:22] would be what exactly is a tensor? So in
[04:41:25] TensorFlow, data is represented in the
[04:41:28] form of tensors. Simply put, a tensor is
[04:41:31] a multi-dimensional array in which data
[04:41:34] is stored. So you can consider these
[04:41:36] tensors to be the building blocks in
[04:41:38] TensorFlow. Now these very tensors are
[04:41:41] given as the input to the neural
[04:41:43] network. So as I've said a tensor is
[04:41:46] nothing but an nd dimensional array. So
[04:41:49] the number of dimensions used to
[04:41:51] represent the data is known as its rank.
[04:41:53] So if a tensor has just one element. In
[04:41:56] other words, if it has just magnitude
[04:41:58] and no direction then its rank will be
[04:42:01] zero. If a tensor has magnitude and
[04:42:04] direction in one plane then its rank
[04:42:06] will be one. Similarly, if a tensor has
[04:42:09] magnitude and direction in two planes,
[04:42:11] then its rank will be two and this goes
[04:42:13] on higher up the order. Now, TensorFlow
[04:42:16] as the name states is a combination of
[04:42:18] two words tensor and flow. Here the data
[04:42:22] is stored in tensors but the execution
[04:42:25] is done in the form of a graph. So, this
[04:42:28] is not like your traditional programming
[04:42:30] where you just write a bunch of lines
[04:42:32] and everything gets executed in
[04:42:34] sequence. So first you'd have to prepare
[04:42:37] this computational graph and then this
[04:42:40] computational graph is executed inside
[04:42:43] something known as a session. Now in
[04:42:46] this computational graph all the
[04:42:48] mathematical operations are depicted
[04:42:51] inside the nodes and all the tensors are
[04:42:55] represented on the edges. So the entire
[04:42:58] computation process is done in two
[04:43:00] stages. In the first step, the code is
[04:43:03] depicted onto the computational graph
[04:43:06] and in the second step, a new session
[04:43:08] environment is started and the graph is
[04:43:11] executed inside this session. So that
[04:43:14] was all about the computational graph.
[04:43:17] Now let's look at the program elements
[04:43:19] in TensorFlow. So we have three program
[04:43:21] elements constant, placeholder and
[04:43:24] variable. So let's start with constants.
[04:43:27] So constants are program elements whose
[04:43:30] value does not change or in other words
[04:43:33] the value is fixed. So let's head on to
[04:43:36] Jupyter notebook and work with these
[04:43:38] constants. Right. So my first task would
[04:43:40] be to import the TensorFlow framework.
[04:43:42] So I'll type import TensorFlow as TF.
[04:43:47] I'll click run. So let me just wait till
[04:43:50] the import is done. Right. So we have
[04:43:52] successfully imported the TensorFlow
[04:43:54] framework. Now, as I've said, let's go
[04:43:56] ahead and start working with the
[04:43:58] constants. So, let me just type in
[04:44:00] constants over here. So, I'll create the
[04:44:02] first constant and name this constant as
[04:44:05] con one. Now, this is how we can create
[04:44:08] constants in TensorFlow. So, I will use
[04:44:11] this TF and then put in a dot and then
[04:44:14] type in constant. Now, inside this I
[04:44:17] will give the value of the constant. So,
[04:44:20] let's say the value is 10. So this is an
[04:44:23] integer type constant. Now similarly
[04:44:26] I'll also create a floating type
[04:44:28] constant and I'll store this in cont. So
[04:44:32] I'll type tf dot constant and the
[04:44:37] floating value would be 3.14.
[04:44:40] Now after this I'll create a string type
[04:44:43] constant. So again this would be TF dot
[04:44:47] constant and the string which I'd be
[04:44:50] giving would be this is spara
[04:44:54] and finally we have a boolean type
[04:44:56] constant and I'll store this in con 4.
[04:44:59] So this will be tf dot constant
[04:45:04] and let's say the value is false. Now
[04:45:08] I'll run. Now let me print all of these
[04:45:11] values. So I'll use the print function
[04:45:14] and then I'll go ahead and print all of
[04:45:18] these values con 3 con 4 right so we see
[04:45:24] that this first constant is a tensor of
[04:45:28] type integer the second constant is a
[04:45:31] tensor of type float this third constant
[04:45:34] is a tensor of type string and this
[04:45:36] fourth constant is a tensor of type
[04:45:38] boolean now we see that we only have the
[04:45:41] data types of all of these tensors but
[04:45:43] we don't have their values. This is
[04:45:45] because as I've already told you guys we
[04:45:48] have to create a computational graph and
[04:45:50] then execute that computational graph
[04:45:52] inside a session. But till now we have
[04:45:54] not started our session. So let's go
[04:45:56] ahead and start a session first.
[04:45:59] So uh I'll type sess equals tf dot
[04:46:04] session and I'll hit run. Now I will run
[04:46:08] all of these inside this session. So I
[04:46:12] will type cess dot run and let me go
[04:46:18] ahead and run all of these con 1 con 2
[04:46:21] con 3 and con 4. So this time we have
[04:46:25] the values of all of these tensors. So
[04:46:27] the value of constant one is 10. The
[04:46:29] value of constant 2 is 3.14. The value
[04:46:32] of constant 3 as this is part of and
[04:46:34] finally the value of constant four is
[04:46:37] false. Right? So first we'd have to
[04:46:40] create all of the constants. Then we'd
[04:46:42] have to create a session. And inside the
[04:46:45] session we'd have to run all of these
[04:46:47] constants.
[04:46:48] Now let me go ahead and perform some
[04:46:50] simple operations on all of these
[04:46:52] constants.
[04:46:53] So let me just type in operations over
[04:46:55] here. So I'll do a simple addition
[04:46:57] operation.
[04:46:59] So I'll type addition over here.
[04:47:04] And let's say the value of the first
[04:47:06] constant is 20. I'll put a plus symbol
[04:47:11] and then I'll take in the next constant
[04:47:13] and the value of the second constant
[04:47:16] would be 30. So I am basically adding
[04:47:19] two TensorFlow constants. The value of
[04:47:22] the first constant is 20. The value of
[04:47:23] the second constant is 30. And I'm
[04:47:25] storing that result in addition.
[04:47:27] Similarly, I will multiply these two
[04:47:29] constant now. So multiplication
[04:47:34] TF dot constant.
[04:47:37] So I'll give the value of 20. Now I'll
[04:47:40] put the asterisk symbol and then this
[04:47:42] would be TF dot constant of 30. So this
[04:47:47] time I'm multiplying these two values.
[04:47:49] Right? So I'll hit on run. So now again
[04:47:52] if I'd have to see the resultant
[04:47:54] addition and multiplication values, I'd
[04:47:57] have to run these two inside a session.
[04:48:00] So I'll type CS dot run and then put in
[04:48:06] these two values over here addition and
[04:48:10] multiplication
[04:48:12] right so we see that 20 + 30 gives us an
[04:48:16] addition value of 50 and similarly when
[04:48:19] we multiply 20 with 30 we get a result
[04:48:22] of 600 right now so this was basic
[04:48:26] operation with scalers and we already
[04:48:28] know that tensors can have higher
[04:48:30] dimensions. So let's go ahead and
[04:48:32] perform addition and multiplication with
[04:48:35] these higher dimension tensors. Right?
[04:48:39] So again I'll just put in addition over
[04:48:41] here
[04:48:43] and I will take in the first constant
[04:48:46] and inside this I will give in a list of
[04:48:48] values. So let's say I will take in 1 2
[04:48:52] 3 4 and 5 and I will add this list with
[04:48:57] the next constant
[04:48:59] and this time the second constant has
[04:49:02] the values of 5 4 3 2 and 1. Similarly
[04:49:08] I'll also multiply. So multiplication
[04:49:12] equals TF dot constant
[04:49:16] and this will have values let's say the
[04:49:18] same values 1 2 3 4 and five. Let me put
[04:49:23] a comma over here. Right now I'll put
[04:49:26] the asterisk symbol again. I'll type
[04:49:29] df.constant
[04:49:31] and I will give in the list of the
[04:49:32] values 5 4 3 2 and 1. Now I'll hit run.
[04:49:39] Again I need to run these two inside a
[04:49:41] session. So sis dot run
[04:49:45] let me put in addition over here. After
[04:49:48] that I would also need the
[04:49:50] multiplication value right? So this is
[04:49:53] our result. So when we add 1 + 5 we get
[04:49:57] six. When we add 2 + 4 we get six.
[04:50:00] Similarly when we add each of the
[04:50:02] corresponding elements with these
[04:50:04] elements inside the list we get all of
[04:50:06] the sixes over here. Now let's take this
[04:50:08] multiplication result. So over here when
[04:50:10] we multiply 1 with five we get five.
[04:50:12] When we multiply two with four we get 8.
[04:50:15] 3 + 3 gives us 9. 4 + 2 gives us 8. And
[04:50:18] again 5 + 1 gives us a five. So this was
[04:50:22] addition and multiplication with respect
[04:50:24] to lists.
[04:50:28] Now let me also do a simple operation on
[04:50:30] strings. So let me take in the first
[04:50:33] string and name it as str1. So this is a
[04:50:36] constant. So TF dot constant and let's
[04:50:41] say I type over here I love and then I
[04:50:45] give a space. Now I will take in the
[04:50:48] second string which would be str2
[04:50:51] and inside this and again this would be
[04:50:54] a constant. So tf dot constant and the
[04:50:59] value of this constant would be
[04:51:02] tensorflow. Right now I will run this
[04:51:06] and let me execute this inside a
[04:51:08] session. So says dot run of str1 plus
[04:51:13] str2. So the result which you get is I
[04:51:16] love tensorflow. Right? So the first
[04:51:18] string is I love and then there's a
[04:51:19] space and the second string is
[04:51:21] tensorflow. So when I add these two
[04:51:22] strings the resultant is I love
[04:51:24] tensorflow. Right? So that was all about
[04:51:26] constants and then we have placeholders.
[04:51:29] So when it comes to placeholders, we
[04:51:32] don't have to provide an initial value
[04:51:34] and can specify it during the runtime.
[04:51:37] So this allows us to build our
[04:51:39] computational graph without needing the
[04:51:42] data. And this is how we can create
[04:51:44] placeholders. So tf.holder
[04:51:47] is the syntax and inside that we just
[04:51:50] give the data type of the variable which
[04:51:52] we will substitute later on during
[04:51:54] execution.
[04:51:57] So let's go to So let's head back to
[04:52:00] Jupiter and work with these placehold.
[04:52:01] So let me just type in placeholder over
[04:52:05] here. So let me create my first
[04:52:07] placeholder. So I'll name that as a and
[04:52:10] tf dot placeholder
[04:52:14] and uh this would be of integer type. So
[04:52:17] tf dot int 32.
[04:52:21] Now I will create another variable which
[04:52:23] would be b and the value of b would be
[04:52:26] actually a cross 2. So let me run this.
[04:52:30] Now I will run these two inside a
[04:52:33] session. So sis dot run and I want to
[04:52:37] see the result of b. So I'll put in b
[04:52:40] over here. Now since we know that a
[04:52:42] placeholder takes in a value during
[04:52:45] runtime. So this is when I'll feed the
[04:52:47] value to this placeholder a over here.
[04:52:50] Now to do that I would have to create
[04:52:52] something known as a feed dictionary.
[04:52:55] So feed dict equals let me create a
[04:52:59] dictionary over here. So it would be a
[04:53:03] and the value which I'll be giving to a
[04:53:05] would be let's say five. Now let me run
[04:53:08] this and let's see what do we get.
[04:53:10] Right? So during the execution time I
[04:53:12] have assigned a value of five to a and
[04:53:16] when we multiply this five with two over
[04:53:19] here we get the value of b which is 10.
[04:53:21] Right? So all of this is happening
[04:53:23] during runtime because with the help of
[04:53:26] a placeholder we can assign it a value
[04:53:28] during execution. Now similarly let me
[04:53:31] go ahead and give a list of values over
[04:53:33] here. So instead of five let me give the
[04:53:36] list of values 1 2 3 4 and five. Now
[04:53:40] I'll run this. So over here 1 + 2 gives
[04:53:43] us 2. 2 + 2 gives us 4. 3 + 2 is 6. 4 +
[04:53:47] 2 is 8. And 5 + 2 is 10. So you get an
[04:53:50] array of values during runtime.
[04:53:55] Now similarly let me also create a
[04:53:57] placeholder for strings.
[04:54:01] So I'll type string placeholder over
[04:54:04] here. So now let me create this variable
[04:54:08] and name this as string name and this
[04:54:12] would be your placeholder. So TF dot
[04:54:15] placeholder and I am taking in a string.
[04:54:19] So TF dot string
[04:54:22] right now let me create another string
[04:54:24] over here. So the name of this string
[04:54:27] would be my name
[04:54:30] and let's say the value of the string is
[04:54:32] I am. Right now, let me run this and
[04:54:36] I'll execute this inside a session. CS
[04:54:40] dot run and I want the result of my name
[04:54:44] when I add it with respect to string
[04:54:47] name.
[04:54:57] Now, let me also create a placeholder
[04:54:59] for strings. So I'll just type string
[04:55:03] placeholder and uh let me create the
[04:55:05] first placeholder. So I'll name that as
[04:55:09] str1
[04:55:11] name and since this is a placeholder tf
[04:55:15] dot placeholder
[04:55:17] and I will be giving in a string during
[04:55:21] execution time. Now I'll also create
[04:55:24] another string value over here and name
[04:55:26] that to be my name and the value of this
[04:55:30] string is I am and then I'll give a
[04:55:33] space right so now I'll hit run and let
[04:55:37] me execute this inside a session now let
[04:55:40] me also create a placeholder for strings
[04:55:43] so this will be string
[04:55:46] placeholder
[04:55:48] right so let me create this placeholder
[04:55:50] I'll name this as str name and since
[04:55:54] this is a placeholder I need to use tf
[04:55:56] do.t placeholder and I will be assigning
[04:56:00] a string to this during execution time.
[04:56:03] Now I will also create another string
[04:56:05] which would be my name and this would be
[04:56:09] equal to I am and there's a space and I
[04:56:13] will be adding this with str name. So
[04:56:18] let me hit run. Right now I'll execute
[04:56:20] this inside a session. So for that I'll
[04:56:24] type sess dot run and I want the result
[04:56:28] of my name. So I'll just put in my name
[04:56:32] over here and then I'll provide the feed
[04:56:33] dictionary feed direct equal to let me
[04:56:37] put in the dictionary over here and I
[04:56:38] will assign the values of str name over
[04:56:42] here. Right. So the values of str name
[04:56:47] would be Sam,
[04:56:50] Bob
[04:56:52] and Charlie. Now let me hit run and
[04:56:54] let's see what do we get. Right. So what
[04:56:57] we are basically doing is we are adding
[04:57:00] this with the placeholder value over
[04:57:02] here and we are giving the values during
[04:57:04] the execution time. So I am Sam, I am
[04:57:10] Bob and I am Charlie. Now these three
[04:57:12] values are coming from this feed
[04:57:15] dictionary during the runtime. Right? So
[04:57:18] this is all about placeholders. And
[04:57:20] finally we have variables. So a variable
[04:57:23] is just a program element which allows
[04:57:25] us to add new trainable parameters to
[04:57:28] the graph. And this is the syntax to
[04:57:31] create a variable. TF dot variable. And
[04:57:33] then we give the value or we initialize
[04:57:35] the value. And then we specify the data
[04:57:38] type of that variable. Right? So let's
[04:57:40] head back to Jupiter now. I'll just type
[04:57:42] in variables over here. Right? So let me
[04:57:44] create my first variable and the name of
[04:57:46] that variable would be v1
[04:57:49] and we can create a variable like this
[04:57:51] tf dot variable. So guys you need to
[04:57:54] keep in mind that this v over here is
[04:57:57] actually capital right. So now after
[04:58:00] this I would have to assign a value to
[04:58:02] it. So let's say I assign this variable
[04:58:05] a value of 20 and this is of integer
[04:58:09] type. So tf dot int 32 I'll run this.
[04:58:14] Now another thing to be kept in mind is
[04:58:16] whenever we are declaring values in
[04:58:19] tensorflow they have to be initialized.
[04:58:22] So this is how we can initialize all of
[04:58:24] the variables.
[04:58:26] So we have something known as
[04:58:29] global variable initializer.
[04:58:33] And when we invoke this function, all of
[04:58:35] the variables which we have declared
[04:58:37] would be initialized. I'll hit run. So
[04:58:40] now let me execute this inside a
[04:58:42] session. So says dot run of init. And I
[04:58:48] have initialized this variable over
[04:58:50] here. Now let me also go ahead and run
[04:58:52] that variable says dot run v 1. Right?
[04:58:59] So we have the result of v 1 which is
[04:59:01] 20. Now since this is a variable the
[04:59:04] value of a variable can be actually
[04:59:07] updated. So let me go ahead and update
[04:59:10] the value of this.
[04:59:12] So I will name this as updated V one and
[04:59:19] the function would be TF dot assign
[04:59:23] and inside this the first parameter
[04:59:25] would be the variable which I'd want to
[04:59:27] update and after that I need to give the
[04:59:30] value to which I'd want to update this.
[04:59:32] So I want to make this value of 20 to
[04:59:35] 25.
[04:59:36] Right now I will run this.
[04:59:45] So now this is actually a variable. We
[04:59:47] can actually update the values. So let
[04:59:49] me go ahead and do that. So the name of
[04:59:52] the updated variable would be let's say
[04:59:54] updated
[04:59:56] one and the function for that would be
[04:59:59] TF dot assign. And this takes in two
[05:00:02] parameters. The first parameter would be
[05:00:05] the variable which we are supposed to
[05:00:06] update. And the second parameter would
[05:00:08] be the value to which we are updating
[05:00:11] it. So I want to make this 20 to 25.
[05:00:15] I'll hit run. Right now let me run this
[05:00:17] inside a session. CS dot run and I need
[05:00:22] to pass in the variable which would be
[05:00:24] updated v1.
[05:00:27] Let me make it small v over here. I'll
[05:00:29] run this. Right. So initially the value
[05:00:31] of V1 was 20 but we have updated it and
[05:00:35] made its value to be 25. Now let's also
[05:00:38] go ahead and create a small linear
[05:00:40] model.
[05:00:42] So let me just type in linear model over
[05:00:44] here. And this is how our linear model
[05:00:47] would look like. WX + B where W and B
[05:00:52] would be variables and X would be a
[05:00:54] placeholder.
[05:00:56] Right. Right. So let me start off by
[05:00:58] creating W. So W is a variable. So W
[05:01:02] would be equal to TF dot variable and I
[05:01:06] am initializing it with a value of let's
[05:01:09] say 10 and this is of integer type. So
[05:01:12] this would be TF dot in 32. Now
[05:01:17] similarly I will also assign the value
[05:01:19] for B. So B is also a variable and its
[05:01:24] initial value would be five and this is
[05:01:26] also of integer type. And finally we
[05:01:29] have x which is a placeholder. So x is
[05:01:32] equal to tf dot placeholder.
[05:01:36] And since placeholder does not actually
[05:01:38] take an initial value it just takes a
[05:01:40] data type. So the data type is again tf
[05:01:43] dot int 32. So I'll run this. And now
[05:01:47] what I'll do is I will multiply w with x
[05:01:51] and add b to it. So the equation would
[05:01:54] be w + cross x + b. and I will store it
[05:01:58] in a variable and name that variable to
[05:02:00] be linear model. Right? So this is W
[05:02:04] cross X + B. I'll run this. Now again if
[05:02:09] I have to execute this, I have to run
[05:02:12] this inside a session. And since also I
[05:02:14] have created two new variables, I'd have
[05:02:16] to initialize them first. So init one
[05:02:20] equals TF dot global variables
[05:02:24] initializer.
[05:02:26] I'll hit run. So now I will create a
[05:02:29] session. So sess dot run and I will
[05:02:33] execute init one first. So I have
[05:02:35] successfully initialized these two
[05:02:37] variables w and x. Now I can go ahead
[05:02:40] and run this linear model
[05:02:43] dot run. Now I would want the result of
[05:02:47] linear model. So linear model and then I
[05:02:50] will use the feed dictionary. Now inside
[05:02:53] this feed dictionary I would have to
[05:02:56] assign a value to this placeholder x. So
[05:03:00] x equal to let's say I give a list of
[05:03:04] values over here and the list of values
[05:03:06] would be 1 2 3
[05:03:09] 4 and 5. Now I'll run this and let's see
[05:03:13] what do we get. Right? So if the value
[05:03:15] of x is 1, we get 15. So this basically
[05:03:19] means that 10 into 1 + 5 which is 15.
[05:03:24] Now after that if the value of x is 2 so
[05:03:26] this would mean 10 into 2 + 5 which is
[05:03:29] 25. Similarly if x is 3 that would mean
[05:03:32] 10 into 3 which is 30 + 5 is 35 right
[05:03:36] and same is the case for 4 and 5. So
[05:03:39] these two commands will update your
[05:03:41] tensorflow version for the CPU and the
[05:03:43] GPU both in your Jupyter notebook. And
[05:03:45] once you have updated your TensorFlow,
[05:03:47] now we'll import the TensorFlow package
[05:03:49] or the library as TF. And if you run
[05:03:52] this command, your TensorFlow package
[05:03:53] will be imported into the Jupyter
[05:03:55] notebook. And once you have imported the
[05:03:57] TensorFlow into Jupyter notebook, now
[05:03:59] we'll verify whether the version of the
[05:04:01] TensorFlow that we're using is 2.0 or
[05:04:03] not. So we'll use the print. So we'll
[05:04:05] print TF.ore
[05:04:08] version_.
[05:04:09] So once you print this line, you will
[05:04:11] see that we have the latest version
[05:04:12] which is 2.0.0. And once you are sure
[05:04:15] that you are using the version 2.0. So
[05:04:17] now we'll move ahead and build some
[05:04:18] tensors. So a tensor is just a fancy
[05:04:21] name for an n-dimensional array and you
[05:04:23] can also think of it as a general
[05:04:25] representation of a vector in higher
[05:04:27] dimensions. So we'll start by making a
[05:04:29] constant tensor which means that its
[05:04:31] value cannot be changed later on. So
[05:04:33] we'll use the tf dot constant method. So
[05:04:35] if you run this particular line here and
[05:04:37] if you press shift and tab here so it
[05:04:39] will open this documentation where you
[05:04:41] can check different attributes of this
[05:04:43] particular method. So the first is your
[05:04:45] value. The first attribute is the value
[05:04:47] or the parameter is the value where you
[05:04:49] put the value of your tensor. So here we
[05:04:51] have passed a string hello and you can
[05:04:53] also mention the data type of the value
[05:04:55] that you're passing and also the shape
[05:04:57] that you're passing if it is a
[05:04:58] multi-dimensional tensor and it will
[05:05:01] create a constant tensor for you. So we
[05:05:02] have passed hello. So it will
[05:05:04] automatically create a string tensor for
[05:05:06] us. So once you run this and we have
[05:05:08] stored the tensor in hello. And if you
[05:05:09] check now the type of the tensor so you
[05:05:11] will see that it is not a string. It is
[05:05:13] a tensor object and before in tens of
[05:05:16] flow one if you wanted to print the
[05:05:18] value of this particular string you
[05:05:20] would have to create a session but in
[05:05:21] tensorflow 2.0 or sessions are not valid
[05:05:24] now so if you want to just print the
[05:05:26] value of a tensor we can directly print
[05:05:28] it using the tf.print function so this
[05:05:31] function will print the value of any
[05:05:32] string or the tensor that we have built
[05:05:34] so if I use tf.print and I pass my
[05:05:36] tensor to it. So it will print the value
[05:05:39] that is actually contained in that
[05:05:41] particular tensor. So now let's create
[05:05:42] another tensor using tf do.constant
[05:05:44] which is a constant tensor and we'll
[05:05:46] pass the string world to it and we'll
[05:05:48] store the tensor in world. And now let's
[05:05:50] print the world tensor using the
[05:05:51] tf.print. So now we have two tensors
[05:05:54] hello and world. And before if you
[05:05:56] wanted to perform any operations on
[05:05:58] tenses you would have to create a
[05:05:59] session and inside the session you have
[05:06:01] to add or perform different operations.
[05:06:04] So now we can directly perform
[05:06:05] operations on tenses. So we'll perform
[05:06:07] the addition. We will concatenate these
[05:06:09] two strings which are contained in hello
[05:06:11] and world tensor and we'll store the
[05:06:12] result in result and then we will print
[05:06:14] the result. So you can see that our
[05:06:15] result is hello world. So it means we
[05:06:17] didn't we do not have to create a
[05:06:18] session in order to perform different
[05:06:20] operations on these tenses. So now let's
[05:06:22] move ahead and create a tensor which
[05:06:24] contains a numeric variable. So we'll
[05:06:26] pass numeric value as 10 and we'll store
[05:06:28] the tensor in a and then we'll create
[05:06:30] another tensor constant tensor with
[05:06:32] value 20. And now we'll use we'll
[05:06:34] perform the addition operation on these
[05:06:36] tenses without using a session. So once
[05:06:38] you perform the addition operation it
[05:06:40] will display you this particular tensor
[05:06:42] object. So it will not actually display
[05:06:44] you the actual value of this particular
[05:06:46] operation. And here you can see if you
[05:06:48] run this operation multiple times. So
[05:06:50] every time the ID here will be
[05:06:51] different. So the tensorflow implicitly
[05:06:53] stores different functions or operations
[05:06:55] as ids. So now if you want to print the
[05:06:57] value of this particular operation on
[05:06:59] these tenses. So you can use tf.print
[05:07:02] again. So once you use tf.print and you
[05:07:04] will pass the operation that you want to
[05:07:05] perform. So it will directly print you
[05:07:07] the value of that particular operation
[05:07:09] performed on these two tenses. So now
[05:07:11] let's build another tensor using the
[05:07:12] fill method. So if you run this command
[05:07:15] first and then you press shift and tap.
[05:07:17] So this will open this documentation and
[05:07:19] this fill command or the fill method
[05:07:21] will create a tensor which is filled
[05:07:23] with a scalar value. So you have to pass
[05:07:24] a value and it will create a tensor with
[05:07:27] the following dimensions. So here we
[05:07:29] have passed the dimensions as 5x 5. So
[05:07:31] it will create a tensor with five rows
[05:07:33] and five columns which are the dims
[05:07:35] argument and then we have to pass the
[05:07:36] value. So we want a 5x5 tensor which
[05:07:38] will contain the value five. And if you
[05:07:40] want to print this particular tensor so
[05:07:42] you can use the tf.print method. So it
[05:07:44] will print the value of this particular
[05:07:46] tensor and you can also print this value
[05:07:48] as an as a numpy array. You just have to
[05:07:51] write the value the name of your tensor
[05:07:53] and then you can use the numpy method.
[05:07:55] So once you run this it will represent
[05:07:57] your tensor or it will display your
[05:07:58] tensor as a numpy array. So now let's
[05:08:00] build another tensor using the constant
[05:08:02] method. So we'll build a two-dimensional
[05:08:04] tensor. So it will be a tensor with two
[05:08:06] rows and two columns. And once we run
[05:08:08] this, we'll build a tensor. And now if
[05:08:11] you want to know how many rows and
[05:08:12] columns are there in your tensor and
[05:08:14] that is the shape of your tensor. So you
[05:08:15] can use the get shape method. So if you
[05:08:17] use a get shape method, it will display
[05:08:19] you the number of rows and the columns
[05:08:20] present in your tensor. Clearly we can
[05:08:22] see that we have two rows and two
[05:08:23] columns in our tensor. And similarly if
[05:08:26] you create this particular tensor which
[05:08:27] has two rows and one column and to get
[05:08:29] the shape of this tensor. So it will be
[05:08:30] it will be displayed as two rows and one
[05:08:32] column. So now we can also create random
[05:08:35] numbers using tenses. So we'll create a
[05:08:37] tensor which contains normal
[05:08:38] distribution or random numbers from the
[05:08:40] normal distribution. So we'll use the
[05:08:42] random dot normal method. So in this
[05:08:44] method we'll pass the dimensions and
[05:08:46] then we'll pass our mean and the
[05:08:47] standard deviation. So if you press
[05:08:49] shift and tab here. So it will first of
[05:08:51] all the first argument will be your
[05:08:53] shape and then your mean of this
[05:08:55] particular normal distribution and your
[05:08:56] standard deviation of the normal
[05:08:58] distribution. So we want a 4x4 tensor
[05:09:01] which will contain four rows and four
[05:09:02] columns and then we want the mean to be
[05:09:04] zero and the standard deviation should
[05:09:06] be one. So it will be a standard normal
[05:09:07] distribution. So once you run this and
[05:09:09] you store the tensor in my tensor and
[05:09:11] now if you print the value of tensor you
[05:09:12] will see that we have four rows and four
[05:09:15] columns in this particular tensor and
[05:09:16] the numbers are from this standard
[05:09:18] normal distribution. So now let's
[05:09:19] discuss what do we mean by eager
[05:09:21] execution that is implemented in
[05:09:22] tensorflow 2.0. So if you talk about
[05:09:24] tensorflow 1.0 or 1.x so it requires you
[05:09:28] to manually build a syntax tree which is
[05:09:30] also called the graph using different
[05:09:32] API calls and then when you build a
[05:09:34] graph then you have to manually compile
[05:09:35] the graph and you have to pass the set
[05:09:36] of input and output tenses in the
[05:09:39] sessions.rren function. But in
[05:09:40] tensorflow 2.0 we execute eagerly which
[05:09:43] means it executes like a normal python
[05:09:45] programming language that is line by
[05:09:47] line. So you don't have to create
[05:09:48] sessions and your code will be executed
[05:09:50] line by line. So if you want to know
[05:09:52] more about the effective changes that
[05:09:54] have been made to tensorflow 1 in
[05:09:55] tensorflow 2. So you can go to the
[05:09:57] following link and if you go to the
[05:09:58] following link you will find a whole
[05:10:00] documentation that discusses different
[05:10:01] changes that have been made to
[05:10:03] tensorflow 1 to migrate to tensorflow 2.
[05:10:05] So now let's move ahead and after
[05:10:07] performing this we'll first of all check
[05:10:09] whether we are running eager execution
[05:10:11] or not. So in tensorflow 2 this is the
[05:10:13] default execution eager execution and if
[05:10:15] you want to enable eager execution in
[05:10:17] other versions of tensorflow so you can
[05:10:19] use these commands. So first of all
[05:10:20] we'll check whether the eagle execution
[05:10:22] is enabled or not. So we'll use the if
[05:10:24] function here. So we'll pass our
[05:10:25] tf.executing eagerly and if this is true
[05:10:28] it means we are running eagerly. So if
[05:10:30] it is true it will print eager execution
[05:10:32] is enabled. And then if it is not true
[05:10:33] then it will print that you are not
[05:10:35] running eager execution. And we'll also
[05:10:37] print how to run and how to enable and
[05:10:39] disable eager execution. So right now
[05:10:40] eager execution is enabled and if you
[05:10:43] print this code here and if your eager
[05:10:45] execution is off so it means if you want
[05:10:47] to disable your eager execution so you
[05:10:50] can run the following command. So you
[05:10:51] have to import this particular module
[05:10:53] from this particular package and then
[05:10:55] you can use the disable eager execution
[05:10:57] function or the method to disable your
[05:10:59] eagle execution and if your eager
[05:11:01] execution is already turned off and if
[05:11:03] you want to turn it on so you can use
[05:11:04] this particular command here. So using
[05:11:06] this command you can enable your eagle
[05:11:08] execution and if your tensorflow is not
[05:11:10] updated so you can use these two lines
[05:11:12] to update your tensorflow to the latest
[05:11:14] version that is 2.0. So now let's move
[05:11:16] ahead and perform simple operations
[05:11:18] using tensorflow 2. So there are some of
[05:11:20] the common operations that we'll use
[05:11:22] while using tensorflow 2. We'll make
[05:11:24] tenses using constant tf dot constant tf
[05:11:28] variable and then we'll concatenate two
[05:11:30] tenses and we'll make tenses using zeros
[05:11:32] and ones. And then we'll also learn how
[05:11:34] to reshape tensors. And we'll also learn
[05:11:37] how to cast tensors from one data type
[05:11:39] to another data type. So we'll start by
[05:11:40] making a constant tensor. So it content.
[05:11:44] So a constant tensor is a tensor that
[05:11:46] does not change. So we'll use the TF dot
[05:11:49] constant again. And it is a tensor that
[05:11:51] has three rows and two columns. So the
[05:11:54] shape of this tensor will be three and
[05:11:56] two. And once you run this command, your
[05:11:59] tensor will be stored in a. And if you
[05:12:01] want to print this tensor, there are two
[05:12:03] methods. So if you want to print the
[05:12:04] value of values of this tensor. So you
[05:12:06] can use a tf.print method and you can
[05:12:09] pass your tensor value to it. And if you
[05:12:11] want it as a numpy array, so you can use
[05:12:13] the dot numpy method to print the value
[05:12:15] of this tensor as a numpy array. This
[05:12:18] mostly represents the values of the
[05:12:19] weights of the inputs in our neural
[05:12:21] network. So we'll use a tf dot variable
[05:12:23] method to create a tensor which is a
[05:12:26] variable tensor. So we'll create a
[05:12:27] tensor which has two rows and two
[05:12:29] columns.
[05:12:30] And we'll store the tensor in VA which
[05:12:32] is a variable tensor. And then we'll
[05:12:34] create another constant tensor which has
[05:12:36] three zones and two columns. So the
[05:12:38] shape of the tensor is 32. And we'll
[05:12:40] store the result in B. And then if you
[05:12:43] want to print B, so you can use
[05:12:44] TF.print. So it will print your B
[05:12:46] directly without running any session. So
[05:12:49] now let's move ahead and perform the
[05:12:51] concatenation of tenses using the
[05:12:53] TF.con.
[05:12:54] So we'll create a concatenated tensor A
[05:12:57] be concatenated and we'll concatenate
[05:12:59] values from the tenses A and B. So if
[05:13:02] you write X is equals 1. So the tenses
[05:13:05] will be concatenated according to the
[05:13:07] columns. So the new tensor will contain
[05:13:09] the columns from the two tenses that we
[05:13:11] have concatenated that is A and B. And
[05:13:14] after we concatenate these two values
[05:13:15] and if you press shift and tab here so
[05:13:17] it will open the documentation. And in
[05:13:19] the documentation you can see that we
[05:13:21] have to firstly pass the values of the
[05:13:22] tenses and then axis here represents uh
[05:13:25] row or the column fashion. So if the
[05:13:27] axis value is one so our new tensor will
[05:13:30] have concatenated columns and the axis
[05:13:33] value is zero. So our concatenation will
[05:13:34] happen according to the rows and after
[05:13:37] this we'll print the value of the
[05:13:38] concatenated tensor. So we'll write this
[05:13:40] f string. So in python this f string
[05:13:43] which is started by writing f and then
[05:13:45] writing your string. So it will contain
[05:13:47] your Python expressions. So inside the
[05:13:49] string if you want to pass any variable
[05:13:52] so you have to write your string and
[05:13:53] then using these braces or using these
[05:13:55] brackets inside the brackets you have to
[05:13:57] write the name of your variable and once
[05:13:59] you run this command so your tenses will
[05:14:01] be printed. So now you can see we have
[05:14:03] concatenated two tenses that is A and B
[05:14:06] and both tenses were the A tensor was
[05:14:09] this particular tensor with three rows
[05:14:11] and three columns and the B tensor was
[05:14:13] this tensor which is also having three
[05:14:16] rows and three columns and we have
[05:14:18] concatenated these two tenses column
[05:14:20] wise. So the first two columns are of
[05:14:22] tensor A and the second two columns are
[05:14:24] of tensor B. And now we'll see how to
[05:14:27] concatenate tensors according to rows.
[05:14:29] So in inside the tf.con function you
[05:14:33] have to just write x is equals zero and
[05:14:36] the values are of the tenses a and b and
[05:14:38] once you run this code and you print the
[05:14:40] value of the tensor using the f string.
[05:14:42] So once you run this code your tenses
[05:14:44] will be concatenated according to the
[05:14:46] rows. So the first three rows represent
[05:14:48] the tensor a and the second last three
[05:14:51] rows represents uh the tensor b. So now
[05:14:54] let's move ahead and make tenses using
[05:14:56] tf0 and tf1's. So we use TF zeros to
[05:14:59] make tenses that contain zeros of a
[05:15:01] particular shape and TF1's to create
[05:15:03] tenses of containing ones of a
[05:15:05] particular shape. So we'll use TF zeros
[05:15:08] to contain or to make a tensor which is
[05:15:10] filled with zeros. And inside the tensor
[05:15:13] if you run this particular code and then
[05:15:16] press shift and tab here
[05:15:19] and inside this function you can see we
[05:15:20] have our shape argument where you can
[05:15:22] mention the shape of your tensor. So the
[05:15:25] first argument will be your row and the
[05:15:27] second argument will be your column and
[05:15:30] you can also mention the data type of
[05:15:32] your tensor. So here we are creating a
[05:15:34] zero tensor a tensor filled with zeros
[05:15:36] with two rows and four columns and the
[05:15:38] data type of those zeros will be integer
[05:15:42] 32.
[05:15:43] So our 32-bit integers will be two rows
[05:15:46] and four columns. And once you make this
[05:15:48] tensor and store the value in tensor and
[05:15:50] then you print the value of this
[05:15:51] particular tensor using the f string
[05:15:53] here and since we have printed it as a
[05:15:56] numpy array using the dot numpy or the
[05:15:59] numpy method. So our tensor will look
[05:16:01] like this which has two rows and four
[05:16:03] columns containing zeros. So now let's
[05:16:05] move ahead and make tensors using tf do
[05:16:08] once. So tf do ones are used to make
[05:16:10] tensors with a particular shape
[05:16:12] containing ones. So we'll use df do.1's
[05:16:14] method and we'll pass the shape of our
[05:16:16] tensor which is four rows and five
[05:16:18] columns here and then we have also
[05:16:20] passed the data type of our tensor which
[05:16:22] is float 32. So once you print this
[05:16:24] tensor using the f string you'll see
[05:16:26] that we have a tensor which contains
[05:16:28] four rows and five columns containing
[05:16:30] the float 32 type of number. So now
[05:16:33] let's move ahead and use tf.tres to
[05:16:36] reshape our tenses. So later on in this
[05:16:38] course you'll see that a lot of times
[05:16:39] we'll have to reshape different tenses
[05:16:41] so that we can pass those to our neural
[05:16:43] networks as features. So we'll first of
[05:16:45] all create a tensor for reshaping. So
[05:16:47] this tensor is a constant tensor and it
[05:16:51] contains four three four rows and three
[05:16:53] columns and first of all we will build
[05:16:55] this tensor and after making this tensor
[05:16:57] we'll reshape it. So we can reshape it
[05:17:00] in the following way. So we'll use the
[05:17:03] tf.reshape reshape method and then we
[05:17:05] have to pass the value of our tensor
[05:17:07] that we want to reshape and if you press
[05:17:08] shift and tab here so it will open this
[05:17:11] documentation where you can see the
[05:17:12] first value is your tensor and then the
[05:17:14] shape that you want to reshape it into.
[05:17:16] So firstly our tensor is your this
[05:17:18] tensor particular and then the shape
[05:17:20] that we want to reshape it into is one
[05:17:22] row and 12 columns since we have total
[05:17:25] of 12 elements in our particular tensor.
[05:17:27] So according to that we would have one
[05:17:29] row and 12 columns. So once you run this
[05:17:32] code your tensor will be reshaped and
[05:17:33] then now we'll print both the tenses
[05:17:35] that is before reshaping and after
[05:17:38] reshaping using the f strings. So now
[05:17:40] here you can see that we have our tensor
[05:17:42] which was before reshaping and now this
[05:17:44] is our tensor which contains one row and
[05:17:46] 12 columns after using the tf.reshape
[05:17:49] function. So now let's move ahead and
[05:17:51] use tf.cast method to cast one type of
[05:17:54] tensor into another type of tensor. So
[05:17:57] first of all we'll create a tensor which
[05:17:58] will contain values of 32-bit float type
[05:18:01] and then we will convert into a 32-bit
[05:18:04] integer values. So first of all let's
[05:18:05] create a tensor using tf dot constant
[05:18:08] which is a constant tensor. So this
[05:18:10] tensor contains four rows and three
[05:18:13] columns and all of these numbers are
[05:18:15] floating point numbers which are 32-bit
[05:18:17] numbers. Then we'll first use the
[05:18:18] tf.cast cast method and we if you press
[05:18:21] shift and tab here and the first is
[05:18:24] argument is your tensor that you want to
[05:18:27] cast and then your data type of the
[05:18:28] tensor that you want to cast it into. So
[05:18:31] we'll cast our tensor which is tensor
[05:18:33] here and then we'll cast it into a
[05:18:35] variable or a tensor which is which will
[05:18:37] contain values of 32-bit integers and
[05:18:40] we'll store the result as tenses as int
[05:18:43] tensor and once we have this tensor we
[05:18:45] can use the f string to print both the
[05:18:47] tenses that is before converting and
[05:18:49] after converting or casting. So we have
[05:18:51] just removed the decimal numbers we have
[05:18:53] not rounded the number. So here you can
[05:18:55] see the 8.8 8 is converted to 8 because
[05:18:58] we have just removed these floating
[05:19:00] points without actually rounding these
[05:19:02] numbers. So now let's move forward and
[05:19:04] perform some linear algebra operations
[05:19:07] on tenses. So we'll start by first of
[05:19:09] all transposing a tensor. So we'll first
[05:19:12] of all create a tensor using tf dot
[05:19:14] constant. So this tensor contains two
[05:19:16] rows and three columns with different
[05:19:18] values. So when we transpose a tensor,
[05:19:21] we'll convert our rows to columns and
[05:19:23] columns to rows. So our first row will
[05:19:25] be converted to first column and our
[05:19:27] second row will be converted to second
[05:19:29] column. So we'll use the tf.transpose
[05:19:32] method and we'll pass our tensor value
[05:19:34] that is a. So once we print this value
[05:19:36] using the f string. So we'll get our
[05:19:38] transposed tensor that is that is a with
[05:19:41] the first row of the original matrix as
[05:19:43] or the first row of the original tensor
[05:19:45] as the first column and the second row
[05:19:47] of the original tensor as the second
[05:19:48] column. Now let's move ahead and
[05:19:50] implement matrix multiplication which
[05:19:52] we'll use throughout the course and most
[05:19:54] of the optimization algorithms also use
[05:19:56] matrix multiplication. So first of all
[05:19:59] we will create two tensors or first of
[05:20:01] all we will uh create a matrix and a
[05:20:04] vector v. So these both are two tensors.
[05:20:07] So our first tensor is a constant tensor
[05:20:09] that has two rows and two columns and
[05:20:12] our second tensor is a vector which has
[05:20:13] only or two rows and only one column. So
[05:20:16] we'll multiply these two matrices or
[05:20:18] these two tenses using matrix
[05:20:20] multiplication and if you know the rules
[05:20:22] of the matrix multiplication so you'll
[05:20:24] get this particular result and we'll use
[05:20:26] df dot matt function. So we'll pass our
[05:20:29] two tenses and this order is also very
[05:20:31] important. So our first tensor is a 2x2
[05:20:35] and our second tensor is also 2x 1. So
[05:20:38] our resulting tensor will be 2x 1
[05:20:40] according to matrix multiplication. So
[05:20:42] once we print the result which we have
[05:20:44] stored in AV using the F string and we
[05:20:46] have passed our expression AV inside the
[05:20:49] F string. So we'll get this particular
[05:20:50] tensor which has two rows and one column
[05:20:53] that is the result of the matrix
[05:20:54] multiplication between A and V. And now
[05:20:56] we'll perform element wise
[05:20:58] multiplication and we'll see how they
[05:20:59] are different. So we'll compare both
[05:21:01] matrix multiplication and element wise
[05:21:02] multiplication and we can see both are
[05:21:05] different because the rules of matrix
[05:21:06] multiplication are different from
[05:21:08] element wise multiplication. So we use
[05:21:11] the same two matrices or the tenses that
[05:21:13] is a and v and we'll use the tf dot
[05:21:16] multiply and the tf do multiply used to
[05:21:19] multiply two tenses element wise. So
[05:21:21] we'll store the result in a and then
[05:21:23] we'll print the result using the f
[05:21:24] string and our a is our expression that
[05:21:27] we have passed inside the f string. So
[05:21:29] we'll once we print this we'll get this
[05:21:31] result which is totally different from
[05:21:32] what we got for matrix multiplication.
[05:21:35] This is our element wise multiplication
[05:21:37] of a and v. And here also the order of
[05:21:40] these tenses matters uh the way we pass
[05:21:42] it into or multiply function or the mad
[05:21:45] mole function. So now let's move ahead
[05:21:47] and see how to calculate or how to
[05:21:49] compute the identity matrix of a
[05:21:51] particular tensor. So identity matrix of
[05:21:54] a matrix is a matrix which we that when
[05:21:57] multiply with a particular matrix it
[05:22:00] will result in the same matrix that we
[05:22:02] have multiplied it with. So if if you
[05:22:04] have a matrix A and if you multiply
[05:22:06] through the matrix multiplication an
[05:22:08] elementary matrix of that particular
[05:22:09] matrix then we'll get the result as the
[05:22:11] same matrix A. So here our matrix A
[05:22:14] which is a tensor we have created using
[05:22:16] the TF doconstant and it has two rows
[05:22:17] and two columns. And now we'll calculate
[05:22:20] the number of rows and the number of
[05:22:21] columns that we have in our tensor using
[05:22:23] the shape attribute. So we'll use the a
[05:22:25] dot shape attribute of our tensor and
[05:22:27] we'll assign the two values that is two
[05:22:29] which is the number of rows and two
[05:22:31] which is the number of columns and we'll
[05:22:33] print it using the f string and now we
[05:22:35] know the number of rows and number of
[05:22:36] columns in our matrix. So now we'll
[05:22:37] create an identity matrix using the tfy
[05:22:41] method. So in this method we can pass
[05:22:43] different arguments. So firstly we'll
[05:22:45] pass the number of rows argument as rows
[05:22:47] here which is two and the number of
[05:22:49] columns argument as columns which is two
[05:22:51] and then we'll pass the data type of our
[05:22:54] values that we'll be containing in our
[05:22:55] identity matrix that is 32-bit integer
[05:22:58] and then we'll store the result in a
[05:23:00] identity and then we'll print the value
[05:23:01] of a identity using the numpy array. So
[05:23:05] it will represent it will be represented
[05:23:07] as a numpy array and here you can see we
[05:23:09] have an identity matrix 10 0 1. So an
[05:23:13] addendary matrix contains one on the
[05:23:15] main diagonal and all other elements are
[05:23:18] zero. So now let's confirm whether or
[05:23:20] this is an elementary matrix or not. So
[05:23:22] we'll first of all multiply both the
[05:23:24] matrices that is A and the identity
[05:23:26] matrix of A and if the resulting matrix
[05:23:29] is A then we can say that we have
[05:23:31] successfully computed the identity
[05:23:33] matrix for a particular tensor that is A
[05:23:35] here. So we'll use the same Matt Mule
[05:23:37] method. So we'll pass our both tenses
[05:23:39] that is a and a identity and then we'll
[05:23:41] store the result in a a and then we'll
[05:23:44] print a using the f string. So we can
[05:23:46] see we get the same result which is the
[05:23:48] same matrix a or the tensor a that we
[05:23:51] created above. So now let's move ahead
[05:23:53] and see how to create a variable in
[05:23:55] tensorflow 2.0. So in tensorflow 1.x X
[05:23:58] you know that the variables are firstly
[05:24:00] instantiated and then if you want to use
[05:24:02] those variables you have to instantiate
[05:24:04] or initialize those variables using the
[05:24:07] global initializers. So in TensorFlow
[05:24:10] 2.0 we don't need to initialize
[05:24:11] variables. So to use the value of a
[05:24:13] variable in TensorFlow 2.0 graph we'll
[05:24:16] simply treat it like a simple tensor
[05:24:19] without actually initializing using the
[05:24:21] global initializers. So we'll firstly
[05:24:23] see how to create variables. So we'll
[05:24:25] create variables using the tf dot
[05:24:26] variable. And here we have created a
[05:24:28] variable with tf.0. So a variable is a
[05:24:31] tensor whose values can be changed. And
[05:24:33] these mostly represents the weight of
[05:24:35] the inputs in our neural networks.
[05:24:38] So we'll first of all create a tensor
[05:24:41] variable. So if you press shift and tab
[05:24:43] here you'll get different arguments and
[05:24:44] you can check the different arguments
[05:24:46] that you can use here. So we'll pass the
[05:24:48] values as a
[05:24:50] tf.zeros zeros tensor and here we have
[05:24:55] here we want to make a zero tensor which
[05:24:58] will have two elements with each element
[05:25:00] having three rows and four columns
[05:25:03] and we'll store the result in my
[05:25:05] variable and if you print my variable
[05:25:07] using the tf.print print method. So
[05:25:08] we'll see we have two tensors or two
[05:25:11] zero tensors with each tensor having uh
[05:25:15] having three rows and four columns. So
[05:25:17] this particular code here represents
[05:25:18] that we have two tenses with each tensor
[05:25:21] having three rows and four columns.
[05:25:26] And now we have different methods that
[05:25:28] we can use on a variable. So we'll first
[05:25:31] of all create a variable which will
[05:25:32] contain a floating bond number zero. And
[05:25:34] then if you want to add anything to our
[05:25:36] variable. So our variable here is V. And
[05:25:38] if you want to add one. So here W is a
[05:25:41] tensor which is computed based on the
[05:25:43] value of V. So here W is our new tensor
[05:25:45] which will be computed based on the
[05:25:47] value of V. So anytime a variable is
[05:25:50] used in an expression, it gets
[05:25:51] automatically converted to a tensor
[05:25:53] representing its value. So here we have
[05:25:56] passed one. So it will be automatically
[05:25:58] converted to a tensor and then the value
[05:26:00] of one will be added to the value of v
[05:26:02] and then we will get the result. So our
[05:26:04] result here should be one because our w
[05:26:07] is v + 1. So this is unlike tensor flow
[05:26:10] 1.x where you would have to initialize
[05:26:13] those variables before we actually use
[05:26:14] them. So here we don't have to
[05:26:16] initialize we can just use them as
[05:26:17] normal tenses. So now let's print both v
[05:26:20] and w. So we'll write tf.print and then
[05:26:23] we'll pass our v and then w. So you can
[05:26:26] see our V was zero which we had declared
[05:26:29] and then we added one to it. So we got
[05:26:31] one. Now let's see another command which
[05:26:33] is assign. So we can assign using the
[05:26:35] assign method to different tensors or
[05:26:38] different variables. So well first of
[05:26:39] all we'll create a variable tensor which
[05:26:41] has a value of 0.0 floating point and
[05:26:44] then we'll assign it a value of two
[05:26:45] using the assign method. So we'll write
[05:26:47] V dot assign and we'll assign it a value
[05:26:49] of two. So now it will be having a value
[05:26:51] of two. So now it is a TensorFlow
[05:26:54] object. So if you want to print it as a
[05:26:56] numpy array so you can just write numpy
[05:26:59] here. So it will be a numpy array of the
[05:27:01] value will be represented and you can
[05:27:02] you can also use df.tprint. So directly
[05:27:05] the value of this particular tensor will
[05:27:06] be printed. So now let's see how to use
[05:27:08] the assign add method to perform the add
[05:27:11] operation on a tensor variable. So we
[05:27:13] have created a variable v with the
[05:27:15] floating point number zero as it value.
[05:27:17] And now we'll use the assign add method
[05:27:19] to add one to the variable v. And if you
[05:27:21] run this you'll get an a variable object
[05:27:25] which is a tensorflow variable object.
[05:27:27] And if you want to print the value of v
[05:27:28] after the assignment you can use the
[05:27:30] tf.print method to print the actual
[05:27:32] value. So after printing the value you
[05:27:34] can see that the value is one since we
[05:27:36] added one to our variable v using the
[05:27:39] assign add method. So now let's see how
[05:27:41] to use operations in tensorflow 2.0. So
[05:27:44] in tensorflow 1.x X we used to define
[05:27:47] sessions and inside of the sessions we
[05:27:49] used to declare or operations using
[05:27:51] placeholders but in TensorFlow 2.0 we
[05:27:54] can just directly declare functions
[05:27:56] instead of declaring a session and a
[05:27:58] placeholder. So here we will first of
[05:28:00] all define an operation which is our add
[05:28:02] operation. So first of all I'll create
[05:28:05] an a function which is add op with two
[05:28:08] inputs that is a and b and it will
[05:28:10] return the addition of those two inputs
[05:28:12] a and b. So now we'll print the sum of
[05:28:16] 10 and 20 using the add operation. So
[05:28:18] we'll call the add operation and it will
[05:28:20] sum the two numbers 10 and 20 and it
[05:28:22] will return the answer as 30. Now let's
[05:28:24] build a linear model using the
[05:28:26] TensorFlow 2.0. So first of all we will
[05:28:28] create a variable W and a variable B
[05:28:31] which are the parameters of an equation
[05:28:33] of a line which is WX + B. So before we
[05:28:37] used to create X as a placeholder. Now
[05:28:39] we'll just pass the value of x directly
[05:28:42] to our function that is our linear model
[05:28:43] function. So first of all we'll create
[05:28:45] two variables. The first variable is w.
[05:28:47] So the value of this variable is 10 and
[05:28:51] it is of integer 32- bit type and the
[05:28:53] next variable is five which is also an
[05:28:56] integer 32 type and the value of this
[05:28:59] variable is assigned to b. So now let's
[05:29:01] define a function which is a linear
[05:29:02] model function which will take one
[05:29:04] argument that is x. So which once we
[05:29:06] plug x we'll get the equation of a line
[05:29:08] and the w and the b values are global
[05:29:09] variables. So we don't have to
[05:29:10] initialize it now in tensorflow 2.0. So
[05:29:14] we'll directly call w and b we'll
[05:29:16] directly use it inside this function.
[05:29:18] And once we define this function which
[05:29:19] will return us an equation of a line for
[05:29:21] different values of x. So now we'll call
[05:29:24] this function using different values of
[05:29:26] x. So we have passed a list which
[05:29:28] contains 5 10 15 and 20 as four numbers
[05:29:31] as different values of x. And once you
[05:29:33] run this particular line, so you'll get
[05:29:35] this particular object. So right now
[05:29:37] this is just a tensor object. So if you
[05:29:39] want to get the values as an array. So
[05:29:41] you can use the numpy method to get the
[05:29:44] values as a numpy array. So now we have
[05:29:46] got four values here. So if you plug x
[05:29:50] as five in this particular equation, so
[05:29:52] your y value will be 55. And if you plug
[05:29:55] x = 10 in this particular equation with
[05:29:57] the respective values of w and b as 10
[05:29:59] and 5. So your value will be 105. And if
[05:30:02] you plug 15 as x and w as 10 and b as 5.
[05:30:06] So you'll get 155. And similarly for 20
[05:30:09] you'll get 205 for this particular
[05:30:12] equation using w and b as 10 and 5. So
[05:30:16] let's understand the limitations of a
[05:30:17] single layer perceptron. So a single
[05:30:20] layer perceptron can only learn linearly
[05:30:23] separable problems. So if a nonlinearly
[05:30:26] separable problem is given to a single
[05:30:28] layer perceptron then it would not be
[05:30:30] able to come up with a solution. So we
[05:30:32] have the andor problems and the exor
[05:30:35] problem over here. So as it is stated
[05:30:37] over here andor problems are linearly
[05:30:40] separable while an XR problem is
[05:30:43] nonlinearly separable. So over here if
[05:30:45] you look at this andor problem we are
[05:30:48] actually supposed to divide these two
[05:30:50] balls into two separate groups. So there
[05:30:53] are green colored balls and there is a
[05:30:55] blue colored ball and we are supposed to
[05:30:57] divide these into two separate groups
[05:30:59] using a single layer perceptron. And
[05:31:01] that is quite easy for a single layer
[05:31:03] perceptron. So all we have to do is draw
[05:31:05] a linear line and it places all of the
[05:31:08] green balls on the left side of the
[05:31:10] linear line and it places the blue
[05:31:11] colored ball on the right side of the
[05:31:13] linear line. So this is a case where the
[05:31:16] single layer perceptron works properly.
[05:31:18] But then again let's take the case of an
[05:31:20] exor problem. So what happens in an XR
[05:31:23] problem is it is nonlinearly separable.
[05:31:26] So we have a blue colored ball over here
[05:31:28] and then we have two green colored balls
[05:31:30] and then again there is a blue colored
[05:31:32] ball over here. So if you're asked to
[05:31:35] separate these four balls into two
[05:31:37] groups, then it wouldn't be possible
[05:31:40] with a single linear line, right? So we
[05:31:43] can't create two separate groups with
[05:31:45] just a single linear line. So this is
[05:31:47] where a single layer perceptron fails
[05:31:50] and it is very clear over here. So it is
[05:31:52] just not possible to draw a single line
[05:31:54] which would be able to separate the
[05:31:56] green colored balls and the blue colored
[05:31:58] balls into two separate groups. So this
[05:32:00] is where we'll have to use the help of
[05:32:03] multi-layer perceptrons. So now again
[05:32:05] there are a lot of complex problems out
[05:32:07] there which cannot be solved with the
[05:32:09] single layer perceptrons. That's image
[05:32:11] classification. Now what happens in
[05:32:13] image classification is there are a lot
[05:32:15] of dimensions and there is a lot of
[05:32:17] complexity associated with it and there
[05:32:19] are a lot of factors which come into the
[05:32:21] equation. So taking care of all of these
[05:32:24] parameters, all of these dimensions that
[05:32:27] is just not possible with the help of a
[05:32:29] single layer perceptron. And to find a
[05:32:31] solution for all of these nonlinearly
[05:32:33] separable problems, we'd have to take
[05:32:35] the help of a multi-layer perceptron. So
[05:32:38] before we go ahead and find the solution
[05:32:40] for this, let's actually go through a
[05:32:42] use case. So consider that you own an
[05:32:44] e-commerce firm and you want to increase
[05:32:47] the traffic on your site. So you decide
[05:32:49] to give special discounts and you
[05:32:51] basically start an end season sale and
[05:32:53] for this end season sale to work
[05:32:55] properly you need to have a very good
[05:32:57] marketing strategy and you have
[05:32:59] different options available with you. So
[05:33:01] to market this end season sale you have
[05:33:03] different platforms such as the Google
[05:33:05] ads, personal emails, YouTube ads,
[05:33:07] LinkedIn and so on. Now you can either
[05:33:10] use a single platform to do all of your
[05:33:13] marketing or you can use a combination
[05:33:15] of all of these platforms for your
[05:33:17] marketing. So as an owner your priority
[05:33:20] should be to make as less investment as
[05:33:23] possible and come up with the maximum
[05:33:25] profits. And for a human to do all of
[05:33:28] this analysis that could be a bit too
[05:33:30] complex and if the owner decides to
[05:33:32] solve it with the help of a single layer
[05:33:34] perceptron even that wouldn't be
[05:33:36] possible. So let's see what can be done
[05:33:38] over here. So these are all of the
[05:33:40] different marketing platforms which are
[05:33:42] available and each of the marketing
[05:33:44] platform has its own advantages and
[05:33:47] disadvantages. Now all of these factors
[05:33:50] have to be considered properly before we
[05:33:53] come up with the optimal solution and
[05:33:55] that is just not possible with the help
[05:33:57] of a single layer perceptron. So again
[05:34:00] we have the same thing over here. We
[05:34:02] have all of these different marketing
[05:34:04] platforms and finding out the right
[05:34:06] permutation combination of all of these
[05:34:08] marketing platforms to come up with the
[05:34:11] perfect solution wouldn't be possible
[05:34:13] with a single layer. What I'm doing is
[05:34:15] I'm creating three numpy arrays x1, x2
[05:34:18] and x3. So x1 would basically be a numpy
[05:34:21] array which consists of 10 values in the
[05:34:24] range of 3 to 12. And then we have x2
[05:34:27] which is again a numpy array which would
[05:34:29] again have 10 random values in the range
[05:34:30] of 9 and 18. And then we have x3 which
[05:34:33] is another numpy array which again
[05:34:35] consists of 10 random values in the
[05:34:38] range of 12 to 20. So we have x1 x2 and
[05:34:41] x3. And similarly we'll also create y
[05:34:44] which is our actual output which would
[05:34:46] be x1 cross 7 + 5 cross x2 + 4 cross x3.
[05:34:51] So we have our input values and we also
[05:34:53] and we also have the actual output with
[05:34:55] us. I'll click on run. Right now let me
[05:34:58] just print out the values of x1, x2 and
[05:35:00] x3. So this is x1, x2 and x3. So x1 is
[05:35:04] the numpy array which consists of 10
[05:35:06] values between 3 and 12. x2 is the numpy
[05:35:08] array which consists of 10 values
[05:35:10] between 9 and 18. And x3 is the numpy
[05:35:13] array which consists of 10 values
[05:35:14] between 12 and 20. Now let me go ahead
[05:35:18] and find out the shape of this numpy
[05:35:20] array.
[05:35:21] Right? So the shape of this numpy array
[05:35:23] x1 comes out to be 10. Now let me also
[05:35:26] have a glance at the output which is y.
[05:35:29] So this is my actual output which is 176
[05:35:33] 156 146 and so on. So I have my actual
[05:35:37] values with me and I also have the
[05:35:39] actual output which I want. Now what
[05:35:42] I'll do is I'm going to reshape this. So
[05:35:45] I'll use this function y dot reshape and
[05:35:48] I will pass in the shape over here. So
[05:35:50] here in the original value of y. So this
[05:35:53] is actually a numpy array where all of
[05:35:55] the values are present in one single
[05:35:57] row. But instead of having all of them
[05:35:59] in one single rows, I want to have them
[05:36:02] in 10 separate rows. So I'll use the
[05:36:05] y.shape function and pass in the
[05:36:07] parameters which is shape over here. So
[05:36:09] shape basically consists of the number
[05:36:11] of rows and then I just want one column.
[05:36:14] So shape comma one. So this tells me
[05:36:16] that y comprises of 10 values and all of
[05:36:20] those values are present in different
[05:36:22] rows. So I have my inputs and the output
[05:36:24] value with me. So now these inputs are
[05:36:27] actually individual values which are x1,
[05:36:30] x2 and x3. So I'll combine all three of
[05:36:32] them into a single numpy array. So I'll
[05:36:35] use the function np array and pass in
[05:36:38] x1, x2 and x3 and I'll store that in
[05:36:40] capital x and I'll print it out. Right?
[05:36:43] So this becomes my new input. Now again
[05:36:45] I'll transpose this. So x dot t would
[05:36:48] help me in transposing this numpy array
[05:36:50] and I'll store it back to capital x.
[05:36:52] I'll click on run again. So this numpy
[05:36:54] array which initially had three rows and
[05:36:57] 10 columns. When we transpose it, it
[05:37:00] becomes a numpy array with 10 rows and
[05:37:02] three columns. Right? So now that we
[05:37:05] have our inputs and the actual output
[05:37:06] value, this is where we go ahead and set
[05:37:08] the learning rate and also initialize
[05:37:10] the random beat values. So over here I
[05:37:13] am setting the learning rate to be
[05:37:14] 0.00001
[05:37:16] and after learning rate I'd have to
[05:37:18] initialize the weight values. So I am
[05:37:20] basically creating a numpy array with
[05:37:22] three values in it and those three
[05:37:24] values would be generated randomly. And
[05:37:26] to generate random values I'll just use
[05:37:28] this np.trandom.rand
[05:37:31] and I'll pass in one. So one basically
[05:37:33] means that it'll generate one random
[05:37:34] value. I'll click on run. So this is our
[05:37:38] weight matrix which is a basically 3 + 1
[05:37:40] matrix and it comprise of these values.
[05:37:43] So now we have the input values the
[05:37:45] actual output the learning rate and the
[05:37:47] initial weight values. Now I'll go ahead
[05:37:50] and start the feed forward propagation.
[05:37:52] So again this time the number of epochs
[05:37:54] is 50 and to get the predicted values
[05:37:57] I'd have to use the matt n mull
[05:37:59] function. What I'm doing is basically
[05:38:01] multiplying the input matrix with the
[05:38:03] mate matrix. So the input matrix is
[05:38:05] present in capital X and the weight
[05:38:07] matrix is present in W and I can
[05:38:10] multiply these two with the help of NB
[05:38:12] domat and I'll store the result in Y
[05:38:15] bread. So again as I've already stated
[05:38:17] the initial predicted values wouldn't be
[05:38:19] equal to the actual values and if you
[05:38:22] have to find out the error in prediction
[05:38:24] we'll just subtract y bread from actual
[05:38:26] y values. So error would be equal to y
[05:38:29] minus y pr and since you want the
[05:38:31] squared error I'll take the whole square
[05:38:33] of it. So we've done the forward
[05:38:36] propagation and we've found out the
[05:38:38] error in prediction. So now that we have
[05:38:40] the error in prediction we' have to find
[05:38:42] out the change in error with respect to
[05:38:45] weights. So now that the forward
[05:38:46] propagation is done and we've also
[05:38:48] calculated the error in prediction. This
[05:38:50] is when we go back that is this is where
[05:38:52] we back propagate and find out the
[05:38:55] change in error with respect to each of
[05:38:57] the weights. So we have to calculate the
[05:38:59] change in error with respect to weight
[05:39:00] one change in error with respect to
[05:39:02] weight two and change in error with
[05:39:04] respect to weight three. So where we are
[05:39:06] finding out the change in error with
[05:39:08] respect to weight one. So when we take
[05:39:10] the partial differentiation of this with
[05:39:12] respect to weight one this is what we
[05:39:13] get. So two times of np domat mull and
[05:39:17] this is what we have over here y - y pr
[05:39:19] and then we take a transpose of it and
[05:39:22] we'd have to multiply this with minus x
[05:39:25] of zero or the first input. So we are
[05:39:28] basically multiplying the change in this
[05:39:31] with respect to the first input.
[05:39:33] Similarly when it comes to the change in
[05:39:34] error with respect to w2, we'll multiply
[05:39:37] this with respect to the second input.
[05:39:39] And when it comes to change in error
[05:39:40] with respect to w3, we'll multiply it
[05:39:42] with the third input over here. Now
[05:39:44] since we have calculated the gradients
[05:39:46] with respect to w1, w2 and w3, we can go
[05:39:50] ahead and update these three weights. So
[05:39:53] the first weight which is present in w
[05:39:55] 0. So we are basically updating the
[05:39:57] weight matrix over here. So in the
[05:39:58] weight matrix, we are supposed to update
[05:40:00] the first weight value and and to update
[05:40:04] the weight value, we just have to use
[05:40:05] the simple gradient descent algorithm.
[05:40:07] So we'll subtract the old weight value
[05:40:10] minus the learning rate into the
[05:40:12] gradient of the error with respect to
[05:40:14] weight one. Similarly when we want to
[05:40:17] update the second weight this will be
[05:40:18] the old weight minus the learning rate
[05:40:21] into gradient of error with respect to
[05:40:24] w2. Similarly for the third weight it
[05:40:26] will be the old third weight minus the
[05:40:28] learning rate into change in error with
[05:40:31] respect to w3. So now the back
[05:40:33] propagation is done and also the
[05:40:35] updation of the weights is done. So what
[05:40:37] I'll do is I'll keep on adding all of
[05:40:40] the errors calculated through each of
[05:40:42] the back propagation iterations and I'll
[05:40:46] also print the value of w1 the value of
[05:40:48] w2 the value of w3 and error at each
[05:40:52] back propagation or error at each epoch.
[05:40:56] Now let me print this out and let's see
[05:40:57] what do we get. So this is value of w1
[05:41:00] value of w2 value of w3 and this over
[05:41:03] here what we have is the error matrix
[05:41:06] right. So initially the values predicted
[05:41:07] had an error of 39,176
[05:41:11] 30,733
[05:41:13] 26,158
[05:41:15] and so on for those 10 values. So this
[05:41:18] was the case of the first iteration. So
[05:41:19] in the second iteration the error it
[05:41:22] became 2.62 plus of 02. So this you can
[05:41:25] consider it to be into 10^2. So 2.62
[05:41:28] into 10^2 would be 262. So from 39,176
[05:41:34] the error came down to 262. Similarly
[05:41:37] over here the error was 30,733.
[05:41:41] So from 30,733
[05:41:44] the error it came down to 200. Right? So
[05:41:47] this is how the error which started at
[05:41:49] 39,176
[05:41:51] it kept on decreasing and it finally
[05:41:54] reached till two. Right? So in 50
[05:41:57] iterations over here or in 50 epochs the
[05:42:01] error has reduced by a very great
[05:42:03] margin. Now if we also want to have a
[05:42:05] look at this in a graphical way I can
[05:42:07] just plot the error list which you had
[05:42:09] obtained through back propagation. So
[05:42:11] I'll just plot it out. So we see that as
[05:42:13] the number of iterations increase the
[05:42:16] error decreases significantly. So this
[05:42:18] would be maybe the third iteration or
[05:42:20] the fourth iteration. So here as we saw
[05:42:22] the mean error it was somewhere around
[05:42:24] 40,000. So from 40,000 it came crashing
[05:42:28] down to around 10 in the second or the
[05:42:30] third iteration. So to get the global
[05:42:33] minimum error and to get the optimal
[05:42:35] weights value we'd have to increase the
[05:42:37] number of epochs a bit more and that'll
[05:42:39] give us the best result. Now again let's
[05:42:41] understand back propagation in a better
[05:42:43] way. Now let's say we have an image of a
[05:42:45] man over here and we give it as an input
[05:42:47] to this neural network. So this neural
[05:42:50] network will process it and then give us
[05:42:53] an output. But what happens with this
[05:42:55] neural network in the feed forward
[05:42:57] propagation mode is it'll take the input
[05:43:00] it'll do a bit of processing and it'll
[05:43:02] give us an output. But the output tells
[05:43:05] us that this is a gorilla instead of a
[05:43:08] smiling man and that is actually an
[05:43:10] incorrect output. So to get the right
[05:43:12] result we have to back propagate and
[05:43:14] update all of these weights over here.
[05:43:17] So we give the input, we feed forward it
[05:43:20] and then we get an output and the output
[05:43:22] given is actually a gorilla. So what we
[05:43:24] do is we back propagate update all of
[05:43:27] these weights with all of these links
[05:43:29] over here and when we get the optimal
[05:43:31] set of all of these weights we'll get
[05:43:34] the final correct result which tells us
[05:43:36] that the input is actually of a smiling
[05:43:38] human who's wearing glasses. Now let's
[05:43:41] again take another example. So over here
[05:43:44] we have the input which is 0 1 and 2 and
[05:43:47] our desired output would be 0 2 and 4.
[05:43:51] So let's say initially we give out the
[05:43:54] weight value to be three over here. So
[05:43:56] when we give out the weight value to be
[05:43:58] three the output which we get as 0 for
[05:44:01] the first input three for the second
[05:44:03] input and six for the third input. But
[05:44:06] then again these are not our desired
[05:44:07] outputs. So let's actually find out the
[05:44:10] error in prediction. So for the first
[05:44:12] input the error is actually zero because
[05:44:14] the desired output is zero and the
[05:44:16] actual output is also zero. The second
[05:44:18] input the desired output is two and the
[05:44:21] output given by the model is three. So
[05:44:24] this time the absolute error is one and
[05:44:27] again for the third input the desired
[05:44:29] output is four and the output given by
[05:44:31] the model is six. So this time the error
[05:44:33] in prediction is two. Now instead of
[05:44:36] getting the absolute error we'd actually
[05:44:38] have to work with a square error. So
[05:44:40] we'll square this up. So 0 is 0. Square
[05:44:42] of 1 is again one and square of two
[05:44:44] becomes four. Now what we'll do is
[05:44:46] instead of keeping the weight as three,
[05:44:49] we'll update the weight and change it to
[05:44:51] four. Now when we update the weight
[05:44:54] value to four, we'll get these outputs.
[05:44:57] So for input zero, the output is zero.
[05:44:59] For input one, the output is four. And
[05:45:02] for the input two, the output is 8. So
[05:45:05] this time the squared errors are very
[05:45:06] large. This time the squared errors are
[05:45:08] 0 4 and 16. So we see that as we
[05:45:12] increase the weights the error also
[05:45:15] increases. So initially when the weight
[05:45:17] was equal to 3 the squared error was 0 1
[05:45:20] and 4 and then when we updated the
[05:45:22] weight and increase it to four the
[05:45:24] squared error became 0 4 and 16. So now
[05:45:28] what we'll do is instead of increasing
[05:45:29] the weight we'll actually decrease the
[05:45:31] value of weight. So from four we'll
[05:45:34] bring it down to two. And when we bring
[05:45:36] down the weight two, these are the final
[05:45:38] outputs which we get. For input zero,
[05:45:41] the output is zero. For input one, the
[05:45:43] output is three. And for input two, the
[05:45:45] output is four. So this time we see that
[05:45:48] the squared errors are 0, 1, and 0. And
[05:45:52] the squared error has actually
[05:45:54] decreased. So for this case, what we've
[05:45:56] seen is as we increase the weight, the
[05:45:59] error also increased. And as we decrease
[05:46:02] the weight the error also decreased. So
[05:46:05] this is how the back propagation
[05:46:07] algorithm works. Now let's go through
[05:46:09] this graph to understand back
[05:46:11] propagation in a better way. So we have
[05:46:13] this graph over here and our aim is to
[05:46:17] minimize the error and to minimize the
[05:46:19] error we can either increase the weight
[05:46:21] or decrease the weight. So let's take
[05:46:23] this side of the graph. So if you take
[05:46:25] this side of the graph then the tangent
[05:46:28] which is passing over here that would
[05:46:31] make a positive slope with this axis and
[05:46:34] since it makes a positive slope with
[05:46:35] this axis over here. So if we decrease
[05:46:38] the weight then the error would decrease
[05:46:40] and if we increase the weight the error
[05:46:42] would also increase. So similarly if you
[05:46:45] take this side over here then the
[05:46:47] tangent with this graph over here that
[05:46:50] would make a negative slope and since
[05:46:52] the tangent would give us a negative
[05:46:54] slope that would mean it is inversely
[05:46:56] proportional. So if we start from this
[05:46:58] side then if we increase the weight then
[05:47:00] the weight would decrease but if we
[05:47:03] decrease the weight then the weight
[05:47:05] would increase. So basically the task of
[05:47:07] the back propagation algorithm is to
[05:47:08] reach this global loss minimum. And to
[05:47:11] reach the global loss minimum, the back
[05:47:13] propagation algorithm could either
[05:47:14] decrease the weight or increase the
[05:47:16] weight. And if it's on the right side
[05:47:18] over here, then it has to decrease the
[05:47:19] weight because it is directly
[05:47:20] proportional. And if it is on the left
[05:47:22] side over here, then it has to increase
[05:47:24] the weight because it is inversely
[05:47:26] proportional. So this is how the back
[05:47:28] propagation algorithm works. So now that
[05:47:30] we've understood the theory behind back
[05:47:32] propagation, let's understand the math
[05:47:34] behind it. So again we have this neural
[05:47:36] network over here and these are the
[05:47:38] inputs. So for the given inputs 05 and10
[05:47:42] we'd have to get the outputs 01 and.99.
[05:47:46] And to get this result we'd have to go
[05:47:48] through two steps. First would be the
[05:47:50] forward pass where we'll go from the
[05:47:52] input layer to the output layer and the
[05:47:54] second would be the backward pass where
[05:47:56] we'll go from the output layer to the
[05:47:59] input layer and update all of these
[05:48:01] initially random weights. So let's just
[05:48:04] jump into all of the math. Now in the
[05:48:06] forward pass, what we have to do is go
[05:48:08] from the first layer to the second
[05:48:11] layer. So over here we just have to
[05:48:13] multiply all of the weights with the
[05:48:15] inputs. So W1 is multiplied with I1, W2
[05:48:19] is multiplied with I2 and the bias is
[05:48:21] multiplied with one and you then add
[05:48:23] them up. So after multiplying all of the
[05:48:26] weights with the corresponding inputs
[05:48:28] that is W1 into I1 plus W2 into I2 and
[05:48:31] then we have bias. So multiplying this
[05:48:33] bias with one we get this. So we have a
[05:48:36] result of 0.3775.
[05:48:38] Now this is just a linear value. Now we
[05:48:42] have to pass this linear value inside an
[05:48:45] activation function. This activation
[05:48:47] function which you see this is the
[05:48:48] sigmoid function. So 1 upon 1 + e power
[05:48:51] minus net of h1. So this is the net of
[05:48:53] h1 which we've calculated. So h1 is the
[05:48:56] first node in the hidden layer and we've
[05:48:59] calculated this value and we've passed
[05:49:01] this inside this sigmoid function. Now
[05:49:04] this activation function will give us a
[05:49:06] value of 0.59.
[05:49:08] So similarly we'll do the same procedure
[05:49:10] for h2. We'll calculate the linear value
[05:49:13] first and then we'll pass that through
[05:49:15] the sigmoid activation function and then
[05:49:17] we'll get the value for h1 and h2. So
[05:49:20] we've done the calculation and then
[05:49:22] we've got the values for h1 and h2. So
[05:49:25] net H1 comes out to be 0.3775
[05:49:28] and net H2 comes out to be 0.59.
[05:49:31] So once this is done, we'd have to find
[05:49:34] out the value for O1 and O2. So this
[05:49:37] time what we'll do is again multiply the
[05:49:39] corresponding weights with the
[05:49:41] corresponding outputs of H1 and H2. So
[05:49:44] similarly what we'll do is multiply W5
[05:49:46] with O1 and W6 with O1 and also multiply
[05:49:49] the bias with one. This is how we can
[05:49:52] get the output for row 1. And we'll do
[05:49:53] the same procedure for O2 as well.
[05:49:58] So W5 into output of H1 plus W6 into
[05:50:01] output of H2 plus B2 into 1. This gives
[05:50:04] us a value. And again this value is
[05:50:06] passed through an activation function
[05:50:08] which is basically the sigmoid function.
[05:50:10] And this gives us a value between 0 and
[05:50:12] 1 which is 0.75.
[05:50:14] So we've got the final output for O1.
[05:50:17] After this we'd also have to find out
[05:50:19] the final value for O2. So we'll do the
[05:50:21] same thing multiply the corresponding
[05:50:23] weights with the corresponding inputs
[05:50:25] and we'll pass down that linear equation
[05:50:27] through the activation function and
[05:50:29] we'll get an output value of 0.77.
[05:50:33] So we've got the value for output of 01
[05:50:35] and output of O2. Now we'd have to find
[05:50:38] out the error in prediction and this is
[05:50:41] the formula to calculate the error in
[05:50:43] prediction. So we just so we'll just
[05:50:45] subtract the target minus output.
[05:50:50] So we'll just subtract So we'll just
[05:50:52] subtract the output from the target and
[05:50:54] then square it up. So this was our
[05:50:55] initial value which was 0.01 and this is
[05:50:58] the predicted value which is 0.75. So
[05:51:01] we'll subtract the 0.75 from the 0.01
[05:51:05] and then we'll square it up and we'll
[05:51:07] get a result of 0.27.
[05:51:10] So this is the error in prediction for
[05:51:12] the 01 layer. So the error in prediction
[05:51:14] for O1 node is 0.21. Similarly, we do
[05:51:18] the same process and get the error in
[05:51:19] prediction for O2. So the error in
[05:51:21] prediction for O2 comes out to be
[05:51:23] 0.0235.
[05:51:25] Now to calculate the total error, we'd
[05:51:29] have to add up the error for O1 and the
[05:51:31] error for O2. And when we add up these
[05:51:34] two, we get a total error of 0.29.
[05:51:38] So now we're done with feed forwarding
[05:51:40] the input and then we've predicted the
[05:51:42] result. And we've also calculated the
[05:51:44] error in prediction. So this is where
[05:51:46] the step two starts. So step two is
[05:51:49] basically backward propagation where we
[05:51:51] start from the output layer and we keep
[05:51:53] heading back and keep on updating the
[05:51:56] values. So now we'll see how can we
[05:51:58] update the weight five. So as I've
[05:52:01] already stated to update the weights
[05:52:03] we'd have to find out the change in
[05:52:06] error with respect to the change in
[05:52:08] weight. So we'd have to take the partial
[05:52:10] differentiation of total error with
[05:52:13] respect to W5. So this can be written as
[05:52:17] partial differentiation of E total with
[05:52:19] respect to output O1 into partial
[05:52:22] differentiation of output O1 with
[05:52:24] respect to net O1 into partial
[05:52:26] differentiation of net O1 with respect
[05:52:28] to W5 that is this entire process. So
[05:52:33] partial differentiation of E total with
[05:52:36] respect to output O1 partial
[05:52:38] differentiation of output O1 with
[05:52:40] respect to net01 and partial
[05:52:42] differentiation of net01 with respect to
[05:52:45] W5 and when we do this we'll get the
[05:52:48] final result. So we'll start with this
[05:52:50] formula over here. So we have to find
[05:52:53] out the partial differentiation of E
[05:52:56] total with respect to output O1. So E
[05:52:59] total is basically this value over here
[05:53:02] which is basically half into target O1
[05:53:04] minus of output O1 square plus/ into
[05:53:08] target O2 minus output O2 square. So
[05:53:11] over here since we are differentiating
[05:53:14] it with respect to U1 this entire thing
[05:53:17] would be a constant and this would turn
[05:53:19] to zero. And when we differentiate this
[05:53:21] this would become minus of target 01
[05:53:24] minus of output 01. And we'll substitute
[05:53:27] the values for target 01 and output 01
[05:53:30] and we'll get the value of 0.74.
[05:53:33] So the result for this would be 0.74.
[05:53:37] And then we'll come to the second term
[05:53:38] over here which is partial
[05:53:40] differentiation of output 01 with
[05:53:41] respect to net01. So we know that output
[05:53:44] 01 is 1 upon 1 + e power minus net 01.
[05:53:48] And this is nothing but the sigmoid
[05:53:50] function. And when you differentiate the
[05:53:52] sigmoid function you basically get the
[05:53:54] same value multiplied by 1 minus same
[05:53:58] value. So over here this output 01 after
[05:54:02] differentiating would become output 01
[05:54:05] into 1 minus output 01. So this is 0.75
[05:54:10] into 1 - 0.75
[05:54:12] and this gives us a value of 0.18. So
[05:54:16] we've got the value for the first term
[05:54:17] which is 0.74. We also got the value for
[05:54:19] the second term which is 0.18. And then
[05:54:22] we'll head on to the third term which is
[05:54:24] differentiation of net01 with respect to
[05:54:27] w5. And we have net01 over here. Net 01
[05:54:30] is w5 into output h1 plus w6 into output
[05:54:34] h2 + b2 into 1. So over here since we
[05:54:38] are differentiating this with respect to
[05:54:40] w5 all of these would be constant and
[05:54:43] would turn to zero. And when we
[05:54:45] differentiate this this would become 1
[05:54:47] into output of h1 into 1. So this
[05:54:51] basically is output of h1. and output of
[05:54:54] H1 we already know the value which is
[05:54:56] basically 0.59.
[05:54:59] So we get the values for the final three
[05:55:01] terms and all we have to do is multiply
[05:55:04] these three terms then we'll get the
[05:55:06] partial differentiation of e total with
[05:55:09] respect to w5. So when we multiply these
[05:55:11] three terms we get a value of 0.08.
[05:55:15] Now we've got the gradient. After
[05:55:17] getting the gradient we have to pass
[05:55:20] this through the stoastic g. After we
[05:55:22] get the gradient, we'd have to pass this
[05:55:24] through the gradient descent formula. So
[05:55:26] this what you see is the gradient
[05:55:28] descent formula. So this is the new
[05:55:29] weight. This is the old weight. This n
[05:55:31] which you see is nothing but the
[05:55:33] learning rate and this is the gradient
[05:55:35] which you've just calculated. So new
[05:55:37] weight is equal to old weight minus the
[05:55:40] learning rate into the gradient descent.
[05:55:42] Since we've already got the gradient
[05:55:43] descent which is 0.08, what we'll do is
[05:55:46] we'll just multiply this gradient with
[05:55:48] the learning rate which is 0.5. Again
[05:55:50] this learning rate is an arbitrary
[05:55:52] value. So you'd have to do the trial and
[05:55:54] error to get the best optimal value for
[05:55:56] this 0.5 into this. And you will
[05:55:58] subtract this from the initial weight
[05:56:01] value of W5. The initial weight of W5 is
[05:56:04] 0.4. So when you subtract this value
[05:56:06] from 0.4 you'll get a value of 0.35.
[05:56:10] So initially W5 was 0.4 and after back
[05:56:15] propagating its weight came down to
[05:56:17] 0.35.
[05:56:18] So we'll do the same procedure for the
[05:56:21] rest of the weights. Now when we apply
[05:56:23] the same procedure W6 the new W6 value
[05:56:26] comes to 0 the new W6 value comes down
[05:56:31] the new W6 value comes to 0.40 new W7
[05:56:34] value comes to 0.51 and new W8 value
[05:56:37] comes to 0.56.
[05:56:39] So we are done through the first pass of
[05:56:42] backward propagation. After this we have
[05:56:44] to go through the second pass and
[05:56:46] calculate W1, W2, W3 and W4. That is
[05:56:50] that is in the first pass of gradient
[05:56:52] descent we calculated these four values.
[05:56:56] Now we again have to go through another
[05:56:58] pass and then we'd have to update these
[05:57:01] four values over here which are W1, W2,
[05:57:03] W3 and W4. So this time if we have to
[05:57:06] update W1 then we'd have to find out the
[05:57:09] change in total error with respect to
[05:57:11] the change in W1. That is we'd have to
[05:57:14] get the partial derivative of E total
[05:57:16] with respect to W1. And this can be
[05:57:18] represented as partial differentiation
[05:57:20] of E total with respect to output H1
[05:57:23] into partial derivative of out H1 with
[05:57:26] respect to net H1 into partial
[05:57:28] derivative of net H1 with respect to W1.
[05:57:32] So this is how it can be understood. So
[05:57:34] let's start with the first term over
[05:57:35] here. So if we take this first term over
[05:57:37] here then H1 over here it contributes to
[05:57:41] both the output of O1 and O2 and that is
[05:57:44] why it can be split into two parts. So
[05:57:46] partial derivative of E total with
[05:57:48] respect to output of H1 becomes partial
[05:57:51] derivative of EO1 with respect to H1
[05:57:53] plus partial derivative of EO2 with
[05:57:56] respect to out H1. Right? So this is
[05:57:59] what we are doing over here. So if you
[05:58:02] want to find out the partial derivative
[05:58:04] of e total with respect to w1, we can
[05:58:07] get it by doing the partial derivative
[05:58:10] of e total with respect to out h1 and
[05:58:13] then we'd have to get the partial
[05:58:15] derivative of out h1 with respect to net
[05:58:18] h1 and then we'd have to get the partial
[05:58:20] derivative of net h1 with respect to w1.
[05:58:23] Now when it comes to the first term, E
[05:58:26] total can be represented as E 01 + E 02.
[05:58:30] And this term over here can be broken
[05:58:32] down as partial derivative of E01 with
[05:58:35] respect to out H1 plus partial
[05:58:37] derivative of E 02 with respect to out
[05:58:40] H1. Right? Now let's understand the math
[05:58:43] over here. So this is the first term
[05:58:45] which we have over here. Partial
[05:58:47] derivative of E total with respect to
[05:58:49] out H1. And as you already saw this can
[05:58:52] be represented as partial derivative of
[05:58:53] EO1 with respect to H1 plus partial
[05:58:56] derivative of EO2 with respect to H1.
[05:58:59] Now with respect to this equation we'll
[05:59:01] take the first term. So this first term
[05:59:04] over here can be represented like this.
[05:59:07] So partial derivative of EO1 with
[05:59:09] respect to H1 can be represented as
[05:59:11] partial derivative of EO1 with respect
[05:59:14] to net O1 into partial derivative of
[05:59:17] net1 with respect to out H1. So these
[05:59:20] two would be canceled out and we'll get
[05:59:22] what is there on the LHS. Now again
[05:59:25] we'll take the first term over here and
[05:59:27] this first term can again be represented
[05:59:29] like this. So partial derivative of EO1
[05:59:31] with respect to net O1 can be
[05:59:32] represented as partial derivative of EO1
[05:59:35] with respect to out H1 into partial
[05:59:38] derivative of out H1 with respect to net
[05:59:40] 01. So these two would be canceled out
[05:59:42] and we'll be getting what is there on
[05:59:44] the LHS. So we have already calculated
[05:59:46] these two values in the first backward
[05:59:49] pass. So the value for this would be
[05:59:50] 0.74 and the value for this would be
[05:59:52] 0.18. And when we multiply these two
[05:59:55] we'll get a value of 0.13.
[05:59:58] So we've got the value for the first
[05:59:59] term which is 0.13. Now we'd have to get
[06:00:02] the value for the second term. The
[06:00:04] second term is partial differentiation
[06:00:06] of net01 with respect to out h1. And we
[06:00:09] already know what net01 is. So net0 o1
[06:00:12] is w5 into out h1 plus w6 into out h2 +
[06:00:15] b2 into 1. And since we want this
[06:00:18] partial derivative with respect to h1,
[06:00:20] these two would be constants. And when
[06:00:22] we differentiate this, we'll just be
[06:00:24] left with w5. And the value of w5 is
[06:00:27] 0.40.
[06:00:29] So the value for this term comes out to
[06:00:32] be 0.40. So we've calculated both the
[06:00:34] values for this. So that the value for
[06:00:36] the first term is 0.13 and the value for
[06:00:39] the second term is 0.4. 4. And when we
[06:00:42] multiply these two values, we'll get the
[06:00:44] partial differentiation of EO1 with
[06:00:48] respect to out H1, which comes out to
[06:00:50] 0.055.
[06:00:53] And if you repeat the same process for
[06:00:55] EO2 as well, we'll get the partial
[06:00:57] differentiation of EO2 with respect to
[06:01:01] out H1. And this comes out to
[06:01:03] minus0.019.
[06:01:06] So we've got partial differentiation of
[06:01:08] EO1 with respect to out H1 and partial
[06:01:11] differentiation of EO2 with respect to
[06:01:13] out H1. And when we add these two up,
[06:01:16] we'll get a final value of 0.03.
[06:01:19] And this is nothing but the partial
[06:01:21] differentiation of E total with respect
[06:01:24] to out H1. So again this is not the
[06:01:27] final result. What we want is partial
[06:01:29] differentiation of E total with respect
[06:01:32] to W1. So we've just got the result for
[06:01:35] the first term. There are two more terms
[06:01:37] over here. The second term is partial
[06:01:39] differentiation of out h1 with respect
[06:01:41] to net h1 and partial differentiation of
[06:01:45] net h1 with respect to w1. So let's
[06:01:48] start with the second term. The second
[06:01:49] term over here is partial
[06:01:50] differentiation of out h1 with respect
[06:01:53] to net h1. And we already know what out
[06:01:55] h1 is. It is this over here which is
[06:01:57] basically the sigmoid function 1 upon 1
[06:02:00] + e power minus net h1. And we know that
[06:02:03] when we differentiate a sigmoid function
[06:02:06] we get the same value into 1 minus the
[06:02:09] same value. So over here this becomes
[06:02:11] output of h1 into 1 minus output of h1.
[06:02:15] So h1 is 0.59 and into 1 - 0.59. So this
[06:02:20] gives us the value for the second term
[06:02:22] which is 0.24.
[06:02:24] And then finally we have the third term.
[06:02:26] So third term is partial differentiation
[06:02:28] of net h1 with respect to w1. And we
[06:02:31] already know what net h1 is. So net h1
[06:02:33] is w1 into i1 + w3 into i2 + b1 into 1.
[06:02:38] And since we are differentiating this
[06:02:40] with respect to w1, these two terms
[06:02:42] become zero. And all we are left with is
[06:02:45] i1 and i1's value is 0.05.
[06:02:49] So we've got the values for all of the
[06:02:51] three terms. So the value for the first
[06:02:54] term or the value for partial
[06:02:56] differentiation of e total with respect
[06:02:58] to out h1 is 0.036. 036 and the value
[06:03:01] for the partial differentiation of out
[06:03:03] H1 with respect to net H1 is 0.24 and
[06:03:07] the value for partial differentiation of
[06:03:09] net H1 with respect to W1 is 0.05
[06:03:13] and then finally all we have to do is
[06:03:15] multiply these three values to get the
[06:03:18] result for partial differentiation of E
[06:03:20] total with respect to W1 and that comes
[06:03:23] to 04.
[06:03:26] So we've got the gradient with respect
[06:03:28] to W1. And since we want the new weight
[06:03:30] for W1, we have to pass this down to the
[06:03:32] gradient descent formula which is new
[06:03:35] weight is equal to old weight minus the
[06:03:37] learning rate value into the gradient
[06:03:40] with respect to W1. So over here the
[06:03:43] gradient is 0.4.
[06:03:46] The learning rate is 0.5 and the initial
[06:03:48] value of W1 is 0.15.
[06:03:51] So when we feed in this value over here,
[06:03:54] W1 becomes 0.1497.
[06:03:58] And in the same way, we'll update the
[06:04:00] values for W2, W3 and W4 as well. So
[06:04:04] this is the entire back propagation
[06:04:07] algorithm with which we were able to
[06:04:09] update all of the weights starting from
[06:04:11] W1 to W8. So initially when the inputs
[06:04:15] were 0.05 05 and 0.1 respectively and we
[06:04:19] forward propagated these inputs we had
[06:04:21] an error of 0.29
[06:04:24] but after applying the back propagation
[06:04:27] once the error came down to 0.29102
[06:04:32] so even though we see that the
[06:04:33] difference is not much but then again
[06:04:35] when we implement this back propagation
[06:04:37] algorithm around 10,000 times the error
[06:04:40] plummets to around 0.351.
[06:04:45] So at this time when we feed forward
[06:04:47] 0.05 and 0.1 the output neurons generate
[06:04:51] 0.015
[06:04:53] for the first output and 0.98 for the
[06:04:57] second output which are very very close
[06:05:00] to the actual outputs. So we see that we
[06:05:03] keep on back propagating until we find
[06:05:06] the lowest error and also find out the
[06:05:09] optimal weights and the optimal bias. So
[06:05:12] back propagation is basically used to
[06:05:15] optimize our neural network and we have
[06:05:17] many search optimization algorithms with
[06:05:19] us and through all of these examples the
[06:05:21] optimization algorithm which we used was
[06:05:24] the gradient descent algorithm. So let's
[06:05:27] understand this gradient descent
[06:05:28] algorithm in a better way. So the
[06:05:30] gradient descent algorithm basically
[06:05:32] measures how much the output of a
[06:05:34] function changes if you change the input
[06:05:37] a little bit. So for all of the examples
[06:05:39] which you saw, you can consider the
[06:05:40] weight to be the input and the error to
[06:05:42] be the output. So you want to know how
[06:05:44] much does the error change with respect
[06:05:46] to a small change in the input. So the
[06:05:49] higher the gradient, the steeper the
[06:05:51] slope and faster the model learns. Or in
[06:05:53] other words, if the change in error is
[06:05:56] very high when there is just a small
[06:05:58] change in weight, then this means that
[06:06:00] the model is learning very quickly. So
[06:06:02] this is the concept behind gradient
[06:06:04] descent. So now let's understand how
[06:06:05] does gradient descent actually work. So
[06:06:08] this is the formula for gradient
[06:06:10] descent. B is the new weight. A is the
[06:06:12] current weight. This what you see is the
[06:06:14] learning rate and this is the gradient.
[06:06:17] So new weight or the next value is equal
[06:06:20] to current value minus the learning rate
[06:06:23] multiplied by the gradient. So let's say
[06:06:27] we have this graph over here and these
[06:06:29] are our initial values of weight and
[06:06:32] bias. So the task of the gradient
[06:06:34] descent function would be to find the
[06:06:35] optimal values of W and B and to
[06:06:38] minimize the error. So let's say we
[06:06:41] start from somewhere over here and we
[06:06:44] have a random values of W and B. So
[06:06:47] let's say there's a ball and that ball
[06:06:50] keeps on rolling to the bottom of the
[06:06:53] curve. So the gradient descent algorithm
[06:06:56] has to make sure that it reaches to this
[06:06:59] bottom of the curve or the bottom of the
[06:07:02] hill in the fastest manner in the right
[06:07:05] direction possible. So restating it we
[06:07:08] have random values of W and B and the
[06:07:10] task of the gradient descent algorithm
[06:07:12] would be to find out the optimal values
[06:07:14] by reducing the error and the reduction
[06:07:17] in error is done by going in the right
[06:07:20] direction of the slope. So let's say if
[06:07:23] there's a ball which starts from over
[06:07:25] here, it keeps on rolling down and it
[06:07:28] has to reach over here and this value
[06:07:30] would be nothing but the global minimum.
[06:07:32] So when it reaches over here, it'll get
[06:07:34] the global minimum which is nothing but
[06:07:36] the lowest error and this is where we'll
[06:07:39] find out the optimal values for W and B.
[06:07:42] Right now let's understand the
[06:07:44] importance of learning rate. So when the
[06:07:46] ball is rolling down the hill, the speed
[06:07:48] of that ball is determined by the
[06:07:50] learning rate. So let's say if the
[06:07:52] learning rate is very high or if it's
[06:07:54] very large then what happens is it might
[06:07:57] overshoot the global minima. So let's
[06:07:59] say if it starts over here then from
[06:08:02] here it'll not come over here but it'll
[06:08:04] shoot over here. From here it'll come to
[06:08:06] the left side and from here it'll reach
[06:08:08] over here. So what is happening is since
[06:08:10] the learning rate is very high it'll
[06:08:12] never reach the global minimum and it'll
[06:08:15] keep on oscillating between the two
[06:08:17] walls of this curve. So this is what
[06:08:19] happens when the learning rate is very
[06:08:21] high. Now let's understand what happens
[06:08:23] when the learning rate is small. So when
[06:08:25] the learning rate is small, let's say if
[06:08:26] the ball starts from over here, then the
[06:08:29] speed of the ball to come from over here
[06:08:32] to the bottom of the hill would be very
[06:08:34] very slow. So it'll start over here,
[06:08:36] then it'll come over here, then over
[06:08:38] here, over here, over here, and over
[06:08:40] here. So from here to here, it'll take
[06:08:43] eternity. and it'll consume up a lot of
[06:08:46] resources and a lot of time for this
[06:08:49] neural network to find the optimal
[06:08:51] values for the weight. So this is why we
[06:08:53] have to make sure that the learning rate
[06:08:55] is neither too big nor too small. So we
[06:08:58] also have to find out the optimal
[06:09:00] learning rate to get the optimal values
[06:09:03] of weight and bias. Now there are other
[06:09:06] interesting terms when it comes to
[06:09:07] gradient descent. So you have something
[06:09:09] known as an epoch. So one epoch is
[06:09:11] basically when we pass the entire data
[06:09:14] set forward and backward through the
[06:09:17] neural network once. So I'm again
[06:09:20] reiterating it. So when we pass the data
[06:09:22] set forward and backward through the
[06:09:25] neural network once that is known as one
[06:09:28] epoch. Now when it comes to neural
[06:09:30] network it is very important that the
[06:09:33] data set is passed through more than one
[06:09:35] epoch because when we pass the data set
[06:09:38] through the neural network through only
[06:09:39] one epoch then it leads to underfitting
[06:09:42] or in other words it fails to learn all
[06:09:45] of the features associated with the
[06:09:48] data. Now when we keep on feeding this
[06:09:50] data to the neural network there
[06:09:52] there'll come a stage where the curve
[06:09:54] would be the optimum for the data
[06:09:56] provided. But then again if we keep on
[06:09:58] increasing the number of epochs and
[06:10:00] never stop what happens is it'll lead to
[06:10:03] overfitting. So when it leads to
[06:10:05] overfitting this would be good for that
[06:10:07] particular problem but it is very bad at
[06:10:10] general problems. So when it's given
[06:10:13] some other data set then it'll fail
[06:10:15] miserably. So this is why epoch is also
[06:10:18] important. So we also have to find out
[06:10:20] the right number of epochs for our model
[06:10:23] to be accurate enough. So after epoch we
[06:10:26] have something known as batch size and
[06:10:27] iterations. So batch size is basically
[06:10:30] the number of training examples present
[06:10:32] in a single batch. Now we'll head on to
[06:10:35] iterations. So iterations is the number
[06:10:38] of batches needed to complete one epoch.
[06:10:40] Now let's understand these three terms
[06:10:43] through an example. So let's say there
[06:10:45] are 2,000 entries in the data. And if we
[06:10:48] divide these 2,000 entries into batches
[06:10:52] of 500, we'll have four batches in
[06:10:55] total. And it'll take four iterations to
[06:10:58] complete one epoch. So I'm reiterating
[06:11:00] it. Let's say there's a data set which
[06:11:03] contains 2,000 entries or 2,000 records
[06:11:05] in total. And if we divide this data set
[06:11:08] of 2,000 examples into batches of 500,
[06:11:11] we'd have four batches in total. So we'd
[06:11:13] have to iterate through all of these
[06:11:15] four batches to complete one epoch.
[06:11:18] Right? So now as I already stated during
[06:11:20] the agenda, the gradient descent comes
[06:11:22] with a lot of variants. So we have the
[06:11:24] vanilla gradient descent which is
[06:11:26] nothing but the batch gradient descent
[06:11:29] and then we have the stoastic gradient
[06:11:30] descent and another variation is the
[06:11:32] mini batch gradient descent. So let's
[06:11:35] start with the batch gradient descent.
[06:11:37] So we'll understand all of these
[06:11:38] variants through this example. So let's
[06:11:40] say we have a data set which comprises
[06:11:42] of these six images of cars. So let's
[06:11:46] start with batch gradient descent. So
[06:11:48] the batch gradient descent algorithm
[06:11:50] takes all of these images at one go. So
[06:11:52] it takes all of these six images of a
[06:11:54] car at one go and it calculates the loss
[06:11:58] for all of these six images at a single
[06:12:00] time. So it'll take all of these six
[06:12:02] images, it'll back propagate and it'll
[06:12:05] calculate the loss or the error in
[06:12:08] prediction. Now once it calculates the
[06:12:10] loss for each of these six cars then
[06:12:13] it'll go ahead and update the gradient
[06:12:15] descent. So it'll actually calculate the
[06:12:18] average gradient descent with respect to
[06:12:21] the loss of all of these six cars. Now
[06:12:24] keeping this in mind we'll head on to
[06:12:26] the next variant which is stoastic
[06:12:27] gradient descent. So in stoastic
[06:12:29] gradient descent what happens is it
[06:12:31] takes one record at a time or for this
[06:12:34] particular example it takes one image at
[06:12:36] a time. So the first image is sent to
[06:12:38] the neural network. It reads the image.
[06:12:41] It back propagates, calculates the error
[06:12:44] and after calculating the error, it
[06:12:47] there itself updates the gradient. Now
[06:12:49] once it updates the gradient, it reads
[06:12:51] the second record. Now after reading the
[06:12:54] second record, it back propagates and
[06:12:56] calculates the error with the updated
[06:12:58] gradient and again it'll update the
[06:13:00] gradient for this record as well. and
[06:13:02] then it'll head on to the third record.
[06:13:04] Calculate the loss, update the gradient.
[06:13:07] Similarly, it'll do that for each of the
[06:13:09] individual records. So this is what is
[06:13:12] done in stoastic gradient descent. And
[06:13:14] then finally we have the mini batch
[06:13:16] gradient descent where the data is read
[06:13:18] in batches or mini batches. So over here
[06:13:21] let's say with respect to this example,
[06:13:23] the mini batch size is two. So in the
[06:13:26] first mini batch two images are sent to
[06:13:29] this neural network over here. So it
[06:13:31] back propagates, calculates the loss and
[06:13:34] calculates the average gradient descent
[06:13:36] for these two records. And with the
[06:13:38] previously found gradient descent, it'll
[06:13:40] again back propagate and find out the
[06:13:43] updated gradient descent. And then we'll
[06:13:45] head on to the next batch and then
[06:13:47] update the gradient descent again. So
[06:13:49] this is what happens with respect to
[06:13:51] vanilla gradient descent, stoastic
[06:13:53] gradient descent and mini batch gradient
[06:13:55] descent. So now that we've understood
[06:13:57] the gradient descent algorithm properly,
[06:13:59] let's understand how can we work better
[06:14:01] with this gradient descent algorithm. So
[06:14:04] to find out that optimal value of weight
[06:14:06] and bias, we can plot the cost versus
[06:14:10] time graph. So time is plotted on the
[06:14:12] x-axis and cost is plotted on the
[06:14:14] y-axis. And the aim of the gradient
[06:14:17] descent algorithm should be to reduce
[06:14:19] the cost as the time progresses. So
[06:14:22] we'll most probably get a curve which
[06:14:24] will start at the far end of the yaxis
[06:14:27] and it'll come down to the bottom of the
[06:14:29] y-axis and then it'll stay constant. So
[06:14:33] basically as time progresses the cost
[06:14:34] has to decrease or in other words the
[06:14:36] error has to decrease. The next tip when
[06:14:39] working with gradient descent algorithms
[06:14:41] would be to modify the learning rates.
[06:14:44] So again we are never sure what learning
[06:14:46] rate would work with what data. So that
[06:14:49] is why we'd have to start with small
[06:14:51] learning rates. So we'll start with
[06:14:52] these small learning rates which are
[06:14:54] specified over here and then we can go
[06:14:56] ahead and randomly play with these
[06:14:58] learning rates so that we get the
[06:15:00] optimal value. And we should also make
[06:15:02] sure that we rescale the inputs. So
[06:15:04] sometimes it happens that the data which
[06:15:06] we feed into the neural network is
[06:15:08] distorted or it is skewed. Now when this
[06:15:11] happens this neural network will not
[06:15:13] work properly. So that is why we'd have
[06:15:16] to scale this data in the either the
[06:15:18] range of 0 to 1 or minus to1 so that we
[06:15:22] get the optimal weight values and we can
[06:15:24] reduce the error as soon as possible.
[06:15:27] Right? So that was gradient descent. Now
[06:15:29] apart from the gradient descent
[06:15:31] algorithm we also have something known
[06:15:32] as the atom optimization algorithm. So
[06:15:35] the atom optimization algorithm
[06:15:37] basically stands for adaptive moment
[06:15:40] estimation and it is a combination of
[06:15:42] gradient descent with momentum and RMS
[06:15:45] prop algorithms. Now the problem with
[06:15:48] gradient descent algorithm is let's say
[06:15:50] if we take the example of that curve and
[06:15:53] we roll the ball down through the hill.
[06:15:56] Now what happens is if there are a lot
[06:15:59] of crests and troughs in that hill then
[06:16:01] there'll be oscillation along the
[06:16:03] vertical axis as well as the horizontal
[06:16:06] axis. Now when there is oscillation
[06:16:08] along the vertical axis this will reduce
[06:16:11] the speed of the ball and this will also
[06:16:14] increase the time of reaching the global
[06:16:16] minimum. So that is why when we add
[06:16:18] momentum along with the gradient descent
[06:16:21] algorithm the speed of reaching the
[06:16:23] global minimum would become faster. And
[06:16:25] also another change when it comes to the
[06:16:27] atom optimization algorithm is the
[06:16:29] learning rate is variable. So the
[06:16:31] learning rate is not constant. So as in
[06:16:33] when the momentum changes the learning
[06:16:36] rate also changes to adapt with that
[06:16:39] momentum. So a feed forward neural
[06:16:41] network just contains multiple nodes
[06:16:43] which are arranged in multiple layers
[06:16:45] and simply put we have an input layer a
[06:16:48] hidden layer and the output layer over
[06:16:51] here. And a feed forward neural network
[06:16:53] can contain two kinds of nodes. So there
[06:16:55] could be a monollayer which does not
[06:16:57] contain any hidden layers and then we
[06:16:59] have our multi-layer perceptrons where
[06:17:01] there is one input layer and there
[06:17:04] exists at least one hidden layer and
[06:17:07] there is one output layer. So all of the
[06:17:10] inputs are taken through the input
[06:17:12] layer. The processing is done in the
[06:17:14] hidden layer and the final output is
[06:17:16] received through the output layer. All
[06:17:18] right. So now let's actually discuss the
[06:17:20] solution to our e-commerce firm problem
[06:17:23] with the help of multi-layer perceptron.
[06:17:26] So we have this input layer. All of the
[06:17:28] inputs are sent over here and the
[06:17:30] processing is done over here in these
[06:17:32] hidden layers. Now each of these hidden
[06:17:35] layers does a part of the processing
[06:17:37] work. So let's say stage one of the
[06:17:40] processing is done in the first hidden
[06:17:42] layer. Stage two of the processing is
[06:17:44] done in the second hidden layer. Stage
[06:17:46] three of the processing is done in the
[06:17:47] third hidden layer. and so on and then
[06:17:49] we'll have a final linear equation. Now
[06:17:53] this final linear equation is sent
[06:17:55] through an activation function and that
[06:17:58] activation function converts this linear
[06:18:00] equation into a nonlinear solution and
[06:18:03] sends that out to the output layer.
[06:18:06] Right? So over here in this multi-layer
[06:18:08] perceptron over here we have these
[06:18:10] inputs, we have this hidden layer and we
[06:18:12] have the output layer over here. So
[06:18:14] let's start off with the input layer. So
[06:18:16] in the input layer we have two external
[06:18:18] inputs X1 and X2 and along with these
[06:18:21] external inputs we also have a bias. Now
[06:18:24] the bias is generally one. Now these
[06:18:27] inputs are sent to the hidden layer and
[06:18:29] these links between the input layer and
[06:18:31] the hidden layer these links are
[06:18:33] basically associated with weights. Now
[06:18:36] once these inputs are sent to the hidden
[06:18:38] layer all of the processing is done over
[06:18:40] here. So what happens in the hidden
[06:18:42] layer is the inputs are multiplied with
[06:18:45] these weights associated. So x1 is
[06:18:48] multiplied with w1, x2 is multiplied
[06:18:51] with w2 and 1 over here or the bias over
[06:18:53] here is multiplied with the weight w. So
[06:18:56] in the hidden layer we get this equation
[06:18:58] w into 1 + w1 into x1 + w2 into x2. And
[06:19:04] this is our linear equation which we
[06:19:05] want. But then again, we'd have to
[06:19:07] convert this linear equation into a
[06:19:09] nonlinear solution. And to do that, we'd
[06:19:12] require an activation function. We'll
[06:19:14] basically pass this linear equation
[06:19:16] through an activation function. And then
[06:19:18] we'll get a result. And that result is
[06:19:20] finally sent out through the output
[06:19:23] layer over here. Now, keeping these
[06:19:25] points in mind of how an input layer
[06:19:26] works, of how the hidden layer works,
[06:19:28] and how the output layer works, let's
[06:19:30] solve our use case. All right. So again
[06:19:33] we have the input layer, the hidden
[06:19:34] layer and the output layer. So over here
[06:19:37] all of the different marketing platforms
[06:19:39] which we have they are sent as the
[06:19:41] individual inputs. So marketing via
[06:19:43] emails that would be our first input.
[06:19:46] Marketing via referral programs would be
[06:19:48] a second input. Patch is our third input
[06:19:50] and then we have direct marketing,
[06:19:52] social media marketing and then we also
[06:19:54] have the organic search. So all of these
[06:19:57] platforms together combined become our
[06:19:59] inputs and these inputs are sent to the
[06:20:01] hidden layer and all of these links
[06:20:03] which you see over here all of these
[06:20:05] links have certain weights associated
[06:20:07] with them and those weights are
[06:20:09] multiplied with the corresponding
[06:20:11] weights over here and we get a linear
[06:20:13] equation and that linear equation is
[06:20:16] passed through an activation function
[06:20:18] and that provides us a solution which is
[06:20:20] finally sent out through the output
[06:20:22] layer and that is how we'll get the
[06:20:24] result where we'll to know what
[06:20:26] combination of all of these marketing
[06:20:29] platforms would be the best idea for us.
[06:20:32] Right now let's actually go through
[06:20:34] another use case to understand
[06:20:35] multi-layer peretrons better. So we have
[06:20:38] this data set over here which comprise
[06:20:40] of these columns. So this column tells
[06:20:42] us the number of hours studied by the
[06:20:43] student. This column tells us the
[06:20:45] midterm marks scored by the student and
[06:20:47] this column tells us whether the student
[06:20:49] would pass in the exam or not. So one
[06:20:51] over here tells us that the student has
[06:20:53] passed in the exam and zero over here
[06:20:56] tells us that the student has failed in
[06:20:58] the exam. So these two would be our
[06:21:00] inputs and this would be a final result.
[06:21:03] Now suppose we want to predict whether a
[06:21:05] student studying 25 hours and having 70
[06:21:08] marks in the exam will pass the final
[06:21:10] term or not. So the input for number of
[06:21:14] hours studied is 25 and the input for
[06:21:16] number of marks obtained is 70. and we'd
[06:21:20] have to find out or we'd have to predict
[06:21:22] the final result. So these are the three
[06:21:25] parameters which we have over here and
[06:21:27] this is basically a binary
[06:21:28] classification problem. So it's either
[06:21:30] one or zero and we'll solve this binary
[06:21:33] classification problem through a
[06:21:35] multi-layer perceptron. Right? Now we've
[06:21:38] understood what feed forward propagation
[06:21:40] is. So in feed forward propagation we
[06:21:43] start from the input layer and these
[06:21:45] inputs are sent to the hidden layer
[06:21:47] where all of the processing is done and
[06:21:50] that processed output is finally sent
[06:21:53] out through the output layer. But then
[06:21:55] again that wouldn't be our final
[06:21:57] solution. Now to get the final solution
[06:22:00] the multi-layer perceptron uses
[06:22:02] something known as the back propagation
[06:22:03] algorithm and using this back
[06:22:05] propagation algorithm it'll come out
[06:22:08] with the optimal solution. So we'll
[06:22:10] understand about back propagation in the
[06:22:12] coming slides. Right? So coming to our
[06:22:15] use case, we have the input layer, the
[06:22:17] hidden layer and the output layer. And
[06:22:19] these are our inputs. So number of hours
[06:22:21] studied by the student are 35 and the
[06:22:23] marks scored by the student are 67. And
[06:22:26] these are given as the inputs to the
[06:22:27] input layer. And there is also a bias
[06:22:29] over here. So these are the initial
[06:22:31] random weights which are W1, W2 and W3.
[06:22:35] Now what happens in the forward
[06:22:37] propagation is we have predicted and the
[06:22:40] prediction for this node over here it
[06:22:43] comes out to be 0.4 and the prediction
[06:22:45] for this node over here it comes out to
[06:22:47] be 0.6 but the actual results are
[06:22:50] different. So the actual result for this
[06:22:53] output is one and the actual result for
[06:22:55] this output is zero. So there's a
[06:22:57] difference of 0.6 and there's a
[06:22:59] difference of 0.4 over here. So there is
[06:23:02] an error in prediction and normally for
[06:23:04] the final output layer we'll be using
[06:23:06] the softmax function as the activation
[06:23:09] function. So the softmax function
[06:23:11] basically gives us probabilities and
[06:23:13] when you add up all of those
[06:23:15] probabilities the sum ends up to be one.
[06:23:17] So as over here we see that this
[06:23:19] probability is 0.4 and this probability
[06:23:21] is 0.6 and when you add these two up you
[06:23:24] get a total probability of one. And to
[06:23:27] get this probability in this way, we'll
[06:23:28] have to use the softmax activation
[06:23:31] function. And we have the same thing
[06:23:32] over here. So these are the inputs.
[06:23:34] Number of hours studied and the midterm
[06:23:35] marks. The number of hours studied the
[06:23:37] 35 and the midterm marks are 67. Now
[06:23:40] these inputs are sent to the hidden
[06:23:42] layer where the processing is done. So
[06:23:44] W2 is multiplied with 35 and W3 is
[06:23:48] multiplied with 67 and then W1 is also
[06:23:51] multiplied with the bias which is one.
[06:23:53] So in total the linear equation becomes
[06:23:55] 1 into w1 + 35 into w2 + 67 into w3 and
[06:24:00] this linear equation is passed through
[06:24:02] an activation function. Now once this
[06:24:05] processing is done and it is sent to the
[06:24:07] activation function we'll get an output
[06:24:09] and that output is sent through the
[06:24:11] output layer. So again as we see over
[06:24:13] here the probability of passing is given
[06:24:15] out as 0.4 and the probability of
[06:24:17] failing is given as 0.6 6 but in
[06:24:21] actuality this is wrong. So the
[06:24:23] probability of passing is 1 and the
[06:24:25] probability of failing is zero. So
[06:24:27] there's an error of 0.6 for the first
[06:24:29] output node and there's an error of 0.4
[06:24:32] for the second output node. Now this has
[06:24:35] happened because the weights and the
[06:24:36] buys were initially randomly assigned.
[06:24:39] So when you give in random weights
[06:24:41] you'll also get a random result. So to
[06:24:44] correct this, we'd have to go back and
[06:24:47] optimize these weights and bias. So this
[06:24:50] is where back propagation comes in. So
[06:24:53] what we'll do is once we get the errors
[06:24:55] over here, we will go back and using an
[06:24:58] optimization algorithm, we'll update
[06:25:01] these weights. So we'd have to update
[06:25:03] W1, W2, and W3. And when we get the
[06:25:06] optimal weights for W1, W2 and W3, we'll
[06:25:10] get the optimal result which would be
[06:25:12] one for the first node and zero for the
[06:25:14] second node. And this is how the back
[06:25:16] propagation algorithm works. So what
[06:25:19] we'll do is we'll find out the change in
[06:25:23] error with respect to the change in
[06:25:25] weight. So we'll have to find out the
[06:25:27] change in Y with respect to the change
[06:25:29] in W4, W5 and W6. And when we get that
[06:25:34] we will go back that is we'll back
[06:25:36] propagate and then we'll update the
[06:25:37] weights for W4 W5 and W6. That is how
[06:25:41] we'll get the optimal weights. So once
[06:25:43] the back propagation is done and we've
[06:25:45] updated these weights W4, W5 and W6, we
[06:25:48] see that the result has also changed. So
[06:25:51] we see that after using the back
[06:25:52] propagation algorithm, the probability
[06:25:54] of pass that is the probability for the
[06:25:56] first node has changed to 0.8 8 and the
[06:26:00] probability of failing or the
[06:26:01] probability for the second node has
[06:26:03] changed to 0.2 and this time the error
[06:26:06] in prediction for the first node is 0.2
[06:26:08] and similarly the error in prediction
[06:26:10] for the second node is also 0.2. So
[06:26:13] after back propagating and updating the
[06:26:16] weights we find out that the error has
[06:26:18] reduced. So this is how back propagation
[06:26:21] works to optimize the weights. So I'll
[06:26:23] start off by loading the required
[06:26:25] packages. we would need numpy, mattplot
[06:26:27] lib and pandas. So I'll import numpy as
[06:26:30] np. I'll import the mattplot lip package
[06:26:32] as plt and I'll import pandas as pd.
[06:26:34] I'll click on run. So let's say there is
[06:26:36] a hypothetical neural network and to
[06:26:38] that hypothetical neural network I'm
[06:26:40] giving two inputs. The value of the
[06:26:41] first input which is x1 is 2 and the
[06:26:43] value of the second input which is x2 is
[06:26:46] five. And for these two inputs I would
[06:26:48] want the output value which is y to be
[06:26:51] equal to 31. So I'm basically
[06:26:54] initializing the values over here. I'll
[06:26:56] click on run again. And this is where
[06:26:58] I'll do all of the back propagation and
[06:27:00] gradient descent work. So starting off
[06:27:02] I'll give in the value for learning
[06:27:04] rate. So I'll set the learning rate to
[06:27:05] be equal to 0.01.
[06:27:08] So after learning rate I'd have to
[06:27:10] assign some random weight values. So
[06:27:12] I'll have W1 and I'll assign it a random
[06:27:15] weight value of three. And then I have
[06:27:17] W2 and I'll again assign it a random
[06:27:19] value of seven. So now through back
[06:27:22] propagation my aim would be to find
[06:27:25] those optimal values of w1 and w2 which
[06:27:30] would give me the correct value of y
[06:27:32] which is 31. So what I've done is in
[06:27:35] this for loop I've basically taken 50
[06:27:38] epochs that is I will pass in these
[06:27:40] inputs through the neural network 50
[06:27:42] times and this is the forward pass over
[06:27:45] here where I'll go ahead and calculate
[06:27:47] the error in prediction. So that is
[06:27:49] basically w1 into x1 + w2 into x2. So
[06:27:53] I've passed these inputs into the input
[06:27:56] layer. Now in the input layer, what is
[06:27:58] happening is the weights are multiplied
[06:28:00] with the inputs. So w1 into x1 + w2 into
[06:28:05] x2 and I'm storing that into y
[06:28:07] predicted. Now we obviously know that
[06:28:09] the predicted values would not be equal
[06:28:12] to the actual values in the first step
[06:28:14] itself. Then I'll go ahead and calculate
[06:28:16] the error in prediction. So error in
[06:28:18] prediction would be y which is this
[06:28:20] value over here minus y bread which is
[06:28:23] basically over here. So y minus y bread
[06:28:27] and I'll square this. So I need the
[06:28:29] squared error. So once I calculate the
[06:28:31] error this is when I'll back propagate
[06:28:34] and find out the change in error with
[06:28:36] respect to individual weights. So
[06:28:38] initially I want the change in error
[06:28:40] with respect to weight one. So when I
[06:28:43] differentiate this this becomes 2 * of y
[06:28:46] - y bread and partial differentiation
[06:28:49] since this is with respect to w1 this
[06:28:51] would be a constant this would become
[06:28:53] zero and then this will become minus x1.
[06:28:56] Similarly when I partially differentiate
[06:28:58] this with respect to w2 I get 2 * of y -
[06:29:02] y into x2 over here. So I've back
[06:29:06] propagated and I've got the partial
[06:29:09] differentiation of error with respect to
[06:29:11] W1 and W2. Now after this is done, I'd
[06:29:14] have to use the gradient descent formula
[06:29:17] and update the weight values. So the new
[06:29:20] value of W1 would be W1 minus learning
[06:29:24] grade into the gradient with respect to
[06:29:26] W1. Similarly, new value of W2 would be
[06:29:30] old value of W2 minus learning grade
[06:29:33] into differentiation of error with
[06:29:35] respect to W2. Now, finally, I'll print
[06:29:38] out W1, W2, and the error. So, I'll run
[06:29:42] this. So, let's see what is happening
[06:29:44] over here. So, initial value of W1 is
[06:29:47] 2.6. Initial value of W2 is 6.2. And the
[06:29:50] error in prediction is 100. So, in
[06:29:53] second iteration, the error comes down
[06:29:54] to 17. In the third iteration the error
[06:29:57] comes down to three. So we see that we
[06:29:59] are going in the right direction and the
[06:30:01] error falls down very very quickly. So
[06:30:04] after around 50 iterations the error is
[06:30:07] zero. Now since the error is zero let me
[06:30:10] find out the weight values now. So
[06:30:12] initially the weight values were 2.6 and
[06:30:14] 6.2. Now finally we have the weight
[06:30:18] values to be 2.3 and 5.27. And when we
[06:30:22] pass in these weight values into this
[06:30:24] formula, we get a value of 31.0
[06:30:27] which is same as this actual output
[06:30:30] value. So the actual output value and
[06:30:32] the predicted output value both of them
[06:30:35] are same. So initially we started with
[06:30:37] an error of 100. Now we kept on back
[06:30:40] propagating 50 times and we reduced this
[06:30:43] 100 to zero and we got the final correct
[06:30:47] value. So there are a lot of deep
[06:30:49] learning frameworks available today. So
[06:30:51] now my question to you guys would be why
[06:30:53] should we use KAS out of all of these
[06:30:55] deep learning libraries. Well let's
[06:30:57] understand. So the number one reason to
[06:31:00] use KAS would be it prioritizes
[06:31:02] developer experience. So KAS is a
[06:31:04] framework which is developed for humans
[06:31:06] and not machines. That is it is very
[06:31:09] easy to code with KAS. You just have to
[06:31:12] keep on adding layers which you can
[06:31:14] invoke with functions and you can keep
[06:31:16] on building neural networks. So it is
[06:31:18] that easy to work with keras. And then
[06:31:21] kas is also broadly adopted in the
[06:31:23] industry and also among the research
[06:31:26] community. So all of the PhD scientists
[06:31:29] over there and all of the data
[06:31:30] scientists their most widely preferred
[06:31:33] deep learning framework is karas. And
[06:31:35] the next reason would be it is very easy
[06:31:37] to turn all of these KAS models into end
[06:31:40] to end products. So let's say if you
[06:31:43] develop a simple prototype with KAS and
[06:31:46] if you want to launch it on some
[06:31:47] platform then you can easily do it. So
[06:31:50] you build a model and then you can
[06:31:52] easily launch it on let's say Android,
[06:31:54] iOS or any other operating system as an
[06:31:57] end toend product and KAS also supports
[06:31:59] multiple backend engines and does not
[06:32:01] lock you into one ecosystem. Now what do
[06:32:03] I exactly mean when I say that Keras
[06:32:06] supports multiple backend engines? Well,
[06:32:08] KAS is basically a highle API and this
[06:32:11] highle API can run on a lot of low-level
[06:32:14] APIs such as TensorFlow, CNTK and
[06:32:18] Theano. So now Keras is what works at
[06:32:21] the front end and at the back end you
[06:32:23] have either TensorFlow, Theo or CNTK
[06:32:26] running. Right? Now Keras also has a
[06:32:29] strong multiGPU support. So when I say
[06:32:32] keras has a strong multiGPU support what
[06:32:35] I mean is you can basically divide the
[06:32:38] data or you can basically train the data
[06:32:40] on multiple GPUs. So let's say you have
[06:32:43] an input data which comprise of 100
[06:32:45] records and you divide into five mini
[06:32:48] batches. Now you can train these each
[06:32:51] individual mini batches on separate
[06:32:54] GPUs. So let's say you have a model and
[06:32:58] you have the input data. Now you'll be
[06:33:00] making copies of this input data and
[06:33:02] each individual copy would run on each
[06:33:05] single GPU and each of that GPU would
[06:33:08] give you an individual result which is
[06:33:10] aggregated and you'll get the final
[06:33:12] result. So this basically speeds up the
[06:33:15] model building process and KAS
[06:33:17] development is also backed by all of the
[06:33:19] major companies out there such as
[06:33:20] Google, Amazon and Nvidia. Right? So now
[06:33:24] that we've understood why should we use
[06:33:26] kas let's actually understand what is
[06:33:29] kas. So as I've already told you guys
[06:33:32] kas is basically a highlevel API and it
[06:33:35] is written in python and this highle API
[06:33:38] can run on top of the tensorflow or
[06:33:41] cnttk and it is very easy to work with
[06:33:44] kas. So you have individual modules and
[06:33:47] you can invoke each of these individual
[06:33:50] modules to keep adding layers on the
[06:33:53] neural network. So now that we
[06:33:54] understand what is keras now let's have
[06:33:56] a look at the different models in keras.
[06:33:58] So there are basically two types of
[06:34:00] models available in keras which are
[06:34:02] sequential model and functional model.
[06:34:04] So let's start with the sequential
[06:34:06] model. So simply put sequential model is
[06:34:09] just a linear stack of layers. So you
[06:34:12] have one layer on top of that one layer
[06:34:15] you add another layer on top of the
[06:34:16] second layer you add the third layer on
[06:34:18] top of the third layer you add the
[06:34:20] fourth layer. So it is basically a
[06:34:22] sequence of layers. So as we see over
[06:34:25] here this input layer would be our first
[06:34:27] layer and on top of this input layer
[06:34:30] we'll add the first hidden layer and on
[06:34:32] top of this first hidden layer we'll add
[06:34:35] the second hidden layer. And finally on
[06:34:37] top of the second hidden layer we'll add
[06:34:40] the output layer. So simply put a
[06:34:42] sequential model is just a linear stack
[06:34:45] of layers which processes the data and
[06:34:48] gives out the final output through the
[06:34:50] output layer. Right? And this is how we
[06:34:53] can invoke a sequential model through
[06:34:55] kas. So first we'd have to import
[06:34:58] sequential from keras.models and then
[06:35:00] we'd have to import whatever layers that
[06:35:02] we require. So the first step in
[06:35:04] creating a sequential model would be to
[06:35:07] create an instance of this. So we have
[06:35:09] to use the sequential method and we'll
[06:35:11] create an instance of it. So now that
[06:35:13] we've created a model. So with this
[06:35:15] model we can keep on adding layers. So
[06:35:18] what we are basically doing is for this
[06:35:20] instance I am adding the first layer. So
[06:35:23] this first layer is basically a dense
[06:35:25] layer and it compris of 32 nodes and the
[06:35:28] dimension of the input which this first
[06:35:30] layer takes is 784. And we are also
[06:35:33] adding an activation function to this
[06:35:35] first layer. It is as simple as that.
[06:35:37] First you create a model and then
[06:35:39] whatever layers you want to add you can
[06:35:41] just add it with the help of the add
[06:35:43] method. So this is how we can create
[06:35:44] sequential models in KAS. Now we'll head
[06:35:47] on to the second type of models which
[06:35:49] are functional models. So functional
[06:35:51] models help us to create complex models.
[06:35:54] Now the problem with sequential models
[06:35:56] is you can give inputs only at the
[06:35:58] beginning stage over here. So let's say
[06:36:00] if I want to add new inputs for the
[06:36:03] second layer that wouldn't be possible.
[06:36:06] So once if you give the inputs you
[06:36:08] cannot add anything else in between. So
[06:36:10] this is where functional models differ.
[06:36:12] So in functional models it is not
[06:36:14] necessary to follow the same sequence.
[06:36:17] So when it comes to functional model any
[06:36:19] layer can be connected to any other
[06:36:22] layer. And these three steps which you
[06:36:24] see these are the three steps which have
[06:36:26] to be followed to create a functional
[06:36:27] model. So we start off by defining the
[06:36:30] input and once we define the input we
[06:36:33] start off by building a set of layers
[06:36:36] and these layers can be connected
[06:36:38] anyway. So I can connect the first layer
[06:36:41] with the fourth layer or the second
[06:36:43] layer with the 10th layer. So I start
[06:36:45] off by defining the inputs and then I'll
[06:36:47] create the layers and then I'll connect
[06:36:49] those layers. Now finally I will build
[06:36:52] the model. So these are the three steps
[06:36:54] involved. First define the input.
[06:36:56] Second, build the layers and connect the
[06:36:58] layers. And then finally, build the
[06:37:00] model. Right? So let's understand each
[06:37:03] of these steps properly. So when it
[06:37:05] comes to a sequential model, we actually
[06:37:07] have to create and define a standalone
[06:37:10] input layer that specifies the shape of
[06:37:12] the input data. So as you see over here,
[06:37:15] we are importing input method. So from
[06:37:18] kas.layers, we are importing input and
[06:37:21] we are defining the shape of the input
[06:37:23] data. And this is extremely important
[06:37:26] when it comes to functional models. So
[06:37:28] first we have to start off by creating
[06:37:30] the input layer. And once we create the
[06:37:32] input layer, we can keep on adding other
[06:37:35] layers. So as we see over here, first we
[06:37:37] import input method. And after that we
[06:37:39] are importing the next layer which is
[06:37:41] basically a dense layer. So as we see
[06:37:43] over here, first we create the input
[06:37:45] layer using the input method and the
[06:37:47] shape of the input is two and I'm
[06:37:50] storing it in visible. After that I'll
[06:37:52] add another layer. So this is basically
[06:37:54] a dense layer. Now this dense layer has
[06:37:57] two nodes in it and it is connected with
[06:37:59] the visible layer over here. Right? So
[06:38:01] we give out the name of the previous
[06:38:03] layer after we create the first layer
[06:38:04] over here. Or in other words, layers in
[06:38:07] functional model are basically connected
[06:38:09] pair-wise. So this is the dense layer
[06:38:12] and this dense layer is connected with
[06:38:14] the first input layer which is visible.
[06:38:16] Right? So we are done with the first two
[06:38:18] steps. So first we created the input
[06:38:20] layer and then we connected the rest of
[06:38:22] the layers. Now after that we have to
[06:38:25] create the model. So defining the model
[06:38:27] is extremely simple. So first we have to
[06:38:29] import the model method from kas.m
[06:38:32] models and using this model method all
[06:38:35] we have to do is pass in the input and
[06:38:37] the outputs. So as we have already seen
[06:38:39] this before input is the visible layer
[06:38:42] and output is the hidden layer. So we
[06:38:44] are passing these two layers in this
[06:38:46] method and we are storing this in the
[06:38:48] model instance. So this is how we can
[06:38:50] create a functional model and KAS
[06:38:52] already has a number of predefined
[06:38:54] layers. So these are some of the
[06:38:56] predefined layers listed over here. So
[06:38:57] KAS can have core layers, convolutional
[06:39:00] layers, pooling layers, recurrent
[06:39:02] layers, noise layers, embedding layers
[06:39:04] and so on. Right? So now that we've
[06:39:06] understood the different models
[06:39:08] available in Keras now let's understand
[06:39:10] some problems when we are building the
[06:39:12] model on top of input data. So the aim
[06:39:15] of model building should be to gain the
[06:39:17] maximum accuracy but then again it is
[06:39:20] not that easy to build that perfect
[06:39:22] model. Many a times what happens is our
[06:39:24] model is not able to learn the right
[06:39:26] amount of data. So let's take this case
[06:39:29] over here. So let's say we build a
[06:39:31] neural network model on top of this data
[06:39:34] and our aim is to classify this data
[06:39:38] into two classes. So the first class
[06:39:40] would be this cross and the second class
[06:39:42] is the circles over here. Now the
[06:39:44] problem with this model is it has not
[06:39:47] learned all of the features of the data
[06:39:49] properly. And since it has not learned
[06:39:51] all of the features of the data
[06:39:52] properly, the mclassification over here
[06:39:55] is quite high. So these three circles
[06:39:58] which would actually be circles have
[06:40:00] been classified as cross and these
[06:40:03] crosses which you see they have been
[06:40:04] mclassified as circles and this
[06:40:07] mclassification is due to the
[06:40:08] underfitting of the model. So I again
[06:40:11] restate it. So underfitting of the model
[06:40:13] basically means that it is not able to
[06:40:16] learn all of the features of the data
[06:40:17] properly and when it doesn't learn all
[06:40:20] the features of the data properly, it is
[06:40:22] not able to predict or it is not able to
[06:40:25] classify the problem in a correct way or
[06:40:28] it doesn't give us the perfect accuracy.
[06:40:30] So let's say if this model is built on
[06:40:32] the train set then the train set
[06:40:35] accuracy would be somewhere around 65%
[06:40:38] and the test set accuracy would be
[06:40:40] somewhere around 63%. Which is not
[06:40:42] really that good right? So this is
[06:40:44] basically the problem with underfitting.
[06:40:47] Now instead of drawing a straight line
[06:40:49] what if I make the model a bit more
[06:40:51] complicated. So what I'll do is I'll
[06:40:54] build a polomial curve over here. So
[06:40:56] this is a complicated model. So this
[06:40:59] complicated model is better than this
[06:41:01] model over here. So this model gives
[06:41:03] just about the right fit. So over here
[06:41:06] all of the circles have been classified
[06:41:08] correctly as circles and just these two
[06:41:10] crosses have been mclassified as
[06:41:12] circles. So let's say this gives an
[06:41:14] accuracy of around 95% on the train data
[06:41:18] and an accuracy of around 93% on the
[06:41:21] test data which is actually good enough.
[06:41:23] Now what if I get overenthusiastic and
[06:41:25] build a very very complex model with lot
[06:41:28] of polomial variables in it. So this
[06:41:31] would give me a curve something like
[06:41:33] this. So over here all of the crosses
[06:41:36] have been classified into one pool and
[06:41:38] all of the circles have been classified
[06:41:40] into another pool. So this basically
[06:41:43] gives us 100% accuracy on top of the
[06:41:45] train set. But the problem is it fails
[06:41:48] miserably on the test set. So the
[06:41:51] problem actually is that it fails to
[06:41:53] generalize. So when it is given the
[06:41:55] train set, it learns all of the features
[06:41:58] of this train set perfectly and when it
[06:42:02] comes to the test set, it has not seen
[06:42:04] the test set at all. And that is why it
[06:42:06] fails to perform properly on the test
[06:42:09] set. So when it comes to the train set,
[06:42:11] it gives out a 100% accuracy and when it
[06:42:14] comes to the test set, it might give
[06:42:16] only around 75% or around 80% accuracy.
[06:42:19] So that is why underfitting and
[06:42:21] overfitting should be avoided and we
[06:42:23] need to find that appropriate fitting or
[06:42:25] just the right fitting to get that ideal
[06:42:27] accuracy on top of the test set. Right?
[06:42:30] So this graph over here explains this in
[06:42:32] a better way. So we have error on the
[06:42:34] y-axis and model complexity on the
[06:42:36] x-axis. As we see over here as we keep
[06:42:39] on increasing the model complexity the
[06:42:43] error it decreases for the training set.
[06:42:45] But what happens with the test set is as
[06:42:48] we increase the complexity the error
[06:42:50] first decreases but then again the error
[06:42:54] increases as we keep on increasing the
[06:42:56] model complexity. Right? So to get that
[06:42:59] perfect accuracy on top of the test set
[06:43:01] we need to find out that optimal model
[06:43:04] complexity. Only when we find out that
[06:43:06] optimal model complexity that is when
[06:43:09] we'll get the maximum accuracy. Right?
[06:43:11] So now that we've understood why
[06:43:13] overfitting is harmful for a model,
[06:43:15] let's understand the solution to it. So
[06:43:17] the solution to overfitting is basically
[06:43:20] regularization. So there are some
[06:43:22] regularization techniques which help to
[06:43:24] reduce overfitting. So let's actually
[06:43:26] take this example over here. So here we
[06:43:28] see that we have built a very complex
[06:43:29] model and it learns all of the training
[06:43:31] data perfectly. But then again since it
[06:43:34] learns all of the training data
[06:43:35] perfectly, this leads to overfitting.
[06:43:37] Now the solution to this would be
[06:43:39] regularization. So when we apply a
[06:43:42] regularization technique on top of this
[06:43:44] data, what happens is it penalizes the
[06:43:48] weight matrices. So when it penalizes
[06:43:50] the weight matrices, so the values of
[06:43:53] those weight matrices and bias matrices
[06:43:55] would change a bit and this would lead
[06:43:57] to a less complicated curve. Now also we
[06:44:00] need to make sure that we give the
[06:44:02] appropriate value for the regularization
[06:44:04] coefficient. So if we keep the
[06:44:06] regularization coefficient to be very
[06:44:08] high or then what might happen is some
[06:44:11] of the weight matrices would become zero
[06:44:13] and when some of the weight matrices
[06:44:15] become zero it would basically lead to a
[06:44:17] model which is underfitting the data.
[06:44:19] Right? So again when we are applying
[06:44:21] this regularization technique or when we
[06:44:23] are tuning this regularization
[06:44:24] coefficient we have to make sure that we
[06:44:27] don't alter the weights too much and if
[06:44:30] we alter the weights too much then we
[06:44:31] might end up actually underfitting the
[06:44:33] model and that is something we have to
[06:44:35] avoid. So that is why we have to find
[06:44:38] out that right value of regularization
[06:44:40] coefficient which would give us this
[06:44:42] appropriate fitting. So we don't want a
[06:44:44] model which is too complex or too
[06:44:47] simple. We just want that right fitted
[06:44:49] model which is able to give us the
[06:44:52] perfect accuracy on top of the train set
[06:44:54] as well as the test set. Right now let's
[06:44:57] actually see some regularization
[06:44:59] techniques. So one regularization
[06:45:01] technique would be to add dropods in our
[06:45:03] neural network model. So now the
[06:45:05] question in your head would be what
[06:45:06] exactly does a dropout do? The dropout
[06:45:09] basically means that we randomly remove
[06:45:11] or cancel some nodes during each
[06:45:13] iteration. So as we see over here in the
[06:45:16] first layer these two have been dropped
[06:45:18] out or these two have been removed.
[06:45:20] Similarly in the second layer these
[06:45:22] three have been removed and in the third
[06:45:24] layer these two have been removed. So
[06:45:25] this again is an arbitrary choice and we
[06:45:28] can give any random value to each of
[06:45:31] these layers. So now that we know what
[06:45:32] exactly dropouts are let's understand
[06:45:34] how they are helpful. So when we cancel
[06:45:36] out these nodes during every iteration
[06:45:39] they do not memorize the data and when
[06:45:41] they don't memorize the data this
[06:45:43] actually in turn helps in not to
[06:45:45] overfitit the data. So let's actually
[06:45:47] understand this through an example. So
[06:45:49] let's say we have this training data and
[06:45:51] we are building this model on top of the
[06:45:53] training data and this training data
[06:45:55] goes through three epochs or three
[06:45:57] iterations. So in the first iteration in
[06:46:01] this layer these two are canceled out.
[06:46:03] In this layer these three are canceled
[06:46:05] out and in the third layer these two are
[06:46:08] canceled out. Now this is only with
[06:46:10] respect to the first iteration. Now in
[06:46:12] the second iteration again a random set
[06:46:15] of two nodes would be canceled out. And
[06:46:17] it's not necessary that these would be
[06:46:19] those two nodes. So those two nodes
[06:46:21] could be either this one and this one or
[06:46:23] this one and this one. So it could be
[06:46:25] any random combination of two nodes
[06:46:27] which are canceled out from this first
[06:46:29] layer. Similarly again from this second
[06:46:31] layer a random combination of three
[06:46:33] nodes is canceled out. Again in this
[06:46:35] third layer a random combination of two
[06:46:37] nodes is canceled out. Again we'll come
[06:46:39] to the third iteration. Again the same
[06:46:42] thing happens in the third iteration. So
[06:46:44] basically adding a dropout layer works
[06:46:46] as an ensemble model. So every iteration
[06:46:49] gives slightly different model. So in
[06:46:51] the first iteration we have different
[06:46:52] set of nodes which are giving the
[06:46:54] output. In the second iteration we have
[06:46:56] another set of nodes which are giving
[06:46:57] the output. Similarly in the third
[06:46:59] iteration we have totally different set
[06:47:00] of nodes which are giving the output
[06:47:02] right. So a dropout layer basically
[06:47:04] works as ensemble models through each
[06:47:07] iteration. So we can basically set a
[06:47:09] probability of number of nodes to be
[06:47:11] dropped out. So over here P is equal to
[06:47:12] 0.5. So this means that the dropout
[06:47:15] percentage would be 50%. That is we'll
[06:47:18] be keeping two nodes and we'll be
[06:47:19] removing two nodes. And it's extremely
[06:47:22] easy to add dropouts with KAS. So first
[06:47:24] we'd have to import the dropout method
[06:47:26] from kas.layers.com. layers.core. So
[06:47:28] once we do that we'll build a model and
[06:47:31] in this model we are adding a layer over
[06:47:32] here. Now for this layer for this dense
[06:47:35] layer over here we are adding a dropout
[06:47:38] and the probability of the dropout we
[06:47:40] are setting it to be 0.25. It's as
[06:47:42] simple as that. So all we have to do is
[06:47:44] use this dropout method and setting the
[06:47:47] probability of dropout. So adding a
[06:47:48] dropout layer was one regularization
[06:47:50] technique. Now we also have another
[06:47:53] regularization technique which is data
[06:47:55] augmentation. So the simplest way to
[06:47:57] reduce overfitting is to increase the
[06:47:59] size of the training data. Now how can
[06:48:02] we increase the size of the training
[06:48:04] data? Well, we can basically modify the
[06:48:06] data to increase the size of it. So
[06:48:09] let's say we are dealing with images or
[06:48:12] images of these numbers over here. Now
[06:48:14] in this case, what we can do is we can
[06:48:16] augment the data. So let's say I have
[06:48:19] the input data which is basically image
[06:48:21] of the number two. Now if I want to
[06:48:23] create a duplicate of this image, what I
[06:48:25] can do is I can add some sort of
[06:48:27] transformation on top of it. Let's say
[06:48:29] the initial image which had the number
[06:48:31] two was over here at the center. So
[06:48:33] instead of keeping it at the center,
[06:48:35] what I'll do is I'll move it up.
[06:48:37] Similarly, let's say if I had the number
[06:48:39] three over here in the center, I'll move
[06:48:41] it down. Similarly, over here the number
[06:48:43] three, I'm pushing it to the right. And
[06:48:45] not just moving them to the top, bottom,
[06:48:48] left. We can also flip the image, we can
[06:48:50] also scale the image. So let's say this
[06:48:52] is my input image seven. So instead of
[06:48:54] keeping it like this, what I'll do is
[06:48:55] I'll flip it inverted like this. So I'll
[06:48:58] flip it and it'll be inverted and also I
[06:49:01] can scale it up or in other words I can
[06:49:03] also increase or decrease its size. So
[06:49:06] these are different transformations
[06:49:07] which I can apply on top of the input
[06:49:10] data so that I can create more records
[06:49:13] or records which are similar to the
[06:49:15] original records. So let's say if my
[06:49:17] input data consists of images of the
[06:49:19] first 100 natural numbers. Now if I want
[06:49:22] thousand records instead of 100 records
[06:49:24] all I have to do is add some sort of
[06:49:26] transformations on those original
[06:49:28] records and have records which are
[06:49:29] pretty much similar to the original
[06:49:31] records. So this is how we can augment
[06:49:33] the data. Now what is the importance of
[06:49:35] data augmentation? Well when we have
[06:49:37] more data obviously it'll be difficult
[06:49:40] for the model to learn all of the
[06:49:41] features of it. So when we have less
[06:49:43] training data then it is very easy for
[06:49:46] the model to overfitit it and when we
[06:49:47] have a lot of data the model will
[06:49:49] obviously take a lot of time to learn
[06:49:51] all the features and it is difficult for
[06:49:53] the model to overfit it. So this is how
[06:49:55] data augmentation works and when it
[06:49:57] comes to kas we can perform all of these
[06:50:00] data augmentation strategies using the
[06:50:02] image data generator method. So we'll
[06:50:05] have to import this image data generator
[06:50:08] from kas.processing pre-processing image
[06:50:10] and it has a big list of arguments as
[06:50:13] over here we are setting the horizontal
[06:50:15] flip to be equal to true. So Mist data
[06:50:17] set comprises of handwritten images of
[06:50:20] numbers from 0 to 9. So this is
[06:50:23] basically a classification problem. So
[06:50:25] we take in the input data and we'd have
[06:50:27] to classify them belonging to this
[06:50:30] particular number. So let's say if I
[06:50:31] take in the image of number one then I
[06:50:34] have to classify to which of the numbers
[06:50:36] starting from 0 to 9 does it belong to.
[06:50:39] So this is basically the data set which
[06:50:41] we are going to working with. So we'll
[06:50:43] start off by importing all of the
[06:50:45] packages that we need. So from keras
[06:50:47] we'll import models and we require the
[06:50:49] dropout layer as well as the dense layer
[06:50:52] and we have to convert some of the
[06:50:54] values in the data set to categorical.
[06:50:56] So we'll be doing one hot encoding and
[06:50:58] to do one hot encoding we'll be
[06:51:00] importing the two categorical method
[06:51:02] from kas.utils and as we are working on
[06:51:05] top of the misc data set we'd have to
[06:51:07] import this from kasdatas and we'll also
[06:51:10] be visualizing how the model works. So
[06:51:13] to visualize that we'll be importing
[06:51:15] model 2 dot from kas.utils
[06:51:18] utils and we'll also be importing SVG
[06:51:21] from ipython.dis. So let me run this.
[06:51:23] Now I'll also import numpy. After that
[06:51:26] I'll import the live loss plot. So live
[06:51:28] loss plot helps us to have a glance at
[06:51:31] the live training of the model. So we
[06:51:33] can actually visualize how the error is
[06:51:35] decreasing and how the accuracy is
[06:51:37] increasing for each of the epoch. So for
[06:51:41] that purpose we'll be importing the live
[06:51:42] loss plot. And there are some initial
[06:51:44] variables which we are creating. So
[06:51:45] we'll set the number of rows to be 28,
[06:51:47] number of columns to be 28, the number
[06:51:49] of classes to be 10, the batch size to
[06:51:51] be 128, and the number of epochs to be
[06:51:54] 10. And then we'll go ahead and create a
[06:51:57] simple method which we can call again
[06:51:58] and again. So this method takes in four
[06:52:01] parameters. Xrain, Y train, X test and Y
[06:52:04] test. So X train basically contains the
[06:52:07] train images. Y train contains the train
[06:52:10] labels. X test contains the test images
[06:52:14] and Y test contains the test labels and
[06:52:17] in this function we are basically
[06:52:18] printing all of these things. So X train
[06:52:21] dotshape that is we are basically
[06:52:23] printing out the shape of the training
[06:52:26] set images. After that, we'll print out
[06:52:28] the shape of the labels from the train
[06:52:31] set. And then we'll print out the shape
[06:52:33] of the images of the test set. And then
[06:52:36] we'll print out the shape of the labels
[06:52:38] of the test set. And finally, we'll just
[06:52:40] print out the train labels and the test
[06:52:42] labels. So I'll click on run. Right. So
[06:52:45] we've created the method. Now I will
[06:52:47] load up the data from MNIST data set and
[06:52:51] I'll store all of the data inside XRain,
[06:52:55] Y train, X test and Y test. Right? And
[06:52:58] then I will pass in these four values
[06:53:01] into this method which I've just
[06:53:02] created. So I'm reiterating it from the
[06:53:05] MNIC data set. I'm loading all of the
[06:53:07] data and I'm storing that data in X
[06:53:10] train, Y train, X test and Y test. and
[06:53:13] I'll pass in these four values into this
[06:53:15] method which I've just created. So I'll
[06:53:17] click on run. Right? So this method
[06:53:20] basically gives me these things. So we
[06:53:22] see that the training images shape is
[06:53:24] 60,28
[06:53:26] cross 28 and the labels is 60,000. That
[06:53:30] is this basically means that there are
[06:53:32] 60,000 entries in the training set and
[06:53:36] this is basically the size of each image
[06:53:38] which is 28 + 28. And then again this is
[06:53:42] the test images. So the shape is
[06:53:44] 10,2828.
[06:53:46] This means that there are 10,000 images
[06:53:48] in the test set and the shape of these
[06:53:51] images is 28 + 28. And these are the
[06:53:54] train labels and the test labels. So
[06:53:56] these are basically labelings of all of
[06:53:58] the images. So the first image in the
[06:54:01] training set is of the number five. The
[06:54:04] second image in the training set is of
[06:54:06] the number zero. Similarly the third
[06:54:08] image in the training set is of the
[06:54:09] number four. So you basically have to
[06:54:11] build a model on top of the train set so
[06:54:13] that it learns all of the features
[06:54:14] properly and then we'll check for its
[06:54:16] accuracy on top of the test set. So
[06:54:19] before we go ahead and build the model
[06:54:21] we actually have to reshape it. Now the
[06:54:24] train image shape it cannot be 28 28. So
[06:54:27] it has to be of a single dimension. So
[06:54:29] what we'll do is we'll multiply this. So
[06:54:31] we have already stored the value of num
[06:54:33] rows and num calls over here. So num
[06:54:35] rows is 28 and num calls is 28. So this
[06:54:38] basically means that number of rows are
[06:54:40] 28 and number of columns are 28 and
[06:54:42] we'll multiply this and we'll convert
[06:54:44] this two-dimensional data into a single
[06:54:46] dimensional data. So we are basically
[06:54:48] reshaping this training data. Similarly
[06:54:51] we'll take in this test data which is of
[06:54:53] the shape 28, 28 and again we'll convert
[06:54:56] this two-dimensional data into a single
[06:54:58] dimensional data by multiplying those
[06:55:00] two. So over here I am passing this X
[06:55:03] test dotshape and I am reshaping this
[06:55:06] with just one dimension. So it'll be num
[06:55:08] rows into num calls which is basically
[06:55:10] 28 cross 28. Right? So I've reshaped the
[06:55:13] image data. Now similarly I'd have to
[06:55:16] reshape the labels as well or in other
[06:55:18] words I'd have to perform one hot
[06:55:20] encoding on top of this. So I just
[06:55:23] cannot have these values 5 04
[06:55:25] individually. So what I'll do is if
[06:55:27] there is five I will do one hot encoding
[06:55:31] and represent it as a numpy array of 10
[06:55:35] values. So I am taking this Y train and
[06:55:38] I am encoding it into categorical
[06:55:42] values. So I'm passing the original Y
[06:55:44] train value and the number of classes
[06:55:47] over here I have set it to be 10. So
[06:55:49] I'll just pass in the parameter which is
[06:55:51] num classes over here. Similarly I'll
[06:55:53] take in Y test and I will pass this
[06:55:56] inside this two categorical method and
[06:55:59] I'll set the num classes again. So this
[06:56:01] basically gives in 10 classes. I'll
[06:56:03] click run now. Now again I will pass in
[06:56:06] all of these modified data. So X train,
[06:56:09] Y train, X test, Y test all of these
[06:56:12] have been modified. I'll pass these
[06:56:14] inside data summary method and let's see
[06:56:17] how these have been changed. Right? So
[06:56:19] initially train images shape was 60,2828
[06:56:23] and that has been changed to 60,784.
[06:56:26] Similarly the train label shape was
[06:56:28] 60,000 and I've changed that to 60,10
[06:56:32] and these have been changed like this.
[06:56:33] And also for the labels we have
[06:56:35] performed one hot encoding. So over here
[06:56:37] if we have a glance at these test
[06:56:39] labels. So this is 7 2 and one right. So
[06:56:43] over here seven has been represented
[06:56:45] like this. Two has been represented like
[06:56:47] this and one has been represented like
[06:56:49] this. So
[06:56:55] so this one represents the place of that
[06:56:58] value. So if you have one at the seventh
[06:57:00] position right? So now we have our data
[06:57:02] ready and since we have our data ready
[06:57:04] we can go ahead and build a sequential
[06:57:06] model. So using this instance which I've
[06:57:10] created models I will call in the
[06:57:12] sequential method. So models
[06:57:14] dosequential.
[06:57:16] So I'll start off by creating an
[06:57:17] instance of a sequential model. So I'll
[06:57:20] call models dosequential and I'll store
[06:57:22] that in this model object. So now that
[06:57:25] I've created my model I can finally go
[06:57:27] ahead and add all of the layers. So what
[06:57:29] I'll do is I'll add the first layer. So
[06:57:32] in this first layer, this is nothing but
[06:57:33] a dense layer. So this is a densely
[06:57:35] connected layer and this layer has 512
[06:57:38] nodes. And the activation function which
[06:57:40] I'm using is ReLU. And since this is the
[06:57:43] first layer, it needs to have the input
[06:57:44] shape. So input shape is nothing but
[06:57:46] number of rows into number of columns.
[06:57:48] So this basically comes to around 784.
[06:57:51] So this is the first layer and after the
[06:57:53] first layer we are adding a dropout with
[06:57:56] a probability of 0.5. After that we'll
[06:57:59] add the second layer and the second
[06:58:01] dense layer has 256 nodes in it and
[06:58:03] again the activation function is relu
[06:58:05] and this time the dropout probability is
[06:58:07] 0.25 and then we finally have the output
[06:58:10] layer and the output layer comprise of
[06:58:13] 10 nodes and we have 10 nodes because we
[06:58:16] have 10 classes in total or these 10
[06:58:19] classes represent the numbers from 0 to
[06:58:22] 9. So we basically have to classify
[06:58:24] whether this is an image corresponding
[06:58:26] to number zero, number one, number two
[06:58:28] and so on. And the activation function
[06:58:30] which we've used for the final output
[06:58:31] layer is softmax. So softmax function
[06:58:33] basically gives us probabilities for
[06:58:35] each of these nodes in the output layer.
[06:58:38] And when you add up all of these
[06:58:39] probabilities, it comes up to one right.
[06:58:42] So I'll run this now. So we have built
[06:58:45] the model. So now that we've built the
[06:58:47] model, we have to tune it a bit. And to
[06:58:48] tune the model, we have to use
[06:58:50] model.compile. So this is where we'll be
[06:58:53] using three parameters. So first
[06:58:55] parameter is the optimizer which we'll
[06:58:57] be using. Second parameter is the loss
[06:58:59] function which we'd have to reduce and
[06:59:01] the third parameter is the metric which
[06:59:04] we'd have to calculate. So over here the
[06:59:06] optimizer or the optimization algorithm
[06:59:09] which I'm using is RMS prop and the loss
[06:59:11] which we'd have to reduce is categorical
[06:59:14] cross entropy and the metric which we
[06:59:16] are getting is accuracy of the model.
[06:59:18] Right? So this is basically the tuning
[06:59:20] part and after that we have to fit the
[06:59:23] model on top of the training set and to
[06:59:25] do that we'll be using model dofit and
[06:59:28] this takes in all of these parameters.
[06:59:29] So we'll pass in x train y train or in
[06:59:32] other words we are basically fitting
[06:59:34] this model on top of the train set and
[06:59:36] we are validating this on top of the
[06:59:39] test set. So we have x test and y test
[06:59:41] over here and we also set the batch size
[06:59:44] and number of epochs over here. So the
[06:59:47] number of epochs are 10 and the batch
[06:59:49] size is 128. So I'll run this now. Now
[06:59:53] this is our first epoch. So our first
[06:59:55] epoch has just started. So we see that
[06:59:59] the loss it is reducing. So this is our
[07:00:02] graph over here. So we have two graphs
[07:00:04] over here. So the first graph this tells
[07:00:07] us how the error reduces with each epoch
[07:00:10] and the second graph tells us how the
[07:00:12] accuracy increases with each epoch. So
[07:00:14] over here this orange line it is for the
[07:00:17] validation data and this blue line it is
[07:00:20] for the training data. Similarly this is
[07:00:22] what we have over here. So we see that
[07:00:24] the error it reduces considerably for
[07:00:27] the training data as well as the
[07:00:29] validation data as the number of epochs
[07:00:32] increases. And similarly over here as
[07:00:34] the number of epochs increases the
[07:00:36] accuracy also increases. So the accuracy
[07:00:39] for the validation data it increases
[07:00:42] till over here till the sixth epoch. Now
[07:00:44] after the sixth epoch for the seventh
[07:00:46] epoch the accuracy actually comes down
[07:00:48] and then it goes back up over here.
[07:00:51] Right? So we were able to visualize this
[07:00:53] because of the live loss plot which we
[07:00:55] had imported earlier and we are done
[07:00:57] with all of the 10 epochs and after the
[07:01:00] 10 epoch we have the log loss and the
[07:01:03] accuracy over here. So we see that after
[07:01:06] completing the 10 epochs the error it
[07:01:10] comes down to 0.072
[07:01:12] for the validation data and the accuracy
[07:01:15] for the validation data is 0.983.
[07:01:19] Right? So we have successfully built the
[07:01:21] model. Now let's also have a glance at
[07:01:23] the summary of the model. So I'll use
[07:01:25] model dots summary and I'll click on
[07:01:26] run. So this gives us a brief summary.
[07:01:29] So this is the first dense layer. This
[07:01:31] is the second dense layer and this is
[07:01:33] the third dense layer over here. And
[07:01:35] this tells us the number of nodes with
[07:01:38] respect to each dense layer. So the
[07:01:40] first dense layer has 512 nodes and then
[07:01:43] we have a dropout layer corresponding to
[07:01:45] the first dense layer. After that we
[07:01:46] have the second dense layer and the
[07:01:48] second dense layer comprised of 256
[07:01:51] nodes. And then again we have a dropout
[07:01:53] layer for the second dense layer. And
[07:01:54] then we finally have the output layer
[07:01:56] which is again a dense layer. And this
[07:01:58] dense layer comprise of 10 nodes over
[07:02:00] here. So in total the number of
[07:02:03] parameters are 535,818
[07:02:07] and all of these are trainable
[07:02:08] parameters over here. So now that we've
[07:02:10] seen the summary let's just actually
[07:02:13] visualize how our model has progressed.
[07:02:16] So for that I will use the SVG method
[07:02:18] and inside that I will pass in my model
[07:02:21] inside model to dot and I'll create a
[07:02:24] dot plot. So I'll click on run.
[07:02:29] So this is what we have over here. So I
[07:02:31] have taken the input and I have passed
[07:02:34] that input through the first dense layer
[07:02:36] which is nothing but our hidden layer
[07:02:38] which comprised of 512 nodes and after
[07:02:42] that I passed the result through the
[07:02:44] first dropout layer and this result was
[07:02:47] then sent to the second hidden layer and
[07:02:49] then this again was sent through a
[07:02:50] dropout layer and then finally the
[07:02:53] result from this dropout layer was sent
[07:02:55] to the output layer. So this was the
[07:02:58] entire process which we performed in
[07:03:00] this Kera sequential model. So now we'll
[07:03:02] work with the second type of model which
[07:03:04] is the functional model and again we'll
[07:03:06] be applying the functional model on top
[07:03:08] of the MNIST data set. So again we'll
[07:03:11] start off by loading all of the required
[07:03:12] packages. So we have to import KAS and
[07:03:15] then we have to import model from KAS
[07:03:17] and I'll import all of the optimizers
[07:03:19] from KAS. I'll click on run.
[07:03:27] After this, I'll load in the training
[07:03:28] set and the testing set of the MNIST.
[07:03:31] So, PD read CSV helps me to load all
[07:03:34] both of these CSV files. So, training
[07:03:36] set has the train set of the MNIST data
[07:03:38] and test. CSV has the test set of the
[07:03:41] MNISC data. Again, I'll click on run.
[07:03:48] Right? So we've stored the train set in
[07:03:50] DF train and we've stored the test set
[07:03:53] in DF test. Now let me have a glance at
[07:03:56] the training set. First I'll click on
[07:03:58] run. So DF train dot head. So this gives
[07:04:01] me the first five rows of the training
[07:04:03] set. So this is our training set which
[07:04:05] comprise of 785 columns. So this what
[07:04:08] you see pixel 0, pixel 1, pixel 2. So we
[07:04:11] have 784
[07:04:13] pixels in total and these are the
[07:04:16] corresponding labels for each of these
[07:04:18] images. So the first image it is
[07:04:21] basically the image of number one. The
[07:04:23] second image it is the image of number
[07:04:25] zero. The third image it is the image of
[07:04:27] number one. Right? So we have the labels
[07:04:30] over here and then we have the 784
[07:04:32] pixels which combine to make up this
[07:04:35] image. Now similarly let me have a
[07:04:37] glance at the test set. So, df
[07:04:39] test.head. I'll click on run. Now, this
[07:04:42] is a data set which comprised of 784
[07:04:45] columns. Actually had a glance at the
[07:04:47] original training and testing set. What
[07:04:49] I'll do is I'll divide the training set
[07:04:51] into features and labels. So, from the
[07:04:54] entire data set, I'll select the columns
[07:04:57] starting from column number two to the
[07:04:59] final column that is starting from pixel
[07:05:01] 0 till pixel 783 and I will store it in
[07:05:06] df features. So this basically represent
[07:05:09] all of my independent variables or these
[07:05:11] basically represent all of my features.
[07:05:13] And similarly from the training set I'll
[07:05:15] just select this first column and I will
[07:05:18] store it in df label. So I have my
[07:05:21] features and labels ready with me. And
[07:05:23] similarly I'll also create the test set.
[07:05:26] So from the test set I'll extract all of
[07:05:28] the rows and all of the columns and I'll
[07:05:30] store that in x test. Now let me print
[07:05:33] out the shape of x test. I'll click on
[07:05:36] run. So we see that there are 28,000
[07:05:38] entries in the test set or in other
[07:05:40] words there are 28,000 images in the
[07:05:43] test set. So now we have our features
[07:05:45] and labels ready with this. We have to
[07:05:47] divide these features and labels into
[07:05:49] validation data and training data. And
[07:05:51] for this purpose we'll be importing
[07:05:52] train test split from sklearn.mmodel
[07:05:55] selection and I'll be passing both the
[07:05:58] features and labels inside this. And the
[07:06:00] test size I'll be setting it to 0.2 2
[07:06:03] that is 20% of the data points would be
[07:06:06] present in the validation data set and
[07:06:08] 80% of the data points would be present
[07:06:10] in the training set and I'm setting a
[07:06:13] random seed of 1 1212 so that I can use
[07:06:15] the same result again whenever I want.
[07:06:17] Now I will store the result in XR XCV Y
[07:06:22] train and YCV. So this X train comprise
[07:06:25] of the training data of the features.
[07:06:27] This XCV comprises the validation data
[07:06:30] of the features and we have Y train
[07:06:32] which comprise of all of the features of
[07:06:33] the training set and then we have YCV
[07:06:36] which comprise of all of the labels for
[07:06:38] the validation data. Now after this
[07:06:41] again we have to reshape this. So we are
[07:06:44] going to reshape the X train and XCV. So
[07:06:47] we're going to reshape this X train and
[07:06:49] XCV into a matrix. And we are going to
[07:06:52] set the number of rows to be 33,600 and
[07:06:54] the number of columns to be 784 for X
[07:06:57] train. And for validation, I'm going to
[07:06:59] have 8,400 rows and 784 columns. And
[07:07:04] finally, we have the X test. So I'll use
[07:07:06] X test as matrix. I'm converting this
[07:07:09] into a matrix and then I'm reshaping it
[07:07:11] to the same thing which is 28,784.
[07:07:15] Right now I'll run this. Right now let's
[07:07:17] have a glance at the minimum and maximum
[07:07:19] values of the pixels present in the
[07:07:22] training set. So I'll click on run. So
[07:07:24] you see that the minimum pixel value is
[07:07:25] zero and the maximum pixel value is 255.
[07:07:29] So these are basically grayscale images
[07:07:32] and the values of these pixels of the
[07:07:34] gscale images they range from 0 to 255.
[07:07:37] So now that I've reshaped the data I
[07:07:40] also have to normalize this data. And to
[07:07:42] normalize this data, I'll be dividing X
[07:07:45] train, XCV and X test with 255. So first
[07:07:49] I will convert them to float. And after
[07:07:52] I convert them to float, I'll just
[07:07:54] divide them with 255. So I'll say X
[07:07:56] train by 255 and I'll store it back to X
[07:07:59] train. Similarly, I'll say XCV divided
[07:08:01] by 255 and I'll store it back to XCV.
[07:08:04] And then again, I'll divide X test with
[07:08:06] 255 and I'll store it back to X test. So
[07:08:08] we have also normalized our training
[07:08:10] data. Now after this we have to perform
[07:08:13] one hot encoding of the labels and to do
[07:08:15] this we'll be using
[07:08:16] kas.utils.2categorical
[07:08:19] and this takes in two parameters. First
[07:08:21] is the object on which we'd have to
[07:08:22] perform one hot encoding and next is the
[07:08:24] value which represents the number of
[07:08:26] classes this object has to be divided
[07:08:29] into. So over here we have given the
[07:08:31] value of 10 to num digits. So basically
[07:08:34] we want to represent this y train in the
[07:08:37] form of a numpy array where there are 10
[07:08:40] values and each of these individual
[07:08:42] numpy array would actually represent the
[07:08:44] value of the image right so we'll do the
[07:08:47] same thing for y train and y cv that is
[07:08:50] for the training set as well as the
[07:08:52] validation set I'll click on run
[07:08:55] so we have normalized the data and also
[07:08:57] we have performed one hot encoding on
[07:08:59] top of the labels so now let me actually
[07:09:02] have a glance and set the labels after
[07:09:04] performing one hot encoding. Right? So
[07:09:06] you see that the first image in the
[07:09:08] training set its label was actually two.
[07:09:12] So now this two has been represented
[07:09:14] like this. So 0 0 1 and then we have
[07:09:17] zeros following it. So this basically
[07:09:20] means that wherever we have one over
[07:09:22] here this represents that index
[07:09:24] position. So this is the zero index
[07:09:27] first index second index. And since we
[07:09:29] have one at the second index, this
[07:09:31] basically denotes that the label of this
[07:09:33] image is one. And then we have the third
[07:09:35] image from the training set. So over
[07:09:37] here the label is actually seven. And it
[07:09:40] has been represented like this. So we
[07:09:42] have 0 0 0
[07:09:44] and 0 1. And then we have two zeros
[07:09:46] following it. So we have one present at
[07:09:48] the seventh index position. And since it
[07:09:51] is present at the seventh index
[07:09:52] position, this basically means that the
[07:09:54] label of this corresponding data is
[07:09:57] seven. Right. So now before we go ahead
[07:10:00] and build our functional models, we'll
[07:10:02] set in all of our parameters first. So
[07:10:05] first I will set the number of features
[07:10:07] that is the number of features are 784
[07:10:09] and I'll store it in an input object.
[07:10:12] And then I'll add the number of nodes to
[07:10:14] be present in each hidden layer. So the
[07:10:16] first hidden layer has 300 nodes. The
[07:10:18] second hidden layer has 100 nodes. Third
[07:10:21] hidden layer again has 100 nodes. And
[07:10:23] the fourth hidden layer has 200 nodes.
[07:10:25] And after this I will have num digits
[07:10:28] which is basically 10. So this basically
[07:10:30] tells me that the number of digits are
[07:10:32] 10. Right? So now I have set all of my
[07:10:35] parameters. I'll click on run. And this
[07:10:38] is finally time to build my model. So
[07:10:40] we'll be building a bunch of models. So
[07:10:43] we are going to build at least three to
[07:10:44] four models. and we're going to change
[07:10:47] the hyperparameters a bit and we'll
[07:10:49] compare the accuracy of all of these
[07:10:51] models and we'll see which model gives
[07:10:53] us the best accuracy. So we'll go ahead
[07:10:56] and build our first model. So our first
[07:10:59] model over here as we already saw during
[07:11:01] the theory part. First we have to create
[07:11:03] the input layer. So I'll be using the
[07:11:06] input method and I'll pass in the shape
[07:11:08] of it. So the shape is 784.
[07:11:11] So this basically means that there are
[07:11:13] 784 inputs and after that I will create
[07:11:17] my first hidden layer. So over here I
[07:11:20] will use the dense method and I will
[07:11:22] pass in n hidden one. This basically
[07:11:24] means that there are 300 nodes in the
[07:11:26] first hidden layer and the activation
[07:11:28] function which I'm using for the first
[07:11:30] hidden layer is relu and I'll name this
[07:11:33] as hidden layer 1. Now I will give inpac
[07:11:37] over here. So this basically means that
[07:11:40] this hidden layer one is connected with
[07:11:43] the input layer or input layer is my
[07:11:47] first layer. After that it is followed
[07:11:49] with the hidden layer one. And then I'll
[07:11:52] go ahead and create my second hidden
[07:11:54] layer. Again I'll use the tense method.
[07:11:56] I'll pass in n hidden two which is
[07:11:58] basically the number of nodes. So in the
[07:12:00] second hidden layer the number of nodes
[07:12:02] are 100 and the activation function used
[07:12:04] by me is relu. Now after this I'll name
[07:12:07] this as hidden layer 2 and I'll connect
[07:12:10] it with X again. So I am connecting this
[07:12:14] second hidden layer with the first
[07:12:16] hidden layer. Again I'll create the
[07:12:18] third hidden layer. I'll do the same
[07:12:20] things. I'll use the activation function
[07:12:22] as ReLU. I'll name it and then I'll
[07:12:24] connect this with the previous layer
[07:12:25] over here. And then finally I'll create
[07:12:27] the fourth hidden layer and I'll pass in
[07:12:29] the number of nodes. After that I'll set
[07:12:32] the activation function and then I'll
[07:12:34] name it. And then again I will connect
[07:12:36] this fourth hidden layer with the third
[07:12:38] hidden layer. And finally I'll set the
[07:12:41] output layer. So over here output layer
[07:12:43] num digits this is basically 10. So this
[07:12:46] basically means that there are 10 nodes
[07:12:47] in the output layer. And the activation
[07:12:49] function used by me is soft max. So
[07:12:52] again I'm using soft max because it'll
[07:12:54] give me a bunch of probabilities for
[07:12:56] each of these nodes. And when I add up
[07:12:58] all of these probabilities this will
[07:13:01] amount to one. And I'm naming this as
[07:13:03] output layer. Again I am connecting this
[07:13:06] output layer to the fourth hidden layer.
[07:13:08] So what I've basically done is first
[07:13:10] I've created the input layer. After that
[07:13:12] I have created four hidden layers and
[07:13:15] I've connected them with each other. And
[07:13:17] finally I've created an output layer and
[07:13:20] connected that output layer with the
[07:13:22] fourth hidden layer over here. So now
[07:13:24] that the connection of all of the hidden
[07:13:26] layers is done, we have to pass the
[07:13:28] input and the output layer inside the
[07:13:30] model method. So we'll do that and I
[07:13:33] will store that in model and let me have
[07:13:35] a glance with a summary of the model
[07:13:37] which I've just created. Right? So this
[07:13:40] is what I have. I have the input layer
[07:13:42] and then I have four hidden layers and
[07:13:44] then I finally have the output layer.
[07:13:46] This is the shape of the first input
[07:13:48] layer. So there are 784 nodes or 784
[07:13:52] inputs for the first input layer. And
[07:13:54] then we have four hidden layers. Number
[07:13:56] of nodes in first hidden layer is 300.
[07:13:59] Number of nodes in second hidden layer
[07:14:00] is 100. Number of nodes in third hidden
[07:14:03] layer is 100. Number of nodes in fourth
[07:14:04] hidden layer is 200. And then the output
[07:14:07] layer comprise of 10 nodes. Right? I'll
[07:14:09] click on run. So this is where I'll set
[07:14:11] all of the hyperparameters. So I'll
[07:14:13] start off with the learning rate. And
[07:14:15] I'll set the learning rate to be 0.1 and
[07:14:18] I'll keep the number of epochs to be 20
[07:14:20] and the batch size to be 100. And the
[07:14:23] optimizer which I'll be using is SGD or
[07:14:26] in other words I'm using the stochastic
[07:14:28] gradient descent as my optimization
[07:14:30] algorithm. And inside this I'll pass the
[07:14:33] learning rate over here. Right now I'll
[07:14:36] go ahead and compile my model. This is
[07:14:39] where I'm basically tuning my model. So
[07:14:41] this takes in three parameters. First is
[07:14:43] the loss which I have to reduce. Next is
[07:14:46] the optimizer. Next is the metric which
[07:14:48] I have to calculate. So the loss which I
[07:14:51] have to reduce is categorical cross
[07:14:53] entropy and the optimizer used by me is
[07:14:56] SGD as we saw over here. And then the
[07:14:58] metrics which I want to calculate is
[07:15:00] accuracy. So I'm building a model and I
[07:15:03] want to find out the accuracy of the
[07:15:05] model which I built. Right? I'll click
[07:15:08] on run. Now I built the model. I've also
[07:15:11] tuned the model. So this is when I can
[07:15:13] go ahead and fit the model on top of the
[07:15:15] train set. So this takes in all of these
[07:15:17] parameters. So I'm basically fitting
[07:15:18] this model on top of X train and Y
[07:15:20] train. I'll give the batch size and
[07:15:23] number of epochs over here and I want to
[07:15:26] validate this on top of XCV and YCV
[07:15:30] which is basically my validation data.
[07:15:32] I'll click on run. So my first epoch has
[07:15:35] started. So in my first epoch I see that
[07:15:38] the validation accuracy is 76%.
[07:15:41] Similarly in the second epoch I see that
[07:15:42] the validation accuracy is 87% and this
[07:15:45] goes on. So the number of epochs are
[07:15:47] increasing. So we see that the accuracy
[07:15:49] is increasing. At the 12th epoch the
[07:15:52] accuracy is 94%. At the 14th epoch the
[07:15:56] accuracy is 94.69.
[07:15:59] So at the end of 20th epoch we see that
[07:16:02] the validation accuracy is 95.82%.
[07:16:06] So we have built our first model and our
[07:16:09] first model gave us an accuracy of
[07:16:10] 95.82%.
[07:16:13] Now I'll go ahead and build another
[07:16:15] model but this time what I'll do is I'll
[07:16:18] not use the stochastic gradient descent
[07:16:20] optimization. So instead of stoastic
[07:16:22] gradient descent I'll be using Adams and
[07:16:25] let's see what difference it gives me in
[07:16:27] the accuracy. So we'll keep this in
[07:16:28] mind. So stochastic gradient descent
[07:16:30] gave us an accuracy of 95.82%.
[07:16:33] So again for the second model I'll give
[07:16:35] the input layer. I'll give all of these
[07:16:37] dense layers and then I'll set the
[07:16:39] output layer. So all of these are same.
[07:16:41] So the input shape that's same 784 the
[07:16:44] number of nodes for these hidden layers
[07:16:46] is same the activation function is also
[07:16:48] same. So the only thing which I'm
[07:16:50] changing in the second model is the
[07:16:53] optimization algorithm. So over here
[07:16:56] I'll be using caras.optimizers
[07:16:59] and I'll be passing the learning rate
[07:17:00] and I'll store this in Adam and I'll go
[07:17:04] ahead and tune this. So while tuning
[07:17:06] I'll set the loss to be categorical
[07:17:09] cross entropy and the one change which
[07:17:12] I'm making that is the optimizer used by
[07:17:15] me that is Adam. Now I'll click on run
[07:17:18] right and I'll go ahead and fit this
[07:17:20] model on top of the train set and I want
[07:17:23] to validate it on top of XCV and YCV.
[07:17:26] I'll click on run. Right. So the first
[07:17:29] epoch has started. Right. So we're done
[07:17:31] with the 20th epoch and at the end of
[07:17:33] the 20th epoch we get a validation
[07:17:36] accuracy of 97.92%.
[07:17:38] So this is a considerable increase. So
[07:17:41] when we used the stoastic gradient
[07:17:43] descent as the optimization algorithm we
[07:17:45] got an accuracy of 95.82
[07:17:48] and instead of SGD when we used Adams
[07:17:51] the accuracy increased from 95 to 97. So
[07:17:54] now we've understood that it is better
[07:17:56] to use Adams instead of SGD. Now we'll
[07:17:59] go ahead and build some other models
[07:18:00] where we'll change something else. So
[07:18:02] this time what we're going to do is
[07:18:04] we're going to keep the same
[07:18:05] optimization algorithm which is Adams
[07:18:08] but we'll change the learning rate. So
[07:18:10] initially the learning rate was 0.1. Now
[07:18:13] instead of keeping it 0.1 I'm changing
[07:18:15] it to 0.01. Rest everything will be the
[07:18:18] same. So I'll click on run over here
[07:18:22] and then I'll go ahead and fit this
[07:18:24] model on top of train and test and I'll
[07:18:26] validate it on top of XCV and YCV. So
[07:18:29] the training has started as we can see
[07:18:31] over here. Right? So we're done with the
[07:18:33] 20th epoch and we get an accuracy of
[07:18:35] 97.60.
[07:18:37] So let's compare it with the previous
[07:18:38] model. So in the previous model when the
[07:18:41] learning rate was 0.1 the accuracy is
[07:18:43] 97.92
[07:18:45] and this time it's 97.60.
[07:18:48] There's not much of a difference but we
[07:18:49] see that when the learning rate is 0.1
[07:18:52] it is we get a better accuracy than when
[07:18:54] the learning rate is 0.01.
[07:18:57] Now similarly again let's try it out
[07:18:59] with a different learning rate. So this
[07:19:00] time I am setting the learning rate to
[07:19:02] be 0.5.
[07:19:04] Again rest everything would be the same.
[07:19:06] So we have the input we have the same
[07:19:08] hidden layers which comprise of the same
[07:19:10] number of nodes and then we have the
[07:19:12] output layer over here and also the
[07:19:14] optimization algorithm is also the same.
[07:19:16] I'll click on run. Right now I'll fit
[07:19:18] the model on top of X-ray and Y train
[07:19:21] and validate it on top of XCV and YCV.
[07:19:23] So we're done with the 20th epoch and we
[07:19:25] get an accuracy of 97.7. So when we
[07:19:28] compare these three models we see that
[07:19:30] we get the highest accuracy when the
[07:19:32] learning rate is 0.1. Right now what
[07:19:35] we'll do is we'll add another layer. So
[07:19:38] till now we just had four hidden layers.
[07:19:40] So I'll add another hidden layer which
[07:19:42] comprise of 200 nodes. So again the
[07:19:44] input layer it comprises of 784 inputs.
[07:19:46] We have the first hidden layer
[07:19:48] comprising of 300 nodes. Second hidden
[07:19:50] layer comprising of 100 nodes. Third
[07:19:52] hidden layer comprising of 100 nodes
[07:19:53] again. Fourth again 100 nodes. Fifth
[07:19:56] would contain 200 nodes. And then we
[07:19:58] finally have the output layer which
[07:20:00] comprise of 10 nodes. Right? I'll click
[07:20:03] on run. And again for all of these five
[07:20:05] hidden layers the activation function is
[07:20:07] relu. And each of these is connected
[07:20:09] with each other. So input is connected
[07:20:12] to first hidden layer. This is connected
[07:20:13] to second. Second is connected to third.
[07:20:15] Third is connected to fourth. Fourth is
[07:20:17] connected to fifth. And this fifth
[07:20:19] hidden layer is connected to the output
[07:20:21] layer over here. And the activation
[07:20:23] function used for the output layer is
[07:20:24] softmax. Right? I'll click on run.
[07:20:29] So we have created the input data and we
[07:20:31] have also connected the layers. Now it's
[07:20:32] time to build the model. And to build
[07:20:34] the model all we have to do is pass in
[07:20:36] the input layer and the output layer.
[07:20:38] And then I'll have a glance of the
[07:20:40] summary of the model which I've built.
[07:20:42] So I'll click on run. Right? So I have
[07:20:44] the input layer and then I have five
[07:20:46] hidden layers and then I have the output
[07:20:48] layer and these are the number of nodes
[07:20:50] corresponding to each of these hidden
[07:20:52] layers. So now again it's time to
[07:20:54] fine-tune our model. So I'll be using
[07:20:56] the same Adams algorithm and I'll set
[07:20:58] the learning rate to be 0.01. And inside
[07:21:01] this compile method I'll set in these
[07:21:03] three factors. So the loss which I have
[07:21:05] to reduce is for categorical cross
[07:21:07] entropy optimizer as Adams and the
[07:21:10] metric which I have to calculate is
[07:21:12] accuracy. All right and then I'll fit
[07:21:14] this model on top of training set and
[07:21:16] validate this on the validation set.
[07:21:19] Right? So we're done with all of the 20
[07:21:20] epochs and this time we get a validation
[07:21:22] accuracy of 97.31.
[07:21:25] So we see that actually this is the
[07:21:27] lowest accuracy of all of the models
[07:21:28] which we've built. So this means that
[07:21:30] adding another hidden layer did not
[07:21:32] really make much of an impact to our
[07:21:34] model. So now we'll do something
[07:21:35] different. So what we'll do is we'll add
[07:21:37] dropouts after each hidden layer. So
[07:21:40] again we have the input layer over here
[07:21:42] and we have four hidden layers. These
[07:21:44] are the number of nodes corresponding to
[07:21:46] each of these hidden layers. And then
[07:21:48] this is the number of nodes for the
[07:21:49] output layer. I'll click on run. Right.
[07:21:52] So over here I'll start off by creating
[07:21:54] the input layer. I'll set the shape and
[07:21:56] the shape is 784. And then I'll create
[07:21:59] the first hidden layer. So the first
[07:22:01] hidden layer the number of nodes are
[07:22:02] 300. The activation function is relu.
[07:22:04] And after this I will set a dropout
[07:22:08] layer. So the probability of this
[07:22:09] dropout layer is 0.3. And then I'll
[07:22:12] create the second hidden layer. And I'll
[07:22:14] add a dropout to the second hidden
[07:22:16] layer. And then I'll create the third
[07:22:18] hidden layer. Again I'll add a dropout
[07:22:20] to the third hidden layer. And then I'll
[07:22:23] create the final hidden layer. And after
[07:22:25] the final hidden layer, I'll create the
[07:22:27] output layer. And again, this output
[07:22:30] layer would comprise of 10 nodes. And
[07:22:31] the activation function used as softmax.
[07:22:34] And this output layer is connected to
[07:22:36] the final fourth hidden layer over here.
[07:22:38] Right? So now that we've established the
[07:22:40] input layer and all of these hidden
[07:22:42] layers, let's go ahead and create a
[07:22:44] model. So all I have to do is pass in
[07:22:45] the input and output layers inside the
[07:22:48] model method. And I'll store this in
[07:22:50] model 4 instance. And I'll have a glance
[07:22:53] at the summary. Right? So we have the
[07:22:55] input layer, hidden layer and a dropout
[07:22:58] layer after each of the hidden layers.
[07:23:00] So now that we've built the model, let's
[07:23:02] again fine-tune it. So the loss which we
[07:23:05] have to reduce is categorical cross
[07:23:07] entropy, optimizer is atom and the
[07:23:09] metric which we've used is accuracy and
[07:23:11] let me fit the model on top of training
[07:23:13] set and validate it on top of XCV and
[07:23:16] YCV. I'll again click on run. Right? So
[07:23:18] we're done with all of the 20 epochs and
[07:23:20] this time we get a validation accuracy
[07:23:22] of 97.6. 63. Right? So what we've done
[07:23:25] is we've basically built multiple models
[07:23:27] and we were just trying to understand
[07:23:29] how does changing some of these
[07:23:31] parameters change the accuracy that we
[07:23:33] get. So now what we'll do is we'll use
[07:23:35] this final model that is the model 4
[07:23:38] which you built and we are going to
[07:23:40] predict the values on top of the test
[07:23:42] set with this model 4 which we have
[07:23:45] created. So I'm going to use the predict
[07:23:47] method from this model 4 instance. So
[07:23:49] I'll just type in model 4.predict
[07:23:51] predict and over here this takes in two
[07:23:53] parameters. So the first parameter is
[07:23:55] the data on which we want to break the
[07:23:57] results which is X test and then we'll
[07:23:59] set the batch size which is 200. So
[07:24:01] we'll basically predict the values on
[07:24:03] top of the test set and I'm storing it
[07:24:05] in test spread. All right. So now let's
[07:24:08] actually have a glance at the predicted
[07:24:10] results. So I'll have a glance at the
[07:24:11] head of this.
[07:24:14] So this is the image ID and this is the
[07:24:16] label which we have predicted. So the
[07:24:18] first image we predicted that this image
[07:24:20] represents number two. Similarly the
[07:24:22] second image we predicted that this
[07:24:23] image represents number zero. The third
[07:24:25] image represents number nine. Fourth
[07:24:27] image represents number nine again. And
[07:24:29] fifth image represents number three.
[07:24:31] Right? So now what I'll do is I'll take
[07:24:34] these predicted results and I'll create
[07:24:36] a CSV file out of this. So test.pre.2
[07:24:40] CSV. This helps me to create a CSV file
[07:24:43] and I'll name that file to be MNIT
[07:24:46] submission. CSV. So now we'll start by
[07:24:49] understanding how our computer actually
[07:24:51] reads the images that we provide to it.
[07:24:54] So let's say we have this image of three
[07:24:56] really cute dogs. So we as humans when
[07:24:58] we see this image, we understand that
[07:25:00] there are three really cute dogs in this
[07:25:02] image and there is grass beneath these
[07:25:04] dogs, right? But then again, if I feed
[07:25:07] this image to a computer, what will it
[07:25:09] exactly see? Well, so there'll be three
[07:25:11] channels which is actually known as the
[07:25:13] RGB channel. So there is this red
[07:25:16] channel, green channel and the blue
[07:25:18] channel and each of these three channels
[07:25:21] would have their own corresponding pixel
[07:25:23] values. Now when I say that the image
[07:25:26] resolution is 32 + 32 + 3, I basically
[07:25:30] mean that there are 32 rows, 32 columns
[07:25:34] and three channels. That is the red
[07:25:36] channel would have its own 32 + 32
[07:25:40] matrix. The green channel would have its
[07:25:42] own 32 cross 32 matrix and the blue
[07:25:44] channel itself would have its own 32 +
[07:25:47] 32 matrix and this is what a computer
[07:25:50] basically sees. So there are basically a
[07:25:52] list of pixel values. So these are
[07:25:54] basically numbers which are fed into the
[07:25:57] computer and this is what the computer
[07:25:59] reads. Now we'll go ahead and understand
[07:26:02] what exactly is the problem with a fully
[07:26:04] connected network. So let's say I have
[07:26:06] this image whose dimensions are 28 + 28
[07:26:10] + 3. Now when I feed in this image to a
[07:26:13] fully connected network, so the first
[07:26:15] hidden layer would basically have 28 +
[07:26:18] 28 + 3 weights. That would amount to
[07:26:21] 2352
[07:26:23] weights. Now that is a whole lot of
[07:26:25] weights, isn't it? So dealing with all
[07:26:27] of these weights would be sort of a
[07:26:29] difficult task for this network. And to
[07:26:31] be honest, this resolution is actually
[07:26:33] low. So 28 + 28 is actually a small
[07:26:36] resolution image. Now what I'll do is
[07:26:39] instead of this 28 + 28 image, I'll feed
[07:26:42] in an image whose dimensions are 200 +
[07:26:45] 200 + 3. And when I multiply 200 into
[07:26:48] 200 into 3, I'll get 120,000 weights.
[07:26:53] And those are a whole lot of weights,
[07:26:55] aren't they? And dealing with all of
[07:26:57] these weights would be a really herculan
[07:27:00] task for this fully connected network.
[07:27:02] And this is exactly where the fully
[07:27:04] connected network fails and we'll have
[07:27:06] several such layers of neurons leading
[07:27:08] to several parameters. So thus this
[07:27:11] connectivity would be a waste as the
[07:27:13] huge number of parameters would actually
[07:27:15] lead to overfitting and we do not want
[07:27:18] any sort of overfitting for our model.
[07:27:20] So now that we understand what exactly
[07:27:22] is the problem with a fully connected
[07:27:23] network, let's see how can a
[07:27:26] convolutional neural network solve this.
[07:27:28] All right. So CNN's are basically like
[07:27:31] your normal neural networks and are made
[07:27:33] up of neurons with weights and biases
[07:27:36] and they take inputs and those inputs
[07:27:39] are passed through weighted sum and then
[07:27:41] they passed through an activation
[07:27:42] function and then we finally get an
[07:27:45] output. But then again the difference
[07:27:47] between a CNN and a normal fully
[07:27:50] connected network is this. So in a CNN
[07:27:52] let's say if we take a neuron in one
[07:27:54] particular layer then this neuron will
[07:27:57] not be connected with all of the inputs
[07:28:00] of the previous layer. So let me
[07:28:02] actually go back over here and let's
[07:28:04] assume that this is a convolutional
[07:28:05] neural network. Then this neuron would
[07:28:07] only be associated with let's say the
[07:28:10] first and the second neuron. Similarly
[07:28:12] if I take the second neuron over here
[07:28:14] and then let's say this would be
[07:28:15] associated with this fourth and the
[07:28:17] fifth neuron. Similarly, if I take this
[07:28:19] neuron over here, then this could be
[07:28:21] associated with the first and the third
[07:28:23] neurons from the first layer. Right? So,
[07:28:25] this is what basically happens in a
[07:28:27] convolutional neural network. So, a
[07:28:29] neuron in one layer is not connected to
[07:28:32] all of the neurons in the previous
[07:28:34] layer, right? And rest of the things are
[07:28:37] actually pretty much same when it comes
[07:28:38] to the convolutional neural networks. So
[07:28:41] another difference when it comes to a
[07:28:42] convolutional neural network is so a
[07:28:45] convolutional net actually perceives
[07:28:47] these images as volumes that is
[07:28:49] three-dimensional objects rather than as
[07:28:52] flat canvases to be measured only by
[07:28:54] width and height. Right? So now that we
[07:28:58] understand what exactly is a
[07:28:59] convolutional neural network, let's see
[07:29:01] what convolution means. So in simple
[07:29:04] terms, convolution basically means to
[07:29:07] roll together or to combine together.
[07:29:09] And when it comes to mathematical
[07:29:11] perspective, so let's say we have two
[07:29:13] functions f and g. Now when we pass
[07:29:17] these two functions with each other,
[07:29:20] convolution basically means how much one
[07:29:22] function overlaps with the other
[07:29:25] function. So let's take this image over
[07:29:27] here. So this is our first function
[07:29:29] which is f and this is our second
[07:29:30] function which is g. Now when I pass g
[07:29:32] with respect to f. So this is where the
[07:29:35] overlapping happens and this is what we
[07:29:37] exactly mean by convolution. So a
[07:29:40] convolution is the integral. So
[07:29:42] convolution is the integral measuring of
[07:29:44] how much two functions overlap as one
[07:29:47] passes over the other. And again in
[07:29:48] mathematical terms you can consider
[07:29:50] convolution to be mixing two functions
[07:29:53] by multiplying them. So we have f and g
[07:29:56] and we are mixing them by multiplying f
[07:29:58] with g. So this is what convolution
[07:30:00] basically means. So now we'll actually
[07:30:02] have a brief understanding about CNN and
[07:30:05] then go through a use case. So what
[07:30:07] actually happens in a convolutional
[07:30:09] neural network is there are many filters
[07:30:12] and each such filter is passed over the
[07:30:15] entire image. So let's say we have this
[07:30:18] image of a car and we want to identify
[07:30:21] or classify this as a car among many
[07:30:23] such classes. So these are the different
[07:30:25] classes available to us. So there is
[07:30:27] horse, ship, airplane, truck and car.
[07:30:30] and we want to label this properly as a
[07:30:33] car. Now when we feed in this image of a
[07:30:35] car to this convolutional neural
[07:30:36] network, what basically happens is there
[07:30:38] are different filters which are passed
[07:30:41] through this image. So let's say there
[07:30:43] is one filter which identifies or maps
[07:30:46] all of the horizontal lines and then
[07:30:48] there is a second filter which
[07:30:50] identifies all of the vertical lines and
[07:30:52] then similarly there is a third filter
[07:30:54] which identifies all of the left
[07:30:56] diagonals and there is another filter
[07:30:58] which identifies all of the right
[07:31:00] diagonals. Now what happens is a
[07:31:02] convolutional neural network takes all
[07:31:04] of these filters and a slice of the
[07:31:07] images feature space and map them one by
[07:31:10] one. that is they basically create a map
[07:31:12] of each place wherever feature occurs.
[07:31:15] So we take this filter map it on top of
[07:31:17] the images feature space and basically
[07:31:20] understand if there is a match between
[07:31:22] the filter and the images feature space
[07:31:24] or not. And a convolutional neural
[07:31:26] network basically comprises of these
[07:31:29] four layers. So there is a convolution
[07:31:31] layer, there is a ReLU layer, there is a
[07:31:33] pooling layer and the final layer is the
[07:31:35] fully connected layer. All right. Now
[07:31:37] we'll actually go through a use case to
[07:31:39] understand CNN's.
[07:31:41] Um
[07:31:43] so let's say our task is to identify. So
[07:31:46] let's say we have this image of X and
[07:31:48] image of O and we want to identify what
[07:31:50] exactly is this. So when we pass this
[07:31:52] image of X through the CNN, we want to
[07:31:55] identify or label this as X. And
[07:31:57] similarly when we pass this image of O
[07:31:59] through the CNN, we want to identify or
[07:32:01] label this as O. Now this might be now
[07:32:04] this might actually seem simple. So if
[07:32:06] there is just one X or one O, it would
[07:32:08] be quite simple. You would just have to
[07:32:10] pass in this image of X through the CNN
[07:32:13] and it will easily identify that this is
[07:32:15] X. And it would be the same case with
[07:32:18] letter O as well. But then again let's
[07:32:20] say we are dealing with sort of a tricky
[07:32:22] situation. Now let's say we have all of
[07:32:25] these images of X. So this is our
[07:32:27] original image of X. So here what we are
[07:32:30] doing is we are thickening it up and
[07:32:32] over here we are actually shifting up to
[07:32:33] the top left corner and over here we are
[07:32:36] distorting this a bit but then again all
[07:32:39] of these are nothing but images of X and
[07:32:42] when we feed all of these images to a
[07:32:44] neural network we want that neural
[07:32:46] network to properly identify these as X
[07:32:49] right and similarly the case with O as
[07:32:52] well and similar is the case with O as
[07:32:54] well so all of these are distorted
[07:32:56] images of O so when I pass these images
[07:32:58] of O to a neural network. I want that
[07:33:01] neural network to properly identify
[07:33:03] these images as the letter O. So now to
[07:33:06] do this, let's actually understand how a
[07:33:08] computer reads an image. So as we've
[07:33:11] already seen, a computer basically reads
[07:33:13] an image like a two-dimensional array of
[07:33:16] pixels with a number in each position.
[07:33:19] So this is what a computer basically
[07:33:21] sees, right? So let's say when we feed
[07:33:23] in this image of letter X to the
[07:33:25] computer, this is what it basically
[07:33:26] sees, right? So these are all of the
[07:33:28] numbers which are present in these
[07:33:30] pixels. So we have minus ones all over
[07:33:32] here in the blue spaces and one over
[07:33:35] here in the white spaces right. So let's
[07:33:37] say this is our original image of X and
[07:33:39] this is our distorted image of X and
[07:33:41] it's pretty much the same. Right? So
[07:33:42] wherever we have the white labels there
[07:33:46] there the value in the pixel is + one
[07:33:49] and wherever we have blue colors there
[07:33:51] we have the value of minus1. Now let's
[07:33:54] say if we use the normal comparison and
[07:33:56] compare each of these pixel values,
[07:33:58] right? So let's say we compare this
[07:34:00] pixel value with this pixel value.
[07:34:02] Similarly, this pixel value with this
[07:34:04] pixel value and this pixel value over
[07:34:06] here with this pixel value. But what
[07:34:07] happens with that sort of comparison is
[07:34:09] we would get a lot of missing pixels
[07:34:11] which actually means that this is not
[07:34:13] the optimal way of image classification
[07:34:15] since it requires exactly the same
[07:34:17] images to classify. Right? So this is
[07:34:19] the original image. This is a distorted
[07:34:21] image and these orange pixels which you
[07:34:23] see so these are basically all of the
[07:34:24] values which are not there in the
[07:34:26] original image and this is not really
[07:34:28] the right way to classify our image. So
[07:34:31] this is where we can use a convolutional
[07:34:33] neural network. So let's understand how
[07:34:34] does a CNN solve this problem. So what
[07:34:36] actually happens in a CNN is we compare
[07:34:38] these images patches by patches or in
[07:34:41] other words we compare them with respect
[07:34:43] to these pieces or features. So earlier
[07:34:46] I had talked of something known as a
[07:34:47] filter. So these boxes which you see
[07:34:50] these are those filters over here. So
[07:34:52] what I'll do is I will start off by
[07:34:54] creating a random filter right now and
[07:34:57] this random filter is nothing but an
[07:34:59] image of a left diagonal. Now what I'll
[07:35:02] do is I will map this filter on top of
[07:35:05] this original image and find out if
[07:35:07] there is any match between this filter
[07:35:09] and this new image. Similarly this is my
[07:35:12] second filter. Now this second filter is
[07:35:14] basically a cross. Now what I'll do is I
[07:35:16] will take this filter and map this on
[07:35:19] top of this input image and again I'll
[07:35:21] see if there is any sort of match or
[07:35:23] not. Again I'll do the same thing. I'll
[07:35:25] take this third filter which is
[07:35:27] basically the image of a right diagonal.
[07:35:29] And I will map this on top of this input
[07:35:31] image and I'll again see if there is any
[07:35:33] sort of match or not. So this is what is
[07:35:35] known as feature matching. And what I'm
[07:35:37] actually doing is creating some random
[07:35:40] filters and mapping those random filters
[07:35:42] on top of this new image to see if there
[07:35:45] is any match with respect to that new
[07:35:47] image or not. And each of these feature
[07:35:49] is sort of like a mini image. You can
[07:35:51] consider this to be a small
[07:35:52] two-dimensional array of values. And as
[07:35:55] I had already said, we'll be using these
[07:35:57] three filters. So this is our first
[07:35:58] filter which basically represents a left
[07:36:00] diagonal. So all of these white values
[07:36:03] they have been represented with positive
[07:36:05] one and all of these black values they
[07:36:07] have been represented with minus one.
[07:36:09] And then we have the second filter which
[07:36:11] basically represents a cross. So we have
[07:36:13] this left diagonal and right diagonal
[07:36:15] which are basically the white spaces and
[07:36:17] those white spaces have been represented
[07:36:18] with positive one and the rest of the
[07:36:20] black spaces have been represented with
[07:36:22] minus one. And then finally we have this
[07:36:25] third filter over here which actually
[07:36:26] represents the right diagonal. So over
[07:36:29] here the values in the right diagonal or
[07:36:31] the white spaces have been represented
[07:36:33] with + one and the rest of the values
[07:36:35] have been represented with minus one. So
[07:36:38] again what we'll do is pass these
[07:36:40] filters on top of this new image and see
[07:36:43] if there's a match or not. So now we'll
[07:36:45] start off by understanding each of these
[07:36:47] layers one by one and we'll start with a
[07:36:50] convolutional layer. So as I had already
[07:36:52] said when presented with a new image the
[07:36:55] CNN doesn't really know where exactly
[07:36:57] these features will match. So what it
[07:36:59] does is it tries them everywhere in
[07:37:01] every possible position. So in
[07:37:04] calculating the match of a feature
[07:37:05] across the whole image they act as
[07:37:08] filters and the math which is used to
[07:37:10] perform this is called as convolution
[07:37:12] which we've already seen right. So
[07:37:14] basically you can consider this to be a
[07:37:16] multiplication of two functions. So
[07:37:19] let's say you have a function f and you
[07:37:20] have a function g and when you multiply
[07:37:23] these two functions you'll have the
[07:37:25] overlapping part and this overlapping
[07:37:27] part is what is known as the
[07:37:28] convolution. So there are four steps
[07:37:30] involved in this convolutional layer. So
[07:37:32] in the first step what we do is we line
[07:37:34] up the feature or the filter on top of
[07:37:36] the image and then we multiply each
[07:37:39] image pixel by the corresponding feature
[07:37:42] pixel. Now once we do that we'll add all
[07:37:45] of the values and find the sum and
[07:37:47] finally we'll divide the sum by the
[07:37:49] total number of pixels in the feature.
[07:37:51] So let's understand these four steps
[07:37:53] with the same example over here. So
[07:37:56] we've got this filter over here and this
[07:37:58] is the filter of the left diagonal. Now
[07:38:00] what I'll do is I will place or map this
[07:38:04] filter on top of this image over here.
[07:38:06] Now once the mapping is done, what I
[07:38:08] have to do is multiply these
[07:38:10] corresponding pixel values. Now let's
[07:38:13] say if I take this pixel value at the
[07:38:14] bottom right hand corner. So what I have
[07:38:17] to do is just multiply these two pixel
[07:38:19] values and I'll get a positive one. Now
[07:38:22] this is just for this pixel value. So
[07:38:24] I'll do this for all of the pixel
[07:38:26] values. So this is what basically
[07:38:28] happens. I have my filter over here or
[07:38:30] my feature over here and I am placing
[07:38:32] this on top of this image space. So
[07:38:35] obviously I can't place this small
[07:38:37] filter on top of the entire image. I can
[07:38:40] place this only on one region of this
[07:38:42] image. So I'm taking this filter and I'm
[07:38:45] placing this over this region. Now this
[07:38:48] is the first step. In the second step
[07:38:50] what we do is we multiply the
[07:38:52] corresponding values. So when we do that
[07:38:54] what we get is 1 + 1 which is 1 and then
[07:38:56] we'll do minus minus1 which is again +
[07:38:59] one. Again it's -1 which is + one again
[07:39:02] and then we'll do minus cross -1 which
[07:39:04] is + one again. So we are basically
[07:39:06] multiplying all of these corresponding
[07:39:08] values and we'll get ones over here. And
[07:39:11] after this what we'll do is we'll divide
[07:39:12] this with the number of pixels in this
[07:39:15] cell and that is 9. So when we add this
[07:39:17] up this is 9 and then we divide it by 9
[07:39:20] again. So we'll get a final result of
[07:39:22] one. So going ahead what we'll do is
[07:39:24] we'll take this one and place this at
[07:39:27] the center of this frame. Now similarly
[07:39:30] I'll take the same filter and place this
[07:39:33] over this region of the image. So this
[07:39:35] is the fourth step and again in the
[07:39:37] second step what we'll do is multiply
[07:39:38] the corresponding values. So 1 + 1 is 1
[07:39:41] again -1 + -1 is + one. Now over here
[07:39:45] when you multiply minus1 with + one what
[07:39:48] you'll get is a minus1 over here. Right?
[07:39:51] And similarly when you take this value
[07:39:52] over here it is minus1 and it is + one
[07:39:55] over here. So when you multiply one with
[07:39:57] minus1 you'll again get minus1 over
[07:40:00] here. So in the summation process you
[07:40:02] basically have two minus1's and seven
[07:40:05] positive 1's. So when you divide this
[07:40:07] value with this value what you get is a
[07:40:09] final value of 0.55.
[07:40:12] So again you'll take this value of 0.55
[07:40:15] and place this over here at the center
[07:40:17] of the frame. And you'll repeat that
[07:40:18] process for all of the regions over
[07:40:20] here. Right? So you'll take this filter,
[07:40:23] place it over here, multiply the
[07:40:24] corresponding pixel values, sum it up
[07:40:27] and divide it by the number of pixels
[07:40:29] and place the final value at the center.
[07:40:31] Similarly, you'll take this filter,
[07:40:33] place it over the second region, take
[07:40:34] this filter again, place it over some
[07:40:36] other region. Take this filter again and
[07:40:38] place it in yet another region. Right?
[07:40:40] And finally, you'll have this set of
[07:40:42] values after passing this filter through
[07:40:44] the entire image over here. Right? But
[07:40:47] then again, that was just the first
[07:40:49] filter. When you notice we actually had
[07:40:51] three filters. So there was a filter
[07:40:53] which basically denoted the left
[07:40:54] diagonal. There was a second filter
[07:40:56] which basically denoted across. And
[07:40:58] there was a third filter which basically
[07:41:00] denoted the right diagonal. Now we were
[07:41:03] just done with the first filter. That is
[07:41:05] we passed the first filter over the
[07:41:08] entire image and we got this
[07:41:09] two-dimensional array. Now similarly
[07:41:12] we'll pass the second filter on top of
[07:41:14] this entire image and we'll get this
[07:41:16] modified two-dimensional array. Again
[07:41:18] similarly we'll pass this third filter
[07:41:20] on top of this image and again we'll get
[07:41:23] this modified two-dimensional array. So
[07:41:25] this is where the convolutional layer
[07:41:26] ends. So again I'm repeating whatever
[07:41:28] happens in convolutional layer. So we'll
[07:41:30] start off by finding some random
[07:41:32] filters. So let's say there are three
[07:41:34] random filters. Now we will pass all of
[07:41:37] those three random filters on top of
[07:41:40] this entire image and multiply those
[07:41:43] corresponding pixels. After multiplying
[07:41:45] those corresponding pixels, we'll add
[07:41:47] them up, find the sum and divide it with
[07:41:50] the number of pixels present in the
[07:41:52] entire feature space. Right? So this is
[07:41:54] the end of the convolutional layer. Now
[07:41:56] let's understand what happens in the
[07:41:58] ReLU layer which is basically the second
[07:42:00] layer over here. So in ReLU layer what
[07:42:01] we actually do is pass in an activation
[07:42:04] function over the result which we
[07:42:06] obtained in the first layer. So this
[07:42:08] function basically activates a node only
[07:42:10] if the input is about a certain
[07:42:12] quantity. So when the input is below
[07:42:15] zero over here then the output is given
[07:42:17] as zero. But when the input rises above
[07:42:19] a certain threshold then it'll have a
[07:42:22] linear relationship with the dependent
[07:42:24] variable and let's actually understand
[07:42:26] that with this table over here. So let's
[07:42:28] say when the input value is minus2 then
[07:42:30] the output would be zero because
[07:42:32] whenever the input is negative the
[07:42:34] output would be zero. Similarly, we have
[07:42:36] this input value of minus 6. Again, this
[07:42:38] minus 6 is fed through the activation
[07:42:40] function and we get a value of zero.
[07:42:43] Now, when we have this value of
[07:42:44] positive2, this relu function actually
[07:42:47] acts like the identity function and
[07:42:49] would give the same value which is two
[07:42:51] again. And again, similarly, if we pass
[07:42:53] in positive 6 or plus 6, then again the
[07:42:56] relu activation function would act as
[07:42:58] the identity function and again give the
[07:43:00] same value which is six again. And this
[07:43:02] is the symbol for ReLU which basically
[07:43:04] denotes that if you have any negative
[07:43:06] values all of those would be represented
[07:43:08] as zero. And if you have any positive
[07:43:11] values then this function acts like the
[07:43:13] identity function. Right? Now we have
[07:43:15] the same two-dimensional array which we
[07:43:17] got after passing in our first filter
[07:43:20] over the image. All right. So now what
[07:43:23] we'll do is pass in this ReLU layer over
[07:43:25] this two-dimensional array. And when we
[07:43:28] pass in this ReLU layer, what it
[07:43:29] basically does is converts this negative
[07:43:32] value into zero. Right? And that is what
[07:43:34] we basically saw in the previous slide.
[07:43:36] So whenever there is a negative input,
[07:43:38] that negative input is turned to zero.
[07:43:41] And since the value over here is
[07:43:42] basically minus0.11, it'll be turned to
[07:43:45] zero. And similarly, wherever there are
[07:43:47] negative values, all of those negative
[07:43:50] values would be turned to zero with the
[07:43:52] help of this re layer. And that again
[07:43:54] was just the two-dimensional array that
[07:43:56] we got from our first filter. And again
[07:43:58] we'd have to do the same process for all
[07:44:00] of the three filters. So these are the
[07:44:02] three two-dimensional arrays which we
[07:44:04] got after passing through those three
[07:44:06] filters. And now what we'll do is we
[07:44:08] will take in these three modified
[07:44:10] two-dimensional arrays and pass this
[07:44:12] through the ReLU layer. And what we'll
[07:44:14] get is again modified two-dimensional
[07:44:16] arrays. So this time these
[07:44:18] two-dimensional arrays have no negative
[07:44:20] values. Right? So we have this
[07:44:22] two-dimensional arrays over here and
[07:44:24] these two-dimensional arrays have no
[07:44:26] negative values in them. Now after we
[07:44:28] pass these two-dimensional arrays
[07:44:30] through the ReLU layer, what we have to
[07:44:32] do is we have to pass them through
[07:44:34] something known as the pooling layer. So
[07:44:35] let's understand what exactly is a
[07:44:37] pooling layer. So pooling is basically a
[07:44:40] way where we take large images and
[07:44:42] shrink them down while preserving the
[07:44:44] most important information in them. So
[07:44:46] I'm repeating it again. So what we do in
[07:44:48] the pooling layer is we take in really
[07:44:51] large images pass them through the
[07:44:53] pooling layer and shrink them so that we
[07:44:56] can keep only the important information
[07:44:59] and discard the insignificant
[07:45:01] information. Now this pooling layer
[07:45:03] comprises of something known as a window
[07:45:05] and we'll take only the maximum value
[07:45:08] which is part of the window. So normally
[07:45:10] the window size is either 2 + 2 or 3 + 3
[07:45:13] and this what you see is the symbol for
[07:45:14] pooling. This is our two dimensional
[07:45:16] array. after passing it through the ReLU
[07:45:18] layer. Now it's time to again pass this
[07:45:20] image through the pooling layer. So as
[07:45:22] we already know what happens in the
[07:45:24] pooling layer is we pass this image
[07:45:26] through a window and we'll take the
[07:45:28] maximum value through that window. So
[07:45:30] over here the window size which I'm
[07:45:31] taking is 2 +2 right. So I will pass
[07:45:34] this window over these values and what I
[07:45:37] see is the maximum pixel value is one
[07:45:40] and that is what I'll take over here.
[07:45:42] Similarly, I'll take a step size of two
[07:45:44] and again pass this window through this
[07:45:47] region over here and this region would
[07:45:49] give me a maximum value of 0.33.
[07:45:52] Similarly, I'll pass this window to
[07:45:54] these four values and from here I'll get
[07:45:56] a maximum value of 0.55. So now if you
[07:45:59] actually note this properly, we actually
[07:46:01] started with a 7 + 7 matrix and we ended
[07:46:04] up with a 4 + 4 matrix. So this is how
[07:46:07] pooling layer actually helps us in
[07:46:09] reducing the dimensions of the image.
[07:46:11] Right? So this was the image after the
[07:46:13] second stage. We passed this through the
[07:46:15] pooling layer and we got a 4 + 4 matrix.
[07:46:18] And again, this was just the result from
[07:46:20] the first filter. We'd have to do the
[07:46:22] same thing for the rest of the two
[07:46:23] filters as well. Right? So we'll pass in
[07:46:25] all of these three two-dimensional
[07:46:27] arrays to this pooling layer. And we'll
[07:46:30] get a modified array again. And these
[07:46:33] modified arrays are of 4 + 4 dimensions.
[07:46:36] Right? Now we'll go ahead and combine
[07:46:38] whatever we did through the
[07:46:40] convolutional neural network. So we had
[07:46:42] this initial image of X. We passed this
[07:46:46] in through the convolutional layer where
[07:46:47] we had basically mapped the filters on
[07:46:49] top of the original image and then
[07:46:52] multiplied the corresponding pixel
[07:46:53] values, added them up and then divided
[07:46:56] it by the number of pixels present in
[07:46:58] the frame. Now after that we had passed
[07:47:01] the modified two-dimensional array to
[07:47:03] the ReLU layer which removed out all of
[07:47:05] the negative values. Now again that
[07:47:07] modified two-dimensional array was
[07:47:09] passed through the pooling layer where
[07:47:11] we got a subset of the original array
[07:47:13] and this is what we get after passing
[07:47:15] the original image through these three
[07:47:17] layers. Now what we'll do is instead of
[07:47:19] having just one set of these three
[07:47:21] layers we'll pass them through a
[07:47:23] multiple of convolutional relu and
[07:47:26] pooling layers. So I have this image
[07:47:28] I'll start off by passing this through a
[07:47:30] convolutional layer and a relu layer.
[07:47:32] Now again I'll pass this through a
[07:47:34] convolutional layer, a ReLU layer and a
[07:47:36] pooling layer. And again I'll do the
[07:47:38] same thing. I'll pass this through the
[07:47:40] convolutional layer, the ReLU layer and
[07:47:42] the pooling layer. So now after this
[07:47:45] process the 4 + 4 matrix has reduced to
[07:47:48] 2 +2 matrix. And this is the result
[07:47:50] after the final pooling layer. So we
[07:47:52] have something known as the fully
[07:47:54] connected layer where the actual
[07:47:56] classification happens. So these were
[07:47:58] our two-dimensional matrices. Now what
[07:48:00] happens in the fully connected layer is
[07:48:02] we'll not actually treat these inputs as
[07:48:04] a two-dimensional array. So they are
[07:48:06] actually treated as a single list and
[07:48:09] each of these value are treated
[07:48:11] identical. So we start off with this
[07:48:13] two-dimensional arrays and we convert
[07:48:15] them into this list. So I'll take this
[07:48:17] one place it over here. I'll take this
[07:48:19] 0.55 place it over here. Similarly I'll
[07:48:22] take this 0.55 place it over here. I'll
[07:48:24] take this one place it over here. And
[07:48:26] I'll do the same thing for these two
[07:48:28] two-dimensional arrays as well. So I'm
[07:48:31] restating it. I had this two-dimensional
[07:48:33] array to start with. But then again I am
[07:48:36] converting this into a list. Right? And
[07:48:38] these are the set of values which I'll
[07:48:40] get for the image X. And similarly when
[07:48:42] I pass in the image O through the
[07:48:44] convolutional neural network, I'll again
[07:48:47] have a different set of values
[07:48:48] associated with it. Right? So now let's
[07:48:50] actually take a look at this vector
[07:48:52] closely. What you see is a set of values
[07:48:54] which basically comprise of 1 and 0.5
[07:48:57] and we'll be only focusing on the one
[07:48:59] values that is only the values which are
[07:49:02] higher that is the first value fourth
[07:49:05] fifth 10th and 11th value over here. So
[07:49:08] wherever we have ones we'll take only
[07:49:10] those. Now how the classification
[07:49:12] basically happens is let's say we have a
[07:49:15] new image. So from that new image what
[07:49:17] we'll do is we'll compare the value
[07:49:19] which is present at the first position,
[07:49:21] fourth position, fifth position and
[07:49:23] similarly these two positions. So if in
[07:49:25] the new image those corresponding values
[07:49:28] are one, then there is a greater
[07:49:30] probability for that new input image to
[07:49:32] be X right and that is what happens with
[07:49:35] O as well. So over here for this list
[07:49:38] will only focus on the high values. So
[07:49:40] over here the second, 9th and the 12th
[07:49:43] values are high. So whenever a new image
[07:49:45] comes in, we'll only compare these
[07:49:48] values over here. So wherever values are
[07:49:49] high, we'll compare only those values.
[07:49:51] And if those corresponding values are
[07:49:53] high in the new input image as well,
[07:49:55] then we can go ahead and say that there
[07:49:57] is a good probability for that new image
[07:49:59] to be equal to O. Right? So let's say we
[07:50:02] have this new image over here. It
[07:50:04] comprises of all of these values. Now I
[07:50:06] will go ahead and sum up all of those
[07:50:08] high values in the vector for X in the
[07:50:10] input image. So in the vector for x this
[07:50:13] is one one one and one. So I basically
[07:50:16] have five ones which basically
[07:50:18] corresponds to five. Now I will add up
[07:50:20] the same corresponding values in the
[07:50:22] input image as well. So I have one over
[07:50:24] here I'll take 0.9. Similarly I have one
[07:50:27] I'll take the 0.7. I have 1. I'll take
[07:50:30] 0.96. I have one over here. I'll take
[07:50:32] 0.9. Again I have 1. I'll take 0.94. So
[07:50:36] what I'll do is I will add up these
[07:50:38] values. So that is basically 0.9 plus
[07:50:40] 0.87 87 + 0.96 plus 0.89 + 0.94 and that
[07:50:45] gives me a final value of 4.56. So when
[07:50:48] I sum up the values from input image I
[07:50:50] get a value of 4.56 and from this vector
[07:50:53] for X I get a value of five. And when I
[07:50:56] divide these two values I get a
[07:50:58] probability of 0.91.
[07:51:01] So there is basically 91% probability
[07:51:03] that this new image denotes the alphabet
[07:51:06] X. Similarly, let's say if I want to
[07:51:08] compare this input image with a vector
[07:51:10] for O. So again, I'll do the same thing.
[07:51:12] I will take all of the high values. So I
[07:51:14] have 1 1 and 1 which basically when I
[07:51:17] combine it adds up to four. Now I'll
[07:51:19] take the same corresponding values. So
[07:51:21] over here it is 0.65 0.45 0.44 and 0.53.
[07:51:27] Now when I add up these values, it comes
[07:51:29] up to 2.07. And when I divide 2.07 07
[07:51:32] with four I get a final probability
[07:51:35] value of 0.51 is actually very less. Now
[07:51:38] let's actually look at the summary of
[07:51:40] this. So we had this input image X. Now
[07:51:44] we took this input image and pass this
[07:51:46] through this convolutional neural
[07:51:48] network and this convolutional neural
[07:51:50] network again had four layers. So the
[07:51:52] first layer was the convolutional layer
[07:51:55] where we map the filters on top of the
[07:51:58] input space and we got a modified
[07:52:00] two-dimensional array. Now we took that
[07:52:01] modified two-dimensional array, passed
[07:52:03] it through the ReLU layer and removed
[07:52:05] out all of the negative values. Now
[07:52:08] again we took this modified
[07:52:09] two-dimensional array and passed it
[07:52:11] through the pooling layer where we got a
[07:52:13] subset of the original image. And then
[07:52:15] we finally passed it through the fully
[07:52:17] connected network which basically gave
[07:52:19] us the classification probability. So
[07:52:22] over here we see that the probability
[07:52:24] for this image to be X is 0.92 and the
[07:52:27] probability for this image to be equal
[07:52:29] to 0.51. So this obviously means that
[07:52:32] the image basically denotes the alphabet
[07:52:35] X and this is the entire theory behind
[07:52:38] convolutional neural networks. So let's
[07:52:39] understand the issues with feed forward
[07:52:41] network. So let's say this is our feed
[07:52:43] forward network over here and we are
[07:52:46] trying to solve a simple classification
[07:52:48] problem. So let's say I given an image
[07:52:50] of a food item and it needs to label it
[07:52:53] properly. So what I'll do is first I
[07:52:56] will give this feed forward network an
[07:52:58] image of chicken. So when I give in an
[07:53:01] image of a chicken to this feed forward
[07:53:03] network, it correctly labels it as
[07:53:06] chicken. So let's say I give this input
[07:53:08] at time stamp t. Now again after some
[07:53:11] time let's say at time stamp t1 I'll
[07:53:14] give it an image of pza. So it takes in
[07:53:17] this image of PZA and then it again
[07:53:20] labels it correctly as PZA. So till here
[07:53:23] all of this is fine. So first I give
[07:53:25] this an image of chicken and it is able
[07:53:27] to correctly label it as chicken. After
[07:53:29] a while at time stamp t + 1 I'll give it
[07:53:32] the image of pesa and it is able to
[07:53:34] correctly label it as pza. Now the
[07:53:37] problem is there is no relation between
[07:53:40] these two outputs. So first we get the
[07:53:43] output which basically states that the
[07:53:44] label associated with this image is
[07:53:46] chicken and second we get that the label
[07:53:48] associated with this food item is pizza.
[07:53:51] Now the problem is there is no sequence
[07:53:54] or there is no relation between these
[07:53:56] two entities. So this is where feed
[07:53:59] forward networks fail. So they can't
[07:54:01] really memorize previous inputs. Now
[07:54:03] there are a lot of cases where we
[07:54:05] actually want the network to remember
[07:54:08] all of the previous inputs and this is
[07:54:10] not really possible with the feed
[07:54:11] forward network. So let's actually take
[07:54:13] another example. So let's say I have
[07:54:16] this sentence over here which states
[07:54:18] recurrent neural. So this is the input
[07:54:21] and I want my feed forward network to
[07:54:24] predict the next word in this sentence.
[07:54:26] Now my question to you guys would be
[07:54:28] would this feed forward network be able
[07:54:30] to predict the next word? Well, that
[07:54:32] really is not possible because as we
[07:54:34] already saw, a feed forward network is
[07:54:37] not really able to memorize the previous
[07:54:39] inputs. So the answer for us humans is
[07:54:43] pretty obvious. We know that recurrent
[07:54:45] neural should be followed with network
[07:54:48] because we have already learned the
[07:54:50] previous words. So we know that after
[07:54:53] recurrent the next word would be neural
[07:54:55] and after neural the next word would be
[07:54:58] network. and we are able to predict the
[07:55:00] next word because we are able to
[07:55:02] memorize the previous words and that is
[07:55:04] how brain functions but then again this
[07:55:07] is not really possible with the feed
[07:55:08] forward network. So this is basically
[07:55:11] where the feed forward network fails. So
[07:55:13] now let's understand how can we solve
[07:55:15] this with the recurrent neural network.
[07:55:17] So over here let's take another example.
[07:55:19] So let's say there's this chef who only
[07:55:22] cooks three items and one item on each
[07:55:26] day. So let's say he cooks chicken on
[07:55:28] day one, pizza on day two and noodles on
[07:55:31] day three. And again he'll continue the
[07:55:33] same thing. That is on day four he'll
[07:55:36] again cook chicken on day five he'll
[07:55:38] cook pizza and on day six he'll again
[07:55:40] cook noodles. So it's as simple as that.
[07:55:42] So basically he has a set pattern and
[07:55:44] he'll not break that pattern. So let's
[07:55:47] consider that this is output at time
[07:55:49] stamp t1. Now since we know that the
[07:55:52] output at time stamp t1 is chicken and
[07:55:55] there is already a sequence which has to
[07:55:57] be followed we can say that output at t
[07:56:00] + 1 will be pizza. And similarly since
[07:56:03] we know what the output is at t + 1 we
[07:56:05] can also predict what the output is at t
[07:56:08] + 2. So the output at t +2 would be
[07:56:11] noodles. So guys keep this in mind and
[07:56:13] this is how basically a recurrent neural
[07:56:15] network works. So a recurrent neural
[07:56:18] network is basically able to memorize
[07:56:20] the sequence between all of the outputs.
[07:56:23] Right? So now let's understand how a
[07:56:25] recurrent neural network works. So let's
[07:56:28] say on day one or on time stamp t1 I'll
[07:56:31] give this some input and with respect to
[07:56:34] that input this recurrent neural network
[07:56:37] would predict that the food to be
[07:56:39] prepared on day one is chicken. Now this
[07:56:43] output at time stamp t1 is not just
[07:56:45] discarded. So this output over here is
[07:56:47] sent to the next neuron at time stamp
[07:56:49] t2. Now for time stamp t2 we have an
[07:56:53] input and along with that input we also
[07:56:57] have the previous time stamps output. So
[07:57:00] it'll take these two into consideration
[07:57:02] and since it already knows what was
[07:57:04] prepared at time stamp t1 it is easily
[07:57:06] able to gauge that the output should be
[07:57:09] pisa at time stamp t2. Now again this
[07:57:12] output is remembered as well. So this
[07:57:14] output which we've received at time
[07:57:16] stamp t2 is again sent to the third
[07:57:19] neuron over here. Now this third neuron
[07:57:22] has some input and along with this input
[07:57:25] it also takes in the output from the
[07:57:28] previous time stamp. So along with this
[07:57:30] output and the input it is able to
[07:57:32] predict that the output should be
[07:57:34] neurals. So this is how a recurrent
[07:57:36] neural network works. So basically the
[07:57:38] outputs at all of these timestamps are
[07:57:41] dependent with each other and this
[07:57:43] recurrent neural network is able to
[07:57:45] handle sequential data. So these what
[07:57:48] you see these neurons are really able to
[07:57:50] memorize the previous inputs. So I'm
[07:57:52] reiterating again this is time stamp one
[07:57:54] and at time stamp one this RNN predicts
[07:57:57] that the output should be taken. Now
[07:57:59] this output at time stamp one is sent to
[07:58:01] the neuron at time stamp two. Now with
[07:58:04] the help of this output from the
[07:58:06] previous time stamp, it is easily able
[07:58:09] to predict that the output at time stamp
[07:58:11] 2 should be pesa. Now again this output
[07:58:14] at time stamp 2 is sent to this neuron
[07:58:17] at time stamp 3. And with the help of
[07:58:19] this output, it is able to predict that
[07:58:21] the food item which has to be cooked is
[07:58:23] noodles. Right? So now that we've
[07:58:25] understood this example properly, let's
[07:58:28] understand RNN's in better way. So as we
[07:58:30] saw in the example, RNN's basically have
[07:58:33] something known as a memory which
[07:58:35] captures information from the previous
[07:58:38] timestamps. And RNN's are called
[07:58:40] recurrent because they basically perform
[07:58:42] the same task for every element of a
[07:58:45] sequence. So let's say this is our input
[07:58:49] at t minus one. Now this is sent to the
[07:58:52] neuron over here and this will give us
[07:58:54] an output at t minus one. Now whatever
[07:58:57] information that we get from time stamp
[07:59:00] t minus one that is sent to the neuron
[07:59:02] at time stamp t. Now along with the
[07:59:05] input at time stamp t and along with the
[07:59:08] information from time stamp t minus one
[07:59:10] it is able to give an output over here.
[07:59:13] Now similarly this information from time
[07:59:15] stamp t is sent to this neuron over here
[07:59:18] and along with this information there is
[07:59:19] also an input at t + 1. So combining
[07:59:22] these two this will give us an output at
[07:59:24] t + 1. So this is how the RNN's work and
[07:59:27] if you roll this entire sequence what we
[07:59:30] get is basically a loop. So we are
[07:59:32] performing the same thing at multiple
[07:59:35] timestamps and that is why RNN's are
[07:59:37] called as recurrent. So we provide the
[07:59:39] input the operation is done and we get
[07:59:42] the output at let's say time stamp t
[07:59:44] minus one. Now again the same operation
[07:59:47] is done at time stamp t. Again the same
[07:59:49] operation is done at time stamp t + 1.
[07:59:52] And this goes on till we get our final
[07:59:54] optimal result. Right? So now that we've
[07:59:57] understood the theoretical part of RNN,
[07:59:59] let's also understand the math behind
[08:00:01] it. So again we have this RNN.
[08:00:06] Here is a quiz question for you guys.
[08:00:08] The question is what is a neural
[08:00:10] network?
[08:00:12] Your options are a type of computer
[08:00:13] processor used for high performance
[08:00:15] computing, a machine learning model
[08:00:17] inspired by the human brain, a type of
[08:00:20] encryption algorithm used for secure
[08:00:22] communication or a software application
[08:00:25] for organizing and visualizing data.
[08:00:27] Please mention your answers in the
[08:00:29] comment section.
[08:00:30] >> And we've unfolded this RNN. And when we
[08:00:32] unfold it, we have something like this.
[08:00:34] So we have different time stamps. So we
[08:00:36] have three time stamps basically t minus
[08:00:38] one, t and t + one. So let's actually
[08:00:41] consider this. So XD over here is
[08:00:43] basically the input at time stamp t. Now
[08:00:47] I'll send this input to this neuron over
[08:00:50] here. And the way we calculate the value
[08:00:53] of this node is pretty simple. So it's
[08:00:54] what we've always been doing is
[08:00:56] basically input into the weight
[08:00:58] associated with this link over here. So
[08:01:00] input is xt along with the weight over
[08:01:03] here which is u. So this becomes xt into
[08:01:06] u. Now this is just the input. So this
[08:01:09] is time stamp t and we've already seen
[08:01:12] that the current time stamp takes input
[08:01:15] from the previous time stamp. So over
[08:01:17] here we have information from the
[08:01:18] previous time stamp and that information
[08:01:20] is basically w into st minus one. So
[08:01:24] this st what you see this is basically
[08:01:27] what is known as the memory of the
[08:01:29] network. So for this node over here at
[08:01:32] time stamp t we'll get this input and
[08:01:35] this information from the previous time
[08:01:36] stamp. So in total it becomes xt into u
[08:01:41] + s tus1 into w. But this is just a
[08:01:44] linear equation. Now we have to pass
[08:01:46] this linear equation through an
[08:01:48] activation function. So that activation
[08:01:50] function could either be tanh relu or
[08:01:53] any other activation functions that
[08:01:55] we've seen till now. And that is how we
[08:01:57] get the value of st. So again I'm
[08:01:59] reiterating it. So to get the value of
[08:02:02] st it is basically this input and the
[08:02:05] information from the previous output. So
[08:02:07] that is xt into u plus w into st minus
[08:02:10] one and we'll pass it through an
[08:02:12] activation function let's say relu and
[08:02:14] that is when we'll get the output which
[08:02:15] is st. Now again we have to pass this st
[08:02:19] through another final activation
[08:02:20] function which is mostly softmax because
[08:02:23] softmax would give us a range of
[08:02:25] probabilities and when you add up all of
[08:02:27] these range of probabilities it would
[08:02:29] sum up to be one. Right? So we'll pass
[08:02:31] this st through the softmax function and
[08:02:34] we'll get the output at time stamp t.
[08:02:36] Similarly, let's consider this node at
[08:02:38] time stamp t + 1. So over here to
[08:02:40] calculate s t + 1, we'd have to take
[08:02:43] this input and this information from the
[08:02:45] previous output. So it'll be xt + 1 into
[08:02:49] u plus w into s and again we have to
[08:02:52] pass this through an activation function
[08:02:54] and we'll get the value of st + 1. Now
[08:02:57] if you want the final output at time
[08:03:00] stamp t + 1, we have to again pass this
[08:03:02] through the softmax function and the
[08:03:04] softmax function would give us a range
[08:03:06] of probabilities between 0 and 1. So
[08:03:08] this is how we get the outputs at
[08:03:10] different timestamps. This was the part
[08:03:12] where we did the forward propagation and
[08:03:15] in that forward propagation we had
[08:03:17] outputs for different timestamps. Now
[08:03:19] again after forward propagation we would
[08:03:22] also have to train the network and while
[08:03:24] training the network what we see is the
[08:03:27] actual values and the predicted values
[08:03:29] are not really similar and there would
[08:03:31] be quite a bit of error between the
[08:03:33] actual values and the predicted values.
[08:03:35] So what we'll do is once we're done with
[08:03:37] the forward propagation we will
[08:03:39] calculate the error for each time stamp.
[08:03:42] So here let's consider this case. So
[08:03:43] let's say there are four time stamps. So
[08:03:45] we'll start with t4 or time stamp 4 over
[08:03:48] here. So over here let's say the actual
[08:03:51] value is 04 and the predicted value is
[08:03:54] p4. Now if we want to know the error in
[08:03:57] prediction, we have to subtract p4 from
[08:03:59] 04 and that'll give us an error over
[08:04:01] here. So this is the error for time
[08:04:04] stamp t4. Similarly we also have to
[08:04:06] calculate the error in prediction for
[08:04:08] time stamp t3. So this time the error in
[08:04:11] prediction would be 03 minus p3. And
[08:04:14] then over here we have to calculate the
[08:04:16] error in prediction for time stamp t2.
[08:04:18] So this will be O2 minus P2. And for
[08:04:21] time stamp T1 it'll be O1 minus P1. And
[08:04:24] for time stamp T 0 it'll be O minus P 0.
[08:04:28] So now we have errors for each of the
[08:04:31] time stamps. So E 0 is the error at time
[08:04:34] stamp T 0. E1 is the error at time stamp
[08:04:37] T1 and so on. So now we have the
[08:04:40] individual errors for individual time
[08:04:43] stamps. we'd have to calculate the total
[08:04:46] error. And to calculate the total error,
[08:04:48] all we have to do is just sum up all of
[08:04:50] these errors. And when we sum up all of
[08:04:52] these errors, we'll get E total. So this
[08:04:55] is the total error which we've got while
[08:04:58] forward propagating. So now that we know
[08:05:00] the error in prediction, what we have to
[08:05:03] do is back propagate and fine-tune our
[08:05:06] weights and bias so that we get the
[08:05:09] optimal value. And we've already seen
[08:05:11] how can this be done through back
[08:05:13] propagation and stoastic gradient
[08:05:14] descent. So in back propagation what we
[08:05:17] saw is we have to calculate the change
[08:05:19] in error with respect to the weight. But
[08:05:22] that was simple back propagation. But
[08:05:24] now we're dealing with RNN. So in RNN we
[08:05:28] have modules or nodes for each time
[08:05:31] stamp. So this time we have to calculate
[08:05:33] the change in error with respect to all
[08:05:36] of the time stamps. So now let me again
[08:05:38] go back and see how can we back
[08:05:40] propagate over here. So now while back
[08:05:43] propagating we have to update three
[08:05:46] parameters over here. So those three
[08:05:48] parameters are U, V and W. So what we
[08:05:52] have to do is we have to find out the
[08:05:55] change in total error with respect to U,
[08:05:59] V and W. Now again the gradient of total
[08:06:02] error with respect to these three
[08:06:04] parameters is nothing but gradient of E0
[08:06:07] with respect to those three parameters
[08:06:09] gradient of E1 with respect to UVW.
[08:06:11] Gradient of E2 with respect to UVW and
[08:06:14] so on. So what we'll do is we'll find
[08:06:17] out the change of these individual
[08:06:19] errors at each of these timestamps with
[08:06:22] respect to each of the parameter. So for
[08:06:25] this link the weight associated was V.
[08:06:27] For this link the weight associated was
[08:06:29] U. And for this link the weight
[08:06:31] associated was w. So let's go ahead and
[08:06:34] back propagate and find out the optimal
[08:06:36] value for v. And if you have to do that
[08:06:39] then we have to do it for each of the
[08:06:41] time stamps. And what we have to do is
[08:06:43] we'd have to find out the gradient of E4
[08:06:46] with V. After that we'll go back a time
[08:06:48] stamp and then we'd have to find out the
[08:06:51] change of E3 with respect to V. Again
[08:06:53] we'll go back a time stamp and this time
[08:06:55] we have to find out the change of E2
[08:06:58] with respect to V. Again we'll go back
[08:07:00] and this time we have to find out the
[08:07:02] change of E1 with respect to V. So now
[08:07:05] once we have the individual changes of
[08:07:08] these errors with respect to the
[08:07:10] parameter V then all we have to do is
[08:07:13] add them up. So that will be gradient of
[08:07:15] E 0 with respect to V plus gradient of
[08:07:18] E1 with respect to V, gradient of E2
[08:07:21] with respect to V and gradient of E3
[08:07:24] with respect to V. And when we add all
[08:07:26] of that up, we'll get the final change
[08:07:28] in error with respect to V. Now,
[08:07:32] similarly, we'll do the same thing with
[08:07:34] respect to W over here. So, first we'd
[08:07:36] have to find out the individual change
[08:07:39] in errors with respect to W. So, that'll
[08:07:42] be gradient of E4 with respect to W,
[08:07:44] gradient of E3 with respect to W,
[08:07:47] gradient of E2 with respect to W, and so
[08:07:49] on. And then we'll have to add all of
[08:07:51] that up. And finally we'll get gradient
[08:07:54] of E total with respect to W. And then
[08:07:56] we'll do the same thing for U over here.
[08:07:59] So that'll be gradient of E4 with
[08:08:01] respect to U, gradient of E3 with
[08:08:04] respect to U, gradient of E2 with
[08:08:06] respect to U and so on. And again we'll
[08:08:08] add all of that up and we'll get the
[08:08:10] gradient of E total with respect to U.
[08:08:13] So this is how we've calculated the
[08:08:15] change in error with respect to U, V,
[08:08:18] and W. So we'll use this in the stoastic
[08:08:21] gradient descent formula and we'll
[08:08:23] update the weights of U, V and W and
[08:08:26] then we'll get our final optimal result.
[08:08:28] So this entire procedure that we saw
[08:08:30] over here is what is known as back
[08:08:32] propagation through time. So we have one
[08:08:35] to one, one to many, many to one and
[08:08:38] many to many. So we'll start off with
[08:08:40] the first type of RNN which is basically
[08:08:42] one one. So this is what is known as the
[08:08:45] vanilla RNN and this is the simplest
[08:08:47] form of recurrent neural network. So
[08:08:49] this basically takes in a single input
[08:08:52] and gives out a single output. So an
[08:08:56] example of this could be let's say we
[08:08:57] given one image or a single word and we
[08:09:01] want to classify this into a single
[08:09:03] class. So let's say I given an image of
[08:09:06] a bird and the output which I want is
[08:09:09] yes or no. that is whether this is a
[08:09:11] bird or not or let's say I given a value
[08:09:15] 30 an integer number and I want to
[08:09:17] classify whether this is a positive
[08:09:19] number or a negative number so again yes
[08:09:22] or no so basically what I'm trying to
[08:09:24] say is this takes in a single input and
[08:09:26] gives out a single output so this is the
[08:09:28] simplest form of RNN and then after that
[08:09:31] we have something known as one to many
[08:09:33] RNN so in this we have a single input
[08:09:37] but that single input gives us multiple
[08:09:40] outputs. So an example of this one could
[08:09:43] be let's say if I given the first word
[08:09:45] of a song then this RNN has to predict
[08:09:48] the rest of the lyrics of the song. So
[08:09:52] let's say there's a song hello darkness
[08:09:53] my old friend and if I give the input
[08:09:56] which is basically the first word hello
[08:09:58] then what it does is this RNN it starts
[08:10:01] off by predicting the second word which
[08:10:03] would be hello darkness. Now again this
[08:10:06] RNN remembers these two words which is
[08:10:08] hello darkness. Now these two words are
[08:10:11] sent as the input to the third time
[08:10:14] stamp which is t3 and then it basically
[08:10:16] predicts the third word which is
[08:10:18] darkness. And this again goes on and the
[08:10:21] next word predicts as my and then it
[08:10:23] goes on to the next time stamp again
[08:10:25] then it'll predict the next word is old.
[08:10:27] So we just started off with one word
[08:10:29] which was hello and it went ahead and
[08:10:32] predicted the next three words which
[08:10:34] were darkness my old friend right so
[08:10:37] this is how a one to many RNN works and
[08:10:40] then next we have what is known as many
[08:10:42] to one RNN so this many to one RNN
[08:10:46] basically takes in a sequence of inputs
[08:10:48] and gives out one final output so an
[08:10:52] example of this could be sentiment
[08:10:54] analysis so let's say I feed this RNN a
[08:10:57] sentence Cristiano Ronaldo is the best
[08:10:59] footballer in the world. Now what this
[08:11:02] will do is it'll take in this entire
[08:11:05] sentence and this will tell me the final
[08:11:08] sentiment of the sentence which is
[08:11:10] either positive, negative or neutral. So
[08:11:14] I feed this the first word which is
[08:11:15] basically Cristiano and this will give
[08:11:18] me the output which is either positive,
[08:11:20] negative or neutral. And this
[08:11:22] information from the first output is
[08:11:24] sent to the node at time stamp t2. And
[08:11:27] there is also an input which is Ronaldo.
[08:11:29] So Cristiano Ronaldo and we also have
[08:11:31] the output which is basically the
[08:11:33] sentiment. And again for this time stamp
[08:11:35] we'll get another output which is
[08:11:38] basically the sentiment which is either
[08:11:39] positive, negative or neutral. So for
[08:11:42] the entire sentence, Cristiano Ronaldo
[08:11:44] is the best footballer in the world.
[08:11:47] We'll have a sequence of outputs which
[08:11:50] are basically positive, negative or
[08:11:52] neutral. So at the final time stamp,
[08:11:55] what we'll do is we'll aggregate all of
[08:11:58] these final outputs and find out the
[08:12:01] final sentiment. So if the final
[08:12:03] sentiment is positive, then we'll
[08:12:05] predict that it's a positive sentence.
[08:12:08] And similarly, if the final sentiment is
[08:12:10] neutral, we'll predict it's a neutral
[08:12:12] sentence. And if it's negative, we'll
[08:12:14] predict that it's negative. And finally,
[08:12:16] we have what is known as a many to many
[08:12:18] RNN, which is actually the most
[08:12:20] frequently used RNN. So an example for
[08:12:23] this could be predicting the next word
[08:12:25] in a sentence. So let's take this
[08:12:27] sentence. I'm going to be the next
[08:12:29] president of United States of America.
[08:12:31] So this will take in the first word,
[08:12:33] which is I, and then it'll predict the
[08:12:35] second word, which is am. And this is
[08:12:37] sent to the node at time stamp 2. And
[08:12:39] this will predict the next word which
[08:12:41] will be going. So we have three words
[08:12:43] till now which is I am going. And this
[08:12:45] again over here is sent to the node at
[08:12:48] next time stamp. And this will predict
[08:12:50] to. So this basically becomes I am going
[08:12:52] to. And this goes on and we have a set
[08:12:55] of nodes for different timestamps and
[08:12:57] this predicts the entire sentence which
[08:12:59] is basically I'm going to be the next
[08:13:01] president of United States of America.
[08:13:04] So now we'll see the issues with
[08:13:06] recurrent neural networks. So it's not
[08:13:08] that these recurrent neural networks are
[08:13:10] perfect. There are issues with these as
[08:13:12] well. So again we'll take the same
[08:13:14] example. So we'll pass in this sentence
[08:13:17] recurrent neural to this network over
[08:13:19] here. Now we have to predict the next
[08:13:22] word with the help of this RNN. So this
[08:13:24] is pretty easy. So this RNN it memorizes
[08:13:27] the sequence recurrent and neural and it
[08:13:30] is easily able to predict the next word
[08:13:32] which is network and it is able to do so
[08:13:35] because this RNN does not need any
[08:13:37] further context and since it does not
[08:13:39] need any further context so it is easily
[08:13:41] able to break the next word. But now
[08:13:43] let's take this second case over here.
[08:13:45] So now I'll pass a really long sentence
[08:13:47] to this RNN. So let's say the sentence
[08:13:49] is I've been staying in Spain for the
[08:13:51] last 10 years. I can speak fluent dash.
[08:13:55] So now for this sentence, it's not
[08:13:57] really that easy for RNN to predict the
[08:13:59] next word and that is because this needs
[08:14:02] the context from previous words. So this
[08:14:05] has to go back a lot in the sentence. So
[08:14:08] this RNN has to go back till over here
[08:14:11] and it needs the context of this word
[08:14:14] Spain and only when it realizes the
[08:14:17] context of this word Spain, it can
[08:14:19] predict the next word over here is
[08:14:21] Spanish. Right? So the entire sentence
[08:14:23] is actually I've been staying in Spain
[08:14:24] for the last 10 years. I can speak
[08:14:27] fluent Spanish. But then again this is
[08:14:30] exactly the problem with normal RNN. And
[08:14:33] this is known as long range
[08:14:34] dependencies. So this gap which you see
[08:14:37] between the word which you'd have to
[08:14:39] actually predict and the word from which
[08:14:41] you'd have to find out the context that
[08:14:42] is really long and this is known as a
[08:14:45] long range dependency. So what happens
[08:14:47] on long-term dependency is while back
[08:14:50] propagating the chain rule becomes
[08:14:52] really really long. So let's say we are
[08:14:54] done with the forward propagation for
[08:14:56] this and we start with the back
[08:14:57] propagation. Now over here since the gap
[08:15:00] between this word and over here is
[08:15:02] really long and there are lot of
[08:15:04] timestamps between all of these words.
[08:15:06] Now, if there are a lot of timestamps,
[08:15:09] this basically creates a really long
[08:15:11] chain rule. And when there is a really
[08:15:13] long chain rule, it's not really that
[08:15:15] easy to memorize the previous words for
[08:15:18] this RNN. And this is exactly what is
[08:15:21] known as vanishing gradient descent
[08:15:23] problem. Now, if there is a really long
[08:15:25] dependency, there is actually a good
[08:15:27] probability that one of the gradients
[08:15:29] might approach zero. And this would lead
[08:15:31] to all the gradients rushing to zero
[08:15:33] exponentially fast due to
[08:15:35] multiplication. So let's say we start
[08:15:37] back propagating from over here. And
[08:15:39] when we start back propagating the
[08:15:42] gradient it slowly starts to diminish.
[08:15:44] So let's say we start back propagating
[08:15:46] over here and the gradient somewhere
[08:15:48] over at time stamp 4 or time stamp 5 it
[08:15:50] gets to zero. And when the gradient at
[08:15:52] one time stamp it gets to zero. All of
[08:15:55] the gradients for all of the time stamps
[08:15:57] will be zero because all we're doing is
[08:15:59] multiplication. And when we multiply
[08:16:02] zero with any other number, it is
[08:16:04] basically zero. So such states would no
[08:16:06] longer help the network to learn
[08:16:08] anything. And this is what is known as
[08:16:10] vanishing gradient problem. So this
[08:16:13] vanishing gradient problem basically
[08:16:15] arises due to long-term dependencies. So
[08:16:17] let's say if we start from over here
[08:16:19] time stamp 1 and we go back till time
[08:16:21] stamp 7. So what happens is there is a
[08:16:24] really long-term dependency over here
[08:16:26] and the gradient starts to diminish as
[08:16:28] we traverse back. So this is where we
[08:16:30] have to use a modified version of RNN
[08:16:32] which are basically known as long
[08:16:34] short-term networks. So long short-term
[08:16:37] networks are special kind of RNN which
[08:16:39] are explicitly designed to avoid the
[08:16:42] long-term dependency problem. So what
[08:16:44] you see over here this is a standard
[08:16:46] RNN. So normally the recurrent neural
[08:16:49] networks have a form of a chain with
[08:16:52] repeating modules. So all of these are
[08:16:54] the repeating modules over here and each
[08:16:57] module is pretty much the same. So
[08:16:59] normally in standard RNN they have a
[08:17:02] pretty similar structure. So all of
[08:17:04] these repeating modules would have just
[08:17:06] a single tanh layer or containing just a
[08:17:09] single activation function over here.
[08:17:11] Right? So this is a standard RNN where
[08:17:14] there'll be just one neural network for
[08:17:17] each of the time stamps. So as we see
[08:17:19] over here we have three time stamps t
[08:17:21] minus one, t and t + one. And all of
[08:17:24] these are just recurrent or all of these
[08:17:26] are just repeating. So for time stamp t
[08:17:28] minus one we have the same activation
[08:17:31] function which is tanh. For time stamp t
[08:17:33] we have the same activation function
[08:17:35] which is tan h again. And this would be
[08:17:37] the same for time stamp t + 1 again. Now
[08:17:40] this is the standard RNN. So this is
[08:17:42] where the difference comes with LSTMs.
[08:17:45] So again LSTMs also have a chain-like
[08:17:47] structure but the repeating module which
[08:17:50] you see over here this is actually
[08:17:52] different. So instead of having just a
[08:17:54] single neural network layer there are
[08:17:57] four layers interacting in a very
[08:18:00] special way. So these are the four
[08:18:02] neural networks over here. So what you
[08:18:04] saw in a standard RNN was there is just
[08:18:07] one neural network but when it comes to
[08:18:10] an LSTM there'll be four neural
[08:18:12] networks. So let's actually understand
[08:18:14] the core idea behind LSTMs. So this line
[08:18:18] which you see over here this is what is
[08:18:20] known as the cell state. So this cell
[08:18:23] state you can consider this to be a
[08:18:25] conveyor belt and the cell state is
[08:18:27] where all of the information is stored.
[08:18:30] Now the idea behind LSTMs is to add or
[08:18:33] remove some information to the cell
[08:18:35] state so that we avoid the long-term
[08:18:38] dependency problem. So when it comes to
[08:18:40] long-term dependency problem, there
[08:18:42] might be some cases where we don't want
[08:18:44] to remember some words. So there are
[08:18:46] some words which don't really have an
[08:18:48] impact on the next prediction. So you
[08:18:50] know those are some things which you'd
[08:18:52] have to forget. And again there are some
[08:18:54] cases where we'd have to add some new
[08:18:56] information to the RNN. So all of this
[08:18:59] can be done with the help of the cell
[08:19:00] state over here. So again I'm
[08:19:02] reiterating it with the help of this
[08:19:04] cell state we can either add or remove
[08:19:07] information from the LSTM and these
[08:19:10] gates are what help us to optionally let
[08:19:13] information through. So this gate
[08:19:14] basically comprises of a sigmoid neural
[08:19:17] net layer and a pointwise multiplication
[08:19:20] operation. So let's again go back to
[08:19:22] this image over here. So what we see is
[08:19:24] we have three gates over here. So this
[08:19:26] is gate number one. This is gate number
[08:19:28] two. And this is gate number three. Now
[08:19:31] since this is a sigmoid layer, this
[08:19:33] basically helps us to pass information
[08:19:35] which is in the range of 0 to one. So a
[08:19:38] value of zero basically means that we
[08:19:41] don't let anything through this gate.
[08:19:43] And the value of one means that we let
[08:19:45] everything through the gate. So this is
[08:19:48] again a range of 0 to one. And this is
[08:19:50] how we can calculate the amount of
[08:19:52] information which has to be sent through
[08:19:54] one particular gate. So if the value is
[08:19:57] closer to zero then very little or
[08:19:59] almost no information is sent through
[08:20:01] this gate and if the value is closer to
[08:20:03] one then almost the entire information
[08:20:06] is sent through the gate. So now let's
[08:20:08] actually understand the working of this
[08:20:10] LSTMs. So let's start with our first
[08:20:12] gate over here. So this gate is known as
[08:20:15] the forget layer and this is where the
[08:20:18] LSTM decides which information are we
[08:20:21] going to throw away from the cell state.
[08:20:23] Right? So this is the forget gate layer
[08:20:26] over here. So what it does is it takes
[08:20:29] information from the input and also from
[08:20:32] the previous output over here. So the
[08:20:34] input to this forget gate would be XT
[08:20:37] into W plus HTUS1 into W. And then again
[08:20:40] we'll pass this through the sigmoid
[08:20:42] layer over here. And this will give us
[08:20:44] what is known as FT which is nothing but
[08:20:46] a value between zero and one. And this
[08:20:49] is where we decide which information are
[08:20:51] we going to remove away from the cell
[08:20:53] state. So let's take the same example of
[08:20:55] predicting the next word in the
[08:20:57] sentence. So let's say this LSTM might
[08:21:00] keep the gender of the current subject
[08:21:02] so that it can predict the current
[08:21:04] pronoun for the subject. But then again,
[08:21:06] when it encounters a new subject, you
[08:21:09] want the cell state to forget the old
[08:21:11] subject's gender and only remember the
[08:21:14] current subject's gender. Right? So this
[08:21:16] is basically where we forget the
[08:21:18] information. So this is done with the
[08:21:20] help of the first gate. So now that we
[08:21:22] decide which information is to be thrown
[08:21:24] away, the second step is where we decide
[08:21:27] what new information we are going to
[08:21:29] store in the cell state over here. So
[08:21:32] this has two parts. First is the sigmoid
[08:21:35] layer which is called as the input gate
[08:21:37] layer. So this input gate layer decides
[08:21:40] which values we'd have to update. After
[08:21:43] that there is this tanh layer which
[08:21:45] basically creates a vector of new
[08:21:47] candidate values that could be added to
[08:21:50] the state. And with respect to the
[08:21:52] language model this is where we can add
[08:21:54] the information of the new subject's
[08:21:56] gender and remove the information of the
[08:21:59] old subject's gender. So we have two
[08:22:01] parts over here. So first is where we
[08:22:04] update the information. So this is
[08:22:06] basically where we take information from
[08:22:07] XT and HT minus one and pass it through
[08:22:10] the sigmoid layer. And after that again
[08:22:13] we'll take information from XT and HT
[08:22:15] minus one and pass it through the tanh
[08:22:18] value. So what we are doing over here is
[08:22:20] we are basically creating new set of
[08:22:22] candidate values. Right? So we have it
[08:22:25] and CT over here. So CT is basically the
[08:22:28] new candidate values which we want. So
[08:22:30] we're done with the first two steps. In
[08:22:32] the first step, we decided what is the
[08:22:34] information which you want to forget. In
[08:22:36] the second step, we decided what is the
[08:22:38] new information which we're going to add
[08:22:40] into the cell state. Now let's head on
[08:22:43] to the third state. So in the third
[08:22:45] state, we basically have to update the
[08:22:47] old cell state CT minus one into the new
[08:22:50] cell state CT. So what we do is we
[08:22:54] multiply the old cell state minus1 by
[08:22:58] f_t. So this means that this is where we
[08:23:01] are forgetting the old things or we are
[08:23:04] removing the old information. So CT
[08:23:07] minus1 into FT. So FT is basically that
[08:23:10] information which we got from the first
[08:23:12] gate and first gate is nothing but the
[08:23:15] forget gate layer. So when we multiply
[08:23:18] FT with CT minus one, what we are doing
[08:23:21] is we are forgetting the old
[08:23:23] information. Now we'll add this with it
[08:23:27] into C cross T. So this term basically
[08:23:30] means what is the new information which
[08:23:32] we want to add to this layer and by how
[08:23:35] much value do we need to scale this. So
[08:23:37] the first term is where we decide how
[08:23:40] much information are we forgetting and
[08:23:42] the second term is where we decide how
[08:23:43] much new information are we adding to
[08:23:46] the cell state and we then just add it
[08:23:48] up and get the new cell state which is
[08:23:50] CT. So from CT minus one we are going to
[08:23:53] CT. So I'm again reiterating it. So this
[08:23:56] consists of two parts. So in the first
[08:23:59] part we basically forget all of the
[08:24:01] information which is not necessary or
[08:24:04] not significant and in the second part
[08:24:07] we update or add the new information. So
[08:24:10] from CT minus one we reach till CT and
[08:24:14] then we've reached the final part. So
[08:24:16] this is where we'll decide what part of
[08:24:18] the cell state that we're going to
[08:24:20] output. It's not necessary that the
[08:24:22] entire output is useful to us. So only a
[08:24:26] range or only a part of the information
[08:24:28] or the output would be necessary. So
[08:24:31] this equation which you see. So this is
[08:24:32] where we decide what part of the cell
[08:24:34] state we're going to output. So now
[08:24:36] after that what we'll do is we'll put
[08:24:38] the cell state through the tan hitch and
[08:24:42] multiply it with the output of the
[08:24:44] sigmoid gate. And this gives us HT which
[08:24:46] is basically our final output. So first
[08:24:49] we decide which part of the cell state
[08:24:51] we are going to output. And then we'll
[08:24:54] pass this cell state through the tanh
[08:24:56] function. This tanh function gives us a
[08:24:58] range which is between minus1 to 1. And
[08:25:01] we'll multiply this with the output
[08:25:03] which is basically our final result. So
[08:25:05] these are the four steps which are
[08:25:07] involved in an LSTM. So first we forget
[08:25:10] all of the insignificant information and
[08:25:13] then the second step we add the
[08:25:15] information which is needed. Now in the
[08:25:17] third step we combine the first two
[08:25:19] steps and finally in the fourth step we
[08:25:22] only output the information which is
[08:25:24] necessary from the cell state. So this
[08:25:27] is the entire working of long short-term
[08:25:30] networks. So let's say if we have five
[08:25:32] numbers like this, five consecutive
[08:25:33] numbers actually 1 2 3 4 and five. And
[08:25:37] our task is to predict the next number
[08:25:39] which would be six. So this is our task
[08:25:42] and we'll be implementing this with the
[08:25:43] help of LSTM. So we'll go ahead and
[08:25:45] import all of the required packages. So
[08:25:47] we'll import sequential from kas.models.
[08:25:49] We'll import dense and LSTM from
[08:25:51] kas.layers and we'd have to split our
[08:25:53] data. So we'll import train test split
[08:25:55] from sklearn.mmodel selection. And we
[08:25:58] also require the numpy library and
[08:26:00] mattplot lip library. So I'll go ahead
[08:26:03] and click on run and I'll just wait till
[08:26:05] all of these packages load up. So now
[08:26:07] it's time to create my input data. So
[08:26:09] this code over here it helps me to
[08:26:10] create 100 vectors of five consecutive
[08:26:13] values. So I'll store this in data and
[08:26:16] let me have a glance at the first five
[08:26:17] vectors. So I'll click on run. So this
[08:26:19] is what I have over here. So this is the
[08:26:21] first vector where the consecutive
[08:26:23] numbers are 0 1 2 3 and four. And over
[08:26:26] here I have 1 2 3 4 and five. This is 2
[08:26:28] 3 4 5 6. So these are basically the
[08:26:30] first five vectors and these comprise of
[08:26:33] consecutive values. Right? So this is
[08:26:35] basically my input data. Now I would
[08:26:37] also require my target data which is
[08:26:39] basically the next number. So so as I
[08:26:42] had already told you guys what we want
[08:26:44] to do with respect to this model is we
[08:26:47] want to predict the next value for a
[08:26:49] consecutive sequence of inputs. So over
[08:26:51] here this is our input which is
[08:26:52] basically 1 2 3 4 and 5. And we'd want
[08:26:55] to break the next value with the help of
[08:26:57] our RNN which will be six. So what I'm
[08:27:00] doing is I'm creating the next number
[08:27:02] for each of these arrays. So over here
[08:27:04] it'll be five. For this it'll be six.
[08:27:06] For this it'll be seven and so on. So
[08:27:08] I'll click on run. Right? So this is
[08:27:10] what I have. So for the first vector the
[08:27:13] target is five. For the second vector
[08:27:15] the target is six. For the third vector
[08:27:17] the target is seven and so on. Now I'll
[08:27:19] go ahead and convert this data and
[08:27:21] target into numpy arrays. So for that
[08:27:24] purpose I'll use np do array and pass in
[08:27:27] the data object inside this and also
[08:27:30] similarly pass in the target object
[08:27:32] inside this and store it in data with a
[08:27:34] small d and similarly store it in target
[08:27:36] with a small d. I'll click on run.
[08:27:38] Right? So now I have my numpy arrays
[08:27:41] with me. Now I'll also have a glance at
[08:27:43] the shape of data and target. So this is
[08:27:45] the shape of data and this is the shape
[08:27:47] of target. Now this is basically helpful
[08:27:49] to me when I am giving these as
[08:27:50] parameters when I'm building the model.
[08:27:53] So now since I have the data it's time
[08:27:55] to finally divide this data into
[08:27:57] training and testing set. So for that
[08:27:59] purpose I'll be using this train test
[08:28:02] split method and it takes in these
[08:28:04] parameters. So first is basically the
[08:28:06] data or the features. Next is a target
[08:28:09] or the labels and I'm setting the test
[08:28:11] size to be 0.2. So this means that 80%
[08:28:14] of the records should be in the training
[08:28:16] set and the rest 20% of the records
[08:28:18] should be in the test set and I'll just
[08:28:20] give out a random state value of four so
[08:28:22] that I can get the same result when I
[08:28:24] want to run this model the next time.
[08:28:26] Right? So I'll click on run and I have
[08:28:30] successfully divided this data into
[08:28:31] Xrain, X test, Y train and Y test. So X
[08:28:35] train over here this basically contains
[08:28:38] all the training values for the
[08:28:40] features. X test over here contains all
[08:28:43] the test values for the features.
[08:28:45] Similarly, Y train over here consists of
[08:28:48] all the train values for the target and
[08:28:50] Y test over here consists of all the
[08:28:52] test values for the target. Right? So
[08:28:55] now we've divided the data into training
[08:28:58] and testing. Now it's time to create a
[08:29:00] model and we are basically creating a
[08:29:02] sequential model over here. So I'll use
[08:29:04] the sequential method and I'll create an
[08:29:06] instance and store it in this model
[08:29:08] object. Right? So now it's time to add
[08:29:11] the layer and the layer which we'll be
[08:29:13] adding over here would be LSTM. So I'll
[08:29:15] use model.lm.
[08:29:18] So I'll say model dot add and the layer
[08:29:20] which I'm adding is LSTM. So this takes
[08:29:22] in a parameter over here and this is
[08:29:24] basically the output size. So I want the
[08:29:26] output size to be equal to one which is
[08:29:29] basically I'm predicting a single number
[08:29:31] or a single value over here. And then we
[08:29:34] have the batch input shape. So the batch
[08:29:36] input shape basically takes in three
[08:29:38] values. So the first value is the number
[08:29:40] of inputs. Second is the input sequence
[08:29:43] and third is the length of the input
[08:29:45] sequence. So over here if you don't know
[08:29:47] the number of inputs which you have you
[08:29:48] can say it as none. And since the input
[08:29:50] sequence is five, I said it as five and
[08:29:53] the length of the input sequence is one
[08:29:55] and I've given it as one. And then we
[08:29:57] have another parameter over here which
[08:29:59] is basically return sequences. So this
[08:30:02] return sequences basically means that do
[08:30:04] I want the output for each time stamp or
[08:30:07] do I just want the final output. So if I
[08:30:10] set it to be true, I'll get the output
[08:30:12] for each time stamp. And if I set it to
[08:30:14] be false, I'll just get a final result.
[08:30:16] So over here I have set this to be
[08:30:18] false. Right? So now I have created the
[08:30:21] model. Now it's time to fine-tune it.
[08:30:23] And to finetune it, I'll use
[08:30:25] model.compile method. And it comprises
[08:30:28] of these three things. So first I'll set
[08:30:30] the loss which I want to minimize. So I
[08:30:32] want to minimize the mean absolute
[08:30:34] error. And then I'll set the
[08:30:36] optimization algorithm. So the
[08:30:37] optimization algorithm which I'm using
[08:30:39] is Adams. And then basically the
[08:30:41] evaluation metric which I want to use.
[08:30:43] So the evaluation metric which I'm using
[08:30:45] is accuracy. I'll click on run. Right.
[08:30:48] So now I've also compiled my model. Now
[08:30:50] let me just have a glance at the summary
[08:30:52] of this. I'll click on run now. Right.
[08:30:55] So we see that this is my LSTM model and
[08:30:57] I have my output shape over here. Now
[08:31:00] since I've also compiled my model, it's
[08:31:02] finally time to fit the model on top of
[08:31:04] the train set. So for this I'll use
[08:31:06] model dot fit and I'm basically fitting
[08:31:08] this model on top of X train and Y
[08:31:11] train. And initially I'll set the number
[08:31:13] of epochs to be equal to 50 which is
[08:31:15] actually less. And I want this to
[08:31:17] evaluate on top of the validation data.
[08:31:20] So validation data is basically X test
[08:31:22] and Y test. So I'm retreating it. I'm
[08:31:25] fitting the data on top of X train and Y
[08:31:27] train and I'm validating it on top of X
[08:31:30] test and Y test. I'll click on run. So
[08:31:33] the model fitting is done. And this is
[08:31:35] what we have the final result over here.
[08:31:37] Now that we fit the model, let me also
[08:31:39] go ahead and predict the values on top
[08:31:41] of the test set. So I'll use
[08:31:43] model.predict and I'll pass in X test as
[08:31:46] the parameter inside this. I'll click on
[08:31:48] run. Right. So I have predicted the
[08:31:51] values and I have stored the result in
[08:31:52] this results object. Now that is done
[08:31:55] what I'll do is I will actually make a
[08:31:57] scatter plot which comprises of the
[08:32:00] actual values and the predicted values.
[08:32:03] So I'm representing the actual values
[08:32:06] which are present in y test with green
[08:32:08] colored dots and the predicted results
[08:32:10] which are present in results with red
[08:32:12] dots and I'll plot it out. Right? So
[08:32:15] these green dots which you see these are
[08:32:17] the actual values and these red dots
[08:32:20] which you see these are the predicted
[08:32:21] values. So with the help of this graph
[08:32:23] we understand that the prediction is
[08:32:25] totally wrong and our model is very bad.
[08:32:27] Now along with this we'll also make a
[08:32:29] plot of the loss. So plt.plot and I'll
[08:32:32] make a loss plot. I'll click on run.
[08:32:35] Right? So you see that the loss of the
[08:32:36] model does not decrease at all. Right?
[08:32:39] It stays constant and that is why this
[08:32:41] is a very very bad model. Now this is
[08:32:44] because we have not normalized the data.
[08:32:46] So whenever we don't normalize the data
[08:32:48] there might be chances of exploding
[08:32:50] gradient descent and the model turns out
[08:32:52] to be really bad. So that is why we have
[08:32:54] to make sure that our data is
[08:32:56] normalized. So I'll go ahead and do
[08:32:57] that. So I will basically divide the
[08:33:00] input data with 100. Similarly I'll also
[08:33:02] divide the target data with 100. So what
[08:33:05] we are doing is normalizing the data.
[08:33:07] I'll click on run again. I'll click on
[08:33:09] run. Right. So now we have the
[08:33:11] normalized input and output data with
[08:33:13] us. Again I'll go ahead and convert
[08:33:15] these into numpy arrays. Again I'll
[08:33:17] split this into training and testing
[08:33:19] set. Build the model. Add this. Compile
[08:33:23] this. Now I'll go ahead and fit the
[08:33:26] model. I'll click on run.
[08:33:30] Right. So all of the 50 epochs are done
[08:33:32] and I'll predict it on top of the X
[08:33:34] test. Now again I'll go ahead and plot
[08:33:36] the actual values and the predicted
[08:33:38] values. I'll click on run. Right. So
[08:33:40] this time you see that there's a bit of
[08:33:42] improvement but then again this is not
[08:33:44] much. So initially all the red dots were
[08:33:46] scattered over here. Now it's just that
[08:33:49] they have come up a bit up but then
[08:33:51] again there is no correct prediction at
[08:33:53] all. Again let me print out this loss
[08:33:56] plot. So this time it's okayish. It is
[08:33:58] reducing but then again the loss is not
[08:34:00] decreasing much. Now what I'll do is I
[08:34:03] will go ahead and increase the number of
[08:34:06] epochs. So the number of epochs were 50.
[08:34:08] I'll change the number of epochs to 500.
[08:34:11] And let's see if there's a difference in
[08:34:12] the prediction or not. I'll click on
[08:34:14] run.
[08:34:16] All right. So the model is being fit.
[08:34:18] Right. So all of the 500 epochs are
[08:34:20] done. Now I'll go ahead and predict this
[08:34:22] on top of the test set and again make a
[08:34:24] scatter plot of actual values and
[08:34:26] predicted values. So this time the
[08:34:28] prediction is much much better. But then
[08:34:30] again this is not 100% accurate. So now
[08:34:33] let me print out the loss plot again.
[08:34:35] Right? So this time the loss reduces in
[08:34:37] a better way. Now again what I'll do is
[08:34:39] I'll increase the number of epochs again
[08:34:41] and this time I'll set the number of
[08:34:43] epochs to be equal to 1,000 and let's
[08:34:46] see what happens. I'll click on run.
[08:34:52] So now we f with the model. I'll predict
[08:34:53] this on top of the test set and again
[08:34:56] I'll make the scatter plot. I'll click
[08:34:58] on run. Right. So this time we see that
[08:35:00] the accuracy is very very good. the
[08:35:03] predicted values and the actual values
[08:35:05] are very close. So the red dots are the
[08:35:08] predicted values and the green dots are
[08:35:10] the actual values and we see that most
[08:35:13] of them are coinciding right and this is
[08:35:15] a very good result for us. So basically
[08:35:17] what we had to do was we had to
[08:35:19] normalize the data and after that we had
[08:35:21] to increase the number of epochs and
[08:35:23] this is how we could make correct
[08:35:25] predictions and reduce the loss and just
[08:35:27] to be sure let me again plot this loss
[08:35:30] right. So now we see that the loss is
[08:35:32] actually decreasing very well and after
[08:35:34] that it stays constant. Right? So this
[08:35:36] is the sort of loss plot we wanted. So
[08:35:38] this is how we could build an LSTM
[08:35:41] network which was able to predict the
[08:35:43] next number for a consecutive sequence
[08:35:45] of numbers.
[08:35:47] Here's a quiz question for you guys.
[08:35:49] What is tokenization in text processing?
[08:35:52] Your options are converting text to
[08:35:54] lower case, removing punctuation and
[08:35:56] special characters, breaking text into
[08:35:59] individual words or tokens or
[08:36:01] translating text from one language to
[08:36:03] another. Please mention your answers in
[08:36:05] the comment section. So it is a process
[08:36:06] of turning a string or a text paragraph
[08:36:09] into smaller chunks and those chunks are
[08:36:11] also called tokens. Tokens can be
[08:36:14] sentences, they can be words, they can
[08:36:16] be even phrases or even symbols. When we
[08:36:19] extract smaller chunks or tokens from a
[08:36:22] huge text document, that process is
[08:36:24] called tokenization. We have some useful
[08:36:27] tokenization methods that we'll discuss
[08:36:29] here. So the first one that we'll
[08:36:31] discuss is called sentence tokenization
[08:36:34] which we will use to extract all the
[08:36:36] sentences that are present in a text
[08:36:38] document and then we'll move on to
[08:36:40] understand how can we extract individual
[08:36:43] words from a huge text document and then
[08:36:46] after that we'll use regular expressions
[08:36:48] to match patterns in our text document
[08:36:50] and only extract those words which match
[08:36:53] that regular expression. And then we
[08:36:55] have our blank line tokenization
[08:36:57] tokenizer which will tokenize sentences
[08:37:00] even if they have spaces or blank lines
[08:37:02] in between them. It will take it will
[08:37:04] ignore the blank lines and it will
[08:37:06] extract the whole sentence. So let's
[08:37:08] begin with our sentence tokenization. It
[08:37:11] is used to split a body of text into
[08:37:13] different sentences. To perform
[08:37:15] tokenization for sentences, we have to
[08:37:18] import the send tokenize function from
[08:37:21] the nltk tokeniz sub package. After
[08:37:25] importing, we'll generate a sample text
[08:37:28] that we'll use for tokenization. I've
[08:37:31] used this text here that Google will
[08:37:33] shut down it job application tracking
[08:37:34] system Google hire that was launched
[08:37:36] just 2 years ago. The company had built
[08:37:39] hire with dying green a former alphabet
[08:37:41] board member. This is a sample text and
[08:37:44] using this text and using the sentence
[08:37:46] tokenizer we'll extract different
[08:37:49] sentences present in this sample. For
[08:37:51] doing that I have initialized a variable
[08:37:53] called tokenize text and I've set it
[08:37:56] equal to the send tokenize function. So
[08:37:58] our send tokeniz function will take the
[08:38:00] argument which is a sample text that we
[08:38:02] have generated here. So it will print it
[08:38:04] will first of all extract all the
[08:38:06] sentences that are present in our sample
[08:38:09] and then we have stored all those
[08:38:10] sentences in the tokenized text
[08:38:12] variable. So it will be a list of all
[08:38:14] the sentences that are present in our
[08:38:16] sample here and then we can print the
[08:38:19] list using the print function. So when
[08:38:21] we print we'll get all the sentences
[08:38:23] that are present in our sample text.
[08:38:25] There are you can see this is the first
[08:38:27] tense sentence that starts from here and
[08:38:29] it ends here and the second sentence
[08:38:31] starts from here and it ends here and
[08:38:34] the third one will start from here and
[08:38:36] end here. So we have three sentences
[08:38:39] present in our sample text. So now let's
[08:38:41] go to the Jupyter notebook and we'll
[08:38:43] implement the sentence tokenization on a
[08:38:46] sample text. Now let's implement first
[08:38:48] of all I'll import the sentence
[08:38:50] tokenizer from the NLTK tokeniz package.
[08:38:55] So after importing I have to run this
[08:38:57] line. So I'll run this line. So my send
[08:39:00] tokenizer will be imported from my NLTK
[08:39:03] tokenize package. So now after that
[08:39:05] we'll generate a sample text. I've
[08:39:08] copied a text from a Google article. So
[08:39:11] I'll use this text and I'll extract all
[08:39:13] the sentences that are present in this
[08:39:16] text. So I'll just copy this and I'll
[08:39:18] generate a sample variable which will
[08:39:20] contain this text.
[08:39:23] So I'll paste my text here. So now we'll
[08:39:26] use this text to extract all the
[08:39:28] sentences that are present in this text.
[08:39:31] So I'll run this line to generate my
[08:39:33] sample variable. Now my sample text is
[08:39:36] generated. I will use the sent tokenize
[08:39:39] to tokenize all the sentences that are
[08:39:41] present in this document. So I'll write
[08:39:43] sample because my text is stored in the
[08:39:46] sample document. This is our text
[08:39:48] document and I'll pass my text document
[08:39:50] to send tokenize function. So it will
[08:39:52] tokenize all the sentences present in
[08:39:54] the samp and I'll store uh the result in
[08:39:57] send tokens.
[08:40:00] So when I run this line all of the
[08:40:01] sentences will be stored in the send
[08:40:04] tokens. So now if I want to print the
[08:40:06] first five sentences that are present in
[08:40:09] the send token I'll write the line.
[08:40:13] So it will print the first five
[08:40:14] sentences that are present in the send
[08:40:16] tokens.
[08:40:20] So if I want to print the first five
[08:40:22] sentences that are present in the send
[08:40:23] tokens. So I'll use the following.
[08:40:28] So if I run this, we'll get the first
[08:40:30] five sentences that are present in our
[08:40:32] send tokens after the sentence
[08:40:34] tokenization. First sentence is starts
[08:40:36] from here and it ends here. And this is
[08:40:39] our second sentence. The apps were
[08:40:41] removed by Google soon after. And this
[08:40:43] is our third sentence. So it starts from
[08:40:45] here and it ends here. And the fourth
[08:40:47] one is here. And this is our last
[08:40:49] sentence that we have in our send tokens
[08:40:52] list. So if I want to know how many
[08:40:54] sentences are there in my send tokens, I
[08:40:57] can just write use the len function and
[08:41:00] pass my list that is send token.
[08:41:03] And if I press control enter, so I'll
[08:41:05] get there are nine sentences that are
[08:41:07] present in this sample text. After this
[08:41:10] we will use corpora Gutenberg corpora
[08:41:13] from the NLTK corpus and we'll extract
[08:41:16] the sentences present in the files that
[08:41:18] we have in Gutenberg corpus. First of
[08:41:20] all we have to import the Gutenberg
[08:41:22] corpus from the NLTK corpus. So we'll
[08:41:24] write from NLDK doc corpus import
[08:41:27] gutenber and after that I will import
[08:41:30] any one of the file that is present in
[08:41:32] the Gutenberg corpus for importing a raw
[08:41:35] text file that contains all the text. So
[08:41:37] we have to use the function the method
[08:41:40] dot ra for the gutenber. So I'll write
[08:41:43] gutenber.ra and then inside the
[08:41:45] parenthesis I have to mention we have to
[08:41:47] mention the name of the file that we
[08:41:49] want to import. We have a file name
[08:41:51] called brand stories.ext in the gutenber
[08:41:54] copus. So we'll import the raw text
[08:41:57] file. So this file will contain only a
[08:41:59] raw text and I'll store it in sample
[08:42:02] give a sample text and then we will use
[08:42:04] this sample text to tokenize all the
[08:42:06] sentences using the send tokenize
[08:42:09] function and I'll store all my sentences
[08:42:11] present in this file in tokenized text
[08:42:14] and then I have printed sentences the
[08:42:16] from the fifth so this sixth sentence to
[08:42:19] the 14th sentence it will be printed
[08:42:22] these are our sentences from 6 to 14. So
[08:42:24] now let's go to the Jupyter notebook to
[08:42:27] implement this. We'll start by first of
[08:42:29] all importing the gutenber corpus from
[08:42:31] the nldk corpra. For doing that I'll
[08:42:33] write from an ltk
[08:42:36] or corpus import gutenber
[08:42:40] once I run control once I press control
[08:42:42] enter my gutenber corpus will be
[08:42:44] imported. Now I want to import a file
[08:42:48] text file from the gutenber corpus that
[08:42:51] we'll use as a sample text for
[08:42:53] tokenization. So I'll store the text in
[08:42:56] a variable called sample and then I'll
[08:42:58] use the Gutenberg corpus and the raw
[08:43:02] method of the Gutenberg copus that will
[08:43:04] extract all the raw text all the text
[08:43:06] that is present in a file and inside the
[08:43:08] parenthesis I have to mention the name
[08:43:10] of the file. The file I'm importing is
[08:43:13] called Bible KJV. So this will import
[08:43:16] this raw text file. All the text that is
[08:43:18] present in this file and the text will
[08:43:20] be stored in the sample. So once I run
[08:43:22] this line my sample text is generated.
[08:43:24] So now if I look at the sample text so
[08:43:27] you will see we'll have a lot of all the
[08:43:29] text that is that is present in our text
[08:43:31] file.
[08:43:33] So if I want now I want to extract only
[08:43:36] sentences from this whole file whole
[08:43:38] text file. I'll use the send tokenizer
[08:43:41] extract all the sentences.
[08:43:44] So I'll store all the sentences in send
[08:43:46] tokens.
[08:43:48] Then I'll use the send token as function
[08:43:50] to extract the sentences
[08:43:53] and inside the parenthesis I have to
[08:43:55] mention the text from which I want to
[08:43:57] extract all my sentences. So that is our
[08:43:59] sample. So now when I press control
[08:44:02] enter all the sentences that are present
[08:44:05] in is in this text file will be stored
[08:44:07] in send tokens. And now if I want to
[08:44:09] print first 10 sentences from this list
[08:44:12] here. So I can just write send tokens
[08:44:17] and it will print the first 10 sentences
[08:44:20] uh that are present in our send token.
[08:44:22] So this is the first sentence. It starts
[08:44:23] from here and it ends here and the
[08:44:25] second is this one. And similarly we
[08:44:27] have the first the 10th sentence is this
[08:44:29] one. So these are the first 10 sentences
[08:44:32] that are present in this text file. So
[08:44:34] now the next function that we'll use is
[08:44:36] word tokenize that will extract all the
[08:44:39] individual words present in the whole
[08:44:41] text. For that I would have to import
[08:44:43] the word tokenize function from tokenize
[08:44:46] sub package in the NLTK package and then
[08:44:48] I will take a sample text from which I
[08:44:51] will extract all the words and then I
[08:44:53] will use the word tokenize function and
[08:44:56] I'll pass my sample text and all these
[08:44:58] words will be stored in tok object to
[08:45:02] variable. So it will be a list of all
[08:45:04] the words. And now if I want to print
[08:45:06] the first 20 words, I'll use print this
[08:45:08] statement. And the first 20 words that
[08:45:11] are present in this text will be
[08:45:13] extracted and they are present in form
[08:45:15] of a list. And similarly, we can use
[08:45:17] corpus. Here we have used the genesis
[08:45:19] corpus that is also present in the NLTK
[08:45:21] corpora folder. So if you go to NLTK
[08:45:24] corpora folder, you'll find a corpus
[08:45:25] named genesis which has a text file
[08:45:28] named English kjv.ext. So I have we have
[08:45:31] imported first of all the genesis corpus
[08:45:33] and the word tokeniz function and we
[08:45:35] have stored all the text that is present
[08:45:37] in this file using the dotra method of
[08:45:40] genesis corpus and I stored all the text
[08:45:43] in sample. So our sample contains all
[08:45:45] the text and then we have used the word
[08:45:47] tokenize function on this sample and all
[08:45:49] the words that are present in this text
[08:45:51] file will be stored in okay variable and
[08:45:54] then we'll we have printed the first 30
[08:45:56] words. The first 30 words you can see in
[08:45:58] the beginning in the beginning God
[08:45:59] created. So all these words are
[08:46:01] separated and they are stored in form of
[08:46:03] a list. So let's go to Jupyter notebook
[08:46:06] and we'll perform it on a corpus. So we
[08:46:09] have same corpus that is Bible KJV. So
[08:46:12] I'll use this corpus and I'll use the
[08:46:14] sample sample text that we have and I'll
[08:46:16] extract all the words that are present
[08:46:18] in this sample.
[08:46:21] So now first of all I'll import the word
[08:46:24] tokenize function from the NLTK package.
[08:46:30] So after I run this line my word
[08:46:32] tokenize function will be imported. Now
[08:46:35] I'll use the sample text and I'll
[08:46:38] extract all the words that are present
[08:46:39] in the sample text. So I'll store the
[08:46:42] words in word tokens list and I'll set
[08:46:45] it equal to I'll use the function word
[08:46:47] tokenize
[08:46:49] and I'll pass my sample that I've
[08:46:51] created above.
[08:46:54] Now when I run this line all the words
[08:46:56] that are present in the sample text will
[08:46:59] be stored in word tokens. So after that
[08:47:02] if I want to print the first 50 words
[08:47:05] I'll just write words tokens.
[08:47:09] Now if I run this line, first 50 words
[08:47:13] of the sample text will be extracted and
[08:47:16] stored in the form of a list. You can
[08:47:18] see we have this also as a word as a
[08:47:20] single token and this also as a single
[08:47:22] token. So every single even if there is
[08:47:24] a symbol or punctuation mark that will
[08:47:27] also be stored as a now the third
[08:47:29] function that we'll use for tokenization
[08:47:31] is regular expression tokenization which
[08:47:33] is reg x tokenizer. This is the
[08:47:36] function. So in this function we
[08:47:38] actually pass a regular expression and
[08:47:40] it will match all the words that are
[08:47:43] present in our text with that regular
[08:47:45] expression and it will extract all those
[08:47:47] words. So for example first of all we
[08:47:49] have to import the X tokenizer from the
[08:47:52] NLTK tokenize by running this line. You
[08:47:55] will import the regular expression
[08:47:56] tokenizer and then we have a sample text
[08:47:59] here which we'll use for extraction. And
[08:48:02] then we have to define a regular
[08:48:04] expression which will be used to match
[08:48:07] every single word that is present in
[08:48:09] this text document. And according to the
[08:48:12] regular expression if the words match
[08:48:14] the regular expression those those words
[08:48:16] will be extracted and stored in our cap
[08:48:18] word token as a list. First of all we
[08:48:20] have to port the class. So this is a
[08:48:22] class regular expression tokenizer. So
[08:48:24] we have to initialize the class. So this
[08:48:27] is an object of the class. We have named
[08:48:29] the object as capword tokenizer and in
[08:48:33] the class arguments when we initialize
[08:48:35] the class we'll pass our regular
[08:48:37] expression. This is the regular
[08:48:38] expression that we will use to extract
[08:48:41] all the words which begin with a capital
[08:48:43] letter. This is a group of words. So it
[08:48:45] will search it will start with a word
[08:48:48] and it will find any word that's lies in
[08:48:51] this range. So if a word begins with any
[08:48:53] of these letters that line a to zed so
[08:48:55] it will take that word and then slash is
[08:48:58] an escape character for w plus w plus
[08:49:01] means any word after a capital letter.
[08:49:04] So if there is any word which starts
[08:49:05] with a capital letter and there is any
[08:49:07] other word any other letter after it. So
[08:49:10] it will pick all those words it will
[08:49:12] store in cap word. So now we have to use
[08:49:15] the dot tokenize method of the cap word
[08:49:18] object to tokenize the sentences
[08:49:21] tokenize this text and we'll write cap
[08:49:23] word tokenizer dot tokenize then we'll
[08:49:25] pass out sample text and then when we
[08:49:28] run this line all those words which
[08:49:30] match this regular expression will be
[08:49:32] extracted and will be displayed. So
[08:49:34] let's go to the Jupyter notebook and
[08:49:37] extract different words according to a
[08:49:39] regular expression. First of all, I will
[08:49:41] start with importing the regular
[08:49:43] expression tokenizer class from the NLTK
[08:49:46] tokenize. I'll write NL from NLTK
[08:49:49] tokenize.
[08:49:54] I've imported the regular expression
[08:49:56] tokenizer class. Now create an object of
[08:50:00] this class that we'll use for performing
[08:50:02] regular expression tokenizer. So I'll
[08:50:04] name the object as reax tokenizer. vex
[08:50:08] tokens
[08:50:10] and then I'll initialize the class
[08:50:17] and now I have to inside the parenthesis
[08:50:19] I have to mention a regular expression
[08:50:21] that will be used to match all the words
[08:50:23] that are present in our text document to
[08:50:25] extract those words. So I have to write
[08:50:28] commas inverted commas and inside this
[08:50:30] I'll mention my regular expression. So
[08:50:32] if I want to match all the words that
[08:50:35] contain a number or all the words even
[08:50:38] if they don't have a number or the
[08:50:40] numbers if I want to extract from my
[08:50:42] text document so I'll pass a range. So
[08:50:45] it will search for all these numbers
[08:50:47] then even if there is a word after the
[08:50:49] number it will also extract those tokens
[08:50:52] for my regular expression tokenizer will
[08:50:54] extract all the numbers and even if
[08:50:55] there is any word after the number it
[08:50:58] will also extract those tokens. So I'll
[08:51:00] initialize this class. Now we'll use the
[08:51:03] object instance of this class to extract
[08:51:07] tokens which match this criteria from
[08:51:09] our sample that we have used above. So
[08:51:12] this is what this is our sample uh that
[08:51:14] is Bible KGB sample from the gutenber
[08:51:17] corus initialize. I'll use the object
[08:51:20] that is reax tokens
[08:51:22] and there's a method called tokenize. So
[08:51:25] I'll use a method tokenize
[08:51:28] and I'll pass a sample that I have
[08:51:31] generated above. So once I run this
[08:51:34] line, you'll see all these numbers that
[08:51:37] are present in our text document are
[08:51:39] extracted. Can see after the number
[08:51:41] there is a comma inverted comma. So that
[08:51:43] is also extracted.
[08:51:46] So all these numbers there is no number
[08:51:48] which has any word after it. So all
[08:51:50] these numbers are extracted from our
[08:51:53] text document. So the next method that
[08:51:55] we are going to use is called the blank
[08:51:57] line tokenizer. Blank line tokenizer is
[08:52:00] used to preserve a sentence even if
[08:52:03] there is a blank line in middle of it.
[08:52:05] So if we have a sentence and in between
[08:52:07] the words of that sentence there's a new
[08:52:10] line character which is slash. So that
[08:52:13] sentence will be preserved and new line
[08:52:15] won't be created. So on the background
[08:52:17] this function actually uses the regular
[08:52:20] expression this regular expression to
[08:52:22] match all the tokens. We'll start off by
[08:52:26] importing the blank line tokenizer class
[08:52:28] from the NLT tokenize. So after
[08:52:31] importing we'll create a sample text.
[08:52:33] This is our text where good the text
[08:52:35] says that good muffins cost $388 in New
[08:52:38] York. And you can see there is a a new
[08:52:40] line character in between the words. And
[08:52:43] here also there are new len characters.
[08:52:45] So once we initialize the class blank
[08:52:48] line tokenizer and then we use the
[08:52:50] tokenize method of this class and we'll
[08:52:53] pass our sample text. So once we do that
[08:52:57] so our two sentences will be extracted
[08:52:59] and you can see the first sentence
[08:53:00] starts from here and ends here. So it
[08:53:03] did not treat it as a new sentence. This
[08:53:06] it preserve all the blank lines that are
[08:53:08] present in the sentence. And second
[08:53:10] example is also a sentence where we have
[08:53:13] this is the first sentence. So there was
[08:53:15] a very loud beep is the first sentence
[08:53:18] and the next sentence starts from here
[08:53:19] and there is a blank new line character
[08:53:22] in between and here also there are new
[08:53:24] line characters. So once we tokenize
[08:53:26] this using the blank line tokenize a
[08:53:28] class. So we'll see that we have uh
[08:53:30] sentences but uh the blank new line
[08:53:32] characters or the blank lines are
[08:53:34] preserved. Now we'll go to Jupyter
[08:53:36] notebook and we'll implement the blank
[08:53:39] line tokeniz. First of all, we'll start
[08:53:40] by importing the blank line tokenizer
[08:53:43] class from the NLTK dot tokenize.
[08:53:51] Start by importing the blank land
[08:53:53] tokenizer from the NLTK tokeniz package.
[08:54:02] Once my class is imported, now we can
[08:54:06] define a sample text which we'll use for
[08:54:09] Lagland tokenization. So I'll write a
[08:54:12] sample here.
[08:54:14] So this is my sample. This is the words
[08:54:16] that we are using. So good muffins cost
[08:54:20] 3 $3.88 and there is a new line
[08:54:22] character and there is a space also. So
[08:54:25] it will preserve this new characters and
[08:54:27] spaces as well. So I'll initialize the
[08:54:30] blank land tokenizer class
[08:54:34] and I'll use the tokenize method of this
[08:54:37] class to tokenize the
[08:54:40] blank lines and sentences.
[08:54:44] I'll pass my sample text to this
[08:54:46] function. And then when I run this,
[08:54:54] so after importing the blank land token
[08:54:56] as a class, now we'll define a sample
[08:54:58] text that we'll use for tokenization. So
[08:55:01] I'll just write sample. I've copied the
[08:55:04] text. So we'll use same text to tokenize
[08:55:08] all the blank spaces or new len
[08:55:11] characters that are present in the text.
[08:55:13] So I'll initialize my blank line
[08:55:15] tokenizer. So I'll firstly run this line
[08:55:18] to create my sample variable. Now my
[08:55:20] sample text is created. Now I'll
[08:55:22] initialize the blank l tokenizer class
[08:55:26] and I'll use the dot tokenize method to
[08:55:31] tokenize the text. So I'll pass my
[08:55:34] sample text into this function. And so
[08:55:36] now I'll initialize the blank line
[08:55:38] tokenizer class and I'll to use the div
[08:55:41] tokenize method of that class to
[08:55:44] tokenize my text.
[08:55:53] So after that now I'll initialize my
[08:55:55] blank line tokenizer class.
[08:55:59] So I've initialized the class and then
[08:56:01] I'll use the tokenized method of this
[08:56:03] class to tokenize the sample text.
[08:56:08] So I'll pass the sample text into this
[08:56:11] method. And then when I run so I'll get
[08:56:14] two sentences uh because this sentence
[08:56:17] contained a new link character and
[08:56:19] spaces as well. So it has preserved this
[08:56:22] whole sentence two sentences. Then we
[08:56:25] have the other sentence that is thanks.
[08:56:28] That is how we use blank line tokenizer
[08:56:30] to preserve new lines or spaces present
[08:56:34] in the text into sentences and regular
[08:56:37] expressions or words. Now we'll use
[08:56:39] we'll understand what is frequency
[08:56:41] distribution and how we can leverage
[08:56:44] frequency distribution. So after
[08:56:46] performing tokenization using send
[08:56:48] token, word token and regular
[08:56:50] expressions. Now we'll understand what
[08:56:52] is frequency distribution and how can we
[08:56:55] use frequency distribution to get more
[08:56:57] insights of the text. Frequency
[08:57:00] distribution is basically it represents
[08:57:02] the word frequency in a text. So it will
[08:57:04] count all the word frequencies and it
[08:57:07] will display a list of pupils where the
[08:57:09] first element of the pupil would be the
[08:57:11] word itself and the next element of the
[08:57:14] pupil would be the number of occurrences
[08:57:16] of that word in that text. So we'll use
[08:57:18] this to gain insights from the data and
[08:57:20] we'll see which are the most common
[08:57:22] words in our text data and then we can
[08:57:25] use that information to extract
[08:57:28] different information from the text and
[08:57:30] we can analyze the text further. So for
[08:57:32] doing that first of all we will import
[08:57:34] the word tokenize to tokenize the words.
[08:57:36] That is the first step and then we are
[08:57:38] using the gutenberg corpus from the NLTK
[08:57:41] corpus as a sample text and then we have
[08:57:44] to import the freak distribution
[08:57:46] frequency distribution class that
[08:57:48] contains the functions to perform
[08:57:50] frequency distribution. So after that
[08:57:52] we'll create a sample text which is from
[08:57:54] the carol alis.ext file which is present
[08:57:57] in the Gutenberg corpus. We'll import
[08:57:59] the raw text file from this corpus and
[08:58:02] we'll store it in sample and after that
[08:58:04] we'll tokenize all the words. We'll
[08:58:06] extract all the words that are present
[08:58:08] in this corpus and we'll store the words
[08:58:10] in tokenized words. So after we have all
[08:58:13] the words that are present in this
[08:58:15] corpus, we'll use the frequency
[08:58:17] distribution class and we'll pass
[08:58:19] tokenized words as an argument. So it
[08:58:21] will count the frequency distribution of
[08:58:24] every single word and it will store that
[08:58:26] in fdist. So this is our object that we
[08:58:28] have created have dist of this class. So
[08:58:30] this object contains and the information
[08:58:32] of the word frequencies of all the words
[08:58:35] present in the text document. So after
[08:58:37] that if you want to find which words
[08:58:40] occur the most in our text document. So
[08:58:43] we'll use most common method of the fist
[08:58:46] object. So we'll write fist dot most
[08:58:49] common and 20. So this will print the 20
[08:58:52] most common words that occur in the
[08:58:54] text. So as you can see here this comma
[08:58:57] it occurred 2418 times in the text file
[08:59:01] and the occurred 1516 times. So you can
[08:59:06] see we have lot of punctuation marks
[08:59:07] that don't mean anything. So if you want
[08:59:10] to extract important words and there's
[08:59:12] only one important word that you can see
[08:59:14] here is Alice which is 394 times. So you
[08:59:18] from here you can actually get the
[08:59:20] information that this text has something
[08:59:22] to do with Alice and we can do the same
[08:59:24] thing by plotting the frequency
[08:59:27] distribution. We use the same object
[08:59:28] that is abdist use the dot plot
[08:59:31] attribute or dot plot method of this
[08:59:34] object and when we write 20 so it will
[08:59:37] print the 20 most it will plot graph for
[08:59:40] the 20 most common words that occur in
[08:59:42] the text. Here you can see that the only
[08:59:45] word that makes sense is alis. Other
[08:59:46] words are not as useful. So to implement
[08:59:50] frequency distribution in Python, we
[08:59:53] have to import the frequency
[08:59:55] distribution class from NLDK.robability.
[08:59:59] And uh the first step is to create a
[09:00:02] sample text or a sample document from
[09:00:04] where we can extract the frequency words
[09:00:07] and then plot their frequency
[09:00:09] distribution. Firstly, we'll create a
[09:00:12] sample and after creating the sample,
[09:00:15] we'll tokenize all the words that are
[09:00:16] present in the sample. And after
[09:00:18] tokenizing all the words that are
[09:00:20] present in the sample, we'll create a
[09:00:22] frequency distribution using the
[09:00:24] frequency distribution class. So, I have
[09:00:26] imported the word tokenize function from
[09:00:29] NLTK. tokenize and then we have imported
[09:00:31] the corpus from NLTK. corpus and then we
[09:00:36] have for the frequency distribution the
[09:00:38] frequency distribution class from
[09:00:40] analytic.probability probability and
[09:00:41] after that I have imported the raw file
[09:00:44] that is carol alis.ext from the gutenber
[09:00:47] corpus and I have stored the text in
[09:00:49] sample text. So after creating the
[09:00:52] sample now we'll use the word tokenize
[09:00:54] function to tokenize to extract all the
[09:00:57] words that are present in this sample.
[09:00:59] So for doing that I have used the word
[09:01:01] tokenize function and I've passed my
[09:01:03] sample into it and it will extract all
[09:01:05] the words that are present in the sample
[09:01:07] and it will store the words in tokenized
[09:01:09] words. So after we have the list of all
[09:01:12] the tokenized words we will pass our
[09:01:14] tokenized words to the frequency
[09:01:17] distribution class and we will create a
[09:01:19] instance of that class that is named
[09:01:21] here as fist. So then if we print fist
[09:01:24] it will show us that how many samples
[09:01:26] exist or if we just write the fist then
[09:01:29] it will print us all the a dictionary
[09:01:31] which will contain the keys as the words
[09:01:34] and the value the value for the keys as
[09:01:35] the word counts. So let's see how we can
[09:01:38] implement this in Python using the
[09:01:39] Jupyter notebook. So we'll start off by
[09:01:41] importing all the required libraries.
[09:01:44] First of all I'll import the word
[09:01:46] tokenize function to tokenize my words
[09:01:48] from nltk tokenize. Now I'll import uh
[09:01:52] the Gutenberg copus from where I'll pick
[09:01:55] a text document text file.
[09:02:01] After that now we have to import our
[09:02:04] frequency distribution class which we'll
[09:02:06] be using for creating a frequency
[09:02:09] distribution of the words. So I have to
[09:02:11] import it from NLTK.robability.
[09:02:18] So I'll run all this code to import all
[09:02:20] the required libraries. Now after
[09:02:22] importing I'll create a sample text so
[09:02:25] that I can tokenize all the words from
[09:02:27] it. For doing that, I'll store my text
[09:02:29] in sample and I'll use the gutenber
[09:02:32] corus
[09:02:34] and I'll use the raw method to import a
[09:02:36] text raw text file which contains all
[09:02:38] the text present in that file and I'll
[09:02:40] use the carol alis
[09:02:45] once I run this line with control enter
[09:02:48] so my sample now it will be generated
[09:02:50] and all the text that is present in the
[09:02:52] carol alis file will be stored in the
[09:02:54] sample variable. So now after we have
[09:02:57] all the text now the first step is to
[09:02:59] tokenize the text which means to extract
[09:03:02] all the words present in the text. So
[09:03:04] I'll use word tokenize for extracting
[09:03:06] all the words from this text. So I'll
[09:03:09] store my words in word tokens.
[09:03:13] I'll use the word tokenize function and
[09:03:15] I'll pass my sample into it. So once I
[09:03:18] press control enter now all the words
[09:03:20] that are present in the word in the
[09:03:22] sample will be stored in the word
[09:03:23] tokens. So now if you want to know what
[09:03:26] is the frequency distribution of a every
[09:03:29] single word that is present in the
[09:03:31] sample or word tokens now we'll use the
[09:03:33] frequency distribution class. So we'll
[09:03:35] create an object of this class let's
[09:03:38] call it frequency fq
[09:03:40] and then now we'll initialize the class
[09:03:43] and this class takes one argument that
[09:03:46] is our words all the words that are that
[09:03:48] we want to count frequency for. So I'll
[09:03:51] pass my word tokens into it.
[09:03:54] Now if I press control enter all the
[09:03:56] words along with their frequencies will
[09:03:58] be stored in the FQ variable or object.
[09:04:02] So now if I print F R EQ so you will see
[09:04:06] we get a dictionary the keys of this
[09:04:08] dictionary are the individual words that
[09:04:11] are present in this text document and
[09:04:13] the values are the number of word
[09:04:15] counts. So how many times this word is
[09:04:18] appearing in this text document it will
[09:04:20] be printed as the value for this case.
[09:04:23] So our frequency object is fq. So now
[09:04:26] we'll use this object to print the most
[09:04:29] common words that are present in our
[09:04:30] text document. We'll add the object
[09:04:32] name. Then we'll use the most common
[09:04:34] method of this object.
[09:04:37] Then we can mention how many words we
[09:04:40] want. So if I want 30 most common words
[09:04:42] that occur in my document. So I'll add
[09:04:44] 30. And when I press control enter, the
[09:04:47] 30 words that occur most in my document
[09:04:49] will be displayed. And out of all these
[09:04:51] words you can see the only word that
[09:04:53] conveys some information to us is Alice.
[09:04:55] So if we can think of it as this
[09:04:57] document as it has something to do with
[09:05:00] Alice. And if we want to plot this
[09:05:03] frequency distribution instead of
[09:05:05] printing it. So we can use the plot
[09:05:07] method of this object.
[09:05:09] So I'll plot the first the 30 most
[09:05:12] common words that are present in the
[09:05:13] text document. So now you can see and
[09:05:15] that these are the words that occur in
[09:05:17] my text document. the is occurring
[09:05:19] around 1500 times and Ali's word is
[09:05:22] occurring around 400 times.
[09:05:25] >> Let's have a quick quiz question guys.
[09:05:27] Your question is what does natural
[09:05:28] language processing or NLP refer to?
[09:05:31] Your options are a method for
[09:05:33] identifying plant species in the wild, a
[09:05:35] sub field of computer science that deals
[09:05:37] with human computer communication, a
[09:05:39] branch of linguistics focused on
[09:05:41] studying regional dialects, or a
[09:05:44] technique for extracting information and
[09:05:46] meaning from human language. Please
[09:05:48] mention your answers in the comment
[09:05:49] section.
[09:05:51] Who will understand what are words and
[09:05:54] why we need to remove these words from
[09:05:55] our text document in order to gain more
[09:05:58] insights and clear insights from our
[09:05:59] text data. So stop words are basically
[09:06:02] the useless words that we have already
[09:06:04] seen in the frequency distribution that
[09:06:06] don't convey any information to us.
[09:06:08] Instead of keeping these words, we'll
[09:06:11] remove these words. So English has some
[09:06:14] stop words and the stop words are
[09:06:16] actually present in the NLTK corpus. If
[09:06:18] you go to NLTK corpus folder, you'll
[09:06:21] find a corpus named stop words. And in
[09:06:23] the stop words, we have different text
[09:06:25] files which contain stop words for
[09:06:27] different languages, 16 languages. And
[09:06:30] we I will import the English version.
[09:06:32] English stop words. As for importing
[09:06:34] stop words, we just have to import a
[09:06:37] stop words from nltk.copus.
[09:06:39] And then we'll store this stop words in
[09:06:42] stop words list. And we'll use the set
[09:06:45] argument. So it will the set argument
[09:06:47] will arrange these words in ascending
[09:06:49] order starting alphabetically. All the
[09:06:51] words that are present in the English
[09:06:53] stop words text file will be stored in
[09:06:55] stop words. And once we print these stop
[09:06:58] words, we'll get all these stop words
[09:07:00] which are like words that occur most of
[09:07:02] the times in your document and they
[09:07:04] don't convey any information to us.
[09:07:06] After importing these stop words, if you
[09:07:08] want to remove the stop words from our
[09:07:10] text document. So here's a method that
[09:07:12] we have used. First of all, I imported
[09:07:15] the send tokenize and word tokenize
[09:07:17] functions from nltk tokenize. Then I've
[09:07:19] created a empty list which is called
[09:07:22] filter words. And then I have imported
[09:07:25] the carol alis text file from the
[09:07:27] gutenber copus and I've stored all the
[09:07:29] text in sample. So now firstly we'll
[09:07:31] tokenize all the words that are present
[09:07:34] in the sample using the word tokenize.
[09:07:37] All the words will be stored in tokenize
[09:07:39] word.
[09:07:41] If we want to remove the stop words from
[09:07:44] this tokenized words uh we can use this
[09:07:47] line this code here.
[09:07:50] We have used a for loop. So it will
[09:07:52] iterate through every word that is
[09:07:54] present in tokenized word and it will
[09:07:57] check if that word is in stop words or
[09:07:59] not. So if the word that is present in
[09:08:01] tokenized word is not in stop words then
[09:08:04] it will continue and it will check
[09:08:06] whether the length of the word is
[09:08:08] greater than three or not. If the length
[09:08:10] is not equal greater than three it will
[09:08:12] not continue within the loop and if the
[09:08:15] length is greater than three then it
[09:08:16] will append that word in the filtered
[09:08:19] word. So in the filtered world we'll
[09:08:21] only have words that are not present in
[09:08:23] the stop words and all those words which
[09:08:26] have a length greater than three three
[09:08:28] characters. Once we write three so it
[09:08:31] will remove all the punctuation marks as
[09:08:33] well. This is a handy method to remove
[09:08:36] punctuation marks. So at the last we'll
[09:08:38] get a filtered word list which will
[09:08:41] contain all the words. So after we have
[09:08:44] our filtered word list, then if we
[09:08:46] create a frequency distribution of all
[09:08:48] the words present in our filtered word
[09:08:50] using the frequency distribution class
[09:08:52] and now we have 2,733
[09:08:55] samples. So now you can see a lot of
[09:08:57] words have been removed from this
[09:08:59] sample. And now if we print the 20 most
[09:09:01] common words. So now you can see I've
[09:09:04] get a lot of information from the words
[09:09:06] because the punctuation marks and the
[09:09:08] stop words were removed. So you can see
[09:09:10] we have Alice and we have queen then we
[09:09:13] have kink then we have turtle mark
[09:09:16] hatter griffon all these words actually
[09:09:18] convey more information to us than all
[09:09:21] those words which were before this is
[09:09:23] one step in pre-processing that we have
[09:09:25] removed all the stop words and all the
[09:09:27] punctuation marks and if we plot now the
[09:09:30] frequency distribution of all the words
[09:09:33] after removing the stop uh the stop
[09:09:35] words so you can see uh the word that
[09:09:37] occurs most of the name is Alice. So now
[09:09:41] we are pretty sure that this text has
[09:09:43] something to do with Alice. So this is
[09:09:45] how we make our machine or computers
[09:09:48] understand the text using
[09:09:49] pre-processing. So now let's go to the
[09:09:51] Jupyter notebook and we'll implement how
[09:09:54] to remove stop words from a text
[09:09:56] document. So I'll start by importing the
[09:09:58] stop words corus from the NLTK corpus.
[09:10:04] So my stop words corus is now imported.
[09:10:06] So now I will store all these stop words
[09:10:08] and all the English stop words in a list
[09:10:11] called stop words top words.
[09:10:16] So now when I mention the name of my
[09:10:18] corpus so I can use the words argument.
[09:10:21] So it will extract all the words
[09:10:23] directly from a corpus. So now I have to
[09:10:26] mention which corpus I want. So I want
[09:10:29] the English one. So I'll write English.
[09:10:31] And then I use this set function to
[09:10:34] arrange these words in a alphabetical
[09:10:37] order starting from a. So now if I write
[09:10:40] the set function if I run this now my
[09:10:42] stop words are stored in this list. Now
[09:10:45] if I print my stop words.
[09:10:49] So now you can see these are all English
[09:10:51] stop words and that are imported. You
[09:10:54] can see there is no once other which D
[09:10:56] have until. So these words don't convey
[09:10:59] any useful information to us. So we
[09:11:02] remove them before performing any text
[09:11:04] analysis. So after importing stop words
[09:11:07] let us import a corpus or a sample text
[09:11:09] from which we'll remove these stop
[09:11:11] words. So if I go to my analytic corpora
[09:11:15] folder and you'll find there is a corpus
[09:11:17] called genesis and in this genesis we
[09:11:20] have different text files. So I'll
[09:11:22] import the English web text file from
[09:11:24] here. So this is one document in our
[09:11:27] corpus. There are several documents in a
[09:11:29] corpus. And once I open this you can see
[09:11:32] we have lot of text here. So now for
[09:11:35] importing this first of all I have to
[09:11:36] import the genesis corpus corpus from
[09:11:39] the NLDK corpus.
[09:11:43] So after importing I import the raw text
[09:11:46] file that is English web from this corus
[09:11:49] and I'll store it in sample variable.
[09:11:53] So this will import the raw text file
[09:11:55] and I'll English web.txt.
[09:11:59] So now my sample all the text that is
[09:12:02] present in English web.txt will be
[09:12:04] stored in the sample. So now next step
[09:12:06] is to tokenize all the words that are
[09:12:09] present in this sample. For doing that,
[09:12:11] I'll store the words in word tokens list
[09:12:16] and I'll pass my sample in this. So now
[09:12:20] my word tokens, all the words that are
[09:12:22] present in the English web.txt will be
[09:12:24] stored in the word tokens. So now in the
[09:12:26] there will be stop words and punctuation
[09:12:28] marks everything will be present in
[09:12:30] these words. Now we want to remove the
[09:12:32] stop words and single end characters as
[09:12:34] well. So we'll create a new list that is
[09:12:37] we'll call that filtered words.
[09:12:40] And then we can define which words we
[09:12:43] want. So we can define it in using for
[09:12:45] loop in multiple lines and we can also
[09:12:47] define it using a single line. So I'll
[09:12:49] write the single line version here. So
[09:12:52] it will take all the words that are
[09:12:54] present in word tokens.
[09:13:02] Now it will take all the words W here
[09:13:06] for W in word token. So it will iterate
[09:13:08] through all the words that are present
[09:13:09] in word tokens and it will check if the
[09:13:12] word is not in stop words. So our stop
[09:13:14] words is this list which contains all
[09:13:16] our stop words. So if the word that is
[09:13:19] present in word tokens is not present in
[09:13:21] stop words then only it will take those
[09:13:23] words and it will also check this
[09:13:25] condition which is it will take only the
[09:13:27] words which have a length of more than
[09:13:29] three and it will store all these words
[09:13:31] in this filtered word list. So once I
[09:13:34] run this, I'll get a list filtered words
[09:13:37] with all the stock words removed and
[09:13:39] single line and punctuation marks also
[09:13:41] removed. So now if you want to plot
[09:13:44] print the frequency distribution of all
[09:13:45] the words. So I'll create a frequency
[09:13:48] distribution. I'll store it in FRQ and
[09:13:51] I'll use the frequency distribution
[09:13:52] class and it takes the words. So our
[09:13:56] words will be filtered words as an
[09:13:59] argument.
[09:14:02] So now my filtered words is created.
[09:14:05] Frequency distribution of all the words
[09:14:06] that are present in filtered words is
[09:14:08] created. So now if I want to know what
[09:14:11] are the most common words that occur in
[09:14:13] this whole text. So I'll use the object
[09:14:17] and I'll use the method called most
[09:14:20] common
[09:14:21] and I'll print the 20 most common words.
[09:14:24] So you can see these are the 20 most
[09:14:26] common words. We have Jacob, Joya,
[09:14:29] Joseph, Abraham, Pharaoh, Isaac. So you
[09:14:32] can see, you can clearly see it is
[09:14:33] something related to the Bible. And if
[09:14:36] you want to plot these words, so we'll
[09:14:39] just use frequency.plot.
[09:14:42] So we plot the 20 most common words. And
[09:14:45] here also you can see that the Jacob is
[09:14:48] occurring most number of times. Then
[09:14:49] there is Yoya, then is Joseph and
[09:14:51] Abraham, Pharaoh. This is how you remove
[09:14:55] all the stop words. This is a oneliner.
[09:14:57] So you can remove using multiple lines
[09:14:59] or you can use simple oneliner and it
[09:15:02] will remove all the words which are
[09:15:04] called stop words. So after removing the
[09:15:06] stop words now we'll move on and
[09:15:08] understand what are bgrams, triagrams
[09:15:10] and engrams. Bagrams are two consecutive
[09:15:13] words that occur in a text. So if we
[09:15:15] have a text where or words in are in
[09:15:18] such a way that two words make more
[09:15:20] convey more information to us than a
[09:15:22] single word. So suppose if we have an
[09:15:24] article about Harry Potter, if we want
[09:15:27] to know without actually reading the
[09:15:29] article what it is all about. So if we
[09:15:31] just tokenize single individual words,
[09:15:33] so it might not convey the meaningful
[09:15:36] information to us. So instead if we
[09:15:38] tokenize two words at once, so
[09:15:41] definitely the Harry Potter word will be
[09:15:43] together. So that is how why we use
[09:15:45] bagrams or triagrams that will be three
[09:15:47] words and we can use multiple words four
[09:15:50] and five in order to check whether we
[09:15:52] get more insight from the text data or
[09:15:54] not. So for implementing biograms we'll
[09:15:57] start by importing the web text corpus
[09:16:00] in our NLTK corpora and we'll also
[09:16:02] import the stops. Then we have to import
[09:16:05] the biograms function from the NLTK
[09:16:08] package. And then we'll first of all if
[09:16:10] we import a text for the first time we
[09:16:13] can also pre-process that as text using
[09:16:15] the w lower function. So if we have
[09:16:18] Harry Potter which occurs in the
[09:16:20] beginning if there are caps in front of
[09:16:23] it and there are lower caps or uppercs
[09:16:25] of the same word while creating biograms
[09:16:28] those two words will be treated
[09:16:29] differently. But if we have a same word
[09:16:32] with lower caps so or and that word will
[09:16:35] be counted as single word. So if we have
[09:16:37] two words but they are in different caps
[09:16:40] so that will be counted as two words. If
[09:16:42] we lower the words then they will be
[09:16:44] counted as a one word. So that is also a
[09:16:46] pre-processing step for text analysis.
[09:16:48] So we'll store the text words from the
[09:16:51] web text corpus. Uh the web text dot raw
[09:16:54] will import the raw text file. All the
[09:16:56] text that is present in this file. And
[09:16:59] if you use the words method so it will
[09:17:02] import all the words individual words
[09:17:04] but this won't be possible for every
[09:17:06] text document. It is possible only for
[09:17:08] corpus. So for other text documents we
[09:17:11] have to use tokenization. So if we have
[09:17:13] imported all the words that are present
[09:17:16] in pirates text file using web text do
[09:17:19] words and it will lower all the words
[09:17:21] will be in lower case. And then if you
[09:17:23] want to remove the stop words you can
[09:17:25] use a single line like this or you can
[09:17:27] use the multiple lines. So you will
[09:17:29] create a empty list. First of all we
[09:17:31] will store all the stop words in stop
[09:17:34] words list and then we'll create an
[09:17:36] empty list filtered word and then we'll
[09:17:39] use the for loop which will iterate
[09:17:40] through every single word in our list.
[09:17:43] It will check if the word is present in
[09:17:44] the stop word or not and it will take
[09:17:46] only the words which have length greater
[09:17:49] than three characters and all those
[09:17:51] words will be appended in the filtered
[09:17:54] word list. So after creating biograms
[09:17:57] after creating the filtered word list
[09:18:00] now if you want to create biograms we'll
[09:18:01] use the bagram function and we'll pass
[09:18:04] our filtered words list and it will
[09:18:06] create we'll store this in a. So here a
[09:18:09] is also a list with two words. So if we
[09:18:12] plot the frequency distribution first of
[09:18:14] all we create a frequency distribution
[09:18:16] using frequency distribution class and
[09:18:19] we'll store it in fist and then we'll
[09:18:21] use the most common at method to print
[09:18:23] 20 most common words that occur in the
[09:18:25] text. So you can see here we get the 20
[09:18:28] most common words that occur in the
[09:18:30] text. We have Jack Sparrow, Elizabeth,
[09:18:32] Davy Jones, Blackpool, Flying Dutchman.
[09:18:35] This gives us all the biograms that are
[09:18:37] present in our text. So you can clearly
[09:18:39] see it is something about the pirates of
[09:18:41] Caribbean. Before moving to diagrams,
[09:18:44] let's go to Jupyter notebook and
[09:18:46] implement this code. So I'll start by
[09:18:48] importing the bagrams from the NLTK
[09:18:50] package.
[09:18:52] And after that I'll import a sample text
[09:18:55] from my web text corpus from the NLTK
[09:18:58] corpus. I'll import.
[09:19:03] So now I imported the web text corpus
[09:19:06] and the biograms function. So now I'll
[09:19:08] take all the words that are present in
[09:19:10] the pirates document pirates text file
[09:19:13] in our web text. So I'll create a list.
[09:19:15] I'll directly take the words instead of
[09:19:17] taking the whole text. So I'll create an
[09:19:20] empty list. Now I'll store all the words
[09:19:22] in this list. So I I want to keep my
[09:19:24] words to lowerase all the words and it
[09:19:28] will take all the words that are present
[09:19:30] in webex dot words.
[09:19:37] It will take all the words that are
[09:19:39] present in pirates do pirates.txt
[09:19:42] file in the web text corpus. And if we
[09:19:45] want to remove the stop words we can
[09:19:47] directly write in one line.
[09:19:50] We have already created the stop words
[09:19:53] list above. So we'll use that list. So
[09:19:56] if you're running this for the first
[09:19:57] time, so you have to import the stop
[09:19:59] words and store it in a stop words list.
[09:20:01] And then you can remove this words from
[09:20:04] your text document. And we'll mention
[09:20:06] another condition that take only the
[09:20:08] words which have length greater than
[09:20:10] three characters.
[09:20:12] So now if I run this now all my words
[09:20:15] which are present in pirates.txt txt
[09:20:18] text file I will and which are not stop
[09:20:21] words after removing the stop words and
[09:20:23] all the words having length more than
[09:20:25] three characters will be stored in words
[09:20:27] list. So now if I create a frequency
[09:20:30] distribution of these words.
[09:20:34] So if I print after creating the
[09:20:36] frequency distribution if I print the
[09:20:38] most common 20 words
[09:20:42] now you can see we have different 20
[09:20:45] most common words but the Jack is
[09:20:47] different will is different sparrow is
[09:20:48] different Elizabeth is different that is
[09:20:50] why we use biograms and diagrams to
[09:20:52] actually gain more insights from the
[09:20:54] data now I'll create biograms from these
[09:20:57] words for doing that I'll create a list
[09:20:59] called bagram and my function is bs
[09:21:04] And I'll pass all my words into this. So
[09:21:08] now my bagrams will be created. And now
[09:21:10] I'll create the frequency distribution
[09:21:11] for the bagrams.
[09:21:14] It is b list name is bgram. So now after
[09:21:17] creating the frequency distribution I'll
[09:21:19] print the most common 20 words. And now
[09:21:21] you will see we have Jack Sparrow will
[09:21:24] turn Elizabeth Swan David Jones black
[09:21:26] boy. So this completely changes the
[09:21:28] information that we get by just using
[09:21:30] the bior. So after understanding what
[09:21:32] are biograms, triagrams and engrams, now
[09:21:35] we'll move on to stemming and we'll
[09:21:37] understand what stemming is and why and
[09:21:40] how we use stemming in text
[09:21:42] prep-processing for text normalization.
[09:21:45] Stemming is a process for text
[09:21:46] normalization which involves the
[09:21:48] reducing words to their root or base
[09:21:51] forms. So if we have two sentences or
[09:21:55] more than two sentences which contain a
[09:21:57] word with their derivational affixes for
[09:22:00] example we take sentence which has the
[09:22:02] word like and we have another sentence
[09:22:04] which has the word likely. When we use
[09:22:07] these words for stemming they will be
[09:22:10] trimmed down to their base forms that is
[09:22:12] like and in stemming we don't have any
[09:22:15] morphological analysis. So you will find
[09:22:17] a lot of these words after stemming they
[09:22:20] don't actually belong to the dictionary
[09:22:22] they don't have a meaning but still the
[09:22:24] words which are stemmed they will have a
[09:22:27] base form and then we can use that base
[09:22:29] form to create words by just adding the
[09:22:32] derivational affixes that is ing or ed
[09:22:34] the tenses without the tenses the
[09:22:37] derivational effects means tenses here
[09:22:40] so to perform stemming we have a class
[09:22:42] named border stemor in ntk so we'll
[09:22:46] import the class water stemmer from nltk
[09:22:49] stem and then we'll initialize the class
[09:22:52] and we'll define where an object or an
[09:22:55] instance of that class that is ps and
[09:22:57] then I've imported send tokenize and
[09:22:59] word tokenize in case we want to
[09:23:01] tokenize sentences or words from a text
[09:23:04] document and then I've defined some
[09:23:06] example word so you can see the word is
[09:23:08] like then likely liking liked then
[09:23:11] unlike and then when I perform stemming
[09:23:14] of on these words and And I print each
[09:23:16] of the words that is present in example
[09:23:18] words list. So you'll get the first four
[09:23:21] words like likely, liking and liked.
[09:23:24] They have the same stem that is like and
[09:23:26] the word unlike has a stem unlike. This
[09:23:30] last one does not have any vocabulary
[09:23:32] but in stemming we don't include
[09:23:34] vocabulary. If there's any word
[09:23:36] unlikingly or unliked it would have the
[09:23:39] same stem. And we can also perform
[09:23:42] stemming on a complete sentence. So I've
[09:23:44] imported the potter from a class from an
[09:23:46] LTK. STEM and I have initialized an
[09:23:49] object instantiate then object PS and
[09:23:52] then this is my example test. So driving
[09:23:54] a self-driving car might not be fun for
[09:23:57] a professional driver but for a lazy
[09:23:59] driver a self-driving car saves a lot of
[09:24:02] driving when we firstly tokenize the
[09:24:05] words from this sentence. So after
[09:24:07] tokenizing we'll store the words in
[09:24:09] words list and then we'll iterate
[09:24:12] through the list and after iterating
[09:24:13] through the list we'll print the stemmed
[09:24:16] version of each single word that is
[09:24:18] present in this list. For driving you'll
[09:24:21] see that we have the stemmed version
[09:24:23] that is drive and for uh this driving we
[09:24:26] also have the same drive and then for a
[09:24:28] professional we have profession for
[09:24:30] driver we have same driver and similarly
[09:24:33] you can see the other words so it will
[09:24:35] be the stemmed version of those words
[09:24:37] will be printed let's implement this in
[09:24:41] Jupyter notebook so for performing
[09:24:42] stemming first of all we'll import the
[09:24:45] portal stemmer class from the nltk stem
[09:24:48] subpackage
[09:24:52] Once we have imported this, now I'll
[09:24:54] import the word tokenize. So then I can
[09:24:56] tokenize the words. Uh then the next
[09:24:59] step will be to find the stemmed version
[09:25:01] of those words. So I'll import from and
[09:25:04] I'll take a tokenize.
[09:25:08] So after that then I'll store all the
[09:25:11] words that are present in this file. So
[09:25:13] I'll store it in words list. So I'll now
[09:25:16] I'll use the word tokenize function.
[09:25:19] Then I have to import first of all the
[09:25:23] state union corpus.
[09:25:27] Now I imported the state union corpus.
[09:25:29] Now from the state union corpus I will
[09:25:31] import the 1961 Canada file. I'll import
[09:25:35] the raw text file. So for that I'll
[09:25:37] write this command state union
[09:25:41] and I have to import the raw text file.
[09:25:43] So all the text that I want from this
[09:25:45] file I'll write this and it is a text
[09:25:48] file. So when the when I run this all
[09:25:51] the words from this file will be stored
[09:25:53] in the words. So now if you want to
[09:25:56] print the stemmed version of all those
[09:25:58] words present in the words list we'll
[09:26:00] first of all instantiate a porter
[09:26:02] stemmer class object which we'll use for
[09:26:05] stemming. And the name of the object is
[09:26:07] PS.
[09:26:09] Now I have instantiated an object of
[09:26:12] this class. Now we'll use this object
[09:26:14] print the stemmed version of the
[09:26:16] different words that are present in this
[09:26:18] words list. So I'll iterate through
[09:26:20] every word in the words list and I'll
[09:26:24] print uh the stemmed version of these
[09:26:26] words.
[09:26:28] If I run this now uh it will print the
[09:26:31] stemmed version of every single word
[09:26:32] that is present in the words list. So
[09:26:34] you can see the first word was
[09:26:35] president. So it has stemmed it to
[09:26:38] present. So even if this doesn't make
[09:26:40] sense still the stemming does not
[09:26:43] include any vocabulary or morphological
[09:26:45] analysis. So it will still uh write
[09:26:48] present and for candid it has wrote
[09:26:50] Kennedy here. And for message it has
[09:26:53] stemmed the word message to ms a and
[09:26:56] similarly for president it has st the
[09:26:59] word to president.
[09:27:01] Similarly, you can see there are a lot
[09:27:03] of words here in this file and they will
[09:27:05] have the same words will have the same
[09:27:07] stemmed version. In leatization, it is a
[09:27:09] process that involves reducing words to
[09:27:12] their base forms. But this time we'll
[09:27:14] use vocabulary and morphological
[09:27:17] analysis so that after reducing words to
[09:27:19] their base forms, they belong to our
[09:27:21] vocabulary and they have a meaning. And
[09:27:24] also different morphological analysis
[09:27:26] will also be performed. So if tenses
[09:27:28] will be changed and you will get the
[09:27:31] base tense of every single word that we
[09:27:34] latize. So it will bring the context to
[09:27:36] the words which is not done in stem. So
[09:27:39] to perform leatization using NLTK we
[09:27:42] have a class in NLTK. STEM which is word
[09:27:45] net leatizer. So we'll import this class
[09:27:48] first from NLTK. stem and then we'll
[09:27:51] instantiate this class and with an
[09:27:53] object leatizer and then we'll use this
[09:27:56] object with a method leatize and then
[09:27:58] we'll pass our text or our strings that
[09:28:01] we have or our words that we have and
[09:28:03] every single word will be leatized to
[09:28:06] its pairs form. So you can see we have
[09:28:08] the corpra which is plural of corpus. So
[09:28:12] it has been leatized to corpus. And then
[09:28:15] we have cacti which is a plural of
[09:28:16] cactus. So it has been leatized to
[09:28:19] cactus. The lines has been dramatized to
[09:28:21] lion. And then we have rocks which is
[09:28:23] leatized to the base form that is rock.
[09:28:26] And then we have geese which is a plural
[09:28:28] of goose. So every single word has a
[09:28:32] vocabulary associated with it. And we
[09:28:34] can also mention the part of speech of
[09:28:36] that word. So if the word is an
[09:28:38] adjective we can write a. The word is a
[09:28:40] verb. So we can write v. According to
[09:28:43] the part of speech those words will be
[09:28:45] leatized. So if we latize the word
[09:28:47] better giving the argument that it is an
[09:28:50] adjective so it will be converted to
[09:28:52] good. And if we latize the words the
[09:28:55] word best according to the adjective it
[09:28:57] will be still kept as best and for runs
[09:29:00] it will keep run. And for if we mention
[09:29:04] the part of speech as verb so it will
[09:29:06] also keep the same leatized version that
[09:29:09] is run. So for similar words we have the
[09:29:11] same latize and stemmed versions. So now
[09:29:15] let's go to the Jupyter notebook and
[09:29:17] imple implement leatization using word
[09:29:19] net leatizer. So we'll start by
[09:29:21] importing the word net leatizer class
[09:29:25] from nltk.
[09:29:29] Once I import this class, now I can use
[09:29:32] some first of all I'll initial initi
[09:29:35] instantiate this class create an object
[09:29:38] of this class. So you can use that
[09:29:39] object to perform alatization.
[09:29:46] So I've created an object PS. Now I'll
[09:29:49] define a words list. We can it can be
[09:29:53] any words. So I've copied one list. So
[09:29:56] we have cats, feet, bats and corpora. So
[09:30:00] let's see what are the stem what are the
[09:30:03] leatized version of these words.
[09:30:08] We have to use the dot leatiz method of
[09:30:11] this object to leatize uh the words.
[09:30:15] So we get the cats for the cats the
[09:30:18] leatized version will be cat and for
[09:30:20] feet it will be foot and for bats it
[09:30:23] will be bat and for corpora it will be
[09:30:25] corpus. So later on in this course when
[09:30:27] we use machine learning algorithms to
[09:30:30] perform text classification. So we'll
[09:30:32] use uh the leatization technique to
[09:30:34] reduce words to their base forms. So
[09:30:37] when we prepare data for our machine
[09:30:39] learning algorithms or data will be in a
[09:30:41] cleaned form. So this is a
[09:30:43] pre-processing step that we will use
[09:30:44] later on. So after understanding
[09:30:46] leatization let's move on to the part of
[09:30:48] speech tagging which is actually used
[09:30:50] for categorizing words into their
[09:30:53] grammatical groups. So in part of speech
[09:30:55] tagging the task is to label each word
[09:30:58] in a document in a text document with
[09:31:00] its grammatical group. So we'll label a
[09:31:04] word if it is a noun or a pronoun
[09:31:06] pronoun or a verb or an adjective or an
[09:31:09] adverb. So we'll use this for
[09:31:11] categorizing and understanding the
[09:31:13] syntax of different words that are
[09:31:15] present in our text document. So these
[09:31:17] are some of the tags that will be
[09:31:18] attached to our words that are present
[09:31:21] in our text. The word has a tag as CC so
[09:31:24] it means it is a coordinating
[09:31:26] conjunction and if it has a tag of NP it
[09:31:29] means it is a proper noun singular and
[09:31:32] then for JJ it is adjective and for uh
[09:31:35] there are lot of other parts of speech.
[09:31:38] So for those uh there will be a
[09:31:40] different tag attached to every single
[09:31:42] word in the text document. Uh so let's
[09:31:45] see how we'll use part of speech tagging
[09:31:48] using the NLTK package. So here I have
[09:31:51] imported NLTK package and then I have
[09:31:54] imported the word tokenize to tokenize
[09:31:56] all the words present in the text
[09:31:57] document. So this is our text document.
[09:32:00] So I have stored it in sample text. So
[09:32:02] the Google has actively been purging
[09:32:04] Mware apps from play store and this is
[09:32:07] our text document. Now the first step is
[09:32:10] always to tokenize the words. So I'll
[09:32:13] perform the word tokenize on this sample
[09:32:16] text and I'll store all the words from
[09:32:20] this text in tokenized text. So when I
[09:32:22] print this uh you'll see that all the
[09:32:25] words that are present in this whole
[09:32:27] text document are displayed and the next
[09:32:30] step now is the part of speech tagging.
[09:32:33] So we have a function called POSOS tag
[09:32:35] in NLTK. So once you import NLTK uh then
[09:32:38] we can use this function. So I'll write
[09:32:40] NLTK.pos
[09:32:43] tag and then I'll pass uh the words that
[09:32:46] are present in the text document. So my
[09:32:48] words are stored in tokenized text. So
[09:32:51] I'll pass my words in tokenized in NLTK
[09:32:55] out of speech tagging. So every word
[09:32:58] that is present there will have a label
[09:33:01] attached to it. So you can see the
[09:33:03] Google has been classified as a personal
[09:33:05] pronoun and has is a verb past form of
[09:33:09] verb and RB is adverb and VB is also
[09:33:14] this is also a verb and Marwell leen. So
[09:33:16] this is classified as an tagged as an
[09:33:19] adjective and abs has been tagged as
[09:33:21] noun and from is tagged as interjection
[09:33:25] and the is tagged as dt. So those are
[09:33:27] called determinant. So DD is determinant
[09:33:30] and similarly we have for every single
[09:33:32] word that is present in our text tag has
[09:33:35] been attached and it is returned as a
[09:33:37] list of pupils. The first element of the
[09:33:39] pupil is our word and second is uh the
[09:33:42] part of speech tag. So now let's go to
[09:33:44] the Jupyter notebook and implement on
[09:33:46] this part of speech tagging on a corpus.
[09:33:49] I'll start off by importing the POS tag
[09:33:52] function from the NLTK package. Uh so
[09:33:55] that
[09:33:57] once my function is imported now let's
[09:33:59] go to the state of the union addresses.
[09:34:02] So we'll take the 1996 Clinton address.
[09:34:06] So we'll take this file take this
[09:34:08] document and we'll import all the words
[09:34:12] tokenize all the words present in this
[09:34:14] document and then we'll attach a part of
[09:34:16] speech tag to each word present in this
[09:34:19] document. First let us uh extract all
[09:34:23] the words using word organization.
[09:34:27] So I have already imported the state
[09:34:29] union corpus. So I'll import the raw
[09:34:31] text file and we have the 1986 Clinton
[09:34:36] text. So all my words will now be stored
[09:34:38] in words. So now let's print the pos
[09:34:42] text of each single word that is present
[09:34:45] in words.
[09:34:46] So once I run this, so I have all the
[09:34:48] words along with I have list of pupils
[09:34:51] and the first element of the pupil is a
[09:34:54] the word itself and the next element is
[09:34:57] the POS tag associated with that word.
[09:34:59] So the president has been classified as
[09:35:01] a personal as a proper noun and then we
[09:35:04] have interjection before is tagged as
[09:35:08] interjunction and we have most of the
[09:35:10] personal proper nouns and there is a
[09:35:12] preposition u. So for every single word
[09:35:15] uh there will be a a tag attached to it.
[09:35:17] So later on we'll use these tags to
[09:35:19] understand the syntactical structure of
[09:35:22] sentences and how we form sentences and
[09:35:25] how we make computers understand how to
[09:35:28] make sentences. So we have used the part
[09:35:30] of speech tagging to tag different parts
[09:35:32] of speech that are present in our text.
[09:35:34] So after that we'll use named entity
[09:35:37] recognition to recognize to identify
[09:35:39] different entities present in our text.
[09:35:42] So if there are any people, places,
[09:35:45] organizations or locations that are
[09:35:47] present in our text. So we'll use the
[09:35:48] named entity recognition to identify
[09:35:51] those words. So named entity recognition
[09:35:54] is used to identify important named
[09:35:56] entities in text such as people, places,
[09:35:59] organizations, location, even dates. So
[09:36:02] work of art. So these are different tags
[09:36:05] that uh or named entity recognition will
[09:36:07] assign to all the words that fall in
[09:36:10] this category. So if there is an
[09:36:12] organization, so it will be tagged with
[09:36:14] organization. And then we have location.
[09:36:17] So any particular location will be
[09:36:18] tagged with location. Then we have
[09:36:21] geopolitical entity which is any country
[09:36:23] or a state or a province. And we have
[09:36:26] daytime person. So any person will be
[09:36:29] tagged with a named the named entity as
[09:36:32] person. Uh so to perform named entity
[09:36:34] recognition we have to use the ne chunk
[09:36:38] function that we'll use. So first of all
[09:36:40] we will import uh the 2005 George W.
[09:36:43] Bush uh state of the union address from
[09:36:47] the state union corpus. I have imported
[09:36:49] the word tokenize and the state union
[09:36:51] corpus and I've imported the raw text
[09:36:53] file from that is 2005 George W. Bush
[09:36:57] address and then I've tokenized all the
[09:36:59] words from this text file and these are
[09:37:02] all the words that are present in this
[09:37:04] text file. So the next step in named
[09:37:07] entity recognition after tokenization is
[09:37:10] attaching a part of speech tag. Once we
[09:37:12] attach a part of speech tag using uh the
[09:37:14] posos tag function from the NLTK. So
[09:37:17] we'll pass our tokenized words to an POS
[09:37:20] tag function and then it will attach a
[09:37:23] part of speech tag to every single word
[09:37:25] that is present in the tokenized text
[09:37:27] list and that will store the result in
[09:37:29] POS words. So after that to perform NLT
[09:37:33] uh named entity recognition using NLTK
[09:37:36] we will import the ne chunk named entity
[09:37:39] chunk function from the NLTK which will
[09:37:42] chunk all the named entities present in
[09:37:44] our text. So we'll first of all import
[09:37:47] and then we'll store uh the named
[09:37:49] entities in ne words. So use the
[09:37:52] function necore chunk and we'll pass the
[09:37:55] part of speech tags. So these will be
[09:37:57] the words along with their part of
[09:37:59] speech tags and then all those words
[09:38:01] which belong to any category of named
[09:38:04] entities. All the named entities will be
[09:38:07] assigned a category according to these
[09:38:09] rules these categories. Once we have
[09:38:12] processed this, we'll store the result
[09:38:14] in any awards. So when I print any
[09:38:16] awards, you can see that George has been
[09:38:19] classified as a person and address has
[09:38:21] been classifi classified as an
[09:38:23] organization and joint is also tagged as
[09:38:26] an organization and we have Congress as
[09:38:29] an organization. So a lot of times
[09:38:31] you'll find that wrong classification
[09:38:33] also. So it is still improving. For some
[09:38:36] words it will have the correct tags. For
[09:38:38] some words it won't have the correct
[09:38:40] tags. So now let's go to the Jupyter
[09:38:42] notebook and implement named entity
[09:38:44] recognition on a corpus. So after
[09:38:46] importing the Gutenberg corpus, I'll
[09:38:49] extract all the words and then assign a
[09:38:52] part of speech tag. So I'll store it in
[09:38:55] part of speech words and then first of
[09:38:57] all I'll tokenize all the words
[09:39:00] and I want uh from the Utenber corpus
[09:39:03] the raw file
[09:39:05] which is Edgeworth parents.txt. So all
[09:39:08] the words from this file will be
[09:39:09] tokenized and then I want to add a part
[09:39:12] of speech tag to every single word. So
[09:39:14] I'll wrap this whole thing in part of
[09:39:17] speech poss tag function.
[09:39:21] This function will extract all the words
[09:39:23] that are present in this file and it
[09:39:25] will tokenize and then it will attach a
[09:39:27] part of speech tag and we'll store this
[09:39:29] in poss words list. So once I run this
[09:39:32] my part of speech tags uh will be
[09:39:35] associated. Now if I want to find all
[09:39:38] the named entities present in these
[09:39:40] words. So I'll import uh the ne chunk
[09:39:42] function from the ntk
[09:39:46] I'll import the ne chunk function which
[09:39:48] is named entity chunk. So I'll use this
[09:39:50] function to extract all the named
[09:39:53] entities present in this word these
[09:39:55] files.
[09:39:57] So you can see if uh I print the length
[09:40:00] of my previous words. So they're at
[09:40:01] around 22 9,120 words. So it will take a
[09:40:05] lot of time to process all these words
[09:40:07] and find named entities. So I'll take uh
[09:40:10] the first 2,000 words and file to find
[09:40:12] all the named entities present in that.
[09:40:14] So for that I'll use I'll store my named
[09:40:16] entity words in any words and then I
[09:40:20] will use the any chunk function to chunk
[09:40:23] all or to find all the named entities.
[09:40:26] So I'll pass here my POS words after
[09:40:30] attaching part of speech and I only want
[09:40:32] if I want only the first thousand words.
[09:40:35] So it will chunk all the first thousand
[09:40:37] words and then we'll print all the named
[09:40:40] words along with their named entity
[09:40:43] tags.
[09:40:45] So you can see that first thousand words
[09:40:47] it has classified has attached a tag of
[09:40:50] organization to parent and organization
[09:40:52] to assistant and then person to Maria
[09:40:55] organization to the and then Rosemary is
[09:40:58] Rosemore is also class attached
[09:41:00] classified as geopolitical entity any
[09:41:03] place and then we have Ireland which is
[09:41:06] also tagged as a geopolitical identity
[09:41:08] geopolitical entity. So these are our
[09:41:10] part of speech tags which is a proper
[09:41:12] noun and this is our named entity
[09:41:14] recognition and tag which is
[09:41:16] geopolitical entity. So after learning
[09:41:18] about different pre-processing
[09:41:19] techniques such as tokenization,
[09:41:21] leatization, stops removal, part of
[09:41:24] speech tagging and named entity
[09:41:26] recognition now we'll move on and we'll
[09:41:28] implement all of those concepts using a
[09:41:31] more advanced package called spacy.
[09:41:33] Spacey is a free and open library and it
[09:41:36] is extensively used for natural language
[09:41:39] processing. It is written in Syon
[09:41:41] language which is C extension of Python
[09:41:43] and that is why it gives us performance
[09:41:46] like C in Python. It is fast and it
[09:41:49] provides us concise application
[09:41:52] programming interfaces to access it
[09:41:54] methods and properties. So we can
[09:41:56] perform all of the operations that we
[09:41:58] have performed with NLTK using spacy and
[09:42:01] you will see that how easy it is to
[09:42:03] perform such operations uh using spacey.
[09:42:06] So we can perform tokenization,
[09:42:08] lamatization, part of speech tagging,
[09:42:10] dependency passing, named entity
[09:42:12] recognition and some other concepts such
[09:42:14] as sentence boundary detection,
[09:42:16] similarity and text classification.
[09:42:19] Before using spacey, we have to install
[09:42:22] it. For installing spacing Anaconda, you
[09:42:25] have to go to your Anaconda folder in
[09:42:27] the start menu and you have to open the
[09:42:29] Anaconda prompt. After opening the
[09:42:32] Anaconda prompt, you have to run the
[09:42:34] following code and once you press enter,
[09:42:37] so you'll be asked to proceed with Y or
[09:42:40] N. So just type Y and press enter. So
[09:42:44] once your spacey is installed, now you
[09:42:47] have to install spacey model. Spacey has
[09:42:50] different models in different languages.
[09:42:52] So these are these models contain
[09:42:54] vocabularies, pre-trained vectors,
[09:42:56] syntaxes. We can train our own models.
[09:42:59] But for these tutorials, we'll use a
[09:43:02] pre-trained model which is the English
[09:43:05] model. So for downloading the English
[09:43:07] model, you have to run the following
[09:43:08] code in the Jupiter. You have to run the
[09:43:11] following code. For downloading the
[09:43:13] model, you have to run the following
[09:43:15] code in the Anaconda prompt. So your
[09:43:17] model will be downloaded. So after these
[09:43:20] two steps we are ready to work with the
[09:43:22] spacey module. So our first step is to
[09:43:25] perform tokenization with spacy. So
[09:43:27] we'll see how to perform tokenization.
[09:43:29] So the first step would be to import the
[09:43:31] spacey package and once you import the
[09:43:34] spacy package and the next step is to
[09:43:36] load the model that you want to use. So
[09:43:39] we have downloaded the model that is n
[09:43:41] core web
[09:43:43] English core model. So we'll load it
[09:43:45] using spacey.load. After loading the
[09:43:48] package, we'll store it in the NLP
[09:43:50] object. So now we'll use this object to
[09:43:52] perform different functions like
[09:43:54] tokenization, leatization. So firstly to
[09:43:57] we'll create using this object
[09:44:00] pre-processed variable or another object
[09:44:04] once we create a spacey object then the
[09:44:06] next step is to pass the string or the
[09:44:08] text that we want to pre-process. So
[09:44:11] once uh you write NLP which is this
[09:44:13] object and then you pass as your text
[09:44:16] document inside of it. Once you pass so
[09:44:18] the spacy model will automatically
[09:44:20] prep-process this text and it will
[09:44:23] extract all the entities and it will
[09:44:26] perform leatization and named entry
[09:44:29] recognition. Once we create a processed
[09:44:31] object we'll store it in doc. Once our
[09:44:34] process object is created, the next step
[09:44:36] is we just have to see how to print
[09:44:38] different tokens, leas or named
[09:44:40] entities.
[09:44:42] To perform tokenization, we just have to
[09:44:45] write token.ext. So the tokens are
[09:44:47] stored in text attributor for this
[09:44:50] document. To print all the tokens that
[09:44:52] are present in our text document, we'll
[09:44:54] just use a for loop to iterate through
[09:44:56] all the tokens. So we have written for
[09:44:58] token and doc. Doc is the pre-processed
[09:45:01] processed object of spacey model and
[09:45:04] then we'll iterate through every single
[09:45:06] entity that is present in every single
[09:45:08] token that is present in our doc. For
[09:45:10] that we have written for token in dog
[09:45:12] print token.ext. So this text is the
[09:45:15] token that we want to print. We'll see
[09:45:17] that all the word tokens that are
[09:45:19] present in our text document are
[09:45:21] printed. So the next step that we'll
[09:45:23] perform is alamatization with spacy.
[09:45:26] We'll create another processed object.
[09:45:28] So we'll use the NLP object that we have
[09:45:30] created using the spacey.load and we
[09:45:33] pass our text document into it and then
[09:45:35] we'll store it in d o doc and after
[09:45:38] processing after creating this processed
[09:45:40] object to print leas we just have to
[09:45:43] iterate through the lema.ext and dot
[09:45:46] lema. So the lemas are stored in dot
[09:45:48] lema method of our document. So we'll
[09:45:52] iterate through every lema. So we'll
[09:45:53] print the text which is the actual word
[09:45:55] token and we'll print along with the
[09:45:58] leatized version of those words. We'll
[09:46:00] iterate through using for loop for lema
[09:46:03] and doc we'll print lema.ext. So this
[09:46:05] will print the token which is google and
[09:46:08] the next it will print lema. So it will
[09:46:10] print the leatized version of those
[09:46:12] tokens. So you can see our these are the
[09:46:15] the second words are the leatized
[09:46:17] versions. So our locations is changed to
[09:46:19] location and uploading is changed to
[09:46:21] upload. Images is changed to image. And
[09:46:24] here we have some scary deemed is
[09:46:26] changed to deem. Scariest is changed to
[09:46:29] scary. The scary is the leatized version
[09:46:31] and creepy is also the leatized version.
[09:46:33] And include is also the leatized
[09:46:35] version. How easy it is perform
[09:46:37] leatization with spaces. You just have
[09:46:39] to load a document and use the lema
[09:46:43] argument to print the lemas that are
[09:46:45] present in our document. So the next
[09:46:47] step is to perform part of speech
[09:46:50] tagging with spacy. We have already
[09:46:52] loaded the same document which is this
[09:46:54] one. So it is stored in doc. So to print
[09:46:57] all the part of speech tags, we'll just
[09:47:00] iterate through all the tokens present
[09:47:02] in the doc. For printing the token, we
[09:47:05] use token.ext.
[09:47:07] And then I have added this three lines
[09:47:10] here. And then for part of speech tag we
[09:47:12] have to use POS attribute or method. So
[09:47:17] it will print the part of speech
[09:47:19] associated with that token. And if you
[09:47:21] want to know the part of speech uh just
[09:47:24] as a tag. So we'll write dot tag
[09:47:26] underscore. So it will print the spacey
[09:47:29] version and this is the NLTK version. So
[09:47:31] our Google is tagged as a proper noun
[09:47:35] NNP and maps is also a proper noun. And
[09:47:38] then we have similarly verbs, nouns,
[09:47:40] interjection, determinant, verbs,
[09:47:43] determinants. This is a preposition.
[09:47:45] It's this is how we perform part of
[09:47:47] speech tagging using spacy. So we have
[09:47:50] to use POS method of document. And to
[09:47:54] print the tags, we use the dot tag
[09:47:57] method. So spacey also provides us
[09:48:00] different visualizers. visualize these
[09:48:03] part of speech tags. The package that we
[09:48:06] need to visualize is called display. So
[09:48:09] we'll import display from the spacy. And
[09:48:12] this is our model that we are using that
[09:48:13] is English model. And then we load an
[09:48:16] object of this model and we pass a
[09:48:19] string. When we pass a string, we'll get
[09:48:21] a processed object which we have stored
[09:48:23] in doc. So to visualize this, we have to
[09:48:26] use display c.nder. So this is the
[09:48:29] function that we'll use in the Jupyter
[09:48:31] notebook. Once you write display render
[09:48:34] so we have to write pass the document
[09:48:36] that we have processed and then the
[09:48:38] style argument will tell us which style
[09:48:40] we want. We want here it is d which is
[09:48:43] called dependency passing. So it will
[09:48:45] print the words or will print this image
[09:48:49] or where it will show which words depend
[09:48:51] on each other is our root verb here
[09:48:54] which is a verb. And once you run this
[09:48:56] line you'll get this graph here. So is
[09:48:59] is a root word and is depends on
[09:49:02] processing and processing depends on
[09:49:04] language and then language depends on
[09:49:06] natural. This is also class tagged as a
[09:49:08] proper noun and this is also a proper
[09:49:10] noun and or is is a verb. So our root
[09:49:13] word is is and then we have an attribute
[09:49:15] called fun and our subject is
[09:49:17] processing. So we'll these are two
[09:49:19] dependent tags that depend on process
[09:49:22] and this all all of these depends on its
[09:49:25] next we'll see how we perform named
[09:49:28] entity recognition with spacey we'll
[09:49:30] load the spacy model and then we create
[09:49:33] a pre-processed object using the models
[09:49:37] object so once we pass our text document
[09:49:40] in this we'll store the process object
[09:49:42] in doc and then we'll to print named
[09:49:45] entities we have to use dot ents method
[09:49:49] of our document. So for printing all the
[09:49:52] named entities we have used the for loop
[09:49:54] to iterate through every single entity
[09:49:56] we'll use for ent dog ents and we'll
[09:50:00] print entity dot text. So it will print
[09:50:02] the text of that entity and then to
[09:50:04] print the label which tag is associated
[09:50:07] which entity is associated with which
[09:50:09] tag or which category for that we'll use
[09:50:12] ent dot label underscore. So this text
[09:50:16] is or entity this is the label that is
[09:50:19] associated or the class that or the
[09:50:21] category which it has been classified.
[09:50:23] So or Tokyo in this text has been
[09:50:26] classified as a geopolitical entity
[09:50:28] which is country or a city. And then we
[09:50:31] have more than 38 million. So it is
[09:50:33] tagged as a cardinal. So it means
[09:50:36] cardinal is number. And we have
[09:50:38] Japanese. So it is classified as NP. So
[09:50:40] that is nationalities or religious or
[09:50:43] political groups. And then we have a s
[09:50:45] geopolitical entity that is Osaka which
[09:50:48] is also a city. And then we have 25 20.5
[09:50:51] million which is classified as a number.
[09:50:53] So a numerical entity. We can also use
[09:50:57] visualizers using the display C method
[09:51:00] for named entity recognition. So this is
[09:51:02] the same text that we used before. So
[09:51:04] this is our text document which we have
[09:51:06] passed to the NLP object and then it
[09:51:09] will be processed and will be stored in
[09:51:10] doc. So we'll use display.trender. So
[09:51:13] we'll pass our document here. Yes. And
[09:51:15] now we want style to be ent means it
[09:51:19] will show all the named entities that
[09:51:20] are present in our text. Once you run
[09:51:23] this code you will get this graph here.
[09:51:25] So every named entity will have will be
[09:51:28] separated with a color and along with
[09:51:30] its label. So now let's go to the
[09:51:33] Jupyter notebook and implement all of
[09:51:36] these concepts using space. So I'll
[09:51:37] start by installing spacey first of all.
[09:51:39] So I have copied I have pasted this code
[09:51:42] here cond install for spacey. So once I
[09:51:45] run this code my spacey will start
[09:51:47] downloading.
[09:51:53] Now you have to press Y to proceed. So
[09:51:56] you enter Y and press enter. So now your
[09:51:59] space C libraries and other packages
[09:52:02] will be installed.
[09:52:04] It is 91.9 MB. So it will take some time
[09:52:08] to install. So once it is installed then
[09:52:10] we'll move on to install our English
[09:52:13] model English core model.
[09:52:17] So once your space is downloaded now you
[09:52:20] can move on and download the English
[09:52:23] module using the following just using
[09:52:26] this code here.
[09:52:28] Now it will download our English module.
[09:52:33] Now our our space and our model English
[09:52:36] model both are downloaded. Now let's go
[09:52:39] to the go to the Jupyter notebook and
[09:52:42] start working with spacey.
[09:52:44] So we'll begin by importing first of all
[09:52:46] the spacey package.
[09:52:49] So after importing the spacy package now
[09:52:51] we'll load our English model. So I'll
[09:52:54] create an object called sp and then load
[09:52:57] my English model using spacey.load.
[09:53:01] And the name of our English model is N.
[09:53:05] Now our English model is loaded. Now
[09:53:09] we'll create pre-processed object in
[09:53:12] which we'll pass our text document. I'll
[09:53:15] store it in doc and we'll use our object
[09:53:18] SP and inside of this we have to pass
[09:53:21] our text.
[09:53:24] We have copied this text where Google on
[09:53:27] Friday celebrated its 21st birthday. Uh
[09:53:30] so this is our text. So we'll start by
[09:53:33] tokenization. So we'll first of all find
[09:53:35] all the word tokens present in this
[09:53:37] text. For that iterate through all the
[09:53:40] word tokens.
[09:53:44] Word tokens are stored in text method.
[09:53:47] So when I press enter you'll see we'll
[09:53:50] get all the word tokens present in our
[09:53:52] text document.
[09:53:55] Now after this we'll print all the
[09:53:57] leatized versions of these words present
[09:54:00] in the text document. So for that we
[09:54:02] just have to use le another for loop.
[09:54:09] So it will print the original word and
[09:54:12] then if you want to print the leatized
[09:54:14] version of those words
[09:54:16] we have to use the method lama_.
[09:54:20] Once I print this, once I run this, you
[09:54:22] will see we have the original words and
[09:54:26] the lamatized version. So we have for
[09:54:28] celebrated it is celebrate
[09:54:31] and for showcases it has showcase
[09:54:34] and for biggest it is big companies it
[09:54:38] is company. So these are all the
[09:54:40] lamatized versions of the words present
[09:54:43] in our document.
[09:54:45] After lamatization, let's implement the
[09:54:48] part of speech tagging and we'll print
[09:54:50] the part of speech the words along with
[09:54:53] their part of speech tags. So for that
[09:54:55] we'll just use another for loop.
[09:55:04] So if I run this now we'll get every
[09:55:06] single word along with the part of
[09:55:07] speech and then the tag associated with
[09:55:10] that part of speech. So our Google is a
[09:55:12] proper noun here NP and similarly for
[09:55:14] every single word present in our text
[09:55:16] document we have a part of speech label
[09:55:18] and a part of speech tag. Now we can
[09:55:20] make a using display C we'll make a
[09:55:23] dependency tree. So first of all we will
[09:55:26] import our display C package from space
[09:55:29] C.
[09:55:32] So after importing display C we'll use
[09:55:34] the displayc.t render function to print
[09:55:37] our part of speech tags as a dependent
[09:55:39] dependency tree
[09:55:42] inside of this function we have to pass
[09:55:44] our document that or processed spacey
[09:55:47] document that is doc and then we have to
[09:55:49] mention what type of plot or graph we
[09:55:52] want so we want dependency passing using
[09:55:54] the part of speech tag so we'll write
[09:55:56] style
[09:55:58] for dependency passing it is t so once
[09:56:02] you run this you'll get all the words
[09:56:05] along with their part of speech category
[09:56:08] and the dependency parse tree or the
[09:56:10] pass object.
[09:56:13] So it will show us which words depend on
[09:56:15] each other and which is the determinant
[09:56:17] which is a compound noun and subject. So
[09:56:19] we can all see it through this graph.
[09:56:23] This was about our part of speech
[09:56:25] tagging. Now let's see how we can
[09:56:28] perform named entity recognition on the
[09:56:31] same text that we have mentioned above
[09:56:33] using spacy. So you just have to use
[09:56:36] another for loop. So for
[09:56:39] named entities are stored in ents ent.
[09:56:46] So we'll this will iterate through all
[09:56:49] the entities present in our document.
[09:56:51] print.
[09:56:57] So here you can see that among all the
[09:56:59] entities or spacey has recognized these
[09:57:02] all from all the words it has recognized
[09:57:04] these entities. So Google has been
[09:57:06] identified as an organization. Friday is
[09:57:08] a date. 21st is ordinal. So it is a
[09:57:11] number and Google is also organization.
[09:57:13] And this is September 27, 1998 is a
[09:57:16] date. And Larry Page and Sergey Brin are
[09:57:19] identified as persons.
[09:57:21] Now let's use the visualizer using
[09:57:23] display for making a plot making a graph
[09:57:26] that will represent all the named
[09:57:28] entities present in the document. So
[09:57:30] I've already imported this display C
[09:57:33] package. So now we'll just use display C
[09:57:35] dot render.
[09:57:38] So in this we'll pass our process object
[09:57:41] process document that is doc and then
[09:57:44] the style will be now equal to int which
[09:57:46] will print all the named entities
[09:57:48] present in the uh document.
[09:57:51] Once I run this, you'll see we'll get
[09:57:54] all the named entities along with the
[09:57:56] labels that are present in our text. Do
[09:57:59] Here's a quiz question for you guys.
[09:58:01] What is word embedding? Your options are
[09:58:04] a method of printing words on paper, a
[09:58:06] technique for converting words into
[09:58:08] numerical vectors, a way to encrypt text
[09:58:11] for security purposes, or a style of
[09:58:13] formatting text in document. Please
[09:58:15] mention your answers in the comment
[09:58:17] section. So syntax can be defined as a
[09:58:19] set of rules that govern or that define
[09:58:22] the grammatical structure of words and
[09:58:24] phrases that is used in order to create
[09:58:26] coherent sentences. So in simple terms
[09:58:29] you can say that the format in which the
[09:58:31] words and phrases are arranged in a
[09:58:32] sentence is called syntax. We have
[09:58:35] different words such as nouns, pronouns,
[09:58:38] verbs, adverb. So how these words and
[09:58:40] phrases are arranged those rules are
[09:58:43] actually defined by the syntax. Sentence
[09:58:46] typically follows hierarchical structure
[09:58:48] and contains the following components. A
[09:58:50] sentence is formed firstly from the
[09:58:52] words and then the individual words are
[09:58:55] combined to form phrases. So in a phrase
[09:58:58] uh we cannot have both a subject and a
[09:59:01] verb. So either we'll have a verb or
[09:59:02] either we have a subject and then we
[09:59:04] have clauses. So a sentence can have
[09:59:06] clauses and phrases both as well. So in
[09:59:09] clauses we have both a subject and a
[09:59:11] verb and all of these components are
[09:59:14] combined to form a sentence. So we have
[09:59:16] different components or syntactic
[09:59:19] categories that we'll use to define the
[09:59:21] syntax of a sentence. We'll represent
[09:59:24] our sentence as s and then we have the
[09:59:27] first thing in our sentence is a noun
[09:59:29] phrase. So noun phrase is represented by
[09:59:31] np. These are some of the syntactic
[09:59:34] categories that we'll use to represent
[09:59:36] different words and phrases in our
[09:59:38] sentences. So our sentence will denote
[09:59:41] it by s. And then we have determiners
[09:59:44] which are words that determine
[09:59:46] something. Words like uh and the every
[09:59:49] these are called determinants
[09:59:50] determiners. And then we have nouns. So
[09:59:53] the name of a place, person or a thing.
[09:59:55] And then we have which are represented
[09:59:57] by n. And then we have verbs. Any words
[10:00:00] that show an action, those will be
[10:00:02] represented by V. And then we have or
[10:00:05] prepositions. These are the words that
[10:00:07] are used to link different words and
[10:00:09] phrases in a sentence. So such as on,
[10:00:11] for, with. So these words will be
[10:00:14] represented by P. And then we have our
[10:00:16] noun phrases. So a noun phrase is a
[10:00:19] group of words or a single noun where a
[10:00:22] noun is mandatory. There should be a
[10:00:23] noun. And then we have optional. So
[10:00:25] modifier can be an adjective or an
[10:00:28] adverb. And then we have optional
[10:00:30] objects and also optional determiners.
[10:00:32] So a non-phrase can contain a determiner
[10:00:35] which is an optional and it can also
[10:00:37] contain an optional modifier which is an
[10:00:40] adjective or an adverb but a noun phrase
[10:00:42] should and must contain a noun. And then
[10:00:44] we have verb phrases. In a verb phrase
[10:00:47] there is a main verb and which is
[10:00:48] preceded by a helping verb. And along
[10:00:51] with these two we have an optional
[10:00:53] modifier which can be an adjective or an
[10:00:55] adverb. And then we have have our
[10:00:57] prepositional phrases. In a
[10:00:59] prepositional phrase, we have a
[10:01:01] preposition along with its object and an
[10:01:04] optional modifier. So now after
[10:01:06] understanding what is a syntax and what
[10:01:08] are the different syntactic categories
[10:01:10] of a sentence, now we'll move on and
[10:01:12] understand what are syntax trees. So a
[10:01:14] syntax tree which is also called a parse
[10:01:17] tree is a tree that represents different
[10:01:19] syntactic categories of a sentence. This
[10:01:21] is a simple definition. If you consider
[10:01:23] the following sentence that is I drove a
[10:01:26] car to my college. So I have you can see
[10:01:28] we have tokenized it. Every word is a
[10:01:31] token. So if we want to draw a syntax
[10:01:34] tree for this sentence which will
[10:01:36] represent what every word in this
[10:01:38] sentence which category every word in
[10:01:40] this sentence belongs to. This will be
[10:01:42] our syntax tree for the sentence where
[10:01:45] you can see the s is our sentence which
[10:01:47] is a complete sentence. Our sentence is
[10:01:50] composed of a noun verb, noun phrase and
[10:01:52] a verb phrase. So a noun phrase here
[10:01:55] only consists of a single noun that is I
[10:01:58] or a pronoun that is I. Next we have
[10:02:00] verb phrase which consists of a verb and
[10:02:02] a noun phrase. So here our verb is drove
[10:02:05] and our noun phrase consists of a
[10:02:08] determiner, a noun and a prepositional
[10:02:10] phrase. So our determiner is a here and
[10:02:13] our noun is a car is a noun and then we
[10:02:16] have a prepositional phrase which will
[10:02:18] contain a preposition and a noun phrase.
[10:02:21] So our preposition is two and our noun
[10:02:23] phrase now consists of two elements that
[10:02:25] is a determiner and a noun. So our
[10:02:27] determiner is my and a noun is a college
[10:02:31] is noun. So this is how we have
[10:02:33] represented different parts of this
[10:02:35] sentence according to their syntactic
[10:02:37] categories using a parse tree or a
[10:02:39] syntax. We use concept called chunking
[10:02:42] where we'll chunk different words and
[10:02:45] we'll understand what the sentence is
[10:02:47] talking about. So it is an NLP technique
[10:02:50] which is used to group words or tokens
[10:02:52] into phrases in order to analyze the
[10:02:55] structure and meaning of a sentence. So
[10:02:57] the grouping of the words is based on
[10:02:59] the P of tags and we can also group
[10:03:02] different phrases from a sentence. So if
[10:03:05] you have a sentence that has different
[10:03:06] words. So these boxes represent
[10:03:08] different words. And now you know that
[10:03:10] how to tokenize, how to extract these
[10:03:12] words, individual words. So now after
[10:03:15] that we'll understand how to extract
[10:03:18] group of words that convey any
[10:03:20] meaningful information to us. So we have
[10:03:22] a noun phrase which can contain a noun,
[10:03:25] an adjective and a determiner. So
[10:03:27] adjective and a determiner are optional.
[10:03:29] So it can contain a noun or a pronoun
[10:03:31] and then we can have verb phrases in our
[10:03:33] sentences which contain a main verb, a
[10:03:36] helping verb or any modifier. So it can
[10:03:38] be an adverb and we have we can have
[10:03:40] different noun phrases, multiple noun
[10:03:42] and verb phrases. So in noun phrase we
[10:03:44] actually have the noun and we can know
[10:03:46] what our sentence is talking about. We
[10:03:49] can chunk different phrases from a
[10:03:51] sentence. The first we'll chunk we'll
[10:03:53] know how to chunk noun phrases and then
[10:03:56] we'll see how to chunk verb phrases,
[10:03:58] adjective phrases that contain an
[10:04:00] adjective and a prepositional phrases
[10:04:02] that contain a preposition or an
[10:04:04] optional object and its modifier. So we
[10:04:06] can chunk all of these phrases from a
[10:04:09] huge sentence to understand the meaning
[10:04:11] of the sentence and syntactic structure
[10:04:13] of the sentence. So to start with we
[10:04:15] will firstly understand how to chunk
[10:04:17] noun phrases from a text. The first step
[10:04:20] is we have to import the word tokenize
[10:04:23] so that we tokenize every single word
[10:04:25] and then after tokenizing we'll attach
[10:04:27] the part of speech tags of every word
[10:04:29] and using those part of speech tags
[10:04:31] we'll chunk a group of words as a noun
[10:04:34] phrase from our sentence. Uh so here
[10:04:36] I've imported the word tokenize function
[10:04:38] from nltk tokenize and then this is our
[10:04:41] sample text that we'll chunk our words
[10:04:43] from. We'll tokenize the words and then
[10:04:45] we'll chunk different phrases from this
[10:04:47] sentence and then after performing the
[10:04:49] word tokenize for tokenization we have
[10:04:52] different words that are present in our
[10:04:54] text. The next step is to attach a part
[10:04:57] of speech tag to each word present in
[10:05:00] the text so that we can truncate on. So
[10:05:02] after using the pos tag function we'll
[10:05:05] tag each and every word with its
[10:05:08] respective part of speech and when we
[10:05:10] print every word will be having a part
[10:05:12] of speech. The here is a determiner and
[10:05:15] jj which means an adjective. So or crazy
[10:05:18] is an adjective. Brown is unknown. It is
[10:05:21] classified as a noun and dog is also a
[10:05:24] noun. And then we have VB which means
[10:05:26] verbs different tenses of verbs. And we
[10:05:29] have interjection determinant nouns.
[10:05:31] Based on these part of speech tags, now
[10:05:34] we'll see how we under how we chunk
[10:05:36] that's noun phrases from our sentences.
[10:05:38] To perform chunking, we'll use both part
[10:05:41] of speech tags and regular expressions.
[10:05:43] So using regular expressions, we will
[10:05:45] extract different chunks or we'll define
[10:05:48] different patterns that suit a
[10:05:50] particular chunk. So we have defined a
[10:05:52] regular expression that will be used to
[10:05:54] chunk all the noun phrases from our
[10:05:56] sentence. regular expression will chunk
[10:05:58] the NP. So we can write anything here.
[10:06:01] So I've written NP which is a noun
[10:06:03] phrase. Then it's colon. So after the
[10:06:05] colon we have to mention our regular
[10:06:06] expression inside these braces. And the
[10:06:09] first word that it will chunk should be
[10:06:11] a determiner. We have to put it inside
[10:06:13] these brackets and DD is for determiner.
[10:06:16] So the is a determiner and then we have
[10:06:19] a sign of interrogation which means
[10:06:21] either zero or one. In a noun phrase
[10:06:23] there can be either a zero determiner,
[10:06:25] no determiner or only one determiner in
[10:06:28] the beginning. And the next word it can
[10:06:30] be an adjective. This means zero or
[10:06:33] more. This asterisk in a noun phrase
[10:06:36] there can be either zero or more than
[10:06:39] zero adjective. And the third word is a
[10:06:41] noun. There can be either zero or one
[10:06:43] noun. So if there's any noun, so it will
[10:06:46] be extracted. And if there is any noun
[10:06:48] and in front of it there is a an
[10:06:50] adjective. So that word will be
[10:06:52] extracted and if there is a noun and in
[10:06:54] front of it there's an adjective and an
[10:06:55] and a determiner all these three words
[10:06:58] will be extracted based on this regular
[10:07:00] expression when we use regular
[10:07:02] expression we have to import the reax
[10:07:05] passer class from NLTK. So I already
[10:07:08] from NLTK import reax passer or if you
[10:07:11] have imported NLTK already so you can
[10:07:13] use along with the NLTK and you have to
[10:07:16] pass your regular expression which I
[10:07:17] have stored it in grammar. So when you
[10:07:19] pass your regular expression to xax
[10:07:22] passer we'll create an object of this
[10:07:24] class that is chunk passer. So after
[10:07:26] creating the object of this class we'll
[10:07:28] use the dot passse method of this object
[10:07:31] to parse our sentence. Once we pass our
[10:07:34] sentence so our in the reax parse dotp
[10:07:36] pass we have to pass our poss tag. So
[10:07:39] this will be all the words along with
[10:07:41] their poss tags. So when we pass our
[10:07:43] words along with their pos tags to our
[10:07:46] chunk parsers.pass pass all the words
[10:07:48] will be chunked and will be stored in
[10:07:50] tree. So once we print the tree now you
[10:07:52] can see this is a sentence and the first
[10:07:54] noun phrase this is np the noun phrase
[10:07:57] is it contains a determinant the and an
[10:08:00] adjective that is crazy and then we have
[10:08:03] two nouns that is brown dog and apart
[10:08:05] from that these are verbs. So these are
[10:08:07] not not chunked because we have not
[10:08:09] chunked the verb till now and then we
[10:08:11] have another verb that is running and
[10:08:13] then we have an interjection that is
[10:08:14] through and then after that we have
[10:08:16] another noun phrase that is np denoted
[10:08:19] by NP and contains determiner that is
[10:08:22] the and unown. In this noun phrase there
[10:08:24] is no adjective so there is only one
[10:08:26] determiner and one noun. We can
[10:08:29] represent this chunking using a tree. So
[10:08:32] that makes it a visually easy to
[10:08:35] understand. Uh so we'll draw the pass
[10:08:37] tree that we have seen before which is
[10:08:39] also called a syntax tree. So for doing
[10:08:41] that you just have to use this variable
[10:08:43] that is tree and the draw method of this
[10:08:46] uh variable. So once you run this line
[10:08:49] there will be automatically a new window
[10:08:50] popping up that will show your syntax
[10:08:53] tree. So you can see this was this was a
[10:08:55] sentence the crazy brown dog went
[10:08:58] running through the mud and we have two
[10:09:00] nonphrases in this sentence. So the
[10:09:02] first noun phrase it contains a
[10:09:04] determiner the an adjective that is
[10:09:06] crazy and two nouns that is brown and
[10:09:08] doc and then we have another noun phrase
[10:09:10] that contains a determiner and a num. So
[10:09:13] we have two noun phrases in this
[10:09:15] sentence and we have successfully
[10:09:17] chunked these two noun phrases from our
[10:09:19] sentence. So now we'll chunk another set
[10:09:22] of phrases using another regular
[10:09:24] expression. So I've defined my sample
[10:09:26] text as the term 5G refers to the fifth
[10:09:29] generation of mobile technology which
[10:09:31] promises of faster browsing, streaming
[10:09:33] and download speeds as well as better
[10:09:35] connectivity. The first step is to
[10:09:37] tokenize all the words present in this
[10:09:39] sentence. So for that I'll have used the
[10:09:42] word tokenize function on the sample and
[10:09:44] all those words are extracted as tokens
[10:09:47] and stored stored in this list. So now
[10:09:50] our next step is to attach a part of
[10:09:52] speech tag to every single word. So
[10:09:54] we'll use this part of speech tags to
[10:09:56] create chunks and extract them from our
[10:09:58] sentence. To attach part of speech tags,
[10:10:01] we'll use uh the pos tag function and we
[10:10:04] will pass out tokenized words. Once we
[10:10:07] pass out tokenized words, all the words
[10:10:09] will be labeled with a part of speech
[10:10:12] tag. You can see that the is a
[10:10:14] determiner and then we have nouns,
[10:10:17] cardinals which is a number and another
[10:10:20] nouns and determinants adjectives. Now
[10:10:23] we'll see we'll use another regular
[10:10:25] expression. So we'll chunk uh different
[10:10:27] phrases from this sentence. So here we
[10:10:30] have defined a regular expression that
[10:10:32] will chunk these two phrases from a
[10:10:34] sentence. So we have written the name of
[10:10:36] the chunk as NP. So it will first of all
[10:10:39] check if there is a determiner or a
[10:10:41] prepositional phrase till the end of the
[10:10:43] line. So this dollar sign means end of
[10:10:45] the line. So it will check every word
[10:10:47] till the end of the line if it is a
[10:10:49] determiner or a prepositional phrase.
[10:10:51] And then we have added a sign of
[10:10:53] interrogation which means zero or one.
[10:10:56] So either this part can be zero or one.
[10:10:58] And then we are searching for an
[10:11:00] adjective. It will be asterisk means
[10:11:03] zero or more. So zero or more
[10:11:04] adjectives. And then one noun is
[10:11:06] mandatory that is nn. And then it will
[10:11:08] also search for this part which is NNP
[10:11:12] which is a proper noun. So it will
[10:11:14] search for NNP. And the plus sign means
[10:11:17] that it will search for one or more. So
[10:11:20] there can be minimum one NNP or one or
[10:11:22] more NNPs that is proper nouns. So it
[10:11:25] will check chunk determiner possessive
[10:11:27] adjectives and a noun and then we have
[10:11:29] chunk sequence of proper nouns. Once we
[10:11:32] define this grammar or this regular
[10:11:34] expression so we'll pass this regular
[10:11:37] expression to our reax passer. So once
[10:11:39] we pass this to our regax passer class.
[10:11:43] So our object is created. So we have
[10:11:45] stored the the name of the object is
[10:11:47] chunk passer. So now we'll use the dot
[10:11:50] pass method of this object to passse our
[10:11:52] text. So the text that we need to pass
[10:11:54] is our text that contains the text the
[10:11:57] words along with their part of speech
[10:11:59] text because our text matched with this
[10:12:01] regular expression and then all the
[10:12:03] phrases or words that match this
[10:12:05] criteria will be chunked and then we
[10:12:06] have stored it in tree variable. So once
[10:12:08] we print the tree I will see that the
[10:12:10] first NP which is our chunk is has one
[10:12:13] determiner and then one noun and after
[10:12:16] that we have another uh noun phrase that
[10:12:18] is determiner then adjective and then
[10:12:21] noun and then we have another noun
[10:12:23] phrase which is only proper noun that is
[10:12:25] mobile according to this regular
[10:12:27] expression and then we have another noun
[10:12:29] phrase that has only technology which is
[10:12:31] another according to this expression
[10:12:33] there is only one noun in there we don't
[10:12:35] have any adjectives or determiner and
[10:12:38] prepositional phrases and then we have
[10:12:40] another noun phrase here and another
[10:12:42] noun phrase. This is how we'll chunk
[10:12:45] based on the POS tag values from our
[10:12:48] sentences. So now if we draw the parse
[10:12:51] tree or the syntax tree for the same
[10:12:52] sentence for the same chunks that we
[10:12:54] have extracted here. So we'll just have
[10:12:56] to write the tree draw attribute. So our
[10:12:59] tree draw method. So once we run this
[10:13:02] all the nps which are which is the name
[10:13:05] of our chunk they will be extracted from
[10:13:07] rest of the sentence. So now we'll see
[10:13:09] how to chunk verb phrases from a
[10:13:12] sentence. So I've defined a sentence
[10:13:14] sample which says he should wait before
[10:13:17] going swimming. So this is my sentence.
[10:13:19] Next two steps are to tokenize the words
[10:13:22] and then to attach part of speech tags
[10:13:24] to every word. So I've done that in one
[10:13:26] step that is nlt.word word token as
[10:13:29] sample and then I'll attach the part of
[10:13:31] speech tags to every single word and
[10:13:33] then we have stored the word in pos
[10:13:35] text. So once we print this we'll get
[10:13:37] all the words along with their part of
[10:13:39] speech tags and then we have to define
[10:13:41] regular expression that we'll use for
[10:13:44] matching the words and chunking the
[10:13:46] words. Define the regular expression
[10:13:48] here which will chunk all the verb
[10:13:50] phrases from the sentence. The first
[10:13:53] word of that verb phrase should be a
[10:13:55] personal pronoun and this in sign of
[10:13:58] interrogation means it can be the zero
[10:14:00] or one in number. And then the next word
[10:14:02] is a verb. So it can be any of any one
[10:14:05] of these verbs. So the VB is the base
[10:14:07] form of a verb. And then we have VBT
[10:14:09] which is the past form of a verb. And
[10:14:12] then either it can be VBZ which is a
[10:14:15] third personal singular present present
[10:14:18] form. Third person is third person
[10:14:20] singular. And then we have VBG which is
[10:14:22] the present participle form of verb. So
[10:14:24] it can be either one of these four and
[10:14:27] the asterisk means this can be either
[10:14:29] zero or more than zero. And then the
[10:14:32] third word can be an adverb. So RB means
[10:14:35] an adverb. And then or it can be adverb
[10:14:37] which is comparative form. So
[10:14:39] comparative form such as better,
[10:14:41] greater, taller. So it can be an adverb.
[10:14:44] And then sign of interrogation means it
[10:14:46] can be either zero or one. We'll chunk
[10:14:48] all these words in verb phrase from our
[10:14:51] sentence. So after defining our regular
[10:14:53] expression now we'll use the reg x
[10:14:56] passer class create an object. So this
[10:14:58] class will take our regular expression
[10:15:00] as an argument. The object that we
[10:15:02] created is chunk passer. So after we
[10:15:04] create an object chunk passer we will
[10:15:06] use the pass method of this object to
[10:15:09] create chunks from our sentences. So we
[10:15:12] have to pass the poss text list which
[10:15:14] contains the words along with their poss
[10:15:16] tags. So I will we have stored it in
[10:15:18] tree. Once we get our tree variable now
[10:15:21] we'll print tree. So once we print the
[10:15:24] tree our first verb phrase is this one
[10:15:26] he. So he is a personal pronoun. So this
[10:15:29] was this was the only one phrase. So
[10:15:32] this will be chunked. And then the next
[10:15:34] verb phrase that will be chunked is wait
[10:15:36] which is the base form of verb. And then
[10:15:38] we have going which is the present
[10:15:39] participle. So it will also be chunked
[10:15:42] once we draw the past three for the same
[10:15:44] sentence. So this is our sentence and
[10:15:46] then only the verb phrases are chunk. So
[10:15:49] the first verb phrase is he. The next
[10:15:51] verb phrase is wait and the third verb
[10:15:54] phrase is going which is the present
[10:15:56] participle form. Let's go to the Jupiter
[10:15:58] studio now and implement chunking and
[10:16:00] we'll see how we chunk different phrases
[10:16:03] from our sentences. Chunk noun phrases
[10:16:05] from this sentence. So I've written a
[10:16:07] sentence the dark cloud covered the sky.
[10:16:10] The first step would be to tokenize all
[10:16:12] the words and attach a part of speech
[10:16:15] tag to every single word. So I'll store
[10:16:17] that in pos list.
[10:16:22] Once I run this line, all my words will
[10:16:25] be tokenized and it will a part of
[10:16:27] speech tag will be attached to every
[10:16:29] single word. So if I run POS now, so you
[10:16:32] can see that every word has been
[10:16:34] attached a part of speech tag. So now
[10:16:36] based on these tags we'll extract all
[10:16:38] the non-phrases that are present in this
[10:16:41] sentence. So for that I'll define a
[10:16:43] regular expression. So I'll store it in
[10:16:46] reax and then R is for defining for
[10:16:50] mentioning that it is a regular
[10:16:51] expression. So I'll name of the regular
[10:16:54] expression the name of the chunk that
[10:16:56] I'll keep is np.
[10:16:58] And then we have to define our tags we
[10:17:01] want. First word can be a determinant
[10:17:04] and there can be either zero or one
[10:17:06] determinant for that I'll write sign of
[10:17:08] interrogation and the next word can be
[10:17:12] adjective that can be zero or more so
[10:17:15] for zero or more I'll write asterisk and
[10:17:17] the third word should be a noun and they
[10:17:20] can be zero or more nouns so we can
[10:17:22] write this. So after adding this when I
[10:17:24] run this my regular expression is
[10:17:26] created. Now I'll use the reg x passer
[10:17:30] class to pass this regular expression
[10:17:32] and then extract all the noun phrases
[10:17:34] from the sentence. For that I'll firstly
[10:17:37] define the instance of my x plus par
[10:17:40] passer class
[10:17:42] and I have to pass my regular expression
[10:17:44] as an argument. So once my object is
[10:17:47] created now I'll use this object to
[10:17:50] identify all the noun phrases from my
[10:17:52] sentence. So I'll just write
[10:17:55] dotp pass method and I'll pass my POS
[10:17:59] words list and I'll store this in tree.
[10:18:02] Once I run this now I'll print my tree
[10:18:05] to see how many noun phrases are there.
[10:18:08] So you can see this is our sentence and
[10:18:10] we have the first noun phrase the dark
[10:18:11] cloud and then we have another noun
[10:18:13] phrase that is the sky. Now let's print
[10:18:16] the parse tree for this.
[10:18:19] So now let us extract different phrases
[10:18:21] from this sentence given here that is
[10:18:23] Mary saw a cat sitting on a mat. So the
[10:18:26] first step will be to take all the part
[10:18:28] of speech text words. So I will just use
[10:18:32] these two functions.
[10:18:35] So once I run this now my words along
[10:18:38] with their part of speech tags will be
[10:18:40] stored in pos list. So now I can create
[10:18:43] a regular expression.
[10:18:45] This is my regular expression that I
[10:18:46] have created. So here we are chunking
[10:18:49] all of these words, all of these
[10:18:51] phrases. First we'll chunk the noun
[10:18:53] phrases. So we have deterine determine a
[10:18:55] determiner zero or one times and then an
[10:18:59] adjective zero or more times and then a
[10:19:01] noun. So the noun can be on any in any
[10:19:04] form. So it is np or ns. Then for any
[10:19:07] word after n and zero or more times. If
[10:19:10] we just write n only nouns singular
[10:19:12] nouns will be extracted. If you write
[10:19:15] this one, so all of the nouns will be
[10:19:17] extracted. And then we have defined how
[10:19:19] to extract preposition that is just in
[10:19:22] for preposition. P tag is in. And then
[10:19:25] to extract verbs. So if you want to
[10:19:27] extract VB, VBZ, VBG. So you have to
[10:19:30] write this code because after V, the
[10:19:32] point means any word after V and that is
[10:19:34] zero or more times. So it will extract
[10:19:36] all the verbs. And then to extract
[10:19:38] prepositional phrases we have used PP
[10:19:41] that is preposition which is in which is
[10:19:43] defined here and unknown phrase which is
[10:19:46] which is which is defined here. And then
[10:19:48] to extract verb phrases we have used VP
[10:19:51] that will be any verb which is defined
[10:19:53] here V. And then after that there can be
[10:19:55] a noun phrase or a prepositional phrase
[10:19:58] zero or more times that is optional.
[10:20:01] Once I run this code so my regular
[10:20:03] expression will be created. Now I'll use
[10:20:06] parser to pass this regular expression.
[10:20:11] So once I run this code, now my parser
[10:20:14] is ready. Now I'll use the dot pass
[10:20:16] method of this parser to chunk all the
[10:20:19] phrases that are that I have defined in
[10:20:21] in my regular expression. I'll write
[10:20:25] and then I'll pass my poss words list.
[10:20:28] I'll store it in tree. So we can print
[10:20:30] it.
[10:20:32] So after that I'll print uh the tree.
[10:20:36] So now you can see that from our
[10:20:39] sentence Mary which is proper noun has
[10:20:42] been chunked as noun phrase. Then we
[10:20:43] have verb phrase then another noun
[10:20:45] phrase and then verb phrase and a
[10:20:47] prepositional phrase and then a noun
[10:20:49] phrase. So it is better if you visualize
[10:20:51] this to see how it is chunking. Just
[10:20:54] write tree dot draw to draw the past
[10:20:57] tree. So to exclude specific phrases
[10:21:00] from our chunks, we have a concept
[10:21:02] called chinking. So chinking is used to
[10:21:04] exclude a specific chunk from the whole
[10:21:06] chunk. And we can define to be a
[10:21:08] sequence of tokens that we don't want to
[10:21:10] be included on in our chunk. For
[10:21:12] example, we have noun phrases, verb
[10:21:14] phrases that we have chunked. So if we
[10:21:16] don't want verb phrases or any specific
[10:21:18] verbs, specific type of verbs, then we
[10:21:21] can define how can we those verbs
[10:21:23] by using chinking. To implement
[10:21:26] chinking, we'll first import word
[10:21:29] tokenized to tokenize all the words. And
[10:21:31] then we have defined a simple text that
[10:21:33] we'll use for chinking. And then the
[10:21:35] sample text is tokenized and a part of
[10:21:39] speech tags are assigned to each and
[10:21:41] every word that is present in the text.
[10:21:43] And we have printed it. These are all
[10:21:45] the words along with their part of
[10:21:47] speech tags. So now we'll define which
[10:21:50] words we do not want from in our chunk.
[10:21:52] For that we have defined a regular
[10:21:54] expression. So here is our regular
[10:21:56] expression. So name of the chunk will be
[10:21:58] chunk. And first of all we'll chunk
[10:22:00] everything using the braces. The dot
[10:22:02] means any word and zero or more of any
[10:22:05] word. Then we want one or more of any
[10:22:08] word. So it means it will chunk every
[10:22:09] word that is present in the sentence.
[10:22:11] And then to perform chinking to exclude
[10:22:14] a specific tokens or specific words from
[10:22:17] our chunk, we just have to invert the
[10:22:19] braces here. So if you just invert the
[10:22:21] braces and mention the tokens that you
[10:22:23] do not want in your chunk, those
[10:22:25] particular tokens will be ch. So you
[10:22:27] just have to mention all the tokens that
[10:22:29] you do not want to be chunked. So I have
[10:22:32] mentioned the first token to be a verb.
[10:22:34] It can be any verb, any form of verb. It
[10:22:37] will be VB and also VBZ, VB R, VBG
[10:22:41] because I've inserted a dot. So it means
[10:22:43] if even if any word is after B, that
[10:22:46] verb will also not be chunked. And then
[10:22:50] sign of interrogation means zero or one.
[10:22:53] Then or we have four tokens. So if any
[10:22:56] one of these tokens is found in the
[10:22:58] sentence, these will not be chunked and
[10:23:00] rest of the sentence will be chunked.
[10:23:01] And then we have a preposition and a
[10:23:04] determinant and a two word. So if
[10:23:06] there's a to word going to, so that will
[10:23:08] be a two word. And we want one or more
[10:23:10] of these there will be at least one word
[10:23:13] that can that will not be chunked. So it
[10:23:15] will be either either one of them. And
[10:23:17] then after defining the regular
[10:23:18] expression we can use our regular
[10:23:20] expression reg x parser class to parse
[10:23:24] our grammar or our regular expression.
[10:23:26] And then when once we create an object
[10:23:28] of this class we can use this object to
[10:23:30] parse our words and to find uh chunks
[10:23:33] from our sentence. So we'll use the
[10:23:36] object and the dotparse method of that
[10:23:38] object and we'll pass our text that has
[10:23:41] words along with the part of speech text
[10:23:42] and we'll store it in tree. Once we
[10:23:45] print the tree we'll get sentence then
[10:23:47] whatever the words that will not be
[10:23:49] chunked and we'll get only the chunks
[10:23:52] where the words that we only want to be
[10:23:55] chunked. So the determinant which was
[10:23:56] mentioned in the regular
[10:23:58] expression this word is not chunked and
[10:24:01] our verbs you can see and our
[10:24:03] preposition not junked and whatever is
[10:24:06] left after these words those words are
[10:24:09] all chunked. So let's see a part three
[10:24:11] of this to understand it better. So here
[10:24:14] you can see that uh this was whole
[10:24:15] sentence only the words which do not
[10:24:18] include determinant verbs and
[10:24:20] prepositions they are not only those
[10:24:22] words are chunked and these words are
[10:24:23] not chunked. Now let's go to the Jupyter
[10:24:26] notebook and implement chinking on a
[10:24:28] sample text. So we have a sentence here.
[10:24:31] The little yellow dog barked at the cat.
[10:24:33] So we'll see how can we certain
[10:24:36] phrases from these words to extract only
[10:24:38] the phrases that we require. If I only
[10:24:40] want the noun phrases from this
[10:24:42] sentence, either I can define a noun
[10:24:45] phrase. But if my sentences are huge and
[10:24:48] if I know there are certain words that I
[10:24:49] don't want from the sentence, then I can
[10:24:51] directly exclude those words using
[10:24:53] chinking so that I only get the noun
[10:24:55] phrases. First of all, we'll tokenize
[10:24:58] the words from this text and then we'll
[10:25:00] attach a part of speech tag to every
[10:25:02] single word and I'll store it in pos.
[10:25:08] So after running running this all all of
[10:25:11] my words they are stored along with
[10:25:14] their part of speech tags. So if I only
[10:25:16] want nonphrases so I can either define
[10:25:19] nonphrase or I can chunk certain phrases
[10:25:22] so that I I'm only left with nonphrases.
[10:25:25] This is my grammar that I'm going to
[10:25:27] use. So I'll store it in reax.
[10:25:29] So we define it by r and then I'll
[10:25:31] define the chunk that I want. First of
[10:25:34] all we'll chunk everything using this
[10:25:36] expression. And now once we junk
[10:25:38] everything. Now if I want to exclude the
[10:25:41] verbs and the prepositions from this
[10:25:44] text. So I'll invert the braces. After
[10:25:47] inverting the braces, I'll write my
[10:25:50] regular expression that I want to
[10:25:52] So I want VBD the verb form which is the
[10:25:56] past form of verb. And then either it
[10:25:59] will this one or if it finds a
[10:26:01] preposition and one of more. So either
[10:26:03] one of these or one of more than one of
[10:26:05] these will be chained from the whole
[10:26:07] sentence. So once I run this my regular
[10:26:09] expression will be created. Now I'll
[10:26:11] pass my regular expression using the
[10:26:13] reax par passer. So I'll store it in
[10:26:16] parser object
[10:26:19] and I'll pass my regular expression in
[10:26:21] this which is reax. So after this I can
[10:26:25] parse my sentences and I can chunk and
[10:26:27] everything that I want. So I'll
[10:26:29] store that the chunks and chunks in the
[10:26:32] tree and I'll use the parser object
[10:26:35] along with the parse method and I'll
[10:26:38] pass my tokens which already have a part
[10:26:42] of speech tag to it tag to them. So I'll
[10:26:44] just write pos. So now if I print my
[10:26:47] tree
[10:26:49] so you'll see that vbd which is the past
[10:26:51] form of verb and preposition have been
[10:26:53] chunked. they're not included in the
[10:26:55] chunk and rest of the words are included
[10:26:57] in the chunk which are actually the noun
[10:26:59] phrases. So let's draw the tree.
[10:27:03] So here here's the tree that we got. So
[10:27:05] we have chin these two words from the
[10:27:08] whole sentence. So first of all we
[10:27:10] chunked all all of the words and then we
[10:27:12] ch these two tokens and we are left with
[10:27:16] the noun phrases. So a contextf free
[10:27:18] grammar is used to describe the syntax
[10:27:20] of a natural language by defining things
[10:27:22] recursively and a language that is
[10:27:25] described by a contextf free grammar is
[10:27:27] defined as all the possible derivations
[10:27:29] within the rules that are defined by a
[10:27:32] contextf free grammar. So a language can
[10:27:34] contain lot of strings and all those
[10:27:37] strings are formed by the rules that are
[10:27:39] described by a contextf free grammar and
[10:27:41] each possible derivation or each
[10:27:43] possible string that a language has has
[10:27:46] at least once one corresponding syntax
[10:27:49] tree. So if there is any string that is
[10:27:52] formed by a language that uses a
[10:27:55] contextf free grammar has a
[10:27:57] corresponding syntax tree. So if you
[10:27:59] cannot parse a string into a syntax
[10:28:02] tree, we cannot say that that strings
[10:28:05] belongs to that natural language or the
[10:28:07] language that is defined by contextf
[10:28:09] free grammar. A string is not valid in
[10:28:11] that language if you cannot parse it
[10:28:13] into a tree. So now let's understand
[10:28:15] context free grammar. Context free
[10:28:17] grammar basically consists of four
[10:28:20] pupils or four elements and the first is
[10:28:23] our set of non-dominal symbols. Those
[10:28:26] can be your noun phrases or verb phrases
[10:28:29] or determiners, nouns, verbs, adverbs
[10:28:32] etc. And then we have a start symbol
[10:28:34] which is the starting string or the
[10:28:36] sentence from where we begin. So that is
[10:28:38] denoted by s which is also a non-
[10:28:40] terminal symbol. And then we have a set
[10:28:42] of terminal symbols which are the actual
[10:28:45] words that we get after replacing all
[10:28:48] the non- terminal symbols. At the end
[10:28:50] we'll get a string which consists of all
[10:28:53] the terminal symbols. Then we have a set
[10:28:56] of production rules that are used to
[10:28:58] define things recursively. Here you can
[10:29:00] see that our sentence s it has two parts
[10:29:03] that is a noun phrase and a verb phrase
[10:29:05] and a noun phrase then recursively is
[10:29:07] defined as a determ minor and a noun and
[10:29:10] a verb phrase is defined as a verb and
[10:29:13] an adverb. So we'll keep replacing the
[10:29:16] things that are on the right hand side
[10:29:17] until we reach point where we have all
[10:29:21] the non- terminal symbols. That will be
[10:29:23] our final string that we'll generate
[10:29:25] using a context fre. We'll understand it
[10:29:29] by implementing on Jupyter notebook.
[10:29:32] Let's see how it is implemented using
[10:29:34] NLTK.
[10:29:35] There is a class called cfg. In NLTK
[10:29:38] that is used to define contextf free
[10:29:40] grammarss. So in that class we have a
[10:29:42] method called from string. So in that
[10:29:45] method we can write or all the
[10:29:47] production rules and we can mention all
[10:29:49] the terminal and non- terminal symbols.
[10:29:51] So here you can see that I have written
[10:29:54] firstly I have imported the NLTK package
[10:29:56] and then I have defined a grammar which
[10:29:58] I've named as grammar one. The function
[10:30:00] that I'm using is cfg dot from string
[10:30:03] that is from NLTK package and inside
[10:30:05] that function within the curly braces we
[10:30:08] have to we have within the inverted
[10:30:10] commas we have to mention our production
[10:30:13] rules and you can see our first
[10:30:15] production rule is that a sentence it
[10:30:17] has two components that is a noun phrase
[10:30:19] and a verb phrase and then we'll define
[10:30:22] a verb phrase which can be either a verb
[10:30:24] a noun phrase or a verb noun phrase or a
[10:30:27] preos prepositional phrase in that order
[10:30:29] and Then we'll define a prepositional
[10:30:31] phrase. It can be a preposition and a
[10:30:33] noun phrase. So these are all our non-
[10:30:35] terminal symbols. And then v whatever is
[10:30:38] on the left hand side is our non-
[10:30:40] terminal symbol. And all the words that
[10:30:42] are present here, these are the terminal
[10:30:43] symbols. So in order to form a sentence,
[10:30:46] we have to recursively replace the non-
[10:30:49] terminal symbols until we reach a point
[10:30:52] where we only have terminal symbols. So
[10:30:54] that will be a valid string in that
[10:30:56] language. So this is a language that is
[10:30:58] defined by or contextf free grammar. The
[10:31:01] set of all the strings that is present
[10:31:03] that can be generated from this context
[10:31:05] free grammar that will compose of an of
[10:31:07] a language. So here here we have a
[10:31:10] sentence Bob saw a man with a telescope
[10:31:12] in the park. So we'll see if this
[10:31:14] sentence is a valid sentence according
[10:31:16] to this context free grammar or not. To
[10:31:19] do that we'll parse the sentence using
[10:31:21] the context free grammar. So first of
[10:31:23] all we'll tokenize all the words present
[10:31:25] in this sentence and then we'll parse
[10:31:28] our grammar. This is our grammar which
[10:31:30] is stored in grammar one. So we'll parse
[10:31:32] our grammar using the recursive descent
[10:31:35] parser which will parse our grammar. So
[10:31:37] this is another function that we'll use
[10:31:38] to parse the grammarss and then after
[10:31:41] creating the object of this class that
[10:31:43] is recursive descent parser. We'll pass
[10:31:47] the all the word tokens that we have
[10:31:48] generated. So if you are able to
[10:31:50] generate a parse tree for this sentence
[10:31:53] using this grammar because we have
[10:31:55] created our object on using this
[10:31:57] grammar. So if we are able to parse this
[10:31:59] sentence using this grammar, this
[10:32:00] sentence will be a valid sentence of
[10:32:02] this of the language that is defined by
[10:32:04] this grammar. So here you can see I'll
[10:32:07] iteratively pass through every single
[10:32:10] word that is present in our token word
[10:32:12] and we'll see what are how our context
[10:32:15] free grammar parses this sentence. So
[10:32:17] you can see it starts from here. Then we
[10:32:19] have a noun phrase called both. And then
[10:32:21] we have a verb phrase that contains a
[10:32:23] verb and a noun phrase. Verb is so a
[10:32:25] noun phrase also contains a determinant
[10:32:28] and a noun and a prepositional phrase.
[10:32:30] Let's look at this structure in a tree
[10:32:33] form so that we can we can understand it
[10:32:35] better. So we'll just write tree from
[10:32:38] this tree and then you can see that our
[10:32:40] whole sentence is passed using different
[10:32:43] phrases by this context free grammar. So
[10:32:46] we can easily say that this sentence is
[10:32:47] a part of the language which is defined
[10:32:49] by this context free grammar. So we have
[10:32:51] our sentence here. So we have a noun
[10:32:53] phrase called Bob and then we have verb
[10:32:55] phrase. So the verb phrase is broken
[10:32:57] down into three parts that is verb, noun
[10:33:00] phrase and prepositional phrase. So
[10:33:02] these are all the production rules that
[10:33:04] were defined by our language. So if you
[10:33:06] look at here so we have verb phrase
[10:33:08] either that verb phrase can be a verb
[10:33:09] and a noun phrase or it can be a verb
[10:33:12] noun phrase and a prepositional phrase.
[10:33:14] So here we have a verb, noun phrase and
[10:33:16] a prepositional phrase and then verb is
[10:33:19] saw and then in noun phrase we have a
[10:33:21] determinant determiner a noun and a
[10:33:24] prepositional phrase and here in this
[10:33:26] prepositional phrase we have a
[10:33:28] preposition and a noun phrase. So here
[10:33:30] all the words they belong to one of the
[10:33:33] categories. We can draw a pass tree for
[10:33:36] this string. We can say that these
[10:33:38] strings belongs to the language that is
[10:33:40] generated by that or that is defined by
[10:33:43] this context free grammar. We can also
[10:33:45] print the production rules of the
[10:33:47] grammar using the print function. If you
[10:33:49] just write the print and the grammar
[10:33:50] that you have defined, you'll see that
[10:33:52] our grammar has 25 production rules and
[10:33:55] it starts from here which is the start
[10:33:57] symbol which is sentence. Sentence will
[10:33:59] be broken down into noun phrase and verb
[10:34:01] phrase and verb phrase will be broken
[10:34:03] down into verb and noun phrase and then
[10:34:05] verb phrase will also be broken can also
[10:34:07] be equal to this verb phrase noun phrase
[10:34:10] and a prepositional phrase. These are
[10:34:11] all terminal symbols and whatever is on
[10:34:14] the left hand side all are all non-
[10:34:15] terminal symbols and then after defining
[10:34:18] a grammar we can use that grammar to
[10:34:20] generate sentences. So for doing that we
[10:34:23] have to import uh the generate function
[10:34:25] from nlt.parse.generate.
[10:34:27] So once we import the generate function
[10:34:29] and this is our context grammar class
[10:34:31] that we have to import from NLTK
[10:34:33] package. Then after importing this we'll
[10:34:36] parse through we'll iterate through
[10:34:37] every sentence that is generated by this
[10:34:39] grammar. So we'll write for sent in
[10:34:42] generate. So this function will generate
[10:34:44] all the sentences and we'll iterate
[10:34:46] through every single sentence that is
[10:34:48] generated by this grammar. So we'll
[10:34:50] write inside the generate function the
[10:34:52] grammar that we have defined grammar
[10:34:54] one. Here is our grammar one. And then
[10:34:56] we'll write n equals 10. So it will
[10:34:58] print the 10 sentences. Otherwise, if
[10:35:00] you don't write n equals 10. So it will
[10:35:02] go on and print all the sentences. And
[10:35:04] then uh we have used print function. So
[10:35:07] we'll join all the words basically that
[10:35:09] will that our grammar will generate.
[10:35:11] This is the space that is between the
[10:35:13] words and it will join all the words and
[10:35:15] it will print each sentence line by
[10:35:18] line. You can see the first sentence is
[10:35:19] John saw John. John saw Mary. John John
[10:35:22] saw Bob and John saw a cat. John saw a
[10:35:25] man. John saw a cat. So all these words
[10:35:27] are actually the strings that are
[10:35:29] contained in that language. You can see
[10:35:30] there are some words John saw and men
[10:35:32] which it doesn't make any sense which
[10:35:34] are grammatically not correct. So in
[10:35:36] contextf free grammarss the reason that
[10:35:38] they are called context free is because
[10:35:40] the words the placement of these words
[10:35:42] placement of the non- terminal symbols
[10:35:44] are not according to a context. So it is
[10:35:47] according to the production rules. So
[10:35:48] whatever even if the production rules
[10:35:51] are not generating a sentence that has a
[10:35:54] meaningful grammatical meaning or a
[10:35:56] meaningful grammatical sequence uh then
[10:35:58] also it will print because it is a
[10:36:00] contextfree language. It is not a
[10:36:02] contextsensitive language. So now let's
[10:36:04] go to the Jupyter notebook and we'll
[10:36:06] generate a contextfree grammar and then
[10:36:09] we'll generate some sentences using that
[10:36:11] contextf free grammar. So after
[10:36:13] importing cfg now we'll define a grammar
[10:36:15] that we'll use to generate and pass
[10:36:18] sentences. So I'll write the name of the
[10:36:20] grammar as grammar
[10:36:22] and then we have to use the function
[10:36:24] from string
[10:36:27] which will generate a grammar from the
[10:36:28] strings that we'll pass in the
[10:36:30] parenthesis.
[10:36:34] This is our grammar that we have
[10:36:36] defined. These are the production rules.
[10:36:38] On the left hand side you can see these
[10:36:40] are our non- terminal symbols.
[10:36:43] Start symbol is s which is the sentence
[10:36:46] or the string from where we begin. And
[10:36:49] then a string contains a noun phrase and
[10:36:52] a verb phrase. And then we define what a
[10:36:54] verb phrase is using another production
[10:36:56] rule. Verb phrase can either be a verb
[10:36:59] and a noun phrase or it can be a verb, a
[10:37:02] noun phrase and a prepositional phrase.
[10:37:04] Then we'll define another production
[10:37:06] rule about prepositional phrase. So it
[10:37:08] can be a preposition followed by a noun
[10:37:11] phrase. And then we have defined a verb
[10:37:14] which is either ate, saw or worked. So
[10:37:16] these three words are called the
[10:37:19] terminal symbol. So all the words that
[10:37:20] you see on the right hand side they are
[10:37:22] terminal words, terminal symbols and on
[10:37:24] the left hand side we have all the non-
[10:37:26] terminal symbols and these are all
[10:37:28] production rules and s is all start
[10:37:30] symbols. So these are the four tupils
[10:37:32] that define a contextf free grammar.
[10:37:34] Once I run this, now our context free
[10:37:37] grammar is generated. So after that
[10:37:40] we'll take a sentence and we'll see if
[10:37:42] we can draw a parse tree or we can parse
[10:37:45] that sentence using this grammar or not.
[10:37:48] And if we are able to do so we'll say
[10:37:50] that that sentence falls in the language
[10:37:52] that is generated by this grammar. So
[10:37:54] let us write a sentence. I'll store it
[10:37:56] in sent. And let us add a sentence that
[10:38:02] John saw Mary with a cat in the park. So
[10:38:06] we'll parse this sentence using this
[10:38:08] grammar. And if you are able to parse
[10:38:10] this sentence and draw a pass tree for
[10:38:13] the same then we'll can say that this
[10:38:15] sentence is a part of language that is
[10:38:17] generated by this grammar. So for doing
[10:38:20] that we'll import the recursive decent
[10:38:22] parser that is used to pass a contextf
[10:38:25] free grammar from the NLTK.
[10:38:32] So once we have imported the recursive
[10:38:34] decent parser now we'll create an object
[10:38:36] of this class that we'll use to pass the
[10:38:38] sentences.
[10:38:42] So the argument is the grammar that we
[10:38:44] have defined so that it will parse the
[10:38:47] grammar and then we can use the object
[10:38:49] to parse sentences.
[10:38:52] So now after parsing the grammar we'll
[10:38:56] generate sentences using this parser. So
[10:38:59] we'll firstly check if the sentence that
[10:39:01] we have mentioned John saw Mary with a
[10:39:04] cat in the park is actually a valid
[10:39:06] sentence in this grammar or not. We will
[10:39:09] set the name as tree
[10:39:11] for our iterator.
[10:39:14] Once I run this code, you can see that
[10:39:16] first we have tokenized the sentences
[10:39:19] here. So all the words will be given to
[10:39:21] the parse and this parse is method of
[10:39:24] the parser which we have declared here
[10:39:26] this parser object and it will pass this
[10:39:28] sentence and if it is able to pass the
[10:39:30] whole sentence then we can say this
[10:39:31] sentence is a valid sentence according
[10:39:33] to this grammar. So now you can see that
[10:39:36] our sentence has a noun phrase that is
[10:39:38] John and we have a verb phrase which
[10:39:40] contains a verb, a noun phrase and a
[10:39:42] prepositional phrase. So we'll visualize
[10:39:44] it using a tree so that it is clear.
[10:39:47] We'll just write tree.d draw. Now what
[10:39:49] are the production rules that our
[10:39:51] grammar is using? We'll just write print
[10:39:53] and the name of the grammar. So you can
[10:39:56] see we have 25 production rules and the
[10:39:58] start state is s and these are the
[10:40:01] production rules. So they all and on the
[10:40:03] left hand side we have all the non-
[10:40:04] terminals and on the right hand side
[10:40:06] wherever there are words those words are
[10:40:08] the terminal symbols. So now let's uh
[10:40:11] generate some sentences using this
[10:40:13] grammar. So for that I'll have to import
[10:40:15] generate function from nltk.pass.
[10:40:21] So once we import the generate function
[10:40:24] now we'll generate some sentences using
[10:40:26] the grammar buff. So I'll write percent
[10:40:30] and I have to pass my grammar that is
[10:40:32] grammar and we will print uh 10
[10:40:34] sentences. So I'll write n equals 10 and
[10:40:37] then I'll print
[10:40:39] if I write without the spaces so it will
[10:40:42] print without any spaces. So I'll use
[10:40:45] spaces to print sentences with spaces in
[10:40:48] between the words. So I'll add join and
[10:40:50] then send. So now you can see we have
[10:40:54] first 10 sentences that our grammar has
[10:40:57] generated. So John ate John, John ate
[10:40:59] Mary and John at dog John. So most of
[10:41:01] these sentences uh they don't make sense
[10:41:03] because this is a contextf free grammar
[10:41:04] and the words are placed according to
[10:41:06] the production rules and where there is
[10:41:08] no context or there is no particular
[10:41:11] order in which the words should be
[10:41:13] placed. So some of the sentences will
[10:41:14] make sense and some of the sentences
[10:41:16] won't.
[10:41:18] >> Here's a quiz question for you guys.
[10:41:20] What is the primary goal of text
[10:41:22] classification in natural language
[10:41:24] processing? Your options are converting
[10:41:26] text into speech, predicting the
[10:41:28] sentiment of a text, summarizing a
[10:41:30] lengthy document or translating text
[10:41:33] from one language to another. Please
[10:41:35] mention your answers in the comment
[10:41:36] section.
[10:41:39] So if you talk about bag of words, so
[10:41:40] bag of words is a method for extracting
[10:41:43] features from a text document and then
[10:41:45] those features are used to train
[10:41:47] different machine learning algorithms.
[10:41:49] So if we talk about text classification
[10:41:51] using machine learning or deep learning
[10:41:53] most of the machine learning and deep
[10:41:55] learning techniques they require a
[10:41:57] numerical input not a textual input
[10:42:00] because they work on mathematical
[10:42:02] equations. So the input is always a
[10:42:04] numeric input but if you want to
[10:42:06] classify text so our input which is text
[10:42:09] is present in a textual form. So that
[10:42:11] problem is actually solved using
[10:42:13] different techniques which we use to
[10:42:15] represent text in form of numbers that
[10:42:18] we can use to train our different
[10:42:20] machine learning models. One of such
[10:42:22] techniques is called bag of words model.
[10:42:24] So in bag of words model we represent
[10:42:26] text data in a format that is suitable
[10:42:29] for machine learning and then we do this
[10:42:31] by creating vocabulary of all the unique
[10:42:33] words. So if we have different text
[10:42:35] documents, so we'll create a list or a
[10:42:38] vocabulary which will contain all the
[10:42:40] unique words that are present in that
[10:42:42] text document or in the text corpus and
[10:42:45] then we'll use that vocabulary to build
[10:42:48] word count vectors. Once we build word
[10:42:50] count vectors which will represent a
[10:42:52] text document or a sentence in form of
[10:42:55] word counts, word frequencies, then we
[10:42:58] can use those as an input to our machine
[10:42:59] learning models. These are some of the
[10:43:01] steps that are involved in building a
[10:43:03] bag of words model. First of all, we
[10:43:05] have our text documents which contains
[10:43:07] textual data. And then the next step is
[10:43:10] pre-processing. So in pre-processing,
[10:43:12] we'll remove stop words or punctuation
[10:43:14] marks and other irregularities from our
[10:43:17] text data and using all the techniques
[10:43:18] that we have already learned and we'll
[10:43:20] also learn in the in this module and in
[10:43:22] the next module as well. So after
[10:43:24] pre-processing our next step is
[10:43:26] tokenization. So we'll tokenize all the
[10:43:28] words present in our text documents. And
[10:43:31] after tokenization, the third step is
[10:43:34] building a vocabulary. So it will
[10:43:36] contain all the unique words present in
[10:43:38] our text documents. And then after
[10:43:40] creating a vocabulary, we will create
[10:43:42] word count vectors or feature vectors.
[10:43:44] We call them as an input to our machine
[10:43:47] learning algorithms. So a bag of words
[10:43:49] model is used in lot of fields such as
[10:43:50] such as natural language processing in
[10:43:53] order to understand and interpret humans
[10:43:55] spoken languages and then to extract
[10:43:57] information from our text documents we
[10:43:59] also can use back of words model and the
[10:44:02] third one is document classification. So
[10:44:04] in order to classify documents into
[10:44:06] different categories we can also use the
[10:44:08] back of words model. So to implement
[10:44:10] back of model in python we require some
[10:44:14] of the libraries. The first library is
[10:44:16] the NLTK which we have been using till
[10:44:19] now. We'll import NLTK and then we'll
[10:44:21] import numpy to create arrays that we'll
[10:44:24] use later. And then the word tokenize
[10:44:26] function is to tokenize all the words
[10:44:28] and then I will import it the stop words
[10:44:31] from an LDK corpus. We can remove the
[10:44:33] stopers in the pre-processing step. So
[10:44:35] after loading all these libraries, I
[10:44:37] will define three functions. These three
[10:44:40] functions will perform the bag of words
[10:44:42] model from the scratch in Python. So we
[10:44:44] have inbuilt functions that we can also
[10:44:47] use to perform bag of words model. But
[10:44:49] first we'll learn about how we use how
[10:44:51] this concept is actually approached. We
[10:44:54] have defined a function that will return
[10:44:56] us a clean text. The tokenized text in a
[10:44:59] cleaned format. The name of the function
[10:45:01] is extract words and this function takes
[10:45:04] a sentence as an argument and it will
[10:45:07] first of all tokenize all the words
[10:45:08] present in the sentence and then it will
[10:45:10] clean all the words. By cleaning I mean
[10:45:13] it will lowerase all the words and then
[10:45:15] it will remove all the stop words using
[10:45:17] this oneliner and after removing it will
[10:45:20] return us a list of all the words
[10:45:22] lowerased and stop words removed. The
[10:45:25] next step is to build our vocabulary.
[10:45:28] We'll iterate through every single
[10:45:29] sentence present in our corpus and then
[10:45:32] we'll take all the words and after
[10:45:34] taking all the words we'll only keep
[10:45:35] unique words. If a word is occurring 10
[10:45:37] times we'll only keep it once. For that
[10:45:40] we have defined a function called
[10:45:41] tokenize sentences. The argument of this
[10:45:44] function is sentences. So it will take
[10:45:46] multiple sentences which is also called
[10:45:48] corus. We have defined an empty list
[10:45:51] called words. So we'll iterate through
[10:45:54] every single sentence present in our
[10:45:56] sentences multiple sentences or in
[10:45:59] corpus and then we'll call our extract
[10:46:01] words function this function. So it will
[10:46:04] extract all the words by removing the
[10:46:06] stop words and by lowering all the words
[10:46:08] and it will store all those words in W
[10:46:11] and then we'll add or we'll extend this
[10:46:14] W to our words list. So when we use
[10:46:16] extend the single elements of every
[10:46:18] element of this list will be added as an
[10:46:21] element to this words list. So after
[10:46:24] this loop is over our words list will
[10:46:26] contain all the words that are present
[10:46:28] in our whole corpus. And in the last
[10:46:31] part we have used the set function which
[10:46:33] will remove the duplicates and it will
[10:46:36] only keep distinct elements from our
[10:46:38] words list. And then we have converted
[10:46:41] it this object to a list using the list
[10:46:44] function. And then we have I use a
[10:46:45] sorted function. So it will sort it
[10:46:47] alphabetically from A to Z. And we'll
[10:46:50] again store it to words. This function
[10:46:53] will return us the words list which is
[10:46:55] actually the vocabulary which contains
[10:46:57] the unique words present in all of the
[10:46:59] text documents in our corpus. That was
[10:47:02] two steps. The first step was
[10:47:03] pre-processing where we tokenize the
[10:47:05] words and remove the stop words. Next
[10:47:07] step is building a vocabulary. And now
[10:47:09] the third step is to build a bag of
[10:47:12] words model that is create a word count
[10:47:14] vectors. We have defined a function
[10:47:16] called bag of words which will take two
[10:47:18] arguments. uh first argument is the
[10:47:20] sentence
[10:47:22] which will be used or for which we will
[10:47:25] create a word count vector and the next
[10:47:27] argument is words which means
[10:47:29] vocabulary. The first task it will
[10:47:32] perform is to extract words. So it will
[10:47:35] call the function which is extract words
[10:47:37] and it will extract all the words
[10:47:39] present in this sentence by firstly
[10:47:42] lowering the words and then removing the
[10:47:44] stop words and it will store all those
[10:47:46] words in sentence words. And the next
[10:47:48] word we next function or the next code
[10:47:51] we have written here is uh np.zeros.
[10:47:55] This will create numpy array of zeros
[10:47:58] and the length of the array will be
[10:48:00] equal to the length of the words which
[10:48:02] is our vocabulary. For each word count
[10:48:05] vector the length will be equal because
[10:48:07] the length is equal to the word count
[10:48:08] vector. When we use the length equal to
[10:48:11] words so it will be equal to the length
[10:48:12] of the vocabulary and it will create
[10:48:15] array. So if we say that our vocabulary
[10:48:18] contains 50 words, the length of this
[10:48:20] array will be 50. In this array, there
[10:48:23] will be 50 zeros because it is a np.0.
[10:48:26] It will create an array of zeros and
[10:48:28] we'll store that array in bag. Now we
[10:48:30] have our words in sentence words and an
[10:48:34] array which is stored in back. Now we'll
[10:48:36] iterate through every word in this
[10:48:38] sentence words list and we'll check if
[10:48:41] that word is occurring in our vocabulary
[10:48:43] and how many times it is occurring.
[10:48:45] We'll use two loops here. So firstly
[10:48:47] we'll iterate through every single word
[10:48:49] that is present in our sentence words
[10:48:51] and after that we'll use another loop to
[10:48:53] check whether the word we are iterating
[10:48:56] we are at this point of time is actually
[10:48:59] present in the vocabulary or not. So if
[10:49:00] it is not present in the vocabulary this
[10:49:03] bag will be zero and if it is present we
[10:49:06] have said that for i. So this will be
[10:49:08] for each individual word starting from
[10:49:10] index zero to the last word in this
[10:49:13] words and then the word which is present
[10:49:16] in enumerate words. So when we use the
[10:49:18] enumerate function we'll add a counter
[10:49:20] to our words list. This words is
[10:49:23] actually our vocabulary. It will start
[10:49:25] with zero and it will go to the last
[10:49:27] word in our vocabulary. It will check
[10:49:29] whether the word in the vocabulary is
[10:49:31] equal to the word present in this list
[10:49:33] or not. And if it is true then it will
[10:49:35] increase the count of that particular
[10:49:37] word from 0 to one by adding one. So if
[10:49:40] this word occurs 10 times in this
[10:49:43] sentence words the value of that
[10:49:46] particular word in this array will be
[10:49:48] 10. We'll return that bag array as a
[10:49:50] numpy array back. So once we use the
[10:49:53] innumerate object this object will also
[10:49:56] be an innumerate. So to convert it back
[10:49:58] to an array, we'll use npar and we'll
[10:50:01] pass our bag object to this function and
[10:50:04] then it will return us a numpy array of
[10:50:06] word count frequencies. So after that
[10:50:08] let's see an example to understand it
[10:50:10] better. So we have uh a corpus which
[10:50:13] contains five sentences. This one
[10:50:16] sentence is also called a document.
[10:50:18] There are five documents in this corpus.
[10:50:20] Based on this corpus we'll generate
[10:50:22] vocabulary. We have for generating a
[10:50:25] vocabulary we have defined a function
[10:50:27] tokenize sentences which takes
[10:50:30] sentences. Sentences mean corus. There
[10:50:32] are multiple sentences which is corpus.
[10:50:35] We'll pass our corpus into it and then
[10:50:36] we'll store it as vocabulary. So once we
[10:50:38] print vocabulary we have this list which
[10:50:41] contains all the unique words present in
[10:50:44] our corpus then and all the stuffers are
[10:50:47] removed. Now these words will be used to
[10:50:50] create word count vectors for each
[10:50:52] sentence that is present in our corpus.
[10:50:55] Now here it is the way that we use to
[10:50:57] create word count vectors for each
[10:50:59] sentence. So here is a vocabulary that
[10:51:01] our function has made. These are all the
[10:51:03] unique words present in our text corus
[10:51:06] and then we use the bag of words
[10:51:08] function which will return a numpy array
[10:51:10] of word count vectors. So if you want to
[10:51:13] calculate the word count vector for this
[10:51:16] sentence, Max and Rob took the bus.
[10:51:18] We'll pass it to our bag of words
[10:51:20] function which takes two arguments that
[10:51:21] is sentence and a vocabulary. This is
[10:51:25] our vocabulary that we have used. We
[10:51:26] have stored it in vocabulary variable.
[10:51:29] Once we pass our sentence and a
[10:51:32] vocabulary, this bag of words function
[10:51:34] will return an array of the word count
[10:51:36] vectors. This is a word count vector for
[10:51:38] this particular sentence. So we can see
[10:51:40] that the word arrived. This is how we
[10:51:42] check. We look at the vocabulary and
[10:51:44] then we find each word if it occurs in
[10:51:47] our text or not. We'll check first of
[10:51:49] all the arrived. So if arrived appears
[10:51:51] in this text then we'll increase the
[10:51:53] count by one. If it doesn't appear in
[10:51:55] this text then we'll move to the we'll
[10:51:57] keep it as zero and then we'll move to
[10:51:58] the next element. Arrive doesn't appear
[10:52:01] so we'll keep it zero. Then we'll check
[10:52:02] the bus. So bus appears once here. So
[10:52:05] we'll keep it one. Then similarly for
[10:52:07] early I, John and late and then we see
[10:52:10] that only max will appear. So this is
[10:52:13] eighth word. Here you can see the eth
[10:52:16] element is one. And similarly we'll
[10:52:18] check for all the elements. So if these
[10:52:20] words are present in the sentence. So
[10:52:22] for that particular index the value will
[10:52:24] be increased by one. Here you can see in
[10:52:26] the next sentence we have arrived. We'll
[10:52:28] check by seeing the first word that is
[10:52:30] arrived. So arrived appears once here.
[10:52:32] So we'll keep it one and then we'll
[10:52:34] check for bus. So bus appears twice
[10:52:36] here, once here and once here. So we'll
[10:52:39] keep it two. The good thing about the
[10:52:42] word count vectors is that no matter how
[10:52:44] short or long your sentences are, but
[10:52:47] the word count vectors will have the
[10:52:49] same length because the length is of a
[10:52:51] word count vector is equal to the
[10:52:53] vocabulary that we have built. Now let's
[10:52:55] go to the Jupyter notebook to implement
[10:52:58] bag of words from scratch using Python.
[10:53:00] and we'll implement it on corpus and
[10:53:02] we'll create word count vectors for so
[10:53:04] we'll start by importing all the
[10:53:06] required libraries import nltk first
[10:53:10] then we'll import numpy because we need
[10:53:13] numpy arrays import numpy as np this is
[10:53:18] a common alias used for numpy now we'll
[10:53:21] import our stop words from our ntk copus
[10:53:24] so we can remove stop words.
[10:53:30] After that, I'll create a stop words
[10:53:32] list which will contain all the unique
[10:53:34] stop words.
[10:53:37] Use set function which will only take
[10:53:39] the distinct unique words from our stop
[10:53:41] words.
[10:53:45] We only want the English stop words.
[10:53:47] I'll set store my stop words in stop
[10:53:50] words list. Now we'll create our first
[10:53:52] function which will return us the clean
[10:53:55] text by first of all tokenizing and then
[10:53:57] removing all the stopers.
[10:54:01] The name of the function I've set is
[10:54:03] extract words which will take a single
[10:54:06] sentence as an argument and then it will
[10:54:08] return us the clean text. So I'll write
[10:54:10] clean text list. I'll define it in one
[10:54:13] line. First of all we'll lower the
[10:54:16] words.
[10:54:19] We'll tokenize the word. It will lower
[10:54:22] all the words after tokenizing.
[10:54:25] And now we'll remove all the SC words.
[10:54:30] It will return a list of all the words
[10:54:33] present in our sentence by removing
[10:54:35] first of all SC words and then lowering
[10:54:37] making all the words to lowerase and
[10:54:39] then it will return us the list. This
[10:54:42] function will return us the clean text
[10:54:44] list.
[10:54:46] Now let us define another second
[10:54:48] function which will create a vocabulary
[10:54:50] for us. We'll I'll let the name of the
[10:54:52] function as vocab. This function will
[10:54:54] take a whole corpus as an argument which
[10:54:57] will be a list of sentences or lists of
[10:55:00] documents. See first step will be to
[10:55:03] extract all the words that are present
[10:55:04] in our corpus. Let's create a list that
[10:55:07] will first of all contain our an empty
[10:55:11] list which will contain all the unique
[10:55:13] words. So let's first create an empty
[10:55:15] list
[10:55:29] iterate through all the sentences
[10:55:31] present in our corpus and we'll call the
[10:55:34] function extract words. So it will only
[10:55:36] return the list of words which are
[10:55:38] cleaned. And once we have the list of
[10:55:40] words which are cleaned, we'll use the
[10:55:42] dot extend method of our list vocabulary
[10:55:44] to add that list of words as single
[10:55:47] words to our vocabulary list. Now after
[10:55:49] this loop is finished, we'll have all
[10:55:52] the words that are present in clean text
[10:55:54] in all the sentences. Now we have to
[10:55:56] find all the unique words. For that
[10:55:59] we'll use the set function. So first of
[10:56:02] all we'll find all the unique words from
[10:56:04] the above vocabulary that we have
[10:56:06] generated and then we'll create a list
[10:56:08] out of it. So it will be converted to a
[10:56:10] list and then we want to sort the
[10:56:12] elements alphabetically. So we'll use
[10:56:14] the sorted function.
[10:56:16] At the last our function will return us
[10:56:19] the vocabulary.
[10:56:21] In this function first of all we will
[10:56:22] declare an empty list which will contain
[10:56:24] all the unique words from or all the
[10:56:26] text documents. And then it will call
[10:56:28] this function which will return the
[10:56:30] clean text
[10:56:32] for each document in corpus. And then
[10:56:34] we'll add that element or add those
[10:56:37] words to our vocabulary list. And then
[10:56:39] we use the set function to find the
[10:56:40] distinct elements and list to create
[10:56:43] convert it into a list and then sort it
[10:56:45] to sort the elements alphabetically. And
[10:56:47] then finally we return the list using
[10:56:50] this function. After creating
[10:56:51] vocabulary, our third function will be
[10:56:54] to create a bag of words model or to use
[10:56:56] this vocabulary to create a word count
[10:56:58] vectors for each sentence or each
[10:57:00] document presented in our corpus. Let's
[10:57:03] define bag of words function. So I'll
[10:57:05] write it at B w. And this function takes
[10:57:07] two arguments. First one will be a
[10:57:09] sentence for which we want to create a
[10:57:12] word vector. Second argument will be the
[10:57:13] vocabulary which it will use to create
[10:57:15] the vector.
[10:57:19] Here first of all we'll take each
[10:57:21] sentence and tokenize it and find all
[10:57:23] the words present in the sentence or
[10:57:25] find all the clean words present in the
[10:57:27] sentence using this function. Let's call
[10:57:30] it words. So it will be we'll call the
[10:57:32] function extract words
[10:57:35] and we'll pass our sentence into it. So
[10:57:38] it will return us a list of all the
[10:57:39] clean words. Stop words will be removed
[10:57:41] and all the words will be in lower case.
[10:57:43] Now we'll create an array of numpy zeros
[10:57:46] that we'll use to create word count
[10:57:49] vectors for each text document in our
[10:57:51] corus. Let's create a numpy array of
[10:57:54] zeros. So we'll store it in bag. So
[10:57:57] we'll create an np and it will be equal
[10:58:00] to the length of this array will be
[10:58:02] equal to the length of our vocabulary.
[10:58:08] After creating an umpire of zeros, now
[10:58:10] we'll use for loop to for each sentence
[10:58:14] to make a word count vector using this
[10:58:17] numpy array of zeros. We'll iterate
[10:58:19] through every word and that is present
[10:58:21] in our words list.
[10:58:23] This is a words list which contains all
[10:58:25] the words and then we'll iterate through
[10:58:28] every single word that is present in
[10:58:30] this list. Then we'll use another for
[10:58:32] loop to check whether the word is in
[10:58:34] vocabulary or not.
[10:58:38] This enumerate function will add a
[10:58:41] counter to our vocabulary which we'll
[10:58:43] use to iterate through every single
[10:58:45] element in the vocabulary and we'll
[10:58:46] check whether that element is equal to
[10:58:48] the word present in our list or not.
[10:58:51] We'll check if the word in vocabulary is
[10:58:54] equal to the word in our words list and
[10:58:57] if it is equal and then we'll increase
[10:58:58] the zero corresponding to that word in
[10:59:01] this number array by one for that
[10:59:04] particular word. we'll increase the
[10:59:06] numpy array of zeros by one. And in the
[10:59:08] last we will return the numpy array. So
[10:59:12] we have to convert this to an array of
[10:59:14] our bags list. So after this it will uh
[10:59:17] when we use the enumerate function to it
[10:59:19] will create an enumerate object. Then we
[10:59:21] have to convert that enumerate object to
[10:59:24] our numpy array. So it will this
[10:59:25] function will return this input for this
[10:59:27] function is our sentence and our
[10:59:28] vocabulary and it will return word count
[10:59:31] vector for that particular sentence
[10:59:33] using this vocabulary. Now we have
[10:59:35] declared these three functions. Now
[10:59:37] let's declare use one corus to use these
[10:59:41] functions. I'll store my corpus in
[10:59:43] corpus. So I have copied these four
[10:59:46] sentences. So in actual we'll have four
[10:59:48] documents. So this first document can
[10:59:51] contain thousands of sentences and
[10:59:52] similarly for your other documents. We
[10:59:54] can pass these sentences in a corus as a
[10:59:56] list and then we can use these functions
[10:59:59] to create word count vectors for each
[11:00:02] sentence that is present in our list.
[11:00:04] for each document. Now let's create this
[11:00:06] corus and after that we'll create a
[11:00:08] vocabulary for this particular corpus.
[11:00:11] So I'll store the vocabulary in
[11:00:13] vocabulary and our function that creates
[11:00:15] vocabulary is this vocab function. I'll
[11:00:17] write vocab and then I'll pass this
[11:00:20] corpus and it will use this corpus to
[11:00:22] create a and to return a vocabulary
[11:00:27] once we run this line. Now let's print
[11:00:29] the vocabulary.
[11:00:32] So you can see these are the words which
[11:00:34] are treated as or vocabulary which we'll
[11:00:36] use to create word con vectors for each
[11:00:38] sentence. So you can see there is both
[11:00:40] football games, John like likes, marry,
[11:00:43] movies and others. So you can see that
[11:00:45] this is one like this is another likes.
[11:00:48] So later on when we perform text
[11:00:50] classification we'll also use the
[11:00:51] stemming and leatization we'll reduce
[11:00:53] these words to a single word that is
[11:00:55] like that we only have one word because
[11:00:57] they mean the same thing. Now after
[11:00:59] creating a vocabulary now we'll use this
[11:01:00] vocabulary to create a word count vector
[11:01:03] for each of these sentences. Our word
[11:01:06] count vector function is bo which will
[11:01:08] take a sentence and vocabulary and
[11:01:10] return a word count vector. So let's use
[11:01:12] this function. So I'll call this
[11:01:13] function and then I'll pass the
[11:01:15] sentence. So let's create word count
[11:01:17] vector for this sentence
[11:01:20] and then the second argument will be our
[11:01:23] vocabulary which is stored in vocabulary
[11:01:28] once I run this so I'll get word count
[11:01:30] vector for this particular sentence so
[11:01:32] we have I know vocabulary both does not
[11:01:35] occur in this so it will be zero there
[11:01:37] is football you can see it occurs two
[11:01:40] times the word count value will be two
[11:01:42] word frequency is two and games occurs
[11:01:44] one time The word frequency is one and
[11:01:47] similarly for others. This array
[11:01:49] represents the word frequency according
[11:01:51] to our vocabulary. We can print this for
[11:01:53] each sentence that is present in our
[11:01:55] text documents. And once we have these
[11:01:57] arrays, then we can use these arrays to
[11:01:59] as an input to our machine learning
[11:02:01] models for text classification. bag of
[11:02:03] words model Python scikit learn library
[11:02:06] provides lots of inbuilt functions for
[11:02:08] implementing machine learning models and
[11:02:10] for text prep-processing or data
[11:02:12] prep-processing and also to calculate
[11:02:14] different metrics about our model. It
[11:02:16] includes our bag of words functions as
[11:02:19] well. We have a class known as count
[11:02:22] vectorzer which we'll use to implement
[11:02:24] bag of words model. It works on terms
[11:02:26] frequency which is also called as the
[11:02:28] word count frequency. will count the
[11:02:31] occurrences of words or tokens it will
[11:02:33] make a sparse matrix. So it will make a
[11:02:35] matrix of zeros and with the words which
[11:02:38] occur within word counts of every single
[11:02:41] word that occurs in our document or in
[11:02:43] our corpus. We'll use the count vector
[11:02:46] as a class to implement bag of words
[11:02:47] model and we'll import this class from
[11:02:50] the feature extraction method or feature
[11:02:53] extraction method of scikit learn
[11:02:54] library and then we'll initialize this
[11:02:56] class and we'll train the object of this
[11:02:58] class to create word count vectors or
[11:03:01] term frequencies of all the documents
[11:03:03] present in our corpus. Instead of making
[11:03:06] different functions that we did earlier,
[11:03:08] we'll not make functions now. We'll just
[11:03:10] use the count vector as a class and we
[11:03:12] just have to pass our corpus into it and
[11:03:15] it will automatically create a word a
[11:03:17] vocabulary and create word count vectors
[11:03:19] for us. So to implement wag words using
[11:03:22] count vectorzer first of all we'll
[11:03:24] import numpy as np and then we will
[11:03:26] import our count vectorzer class which
[11:03:28] will create word count vectors. So it
[11:03:31] will import it from feature extraction
[11:03:33] methods feature extraction.ext text and
[11:03:35] then we have imported pandas so we can
[11:03:37] visualize the results as a data frame
[11:03:40] and then I have declared a corpus this
[11:03:42] time we have to declare it as an array
[11:03:45] so that you can feed it to our count
[11:03:47] vector we have corpus which contains
[11:03:50] five text documents which is John saw
[11:03:53] the train the train was late Max and Rob
[11:03:56] took the bus I looked for Max and Rob at
[11:03:58] the bus station and the fifth sentence
[11:04:00] these are the five sentences then we'll
[11:04:02] use our count vector izer class to
[11:04:06] instantiate first of all an object. Once
[11:04:08] we create an object of this class then
[11:04:10] we can use this object to pass our
[11:04:13] corpus and it will create the vocabulary
[11:04:15] and everything. So it will represent
[11:04:17] directly the word count vectors for each
[11:04:20] text or each sentence or each document
[11:04:22] present in our corpus. Once we
[11:04:24] instantiate an object of count vector as
[11:04:26] a class. Now after instantiating we'll
[11:04:29] use the fit transform method. So it will
[11:04:32] first of all fit our data to our object
[11:04:34] then it will transform our data in word
[11:04:37] count vectors. So we'll store it in bag
[11:04:39] of words variable. Once we have a bag of
[11:04:42] words variable so it will be a sparse
[11:04:44] matrix which means a matrix of zeros. So
[11:04:46] we'll convert it into an array once we
[11:04:49] convert into array using the two array
[11:04:52] method. So you can see we have our
[11:04:54] array. So this array contains a word
[11:04:57] count vectors or which we also call
[11:04:59] count vectorzers for each sentence that
[11:05:01] is present in our text. The first vector
[11:05:04] the first element of this array is word
[11:05:07] count vector for the first sentence that
[11:05:09] is John saw the train. So this time we
[11:05:11] don't have to manually conclude to a
[11:05:14] vocabulary or manually define a function
[11:05:16] which will print a vocabulary for us
[11:05:17] when you find a vocabulary. So it will
[11:05:19] be automatically done by the count
[11:05:21] vector as a class. So we just to you
[11:05:23] have to use the fit transform method and
[11:05:25] we'll pass our text data. So once we
[11:05:27] pass our text data or a feature matrix
[11:05:29] which will contain the word count
[11:05:31] vectors for every text document or
[11:05:33] sentence present in our corpus. So now
[11:05:36] if you want to get the feature names
[11:05:38] which is also the vocabulary. So here
[11:05:40] this is our vocabulary which is stored
[11:05:42] in feature names. This will be to get
[11:05:44] feature names we just have to use this
[11:05:47] object and the get feature names method
[11:05:50] of this object. Once we write this and
[11:05:52] we'll store the feature names in feature
[11:05:53] names. When we print it, this is our
[11:05:55] vocabulary which are called features in
[11:05:58] count vector as a class. These are the
[11:06:00] features that our counter vector as a
[11:06:02] class uses to make this word count
[11:06:04] vectors. Now if you want to represent
[11:06:06] every sentence with a word count vector
[11:06:08] as a data frame. So we'll just use
[11:06:10] panda's data frame. So pd dot data frame
[11:06:13] then we'll write the bag of words to
[11:06:15] array. this array which contains all the
[11:06:18] word count vectors for each sentence and
[11:06:20] then the columns are the feature names.
[11:06:22] This is our vocabulary and according to
[11:06:24] this vocabulary which are columns each
[11:06:26] sentence is represented. So this is our
[11:06:29] first sentence. This is the second,
[11:06:30] third, fourth and fifth sentence. This
[11:06:32] is the word count vector for the first
[11:06:34] sentence. This is for the second
[11:06:35] sentence. This is for third, fourth and
[11:06:37] fifth sentence. So this is printed as a
[11:06:40] data frame. Now let's go to the Jupyter
[11:06:42] notebook and implement bag of words
[11:06:44] model using count vectorzer class. We
[11:06:46] will start by importing the count
[11:06:48] vectorzer class from skarn.feature
[11:06:51] extraction.ext
[11:06:52] from skarn.
[11:07:06] After this I'll import both numpai and
[11:07:09] pandas.
[11:07:14] After importing these two these three
[11:07:17] libraries, now let us define corpus that
[11:07:20] we'll use for creating count vectors.
[11:07:24] I'll take this corpus from here.
[11:07:31] Now we have a corpus. So our next step
[11:07:33] is to create a count vectorzer object.
[11:07:36] So we'll store it in count vectorizer
[11:07:40] count vectorzer.
[11:07:45] Once we create this object now we have
[11:07:47] to use the fit transform method of this
[11:07:50] object on our corpus. So we get our word
[11:07:53] count vectors or our count vectors. I'll
[11:07:56] store my count vectors in bag of words.
[11:08:01] We have to use the object that we have
[11:08:03] created. It is count vectorzer and
[11:08:05] inside you have to use the fit transform
[11:08:08] method
[11:08:11] of this object and then we have to pass
[11:08:13] our corpus. So word count vectors will
[11:08:16] be created and stored in back of words.
[11:08:18] Now after we have to print our word
[11:08:21] count vectors as an array.
[11:08:25] Once we print this you can see this is
[11:08:27] we have four sentences in our corpus. So
[11:08:29] for each sentence it has printed a word
[11:08:32] count vector. Now if now let's look at
[11:08:34] the vocabulary or which is called
[11:08:36] feature names in count vectorizer. Let's
[11:08:40] extract the feature names from our count
[11:08:43] vectorzer object.
[11:08:46] We have to use the get feature names
[11:08:49] method of this object.
[11:08:53] Now we'll have our feature names. Now
[11:08:55] let's print our feature names.
[11:08:59] If I print my feature name so this is a
[11:09:01] vocabulary which we call in backwards
[11:09:03] model here we call it feature names. So
[11:09:05] you can see we have and both but does
[11:09:07] football games like likes marry there
[11:09:10] are more feature names. So if you
[11:09:11] compare it to our vocabulary you can see
[11:09:13] the and is not there. So because and is
[11:09:16] a stop words and we have not removed any
[11:09:18] stop words here. In order to remove the
[11:09:20] stop words we can use the contractor as
[11:09:22] a class and it has an argument called
[11:09:25] stop words. So we'll set it equal to the
[11:09:28] stop words list that we have created
[11:09:30] here. I'll set it equal to the stop
[11:09:32] words. It will remove the stop words.
[11:09:36] Once we remove the stop words. Now you
[11:09:37] can see our array has changed a little
[11:09:39] bit. If I print my feature name and in
[11:09:41] both which are both stop words are
[11:09:43] removed. Now this is our vocabulary
[11:09:44] which is called feature names. And
[11:09:46] according to these feature names we have
[11:09:48] the word count vectors of each text or
[11:09:51] each sentence that is present in our
[11:09:53] corpus. Now let's print our array of
[11:09:57] word count vectors as a data frame. We
[11:10:00] use pandas data frame.
[11:10:03] Here we have to pass our array that we
[11:10:05] want to view as a data frame. And then
[11:10:08] the columns of our data frame will be
[11:10:10] our feature names.
[11:10:12] So if I print this now you can see these
[11:10:15] are our feature names and these are the
[11:10:16] word count vectors for each sentence
[11:10:18] that is present in our text corpus. So
[11:10:22] this is our corus and these are the four
[11:10:24] sentences. This is the word count vector
[11:10:26] for the first sentence word count vector
[11:10:28] for the second and third and fourth.
[11:10:31] This represents the word count of each
[11:10:34] word that is present in the vocabulary.
[11:10:35] So football occurs two times in this and
[11:10:38] third sentence and games occur one time
[11:10:41] in third sentence. So here in count
[11:10:43] vectorizer we have used the word count
[11:10:46] vector. So we'll represent each sentence
[11:10:48] with its with all the words as their
[11:10:51] word counts. This is the word count
[11:10:53] representation of our text.
[11:10:55] >> Keeping up with the AI advancements can
[11:10:57] be exhausting. One day it's a new
[11:10:58] chatbot that can write poetry and the
[11:11:00] next it's a model predicting stock
[11:11:02] trends. According to a report from the
[11:11:04] World Economic Forum, 85 million jobs
[11:11:07] are going to be replaced by AI by 2025.
[11:11:09] But at the same time, 97 million new
[11:11:12] roles will be created in areas like AI
[11:11:14] development, data science, and human AI
[11:11:16] collaboration. So, how do you stay ahead
[11:11:18] in this rapidly changing datadriven
[11:11:20] world? By building real hands-on AI
[11:11:23] projects. Working on projects not only
[11:11:25] helps you understand AI concepts better,
[11:11:27] but also prepares you for real world
[11:11:29] challenges. It's the best way to learn,
[11:11:31] grow, and make your skills future proof.
[11:11:33] I'm sharing 10 AI project ideas, five
[11:11:35] for beginners and five for advanced.
[11:11:37] Designed to give you the practical
[11:11:38] experience you need. Whether you're just
[11:11:41] starting or ready to tackle the big
[11:11:42] stuff, these projects will help you
[11:11:44] build something meaningful while also
[11:11:46] making your resume stand out in this
[11:11:47] competitive field. So the very first
[11:11:50] project idea that we'll be discussing is
[11:11:52] the product recommendation system. This
[11:11:54] is a system used by companies like
[11:11:56] Amazon, Netflix, and Spotify to suggest
[11:11:58] products, movies or songs based on what
[11:12:00] users like. A product recommendation
[11:12:02] system is an algorithm that analyzes
[11:12:04] user behavior to suggest products that
[11:12:07] they might like. For example, the system
[11:12:09] or the algorithm might recommend phone
[11:12:10] cases or screen detectors if someone
[11:12:12] buys a phone. This makes shopping more
[11:12:15] personalized and engaging. An
[11:12:16] interesting fact before you dive into
[11:12:18] this project. Amazon's 35% of sales come
[11:12:21] through product recommendation engine
[11:12:23] itself. This tells you how valuable this
[11:12:26] project would be if you were to make a
[11:12:27] career in any e-commerce company. The
[11:12:29] first step in this project is to gather
[11:12:31] and pre-process data such as user
[11:12:33] ratings, browser history, or product
[11:12:35] descriptions. You can find a sample data
[11:12:37] at Kaggel or UCR repository. You'll then
[11:12:40] use machine learning techniques to
[11:12:42] develop the recommendation model.
[11:12:44] Popular methods would include
[11:12:45] collaborative filtering where you
[11:12:46] recommend items based on what similar
[11:12:49] users have liked or content based
[11:12:51] filtering where you recommend items
[11:12:52] similar to what a user has shown
[11:12:54] interest in before. Now the tech stack
[11:12:57] for this project would include Python as
[11:12:59] the main programming language with
[11:13:00] libraries like pandas for data handling,
[11:13:02] cycle for building machine learning
[11:13:04] models, KN&N for collaborative filtering
[11:13:06] or decision trees for content based
[11:13:08] filtering. You can also use FL for
[11:13:11] building a simple web app to display the
[11:13:13] recommendations to users. Completing
[11:13:15] this project will give you hands-on
[11:13:17] experience in machine learning, data
[11:13:18] manipulation, and web development. From
[11:13:21] a rum perspective, this project is a
[11:13:23] great addition because product
[11:13:24] recommendation systems are widely used
[11:13:26] in many industries like e-commerce
[11:13:28] entertainment like Netflix or Spotify
[11:13:30] and online services. Now, the second AI
[11:13:33] project idea is cancer disease
[11:13:35] detection. This project uses artificial
[11:13:37] intelligence to help detect cancer at an
[11:13:39] early stage which can be crucial for
[11:13:41] saving lives. Now, in order to get
[11:13:43] started, you will need a data set that
[11:13:44] contains labeled medical images of
[11:13:46] cancerous and non-cancerous cells. There
[11:13:49] are several data sets available online
[11:13:51] like the cancer image archive TCIA that
[11:13:54] can help you with this. Once you have
[11:13:56] your data, the next step is to
[11:13:57] pre-process it. For medical images, this
[11:13:59] might involve converting them into
[11:14:01] numerical arrays or normalizing them to
[11:14:04] ensure consistency. For the text stack,
[11:14:06] you will be using a combination of
[11:14:08] Python and deep learning libraries. The
[11:14:10] most popular deep learning framework is
[11:14:11] TensorFlow or PyTorch. Now, these
[11:14:14] libraries are great for building and
[11:14:15] training machine learning models. You
[11:14:17] will also work with CNN's which are
[11:14:19] specifically designed for image
[11:14:21] recognition tasks like this one. CNN's
[11:14:23] analyze patterns in the medical images
[11:14:25] and then classify them as either
[11:14:27] cancerous or not. Now to handle the data
[11:14:29] you can use pandas and numpy. You can
[11:14:32] also use flask or streamllet to create
[11:14:33] simple web app for your algorithm. This
[11:14:35] way doctors can upload images and your
[11:14:37] AI system will analyze them and give
[11:14:39] predictions on whether the image shows
[11:14:41] signs of cancer or not. Now what's the
[11:14:43] benefit of working on this project? By
[11:14:46] working with the real world medical data
[11:14:48] and applying deep learning techniques,
[11:14:49] you'll develop technical skills in AI,
[11:14:51] machine learning and data processing.
[11:14:53] Now, this can prove to be highly
[11:14:55] valuable for your resume. Completing
[11:14:57] this project will demonstrate that you
[11:14:59] can work with large data sets, develop
[11:15:00] models, and even deploy AI systems,
[11:15:02] which are skills in high demand. Now,
[11:15:04] moving ahead to the third project idea.
[11:15:07] Businesses and consumers rely heavily on
[11:15:09] online reviews to make decisions.
[11:15:11] However, reading through thousands of
[11:15:13] reviews to understand customer opinions
[11:15:15] can be timeconuming. This is where
[11:15:17] sentiment analysis comes in. It's a
[11:15:19] technique in artificial intelligence
[11:15:21] that can automatically determine whether
[11:15:23] a review is positive, negative, or
[11:15:25] neutral. Through this, businesses are
[11:15:27] able to quickly gain valuable insights
[11:15:29] into customer feedback. The project will
[11:15:32] begin with collecting product review
[11:15:34] data which is publicly available from
[11:15:36] sources like Amazon or Yelp and then
[11:15:38] pre-process it. For the model, we will
[11:15:40] use machine learning techniques,
[11:15:42] specifically natural language processing
[11:15:44] to understand and analyze the text.
[11:15:47] Initially, a simple model using
[11:15:49] techniques like bag of words or TF
[11:15:51] combined with a classifier like logistic
[11:15:53] regression can be implemented. Later,
[11:15:56] you can enhance the model using deep
[11:15:57] learning with RNN's or LSTM, which is
[11:16:00] long short-term memory for better
[11:16:03] accuracy, especially for complex
[11:16:05] sentences.
[11:16:06] On completing this project, you will not
[11:16:08] only boost your understanding of NLP and
[11:16:11] machine learning techniques, but also
[11:16:12] make your resume stand out. This project
[11:16:14] will demonstrate your ability to handle
[11:16:16] both data prep-processing and model
[11:16:18] buildings, skills that are highly sought
[11:16:20] after in AI and data science roles.
[11:16:22] Additionally, it's a great way to get
[11:16:24] hands-on experience with Python and
[11:16:25] machine learning frameworks, both of
[11:16:27] which are valuable in the tech industry.
[11:16:29] Moving on to the fourth project idea,
[11:16:31] job seekers today face a major
[11:16:33] challenge. getting their résumés noted
[11:16:35] by recruiters. Many companies use an
[11:16:38] applicant tracking system or ATS to
[11:16:40] filter rumés before they even reach a
[11:16:42] human recruiter. These systems scan rums
[11:16:45] for specific keywords, job titles,
[11:16:47] skills, and qualifications. If a résé
[11:16:50] doesn't match these criteria, it may
[11:16:51] never get seen by a hiring manager. Now,
[11:16:54] the goal of this project is to build a
[11:16:56] system that can automatically pass and
[11:16:58] extract important information from
[11:16:59] résumés. This makes it easier for ATS to
[11:17:02] process résumés and for recruiters to
[11:17:04] find the most qualified candidates. The
[11:17:06] réumé parsers should be able to extract
[11:17:09] details such as the candidates's name,
[11:17:11] contact information, skills, education,
[11:17:13] work experience, and certifications from
[11:17:15] various rum formats like PDF, DOCX, and
[11:17:18] plain text. It will also analyze and
[11:17:20] categorize this information, helping job
[11:17:22] seekers optimize their resumes for ATS.
[11:17:26] In order to get started with this
[11:17:27] project, we'll use natural language
[11:17:29] processing or NLP to extract and
[11:17:32] understand the content of résumés. For
[11:17:34] passing different formats, you can go
[11:17:36] ahead and use libraries like PDF minor
[11:17:38] or docx in Python. You can then employ
[11:17:41] machine learning models for keyboard
[11:17:42] extraction, entity recognition and
[11:17:45] classification to identify and organize
[11:17:47] this information. There are also
[11:17:49] libraries such as spacy or NLTK that can
[11:17:52] be used for text processing. Once that
[11:17:54] is done, you might also want to
[11:17:56] integrate a simple scoring system that
[11:17:58] ranks rumés based on the match to a job
[11:18:00] description. Completing this project
[11:18:02] adds immense value to your resume and
[11:18:04] skill set. You'll gain hands-on
[11:18:06] experience in working with text data
[11:18:08] using libraries and tools commonly used
[11:18:10] in industry for NLP task. You'll also
[11:18:13] understand how ATS systems work and
[11:18:15] learn about their role in modern hiring
[11:18:17] processes. This project will strengthen
[11:18:19] your proficiency in Python, machine
[11:18:21] learning, and NLP, making you a more
[11:18:24] attractive candidate for roles in data
[11:18:26] science, AI, and software development.
[11:18:28] Plus, it will give you a practical
[11:18:29] project to showcase, demonstrating your
[11:18:31] ability to solve real world problems and
[11:18:33] adding a highly relevant skill to your
[11:18:35] resume. Moving on to the fifth project
[11:18:37] idea. This project focuses on a world
[11:18:40] where machines can see and interpret the
[11:18:42] environment just like humans. Realtime
[11:18:44] object detection is all about teaching
[11:18:46] computers to identify and track objects
[11:18:48] in live video feed such as recognizing a
[11:18:51] car, a pedestrian, or even a stray
[11:18:53] animal crossing the road. The problem
[11:18:55] this project addresses is simple yet
[11:18:57] impactful. In many industries, from
[11:18:59] autonomous driving to security systems,
[11:19:01] there's a need for technology that can
[11:19:03] detect and respond to objects in real
[11:19:05] time. Without it, processes remain slow
[11:19:08] and manual. And in some cases, this
[11:19:10] delay could even lead to accidents or
[11:19:11] inefficiencies. This project allows you
[11:19:14] to bridge that gap by creating a system
[11:19:16] capable of detecting objects instantly,
[11:19:18] enhancing safety and efficiency. You'll
[11:19:21] begin by collecting a data set of images
[11:19:23] with labeled objects such as COC or
[11:19:26] Pascal VOC. These labels help train your
[11:19:28] model to recognize different items. Then
[11:19:31] you can go ahead and use a pre-trained
[11:19:33] deep learning model like YOLO, you only
[11:19:35] look once or SSD singleshot detector
[11:19:38] which are designed for speed and
[11:19:40] accuracy. With frameworks like
[11:19:42] TensorFlow or PyTorch, you'll fine-tune
[11:19:44] the model, feeding it data and teaching
[11:19:46] it to detect objects in real time video
[11:19:48] streams from a webcam or CCTV camera.
[11:19:52] Finally, you'll integrate Open CV for
[11:19:54] video processing and visualization.
[11:19:57] The text tag includes Python for coding,
[11:19:59] TensorFlow or PyTorch for building the
[11:20:01] air model, and Open CV for handling
[11:20:03] video feeds and a GPU to process the
[11:20:06] data quickly. You'll also use libraries
[11:20:08] like Numpa and Pandas to manage the data
[11:20:11] for your resume. This project will
[11:20:12] showcase your expertise in computer
[11:20:14] vision, machine learning, and handling
[11:20:16] real-time data, which are sought after
[11:20:18] skills in the AI job market. On the
[11:20:20] other hand, for organizations, the
[11:20:22] benefit is enormous. Whether it's
[11:20:24] enhancing surveillance systems,
[11:20:25] improving vehicle automation, or
[11:20:28] optimizing industrial workflows. Moving
[11:20:30] on to the sixth project idea. For a
[11:20:32] business that receives customer queries
[11:20:34] day and night, hiring people to answer
[11:20:36] these questions 24 into7 can prove to be
[11:20:39] expensive and sometimes inefficient.
[11:20:41] This is where chatbots can help. A
[11:20:43] chatbot is an AI powered assistant that
[11:20:46] can understand customer questions and
[11:20:48] provide instant answers, saving time and
[11:20:50] money while improving customer
[11:20:52] satisfaction. To build this project,
[11:20:54] you'll start by defining the purpose of
[11:20:55] your chatbot. Will it assist with
[11:20:57] customer support, book appointments, or
[11:20:59] maybe help users navigate a website?
[11:21:02] Once you figure out the main category of
[11:21:04] support, you'll need to gather or create
[11:21:06] a data set of common questions and
[11:21:08] answers related to your topic. Next, you
[11:21:10] use tools like Python and frameworks
[11:21:12] like Flask or Fast API to create a
[11:21:14] conversational interface. For the AI
[11:21:17] brain, you can use pre-trained models
[11:21:19] from libraries like hugging face, which
[11:21:21] help the chatbot understand and respond
[11:21:23] to text. The back end might use an LLP
[11:21:26] model like bird or GPT. These models
[11:21:28] analyze the input text, understand its
[11:21:30] meaning, and generate appropriate
[11:21:32] replies. If you're making a more
[11:21:34] advanced chatboard, you could connect it
[11:21:36] to a database or APIs to retrieve
[11:21:39] specific information like product
[11:21:40] availability or order status. Once the
[11:21:43] chatboard is built, you can again deploy
[11:21:45] it on website or messaging platform
[11:21:48] using tools like Heroku or AWS. This
[11:21:50] project is a fantastic addition to your
[11:21:52] resume. It shows your ability to solve
[11:21:54] real world problems using AI to automate
[11:21:57] processes and build systems that improve
[11:21:59] user experience. By completing this
[11:22:01] project, you'll demonstrate your skills
[11:22:03] in Python, NLP, and deploying AI systems
[11:22:06] which are in high demand. Plus, it gives
[11:22:08] you a practical example to discuss
[11:22:10] during interviews and sets you apart
[11:22:12] from problem solver ready to tackle real
[11:22:14] business challenges. Moving on to the
[11:22:15] seventh project idea. Now, if you're a
[11:22:18] developer working on a large project
[11:22:20] with multiple team members,
[11:22:22] understanding someone else's code is
[11:22:23] like solving a puzzle, especially when
[11:22:25] the code lacks clear documentation.
[11:22:28] Writing documentation is tedious and
[11:22:30] timeconuming, which is why most of the
[11:22:32] coders neglect it. This is a common
[11:22:34] problem often leading to
[11:22:36] miscommunication, bugs, and slower
[11:22:38] onboarding of new developers. Code
[11:22:40] document generator is an algorithm
[11:22:42] designed to bridge this gap. This
[11:22:44] project aims to automate the processing
[11:22:47] of creating detailed and accurate code
[11:22:49] documentation. You will build an
[11:22:51] algorithm that will analyze the
[11:22:52] structure functions and logic of the
[11:22:54] codebase and generate human readable
[11:22:57] explanations for each part. To build
[11:22:59] this project, you can use technologies
[11:23:01] like Python and hugging face
[11:23:03] transformers. A model like GPD4 can be
[11:23:06] fine-tuned to understand code and
[11:23:07] generate meaningful descriptions. You'll
[11:23:10] also work with frameworks like lang
[11:23:12] chain for chaining task and fast API to
[11:23:14] deploy the tool as a web service. You
[11:23:17] can also go ahead and add a front-end
[11:23:19] interface using streamllet that will
[11:23:21] make the algorithm userfriendly and
[11:23:23] visually appealing. By completing this
[11:23:25] project, you'll gain hands-on experience
[11:23:27] with advanced AI concepts like working
[11:23:29] with large language models and deploying
[11:23:31] them in real world applications while
[11:23:33] also highlighting your understanding of
[11:23:35] generative AI. The next project idea
[11:23:37] that I'll be talking about has been
[11:23:39] inspired by Google's revolutionary
[11:23:41] cotton project pest monitoring
[11:23:44] application. This is basically an AI
[11:23:46] application that helps farmers protect
[11:23:48] their crops by guiding them on aspects
[11:23:50] like the best time to spray pesticides.
[11:23:53] Pests are a huge challenge for farmers.
[11:23:55] They damage crops, lower yields, and
[11:23:57] increase the cost of farming due to the
[11:23:59] needs of pesticides. Many farmers don't
[11:24:01] realize there's an infestation until
[11:24:03] it's too late, leading to significant
[11:24:06] financial losses. This is where
[11:24:08] technology, especially AI, that can step
[11:24:10] up and make a big difference. The
[11:24:12] application will use image recognition
[11:24:14] to identify and track pest infestations
[11:24:17] in real time. Farmers can upload photos
[11:24:20] of their crops, and the app will analyze
[11:24:22] the images to detect pest, classify
[11:24:24] them, and provide actionable
[11:24:26] recommendations such as whether to use
[11:24:28] pesticides or take other measures. To
[11:24:30] get started, you'll need a data set of
[11:24:31] test images and healthy crop photos.
[11:24:34] Then you will need to train a machine
[11:24:36] learning model like CNN using frameworks
[11:24:39] like TensorFlow or PyTorch. Then you can
[11:24:41] go ahead and add a user-friendly
[11:24:43] interface, perhaps a mobile app where
[11:24:45] farmers can upload their images and
[11:24:47] receive instant feedback. You can even
[11:24:49] incorporate weather data or other
[11:24:51] regional trends to make the app more
[11:24:53] accurate. For this project, you'll use
[11:24:56] tools like TensorFlow or PyTorch for the
[11:24:58] AI model, OpenCV for image processing,
[11:25:00] and a cloud platform like AWS or Google
[11:25:03] Cloud for deployment. The mobile app can
[11:25:05] be built using Flutter or React Native
[11:25:08] to ensure it works on multiple devices.
[11:25:10] You could also add APIs for weather
[11:25:12] forecasting to give farmers extra
[11:25:14] insights. This project is a great
[11:25:16] addition to your resume. Employers value
[11:25:18] candidates who can demonstrate technical
[11:25:20] skills with practical applications.
[11:25:22] Something that brings a change in the
[11:25:24] everyday lives of the people. Now moving
[11:25:26] on to the ninth project idea. Consider a
[11:25:28] situation where a marketing manager
[11:25:30] needs to come up with engaging content
[11:25:32] for products in an organization. Every
[11:25:35] campaign is going to require hours of
[11:25:37] brainstorming, research, and editing to
[11:25:39] create creator emails, blog posts, or
[11:25:41] social media captions. This is where an
[11:25:43] AI powered marketing content generator
[11:25:46] can make life easier. Repetitive content
[11:25:48] creation takes up time and resources
[11:25:50] that could be better spent on creative
[11:25:52] ideas and connecting with customers. By
[11:25:54] building an AI marketing content
[11:25:56] creator, you can automate the task,
[11:25:58] making it faster and more efficient
[11:26:00] while still maintaining quality and
[11:26:02] personalization. To develop this
[11:26:04] project, you would start by training a
[11:26:06] generative language model like OpenAI's
[11:26:08] GPD or hugging face transformers to
[11:26:10] create different types of marketing
[11:26:12] content. You can feed it examples of
[11:26:14] blogs, email campaigns, or social media
[11:26:16] post to fine-tune the model for specific
[11:26:19] industries. Adding features like
[11:26:21] customizable tone, keywords, or target
[11:26:23] audience will make the tool more
[11:26:24] versatile. You can also integrate a
[11:26:27] user-friendly interface using tools like
[11:26:29] Streamlit or Fast API, so marketers can
[11:26:31] simply input a prompt and get ready to
[11:26:33] use content instantly. The text tag will
[11:26:36] include Python as the primary
[11:26:37] programming language libraries like
[11:26:39] TensorFlow or PyTorch for deep learning
[11:26:41] and lang chain for chaining tasks like
[11:26:43] summarization and editing. Hugging face
[11:26:45] can provide pre-trained models while
[11:26:47] pine cone or a similar vector database
[11:26:49] can store embeddings for fetching quick
[11:26:52] texts. To deploy the app, you can use
[11:26:54] cloud platforms like AWS or Google
[11:26:56] cloud. Completing this project will add
[11:26:59] immense value to your resume. It
[11:27:00] showcases your expertise in generative
[11:27:02] AI, NLP, and creating end-to-end
[11:27:04] solutions for real world problems. By
[11:27:07] building a marketing content generator,
[11:27:09] you will not just be solving a technical
[11:27:11] problem, but you'll be creating
[11:27:12] something that directly drives business
[11:27:14] growth, improves efficiency, and
[11:27:16] demonstrates the power of AI in everyday
[11:27:18] task. Here comes the last project idea
[11:27:20] for this particular video. Managing
[11:27:22] finances today can be complex and
[11:27:24] overwhelming for many people. Financial
[11:27:27] advice are often expensive and
[11:27:29] individuals with limited resources may
[11:27:31] not have access to professional
[11:27:32] guidance. This is where a generative AI
[11:27:35] powered chatbot can bridge the gap by
[11:27:37] offering personalized realtime financial
[11:27:39] advice. To build this project, you would
[11:27:42] create a conversational AI model capable
[11:27:44] of understanding user inputs such as
[11:27:46] financial goals or challenges and
[11:27:48] provide tailored recommendations. The
[11:27:50] chatboard would use natural language
[11:27:52] processing to understand queries,
[11:27:54] retrieve the relevant data, and then
[11:27:56] generate meaningful responses. For
[11:27:58] example, if someone asks, "How can I
[11:28:00] save for retirement?" The chatbot could
[11:28:02] analyze the user's income, expense, and
[11:28:05] investment options to provide advice.
[11:28:07] The text stack for this project includes
[11:28:09] hugging face transformers to build and
[11:28:11] fine-tune the conversational model. Then
[11:28:13] langchain for integrating multiple
[11:28:15] knowledge sources and pine cone or v8
[11:28:18] for storing and retrieving user data
[11:28:20] efficiently. Tools like fast API can be
[11:28:23] used to deploy the chatbot and a
[11:28:25] user-friendly interface can be created
[11:28:27] using streamllet or react. Security
[11:28:30] protocols like o can be added to ensure
[11:28:33] user data privacy. Completing this
[11:28:35] project will be a big win for both the
[11:28:37] individual and the organization. for
[11:28:39] you. It would demonstrate your ability
[11:28:41] to work on real world AI problems using
[11:28:43] advanced tools and showcase your skills
[11:28:45] in generative AI and multimodel data
[11:28:48] integration. It's a standout addition to
[11:28:50] your resume especially for fintech. So
[11:28:52] this is all for this video. I hope you
[11:28:54] guys were able to get some insight out
[11:28:56] of it. If you guys liked it, hit the
[11:28:58] like button and subscribe to
[11:28:59] Intellipath's YouTube channel. Thank you
[11:29:00] and see you in the next video.
[11:29:02] >> Hello everyone. Intellipath offers
[11:29:04] executive post-graduate certification in
[11:29:06] data science and artificial intelligence
[11:29:09] in collaboration with iHub IAT RII.
[11:29:12] Through this particular course, you'll
[11:29:13] get to learn multiple tools like Python,
[11:29:17] Pispark, Scypi, Numpai, Pandas, Mattplot
[11:29:21] Lip, TensorFlow, Git, etc. You are going
[11:29:24] to learn multiple skills like data
[11:29:26] science, natural language processing,
[11:29:29] deep learning, fundamentals of
[11:29:31] generative AI, prompt engineering and
[11:29:33] application based generative AI as well
[11:29:35] as recent trends like agentic AI. This
[11:29:39] course is designed to get you ready for
[11:29:41] the AI world. So do check out link
[11:29:43] available in the description. Also
[11:29:45] through this course we have already
[11:29:47] helped thousands of learners take
[11:29:49] positive step in their career. You can
[11:29:51] check out their testimonials on our
[11:29:53] achievers channel.
[11:30:02] [Music]