Full Transcript
https://www.youtube.com/watch?v=UC-3uRGiNBY
[00:09] hello
[00:10] hello everyone the concept that we going to
[00:12] everyone the concept that we going to have a look at today is called
[00:16] have a look at today is called discretization so what is discretization
[00:18] discretization so what is discretization why do we do this and how do we do this
[00:21] why do we do this and how do we do this is what we going to have a look at today
[00:23] is what we going to have a look at today right before you understand what
[00:25] right before you understand what discretization is you need to understand
[00:27] discretization is you need to understand what the difference between continuous
[00:30] what the difference between continuous and discrete data is right so looking at
[00:34] and discrete data is right so looking at the definition of continuous data you
[00:35] the definition of continuous data you can say so these are variables that can
[00:39] can say so these are variables that can take up infinite number of possible
[00:42] take up infinite number of possible values within a given range right what I
[00:44] values within a given range right what I mean to say is that they are not
[00:45] mean to say is that they are not restricted by being an integer or whole
[00:48] restricted by being an integer or whole numbers or whatever they can take up
[00:51] numbers or whatever they can take up theoretically speaking infinite number
[00:53] theoretically speaking infinite number of options for example think about
[00:56] of options for example think about something like
[00:58] something like height right
[01:00] height right someone can be 160 cm or they can be
[01:04] someone can be 160 cm or they can be 160.5 CM if you are very precise with
[01:07] 160.5 CM if you are very precise with the measuring right they can be some
[01:11] the measuring right they can be some 61.6 CM or whatever I believe you get
[01:15] 61.6 CM or whatever I believe you get the point right this is a feature that
[01:18] the point right this is a feature that can practically speaking take up
[01:22] can practically speaking take up infinite number of options right same
[01:25] infinite number of options right same goes with
[01:26] goes with weight as well depending on how you're
[01:29] weight as well depending on how you're measuring it
[01:30] measuring it this also can take up infinite number of
[01:33] this also can take up infinite number of options right temperature time all of
[01:36] options right temperature time all of these are examples of continuous
[01:38] these are examples of continuous data so if that is the case what do you
[01:41] data so if that is the case what do you think is discrete
[01:43] think is discrete data the discrete data going by the
[01:47] data the discrete data going by the definition of continuous we'll
[01:48] definition of continuous we'll understand that these are the kind of
[01:50] understand that these are the kind of variables which can take up only a
[01:53] variables which can take up only a finite set of values right think of
[01:57] finite set of values right think of something like the number that comes up
[02:00] something like the number that comes up when you roll a dice right they can be
[02:02] when you roll a dice right they can be only 1 2 3 4 5 6 right
[02:08] if you are for example categorizing someone as an adult a minor or a senior citizen or whatever this also is me limiting the options that this variable can take up right
[02:26] so discrete data is something that that is limited by the options that it can take up that is the difference between continuous and discrete data so
[02:41] discretization is the process of converting a continuous data to a discrete data
[03:03] Right, why do we do this?
[03:10] There are a different range of reasons why discretization is important.
[03:13] There are certain algorithms which work better with discrete data than it does with continuous data.
[03:20] For example, decision trees, if you have learned about decision trees, you know that it makes splits based on certain conditions, right?
[03:30] So if it is making splits based on a continuous variable, right, it would have to, for example, let's say I have something like age, right?
[03:39] I have 22, 23, 24, 25 and so on.
[03:46] So in order to figure out what is the best split, my decision tree algorithm will have to create these many splits and compare the, you know, Information Gain or Genie index or whatever is the uh.
[04:04] Genie index or whatever is the uh parameter over there right on the other hand.
[04:06] If I was creating this to be a hand.
[04:10] If I was creating this to be a discrete kind of a discrete kind of a data right.
[04:11] It could be something like adults.
[04:17] Adults miners and senior.
[04:19] Miners and senior citizens in this case the number of.
[04:22] Citizens in this case the number of splits and the computation complexity of.
[04:24] Splits and the computation complexity of this algorithm is greatly reduced used.
[04:28] This algorithm is greatly reduced used right.
[04:30] it only needs to make splits based on these categories or discrete.
[04:32] On these categories or discrete categories over here right.
[04:34] So that is one reason there are certain algorithms.
[04:37] One reason there are certain algorithms especially certain algorithms like.
[04:40] Especially certain algorithms like decision trees which works better with.
[04:42] Decision trees which works better with discrete data than it does with.
[04:45] Discrete data than it does with continuous data right.
[04:46] That results in faster computation as well right.
[04:49] And also better interpretability right.
[04:52] So the moment I have let's say.
[04:55] Have let's say a age feature just like before and I.
[04:58] And I have a set of Ages over here if I.
[05:06] have a set of Ages over here if I categorize that into this kind of
[05:08] categorize that into this kind of categories and look at the frequencies
[05:11] categories and look at the frequencies for example right I might be able to see
[05:14] for example right I might be able to see that I have about 10 adults and five
[05:17] that I have about 10 adults and five minors and 10 senior citizens or
[05:19] minors and 10 senior citizens or whatever this also improves the
[05:22] whatever this also improves the interpretability of the data right
[05:25] interpretability of the data right another thing is dealing with the noise
[05:29] another thing is dealing with the noise or the outliers the moment you convert a
[05:31] or the outliers the moment you convert a continuous data into a categorical data
[05:34] continuous data into a categorical data it greatly reduces the effect of
[05:37] it greatly reduces the effect of outliers over here for example you might
[05:40] outliers over here for example you might have temperatures in the range let's say
[05:43] have temperatures in the range let's say 0 10 15 and for some reason you have a
[05:48] 0 10 15 and for some reason you have a outlier over here which is about let say
[05:50] outlier over here which is about let say 60° C or something right if you are
[05:54] 60° C or something right if you are converting this kind of a feature to a
[05:56] converting this kind of a feature to a category or a discrete kind of a feature
[05:59] category or a discrete kind of a feature you might might have categories like low
[06:02] you might might have categories like low temperature medium temperature high
[06:06] temperature medium temperature high temperature right and then let's say you
[06:10] temperature right and then let's say you might say that 0 and turn 10 over here.
[06:12] might say that 0 and turn 10 over here comes into the low temperature category.
[06:15] comes into the low temperature category 15 comes into the medium category and 60.
[06:18] 15 comes into the medium category and 60 and anything above is going to come into.
[06:20] and anything above is going to come into this High category right so that is the.
[06:24] this High category right so that is the effect of this outlier have you been uh.
[06:27] effect of this outlier have you been uh treating this as a continuous data would.
[06:29] treating this as a continuous data would have.
[06:30] have been quite evident right it would have.
[06:33] been quite evident right it would have actually uh created biases in your.
[06:36] actually uh created biases in your machine learning algorithms as well.
[06:39] machine learning algorithms as well right so that problem is dealt quite.
[06:41] right so that problem is dealt quite nicely when you are converting this to a.
[06:44] nicely when you are converting this to a category right and there are many more.
[06:47] category right and there are many more reasons as well depending on the context.
[06:49] reasons as well depending on the context depending on the problem that you're.
[06:50] depending on the problem that you're facing discretization might help you in.
[06:53] facing discretization might help you in this manner so how do we do.
[06:56] this manner so how do we do discretization so this is not an.
[06:58] discretization so this is not an exhaustive list.
[07:00] exhaustive list right but these are the few most common.
[07:03] right but these are the few most common types or popular types of discretization.
[07:05] types or popular types of discretization methods starting off with something.
[07:07] methods starting off with something called equal width binning okay so let.
[07:11] called equal width binning okay so let me put up a sample data over here of.
[07:15] me put up a sample data over here of Ages let's say okay I have someone who is 20 21 22 25 I have someone is 30 or.
[07:20] is 20 21 22 25 I have someone is 30 or 40 and then I have a couple of 50 55 and.
[07:26] 40 and then I have a couple of 50 55 and 60 and then I have.
[07:31] 60 and then I have these people as well okay so what equal.
[07:34] these people as well okay so what equal width binning does for you as the name.
[07:39] width binning does for you as the name suggests you are going to create bins of.
[07:42] suggests you are going to create bins of equal width what I mean by that is let's.
[07:46] equal width what I mean by that is let's say my first bin is starting from 20 and.
[07:50] say my first bin is starting from 20 and going all the way to 30 okay so this is.
[07:53] going all the way to 30 okay so this is my first bin.
[07:57] my first bin right and let's say 31 to 40 is another.
[08:00] bin 41 to 50 is another.
[08:11] bin 41 to 50 is another bin 50 to 60 60 to 70 70 to 80 and 80 to bin 50 to 60 60 to 70 70 to 80 and 80 to 90 right you can also alter the width
[08:23] 90 right you can also alter the width over here as well you can create a width over here as well you can create a width of 20 for example I can say 20 to 40
[08:30] of 20 for example I can say 20 to 40 right I can say something like 41 to
[08:36] right I can say something like 41 to 60 you can say something like 61 to
[08:43] 80 81 to
[08:47] 80 81 to 100 right so I have these four pinnings
[08:51] 100 right so I have these four pinnings you can alter the width of the bin
[08:54] you can alter the width of the bin according to your requirement right if
[08:57] according to your requirement right if you want to have a look at a more
[09:00] you want to have a look at a more granular sense of data then you might
[09:02] granular sense of data then you might want to increase the number of bins
[09:05] want to increase the number of bins right but let's say this is the bin that
[09:07] right but let's say this is the bin that we are working with right so what
[09:09] we are working with right so what happens is anyone who is from 20 to 40
[09:13] happens is anyone who is from 20 to 40 years of age is going to be put into
[09:16] years of age is going to be put into this bin right so we have 1 2 3
[09:20] this bin right so we have 1 2 3 4 5 and six
[09:25] 4 5 and six right so we have six members in this and
[09:28] right so we have six members in this and from 41 to 60 range I have 1 2 3
[09:32] from 41 to 60 range I have 1 2 3 4 over here 61 to
[09:38] 80 um actually I have only three over
[09:40] 80 um actually I have only three over here not four 61 to 80 I have just one
[09:44] here not four 61 to 80 I have just one and uh 81 to 100 I have
[09:49] and uh 81 to 100 I have three right this is what equal width
[09:52] three right this is what equal width binning is going to look like you are
[09:55] binning is going to look like you are creating bins which are of equal width
[09:58] creating bins which are of equal width and then you are assigning
[10:00] and then you are assigning the bins to each of
[10:02] the bins to each of the continuous data over here right so
[10:06] the continuous data over here right so that is one way of creating discrete
[10:09] that is one way of creating discrete feature out of your continuous feature
[10:12] feature out of your continuous feature something else is called equal frequency
[10:14] something else is called equal frequency binning right so one issue that you can see with equal width binning is that the frequency of the members in each bin is quite different right we have five over here three over here 1 three so equal frequency bining is focused on one thing that it is going to create bins which has equal number of members let is say over here I have 1 2 3 4 5 6 7 8 9 10 11 12 I have 12 members over here let's say I want to create four BS right if this is the case I will first arrange the entire thing into an ascending or descending order right and since this is
[11:14] descending order right and since this is 12 12 divid 4 is going to give you three
[11:17] 12 12 divid 4 is going to give you three so each of this bins are supposed to
[11:19] so each of this bins are supposed to have three members so the first three
[11:22] have three members so the first three members are going to come over here so
[11:25] members are going to come over here so 20 21 22 so the WID of this particular
[11:30] 20 21 22 so the WID of this particular pin becomes 20 all the way to 22 right
[11:35] pin becomes 20 all the way to 22 right and the next three members that is 25 30
[11:40] and the next three members that is 25 30 and
[11:43] 40 right so the width of this bin
[11:47] 40 right so the width of this bin becomes 25 to
[11:49] becomes 25 to 40 and then 50 55
[11:55] 66 so the width over here becomes 6050
[11:59] 66 so the width over here becomes 6050 to 66 and finally
[12:02] to 66 and finally 90
[12:03] 90 97 is going to be
[12:06] 97 is going to be this right so over here we have achieved
[12:10] this right so over here we have achieved one thing that each of the bins have
[12:13] one thing that each of the bins have equal number of members but you would
[12:16] equal number of members but you would see we have disruption when it comes to
[12:19] see we have disruption when it comes to the width of the bin now previously we
[12:22] the width of the bin now previously we had equal width in this case we have
[12:24] had equal width in this case we have equal frequency now this is also one of
[12:27] equal frequency now this is also one of the methods of
[12:28] the methods of doing I
[12:30] doing I ization and then you have clustering
[12:32] ization and then you have clustering based binning as well which uses
[12:34] based binning as well which uses Advanced clustering machine learning
[12:36] Advanced clustering machine learning algorithms in order to figure out that
[12:39] algorithms in order to figure out that which particular category is supposed is
[12:41] which particular category is supposed is this particular continuous data supposed
[12:44] this particular continuous data supposed to fall into right and then create the
[12:47] to fall into right and then create the bin out of that and also we have the
[12:49] bin out of that and also we have the concept
[12:50] concept of custom billing custom billing is
[12:54] of custom billing custom billing is referring to the situation where you as
[12:58] referring to the situation where you as the researcher you as the data scientist
[13:01] the researcher you as the data scientist owing to your domain knowledge owing to
[13:03] owing to your domain knowledge owing to your domain expertise o to your business
[13:06] your domain expertise o to your business knowledge and everything you are able to
[13:08] knowledge and everything you are able to figure out that what is supposed to be
[13:10] figure out that what is supposed to be the bin length right so rather than
[13:14] the bin length right so rather than relying upon equal width or relying upon
[13:17] relying upon equal width or relying upon frequency or whatever you are creating frequency or whatever you are creating your own bins based on your domain your own bins based on your domain knowledge for example I can say that anyone who is 0 to 18 is should be anyone who is 0 to 18 is should be considered as a minor right and then I have anyone who is between the age of 18 to 60 to be an adult and 60 and greater to be a senior citizen right so in this case the width of the binning of the bins is something that I have created customly right so that is also one of the method now mind that this is not an exhaustive list this is not the only set of U binning or a discretization me methods that you have in hand there are more and we will be covering those as well going forward so St stay tuned and uh see you in the next video thank you