# SQL Data Warehouse from Scratch | Full Hands-On Data Engineering Project

https://www.youtube.com/watch?v=9GVqKuTVANE

[00:00] hey friends so today we are diving into
[00:02] something very exciting Building
[00:04] Together modern SQL data warehouse
[00:06] projects but this one is not any project
[00:09] this one is a special one not only you
[00:11] will learn how to build a modern Data
[00:13] Warehouse from the scratch but also you
[00:15] will learn how I implement this kind of
[00:17] projects in Real World Companies I'm bar
[00:19] zini and I have built more than five
[00:22] successful data warehouse projects in
[00:24] different companies and right now I'm
[00:26] leading big data and Pi Projects at
[00:28] Mercedes-Benz so that's me I'm sharing
[00:30] with you real skills real Knowledge from
[00:33] complex projects and here's what you
[00:35] will get out of this project as a data
[00:37] architect we will be designing a modern
[00:39] data architecture following the best
[00:41] practices and as a data engineer you
[00:43] will be writing your codes to clean
[00:46] transform load and prepare the data for
[00:48] analyzis and as a data Modell you will
[00:51] learn the basics of data moding and we
[00:54] will be creating from the scratch a new
[00:56] data model for analyzes and my friends
[00:58] by the end of this project you will have
[01:00] a professional portfolio project to
[01:03] Showcase your new skills for example on
[01:05] LinkedIn so feel free to take the
[01:07] project modify it and as well share it
[01:09] with others but it going to mean the
[01:11] work for me if you share my content and
[01:14] guess what everything is for free so
[01:16] there are no hidden costs at all and in
[01:18] this project we will be using SQL server
[01:21] but if you prefer other databases like
[01:23] my SQL or bis don't worry you can follow
[01:25] along just fine
[01:31] all right my friends so now if you want
[01:32] to do data analytics projects using SQL
[01:35] we have three different types the first
[01:36] type of projects you can do data
[01:38] warehousing it's all about how to
[01:40] organize structure and prepare your data
[01:43] for data analysis it is the foundations
[01:45] of any data analytics projects and in
[01:48] The Next Step you can do exploratory
[01:50] data analyzes Eda and all what you have
[01:52] to do is to understand and cover
[01:54] insights about our data sets in this
[01:56] kind of project you can learn how to ask
[01:58] the right questions and how to find the
[02:01] answer using SQL by just using basic SQL
[02:04] skills now moving on to the last stage
[02:06] where you can do Advanced analytics
[02:08] projects where you going to use Advanced
[02:10] SQL techniques in order to answer
[02:12] business questions like finding Trends
[02:14] over time comparing the performance
[02:17] segmenting your data into different
[02:18] sections and as well generate reports
[02:21] for your stack holders so here you will
[02:22] be solving real business questions using
[02:25] Advanced SQL techniques now what we're
[02:27] going to do we're going to start with
[02:28] the first type of projects SQL data
[02:30] warehousing where you will gain the
[02:31] following skills so first you will learn
[02:33] how to do ETL elt processing using SQL
[02:36] in order to prepare the data you will
[02:38] learn as well how to build data
[02:39] architecture how to do data Integrations
[02:42] where we can merge multiple sources
[02:43] together and as well how to do data load
[02:45] and data modeling so if I got you
[02:47] interested grab your coffee and let's
[02:49] jump to the
[02:53] projects all right my friends so now
[02:55] before we Deep dive into the tools and
[02:57] the cool stuff we have first to have
[02:59] good understanding about what is exactly
[03:01] a data warehouse why the companies try
[03:04] to build such a data management system
[03:06] so now the question is what is a data
[03:08] warehouse I will just use the definition
[03:10] of the father of the data warehouse Bill
[03:12] Inon a data warehouse is subject
[03:14] oriented integrated time variance and
[03:17] nonvolatile collection of data designed
[03:20] to support the Management's
[03:21] decision-making process okay I I know
[03:23] that might be confusing subject oriented
[03:25] it means thata Warehouse is always
[03:27] focused on a business area like the
[03:29] sales customers finance and so on
[03:32] integrated because it goes and integrate
[03:34] multiple Source systems usually you
[03:36] build a warehouse not only for one
[03:38] source but for multiple sources time
[03:40] variance it means you can keep
[03:42] historical data inside the data
[03:43] warehouse nonvolatile it means once the
[03:46] data enter the data warehouse it is not
[03:48] deleted or modified so this is how build
[03:51] and mod defined data warehouse okay so
[03:53] now I'm going to show you the scenario
[03:54] where your company don't have a real
[03:56] data management so now let's say that
[03:58] you have one system and you have like
[04:00] one data analyst has to go to this
[04:02] system and start collecting and
[04:03] extracting the data and then he going to
[04:05] spend days and sometimes weeks
[04:07] transforming the row data into something
[04:10] meaningful then once they have the
[04:12] report they're going to go and share it
[04:14] and this data analyst is sharing the
[04:15] report using an Excel and then you have
[04:18] like another source of data and you have
[04:20] another data analyst that she is doing
[04:22] maybe the same steps collecting the data
[04:24] spending a lot of time transforming the
[04:26] data and then share at the end like a
[04:28] report and this time she is sharing the
[04:30] data using PowerPoint and a third system
[04:32] and the same story but this time he is
[04:34] sharing the data using maybe powerbi so
[04:37] now if the company works like this then
[04:39] there is a lot of issues first this
[04:41] process it take too way long I saw a lot
[04:44] of scenarios where sometimes it takes
[04:46] weeks and even months until the employee
[04:48] manually generating those reports and of
[04:50] course what going to happen for the
[04:51] users they are consuming multiple
[04:54] reports with multiple state of the data
[04:56] one report is 40 days old another one 10
[04:59] days and a third one is like 5 days so
[05:02] it's going to be really hard to make a
[05:03] real decision based on this structure a
[05:06] manual process is always slow and
[05:08] stressful and the more employees you
[05:10] involved in the process the more you
[05:11] open the door for human errors and
[05:14] errors of course in reports leads to bad
[05:16] decisions and another issue of course is
[05:19] handling the Big Data if one of your
[05:21] sources generating like massive amount
[05:23] of data then the data analyst going to
[05:25] struggle collecting the data and maybe
[05:27] in some scenarios it will not be any
[05:29] more possible to get the data so the
[05:31] whole process can breaks and you cannot
[05:33] generate any more fresh data for
[05:35] specific reports and one last very big
[05:38] issue with that if one of your stack
[05:40] holders asks for an integrated report
[05:43] from multiple sources well good luck
[05:45] with that because merging all those data
[05:47] manually is very chaotic timec consuming
[05:50] and full of risk so this is just a
[05:52] picture if a company is working without
[05:54] a proper data management without a data
[05:57] leak data warehouse data leak houses so
[05:59] in order to make real and good decisions
[06:02] you need data management so now let's
[06:04] talk about the scenario of a data
[06:06] warehouse so the first thing that can
[06:08] happen is that you will not have your
[06:10] data team collecting manually the data
[06:12] you're going to have a very important
[06:14] component called ETL ETL stands for
[06:17] extract transform and load it is a
[06:20] process that you do in order to extract
[06:22] the data from the sources and then apply
[06:24] multiple Transformations on those
[06:25] sources and at the end it loads the data
[06:28] to the data warehouse and this one going
[06:30] to be the single point of Truth for
[06:32] analyzes and Reporting and it is called
[06:35] Data Warehouse so now what can happen
[06:37] all your reports going to be consuming
[06:40] this single point of Truth so with that
[06:42] you create your multiple reports and as
[06:44] well you can create integrated reports
[06:47] from multiple sources not only from one
[06:49] single source so now by looking to the
[06:51] right side it looks already organized
[06:53] right and the whole process is
[06:55] completely automated there is no more
[06:57] manual steps which of course it ru uses
[07:00] the human error and as well it is pretty
[07:02] fast so usually you can load the data
[07:04] from the sources until the reports in
[07:06] matter of hours or sometimes in minutes
[07:09] so there is no need to wait like weeks
[07:12] and months in order to refresh anything
[07:14] and of course the big Advantage is that
[07:16] the data warehouse itself it is
[07:18] completely integrated so that means it
[07:20] goes and bring all those sources
[07:22] together in one place which makes it
[07:24] really easier for reporting and not only
[07:27] integrate you can build in the data
[07:29] warehouse as well history so we have now
[07:31] the possibility to access historical
[07:33] data and what is also amazing that all
[07:36] those reports having the same data
[07:38] status so all those reports can have the
[07:40] same status maybe sometimes one day old
[07:43] or something and of course if you have a
[07:44] modern Data Warehouse in Cloud platforms
[07:47] you can really easily handle any big
[07:49] data sources so no need to panic if one
[07:52] of your sources is delivering massive
[07:54] amount of data and of course in order to
[07:56] build the data warehouse you need
[07:57] different types of Developers so usually
[08:00] the one that builds the ATL component
[08:02] and the data warehouse is the data
[08:04] engineer so they are the one that is
[08:07] accessing the sources scripting the atls
[08:09] and building the database for the data
[08:11] warehouse and now for the other part the
[08:14] one that is responsible for that is the
[08:16] data analyst they are the one that is
[08:18] consuming the data warehouse building
[08:20] different data models and reports and
[08:22] sharing it with the stack holders so
[08:25] they are usually contacting the stack
[08:26] holders understanding the requirements
[08:28] and building multiple reports based on
[08:31] the data warehouse so now if you have a
[08:32] look to those two scenarios this is
[08:35] exactly why we need data management your
[08:37] data team is not wasting time and
[08:40] fighting with the data they are now more
[08:42] organized and more focused and with like
[08:45] data warehouse and you are delivering
[08:47] professional and fresh reports that your
[08:50] company can count on in order to make
[08:52] good and fast decisions so this is why
[08:55] you need a data management like a data
[08:57] warehouse think about data warehouse as
[08:59] a busy restaurant every day different
[09:01] suppliers bring in fresh ingredients
[09:04] vegetables spices meat you name it they
[09:06] don't just use it immediately and throw
[09:08] everything in one pot right they clean
[09:10] it shop it and organize everything and
[09:13] store each ingredients in the right
[09:15] place fridge or freezer so this is the
[09:18] preparing face and when the order comes
[09:20] in they quickly grab the prepared
[09:22] ingredients and create a perfect dish
[09:25] and then serve it to the customers of
[09:26] the restaurant and this process is
[09:28] exactly like the data warehouse process
[09:30] it is like the kitchen where the raw
[09:32] ingredients your data are cleaned sorted
[09:35] and stored and when you need a report or
[09:37] analyzes it is ready to serve up exactly
[09:40] like what you
[09:44] need okay so now we're going to zoom in
[09:47] and focus on the component ETL if you
[09:49] are building such a project you're going
[09:50] to spend almost 90% just building this
[09:53] component the ATL so it is the core
[09:56] element of the data warehouse and I want
[09:58] you to have a clear understanding what
[10:00] is exactly an ETL so our data exist in a
[10:04] source system and now what we want to do
[10:05] is is to get our data from the source
[10:08] and move it to the Target source and
[10:10] Target could be like database tables so
[10:12] now the first step that we have to do is
[10:14] to specify which data we have to load
[10:17] from the source of course we can say
[10:19] that we want to load everything but
[10:20] let's say that we are doing incremental
[10:22] loads so we're going to go and specify a
[10:24] subset of the data from The Source in
[10:26] order to prepare it and load it later to
[10:28] the Target so this step in the ATL
[10:30] process we call it extract we are just
[10:32] identifying the data that we need we
[10:35] pull it out and we don't change anything
[10:37] it's going to be like one to one like
[10:39] the source system so the extract has
[10:41] only one task to identify the data that
[10:43] you have to pull out from the source and
[10:46] to not change anything so we will not
[10:48] manipulate the data at all it can stay
[10:50] as it is so this is the first step in
[10:52] the ETL process the extracts now moving
[10:55] on to the stage number two we're going
[10:57] to take this extract data and we will do
[10:59] some manipulations
[11:01] Transformations and we're going to
[11:02] change the shape of those data and this
[11:05] process is really heavy working we can
[11:07] do a lot of stuff like data cleansing
[11:09] data integration and a lot of formatting
[11:12] and data normalizations so a lot of
[11:14] stuff we can do in this step so this is
[11:16] the second step in the ETL process the
[11:18] transformation we're going to take the
[11:20] original data and reshape it transformat
[11:23] into exactly the format that we need
[11:26] into a new format and shapes that we
[11:28] need for anal and Reporting now finally
[11:30] we get to the last step in the ATL
[11:32] process we have the load so in this step
[11:35] we're going to take this new data and
[11:37] we're going to insert it into the
[11:38] targets so it is very simple we're going
[11:40] to take this prepared data from the
[11:42] transformation step and we're going to
[11:44] move it into its final destination the
[11:46] target like for example data warehouse
[11:49] so that's ETL in the nutshell first
[11:51] extract the row data then transform it
[11:53] into something meaningful and finally
[11:55] load it to a Target where it's going to
[11:57] make a difference so that's that's it
[11:59] this is what we mean with the ETL
[12:01] process now in real projects we don't
[12:03] have like only source and targets our
[12:06] thata architecture going to have like
[12:07] multiple layers depend on your design
[12:10] whether you are building a warehouse or
[12:11] a data lake or a data warehouse and
[12:13] usually there are like different ways on
[12:15] how to load the data between all those
[12:17] layers and in order now to load the data
[12:19] from one layer to another one there are
[12:21] like multiple ways on how to use the ATL
[12:24] process so usually if you are loading
[12:26] the data from the source to the layer
[12:27] number one like only the data from the
[12:30] source and load it directly to the layer
[12:32] number one without doing any
[12:33] Transformations because I want to see
[12:35] the data as it is in the first layer and
[12:38] now between the layer number one and the
[12:40] layer number two you might go and use
[12:42] the full ETL so we're going to extract
[12:44] from the layer one transform it and then
[12:46] load it to the layer number two so with
[12:49] that we are using the whole process the
[12:50] ATL and now between Layer Two and layer
[12:53] three we can do only transformation and
[12:55] then load so we don't have to deal with
[12:57] how to extract the data because it is
[12:59] maybe using the same technology and we
[13:01] are taking all data from Layer Two to
[13:03] layer three so we transform the whole
[13:05] layer two and then load it to layer
[13:07] three and now between three and four you
[13:10] can use only the L so maybe it's
[13:12] something like duplicating and
[13:13] replicating the data and then you are
[13:16] doing the transformation so you load to
[13:18] the new layer and then transform it of
[13:20] course this is not a real scenario I'm
[13:22] just showing you that in order to move
[13:24] from source to a Target you don't have
[13:26] always to use a complete ETL depend on
[13:29] the design of your data architecture you
[13:31] might use only few components from the
[13:33] ETL okay so this is how ETL looks like
[13:36] in real projects okay so now I would
[13:38] like to show you an overview of the
[13:40] different techniques and methods in the
[13:42] etls we have wide range of possibilities
[13:45] where you have to make decisions on
[13:46] which one you want to apply to your
[13:48] projects so let's start first with the
[13:50] extraction the first thing that I want
[13:52] to show you is we have different methods
[13:54] of extraction either you are going to
[13:56] The Source system and pulling the data
[13:58] from the source or the source system is
[14:00] pushing the data to the data warehouse
[14:02] so those are the two main methods on how
[14:04] to extract data and then we have in the
[14:06] extraction two types we have a full
[14:09] extraction everything all the records
[14:11] from tables and every day we load all
[14:13] the data to the data warehouse or we
[14:15] make more smarter one where we say we're
[14:17] going to do an incremental extraction
[14:19] where every day we're going to identify
[14:21] only the new changing data so we don't
[14:23] have to load the whole thing only the
[14:25] new data we go extract it and then load
[14:27] it to the data warehouse and in data
[14:29] extraction we have different techniques
[14:31] the first one is like manually where
[14:33] someone has to access a source system
[14:35] and extract the data manually or we
[14:37] connect ourself to a database and we
[14:39] have then a query in order to extract
[14:41] the data or we have a file that we have
[14:43] to pass it to the data warehouse or
[14:45] another technique is to connect ourself
[14:47] to API and do their cods in order to
[14:50] extract the data or if the data is
[14:52] available in streaming like in kfka we
[14:54] can do event based streaming in order to
[14:57] extract the data another way is to use
[14:59] the change data capture CDC is as well
[15:02] something very similar to streaming or
[15:04] another way is by using web scrapping
[15:06] where you have a code that going to run
[15:08] and extract all the informations from
[15:10] the web so those are the different
[15:11] techniques and types that we have in the
[15:14] extraction now if you are talking on the
[15:16] transformation there are wide range of
[15:18] different Transformations that we can do
[15:20] on our data like for example doing data
[15:23] enrichment where we add values to our
[15:25] data sets or we do a data integration
[15:27] where we have multiple sources and we
[15:29] bring everything to one data model or we
[15:31] derive a new of columns based on already
[15:34] existing one another type of data
[15:36] Transformations we have the data
[15:37] normalization so the sources has values
[15:40] that are like a code and you go and map
[15:42] it to more friendly values for the
[15:44] analyzers which is more easier to
[15:47] understand and to use another
[15:48] Transformations we have the business
[15:50] rules and logic depend on the business
[15:52] you can Define different criterias in
[15:54] order to build like new columns and what
[15:56] belongs to Transformations is the data
[15:59] aggregation so here we aggregate the
[16:00] data to a different granularity and then
[16:03] we have type of transformation called
[16:05] Data cleansing there are many different
[16:07] ways on how to clean our data for
[16:09] example removing the duplicates doing
[16:11] data filtering handling the missing data
[16:14] handling invalid values or removing
[16:16] unwanted spaces casting the data types
[16:19] and detecting the outliers and many more
[16:22] so we have different types of data
[16:24] cleansing that we can do in our data
[16:26] warehouse and this is very important
[16:28] transformation so as you can see we have
[16:30] different types of Transformations that
[16:32] we can do in our data warehouse now
[16:34] moving on to the load so what do we have
[16:36] over here we have different processing
[16:39] types so either we are doing patch
[16:41] processing or stream processing patch
[16:43] processing means we are loading the data
[16:45] warehouse in one big patch of data
[16:48] that's going to run and load the data
[16:50] warehouse so it is only one time job in
[16:52] order to refresh the content of the data
[16:54] warehouse and as well the reports so
[16:56] that means we are scheduling the data
[16:58] warehouse in order to load it in the day
[17:00] once or twice and the other type we have
[17:02] the stream processing so this means if
[17:04] there is like a change in the source
[17:06] system we going to process this change
[17:08] as soon as possible so we're going to
[17:10] process it through all the layers of the
[17:11] data warehouse once something changes
[17:14] from The Source system so we are
[17:15] streaming the data in order to have real
[17:18] time data warehouse which is very
[17:20] challenging things to do in data
[17:21] warehousing and if you are talking about
[17:23] the loads we have two methods either we
[17:26] are doing a full load or incremental
[17:28] load it's a same thing as extraction
[17:30] right so for the full load in databases
[17:32] there are like different methods on how
[17:33] to do it like for example we trate and
[17:36] then insert that means we make the table
[17:38] completely empty and then we insert
[17:40] everything from the scratch or another
[17:42] one you are doing an update insert we
[17:44] call it upsert so we can go and update
[17:46] all the records and then insert the new
[17:49] one and another way is to drop create an
[17:51] insert so that means we drop the whole
[17:53] table and then we create it from scratch
[17:55] and then we insert the data it is very
[17:57] similar to the truncate but here we are
[17:59] as well removing and drubbing the whole
[18:01] table so those are the different methods
[18:03] of full loads the incremental load we
[18:05] can use as well the upserts so update
[18:07] and inserts so we're going to do an
[18:09] update or insert statements to our
[18:11] tables or if the source is something
[18:13] like a log we can do only inserts so we
[18:16] can go and Abend the data always to the
[18:18] table without having to update anything
[18:20] another way to do incremental load is to
[18:22] do a merge and here it is very similar
[18:24] to the upsert but as well with a delete
[18:26] so update insert delete so those are the
[18:29] different methods on how to load the
[18:30] data to your tables and one more thing
[18:32] in data warehousing we have something
[18:34] called slowly changing Dimensions so
[18:36] here it's all about the hyz of your
[18:39] table and there are many different ways
[18:41] on how to handle the Hyer in your table
[18:44] the first type is sd0 we say there is no
[18:46] historization and nothing should be
[18:48] changed at all so that means you are not
[18:50] going to update anything the second one
[18:52] which is more famous it is the sd1 you
[18:55] are doing an override so that means you
[18:58] are updating the records with the new
[19:00] informations from The Source system by
[19:02] overwriting the old value so we are
[19:04] doing something like the upsert so
[19:05] update and insert but you are losing of
[19:08] course history another one we have the
[19:09] scd2 and here you want to add
[19:11] historization to your table so what we
[19:13] do so what we do each change that we get
[19:16] from The Source system that means we are
[19:18] inserting new records and we are not
[19:20] going to overwrite or delete the old
[19:22] data we are just going to make it
[19:24] inactive and the new record going to be
[19:26] active one so there are different
[19:28] methods on how to do historization as
[19:30] well while you are loading the data to
[19:33] the data warehouse all right so those
[19:34] are the different types and techniques
[19:36] that you might encounter in data
[19:38] management projects so now what I'm
[19:39] going to show you quickly which of those
[19:41] types we will be using in our projects
[19:43] so now if we are talking about the
[19:44] extraction over here we will be doing a
[19:46] pull extraction and about the full or
[19:49] incremental it's going to be a full
[19:51] extraction and about the technique we
[19:53] are going to be passsing files to the
[19:55] data warehouse and now about the data
[19:57] transformation well this one we will
[20:00] cover everything all those types of
[20:02] Transformations that I'm showing you now
[20:04] is going to be part of the project
[20:06] because I believe in each data project
[20:08] you will be facing those Transformations
[20:10] now if we have a look to the load our
[20:12] project going to be patch processing and
[20:14] about the load methods we will be doing
[20:16] a full load since we have full
[20:18] extraction and it's going to be trunk it
[20:20] and inserts and now about the
[20:22] historization we will be doing the sd1
[20:26] so that means we will be updating the
[20:28] content of the thata Warehouse so those
[20:29] are the different techniques and types
[20:31] that we will be using in our ETL process
[20:34] for this project all right so with that
[20:36] we have now clear understanding what is
[20:37] a data warehouse and we are done with
[20:39] the theory parts so now the next step
[20:42] we're going to start with the projects
[20:43] the first thing that you have to do is
[20:45] to prepare our environment to develop
[20:47] the projects so let's start with
[20:52] that all right so now we go to the link
[20:55] in the description and from there we're
[20:57] going to go to the downloads and and
[20:58] here you can find all the materials of
[21:00] all courses and projects but the one
[21:02] that we need now is the SQL data
[21:03] warehouse projects so let's go to the
[21:05] link and here we have bunch of links
[21:07] that we need for the projects but the
[21:09] most important one to get all data and
[21:11] files is this one download all project
[21:14] files so let's go and do that and after
[21:16] you do that you're going to get a zip
[21:18] file where you have there a lot of stuff
[21:20] so let's go and extract it and now
[21:22] inside it if you go over here you will
[21:24] find the reposter structure from git and
[21:26] the most important one here is the data
[21:28] ass sets so you have two sources the CRM
[21:31] and the Erp and in each one of them
[21:33] there are three CSV files so those are
[21:36] the data set for the project for the
[21:38] other stuffs don't worry about it we
[21:40] will be explaining that during the
[21:41] project so go and get the data and put
[21:44] it somewhere at your PC where you don't
[21:45] lose it okay so now what else do we have
[21:47] we have here a link to the get
[21:49] repository so this is the link to my
[21:51] repository that I have created through
[21:53] the projects so you can go and access it
[21:55] but don't worry about it we're going to
[21:56] explain the whole structure during the
[21:58] project and you will be creating your
[22:00] own repository and as well we have the
[22:02] link to the notion here we are doing the
[22:04] project management here you're going to
[22:06] find the main steps the main phes of the
[22:08] SQL projects that we will do and as well
[22:10] all the task that we will be doing
[22:12] together during the projects and now we
[22:15] have links to the project tools so if
[22:17] you don't have it already go and
[22:18] download the SQL Server Express so it's
[22:21] like a server that going to run locally
[22:22] at your PC where your database going to
[22:24] live another one that you have to
[22:26] download is the SQL Server management
[22:27] Studio it is just a client in order to
[22:30] interact with the database and there
[22:32] we're going to run all our queries and
[22:34] then link to the GitHub and as well link
[22:36] to the draw AO if you don't have it
[22:38] already go and download it it is free
[22:40] and amazing tool in order to draw
[22:42] diagrams so through the project we will
[22:44] be drawing data models the data
[22:46] architecture a data lineage so a lot of
[22:49] stuff we'll be doing using this tool so
[22:51] go and download it and the last thing it
[22:53] is nice to have you have a link to the
[22:55] notion where you can go and create of
[22:57] course free account accounts if you want
[22:59] to build the project plan and as well
[23:01] Follow Me by creating the project steps
[23:04] and the project tasks okay so that's all
[23:06] those are all the links for the projects
[23:08] so go and download all those stuff
[23:10] create the accounts and once you are
[23:12] ready then we continue with the
[23:17] projects all right so now I hope that
[23:19] you have downloaded all the tools and
[23:21] created the accounts now it's time to
[23:23] move to very important step that's
[23:25] almost all people skip while doing
[23:27] projects and then that is by creating
[23:30] the project plan and for that we will be
[23:32] using the tool notion notion is of
[23:34] course free tool and it can help you to
[23:36] organize your ideas your plans and
[23:39] resources all in one place I use it very
[23:42] intensively for my private projects like
[23:44] for example creating this course and I
[23:46] can tell you creating a project plan is
[23:47] the key to success creating a data
[23:49] warehouse project is usually very
[23:51] complex and according to Gardner reports
[23:54] over 50% of data warehouse projects fail
[23:57] and my opinion about any complex project
[24:00] the key to success is to have a clear
[24:02] project plan so now at this phase of the
[24:04] project we're going to go and create a
[24:06] rough project plan because at the moment
[24:09] we don't have yet clear understanding
[24:11] about the data architecture so let's go
[24:13] okay so now let's create a new page and
[24:15] let's call it data warehouse projects
[24:16] the first thing is that we have to go
[24:18] and create the main phases and stages of
[24:21] the projects and for that we need a
[24:23] table so in order to do that hit slash
[24:25] and then type database in line and then
[24:28] let's go and call it something like data
[24:31] warehouse epic and we're going to go and
[24:33] hide it because I don't like it and then
[24:35] on the table we can go and rename it
[24:37] like for example project epics something
[24:40] like that and now what we're going to do
[24:42] we're going to go and list all the big
[24:43] task of the projects so an epic is
[24:45] usually like a large task that needs a
[24:47] lot of efforts in order to solve it so
[24:49] you can call it epics stages faces of
[24:52] the project whatever you want so we're
[24:53] going to go and list our project steps
[24:56] so it start with the requirements
[24:58] analyzes and then designing data
[25:02] architecture and another one we have the
[25:05] project
[25:07] initialization so those are the three
[25:09] big task in the project first and now
[25:11] what do we need we need another table
[25:13] for the small chunks of the tasks the
[25:15] subtasks and we're going to do the same
[25:16] thing so we're going to go and hit slash
[25:18] and we're going to search for the table
[25:20] in line and we're going to do the same
[25:21] thing so first we're going to call it
[25:23] data warehouse tasks and then we're
[25:25] going to hide it and over here we're
[25:27] going to rename it and say this is the
[25:30] project tasks so now what we're going to
[25:32] do we're going to go to the plus icon
[25:33] over here and then search for relation
[25:36] this one over here with the arrow and
[25:38] now we're going to search for the name
[25:39] of the first table so we called it data
[25:42] warehouse iix so let's go and click it
[25:45] and we're going to say as well two-way
[25:46] relation so let's go and add the
[25:49] relation so with that we got a fi in the
[25:51] new table called Data Warehouse iix this
[25:53] comes from this table and as well we
[25:55] have here data warehouse tasks that
[25:57] comes from from the below table so as
[25:59] you can see we have linked them together
[26:01] now what I'm going to do I'm going to
[26:02] take this to the left side and then what
[26:04] we're going to do we're going to go and
[26:05] select one of those epics like for
[26:07] example let's take design the data
[26:09] architecture and now what we're going to
[26:11] do we're going to go and break down this
[26:12] Epic into multiple tasks like for
[26:15] example choose data management approach
[26:19] and then we have another task what we're
[26:20] going to do we're going to go and select
[26:22] as well the same epic so maybe the next
[26:24] step is brainstorm and design the layers
[26:29] and then let's go to another iic for
[26:31] example the project initialization and
[26:33] we say over here for example create get
[26:36] repo prepare the structure we can go and
[26:39] make another one in the same epic let's
[26:42] say we're going to go and create the
[26:43] database and the schemas so as you can
[26:46] see I'm just defining the subtasks of
[26:48] those epics so now what we're going to
[26:50] do we're going to go and add a checkbox
[26:51] in order to understand whether we have
[26:53] done the task or not so we go to the
[26:55] plus and search for check we need the
[26:57] check box and what we're going to do
[26:59] we're going to make it really small like
[27:02] this and with that each time we are done
[27:04] with the task we're going to go and
[27:05] click on it just to make sure that we
[27:07] have done the task now there is one more
[27:09] thing that is not really working nice
[27:11] and that is here we're going to have
[27:12] like a long list of tasks and it's
[27:14] really annoying so what we're going to
[27:16] do we're going to go to the plus over
[27:17] here and let's search for roll up so
[27:20] let's go and select it so now what we're
[27:21] going to do we have to go and select the
[27:23] relationship it's going to be that data
[27:24] warehouse task and after that we're
[27:26] going to go to the property and make it
[27:27] as the check box so now as you can see
[27:29] in the first table we are saying how
[27:31] many tasks is closed but I don't want to
[27:33] show it like this what you going to do
[27:35] we're going to go to the calculation and
[27:36] to the percent and then percent checked
[27:39] and with that we can see the progress of
[27:41] our project and now instead of the
[27:43] numbers we can have really nice bar
[27:45] great so as well we can go and give it a
[27:47] name like progress so that's it and we
[27:49] can go and hide the data warehouse tasks
[27:52] and now with that we have really nice
[27:53] progress bar for each epic and if we
[27:55] close all the tasks of this epic we can
[27:57] see that we have reached 100% so this is
[28:00] the main structure now we can go and add
[28:01] some cosmetics and rename stuff in order
[28:04] to make things looks nicer like for
[28:06] example if I go to the tasks over here I
[28:08] can go and call it tasks and as well go
[28:11] and change the icon to something like
[28:13] this and if you'd like to have an icon
[28:15] for all those epics what we going to do
[28:17] we're going to go to the Epic for
[28:18] example design data architecture and
[28:20] then if you hover on top of the title
[28:22] you can see add an icon and you can go
[28:24] and pick any icon that you want so for
[28:27] example this one and now now as you can
[28:28] see we have defined it here in the top
[28:30] and the icon going to be as well in the
[28:32] pillow table okay so now one more thing
[28:34] that we can do for the project tasks is
[28:36] that we can go and group them by the
[28:38] epics so if you go to the three dots and
[28:40] then we go to groups and then we can
[28:42] group up by the epics and as you can see
[28:44] now we have like a section for each epic
[28:47] and you can go and sort the epics if you
[28:49] want if you go over here sort then
[28:51] manual and you can go over here and
[28:53] start sorting the epics as you want and
[28:56] with that you can expand and minimize
[28:58] each task if you don't want to see
[29:00] always all tasks in one go so this is
[29:02] really nice way in order to build like
[29:04] data management for your projects of
[29:06] course in companies we use professional
[29:08] Tools in order to do projects like for
[29:10] example Gyra but for private person
[29:12] projects that I do I always do it like
[29:15] this and I really recommend you to do it
[29:17] not only for this project for any
[29:18] project that you are doing CU if you see
[29:20] the whole project in one go you can see
[29:22] the big picture and closing tasks and
[29:24] doing it like this these small things
[29:26] can makes you really satisfied and keeps
[29:28] you motivated to finish the whole
[29:30] project and makes you proud okay friends
[29:33] so now I just went and added few icons a
[29:35] rename stuff and as well more tasks for
[29:38] each epic and this going to be our
[29:40] starting point in the project and once
[29:42] we have more informations we're going to
[29:43] go and add more details on how exactly
[29:46] we're going to build the data warehouse
[29:48] so at the start we're going to go and
[29:49] analyze and understand the requirements
[29:51] and only after that we're going to start
[29:53] designing the data architecture and here
[29:55] we have three tasks first we have to to
[29:58] choose the data management approach and
[30:00] after that we're going to do
[30:01] brainstorming and designing the layers
[30:03] of the data warehouse and at the end
[30:05] we're going to go and draw a data
[30:07] architecture so with that we have clear
[30:10] understanding how the data architecture
[30:11] looks like and after that we're going to
[30:13] go to the next epic where we're going to
[30:15] start preparing our projects so once we
[30:17] have clear understanding of the data
[30:18] architecture the first task here is to
[30:20] go and create detailed project tasks so
[30:23] we're going to go and add more epes and
[30:25] more tasks and once we are done then
[30:27] we're going to go and create the naming
[30:29] conventions for the project just to make
[30:31] sure that we have rules and standards in
[30:33] the whole project and next we're going
[30:34] to go and create a repository in the git
[30:37] and we can to prepare as well the
[30:38] structure of the repository so that we
[30:40] always commit our work there and then we
[30:42] can start with the first script where we
[30:44] can create a database and schemas so my
[30:47] friends this is the initial plan for the
[30:49] project now let's start with the first
[30:51] epic we have the requirements analyzes
[30:58] now analyzing the requirement it is very
[31:00] important to understand which type of
[31:02] data wehous you're going to go and build
[31:03] because there is like not only one
[31:05] standard on how to build it and if you
[31:07] go blindly implementing the data
[31:09] warehouse you might be doing a lot of
[31:11] stuff that is totally unnecessary and
[31:13] you will be burning a lot of time so
[31:15] that's why you have to sit with the
[31:17] stockholders with the department and
[31:19] understand what we exactly have to build
[31:21] and depend on the requirements you
[31:23] design the shape of the data warehouse
[31:26] so now let's go and analyze the
[31:27] requirement of this project now the
[31:28] whole project is splitted into two main
[31:31] sections the first section we have to go
[31:33] and build a data warehouse so this is a
[31:35] data engineering task and we will go and
[31:38] develop etls and data warehouse and once
[31:41] we have done that we have to go and
[31:43] build analytics and reporting business
[31:45] intelligence so we're going to do data
[31:47] analysis but now first we will be
[31:49] focusing on the first part building the
[31:51] data warehouse so what do you have here
[31:53] the statement is very simple it says
[31:56] develop a modern data warehouse using
[31:58] SQL Server to consolidate sales data
[32:01] enabling analytical reporting and
[32:04] informed decision making so this is the
[32:06] main statements and then we have
[32:08] specifications the first one is about
[32:10] the data sources it says import data
[32:12] from two Source systems Erb and CRM and
[32:15] they are provided as CSV files and now
[32:18] the second task is talking about the
[32:20] data quality we have to clean and fix
[32:22] data quality issues before we do the
[32:25] data analyses because let's be real
[32:27] there is no R data that is perfect is
[32:29] always missing and we have to clean that
[32:31] up now the next task is talking about
[32:33] the integration so it says we have to go
[32:35] and combine both of the sources into one
[32:38] single userfriendly data model that is
[32:41] designed for analytics and Reporting so
[32:44] that means we have to go and merge those
[32:45] two sources into one single data model
[32:48] and now we have here another
[32:49] specifications it says focus on the
[32:51] latest data sets so there is no need for
[32:54] historization so that means we don't
[32:56] have to go and build histories in the
[32:57] the database and the final requirement
[32:59] is talking about the documentation so it
[33:01] says provide clear documentations of the
[33:03] data model so that means the last
[33:05] product of the data warehouse to support
[33:08] the business users and the analytical
[33:10] teams so that means we have to generate
[33:12] a manual that's going to help the users
[33:14] that makes lives easier for the
[33:16] consumers of our data so as you can see
[33:18] maybe this is very generic requirements
[33:20] but it has a lot of information already
[33:22] for you so it's saying that we have to
[33:24] use the platform SQL Server we have two
[33:26] Source systems using using the CSV files
[33:29] and it sounds that we really have a bad
[33:31] data quality in the sources and as well
[33:33] it wants us to focus on building
[33:35] completely new data model that is
[33:37] designed for reporting and it says we
[33:40] don't have to do historization and it is
[33:42] expected from us to generate
[33:44] documentations of the system so these
[33:46] are the requirements for the data
[33:48] engineering part where we're going to go
[33:49] and build a data warehouse that fulfill
[33:52] these requirements all right so with
[33:54] that we have analyzed the requirements
[33:56] and as well we have closed at the first
[33:58] easiest epic so we are done with this
[34:00] let's go and close it and now let's open
[34:02] another one here we have to design the
[34:05] data architecture and the first task is
[34:07] to choose data management approach so
[34:10] let's
[34:13] go now designing the data architecture
[34:16] it is exactly like building a house so
[34:19] before construction starts an architect
[34:21] going to go and design a plan a
[34:23] blueprint for the house how the rooms
[34:25] will be connected how to make the house
[34:27] functional safe and wonderful and
[34:30] without this blueprint from The
[34:31] Architects the builders might create
[34:33] something unstable inefficient or maybe
[34:35] unlivable the same goes for data
[34:37] projects a data architect is like a
[34:39] house architect they design how your
[34:41] data will flow integrate and be accessed
[34:44] so as data Architects we make sure that
[34:46] the data warehouse is not only
[34:48] functioning but also scalable and easy
[34:50] to maintain and this is exactly what we
[34:52] will do now we will play the role of the
[34:54] data architect and we will start
[34:56] brainstorming and designing the
[34:58] architecture of the data warehouse so
[35:00] now I'm going to show you a sketch in
[35:01] order to understand what are the
[35:03] different approaches in order to design
[35:05] a data architecture and this phase of
[35:07] the projects usually is very exciting
[35:09] for me because this is my main role in
[35:11] data projects I am a data architect and
[35:14] I discuss a lot of different projects
[35:16] where we try to find out the best design
[35:18] for the projects all right so now let's
[35:23] go now the first step of building a data
[35:26] architecture is to make very important
[35:28] decision to choose between four major
[35:30] types the first approach is to build a
[35:33] data warehouse it is very suitable if
[35:35] you have only structured data and your
[35:37] business want to build solid foundations
[35:39] for reporting and business intelligence
[35:42] and another approach is to build a data
[35:44] leak this one is way more flexible than
[35:47] a data warehouse where you can store not
[35:49] only structured data but as well semi
[35:51] and unstructured data we usually use
[35:54] this approach if you have mixed types of
[35:56] data like database tables locks images
[35:58] videos and your business want to focus
[36:00] not only on reporting but as well on
[36:03] Advanced analytics or machine learning
[36:05] but it's not that organized like a data
[36:07] warehouse and data leaks if it's too
[36:09] much unorganized can turns into Data
[36:12] swamp and this is where we need the next
[36:14] approach so the next one we can go and
[36:16] build data leak house so it is like a
[36:18] mix between data warehouse and data leak
[36:21] you get the flexibility of having
[36:23] different types of data from the data
[36:25] Lake but you still want to structure and
[36:27] organiz your data like we do in the data
[36:29] warehouse so you mix those two words
[36:31] into one and this is a very modern way
[36:33] on how to build data Architects and this
[36:35] is currently my favorite way of building
[36:37] data management system now the last and
[36:39] very recent approach is to build data
[36:41] Mish so this is a little bit different
[36:43] instead of having centralized data
[36:45] management system the idea now in the
[36:47] data Mish is to make it decentralized
[36:49] you cannot have like one centralized
[36:51] data management system because always if
[36:53] you say centralized then it means
[36:55] bottleneck so instead you have multiple
[36:57] departments and multiple domains where
[36:59] each one of them is building a data
[37:01] product and sharing it with others so
[37:03] now you have to go and pick one of those
[37:05] approaches and in this project we will
[37:07] be focusing on the data warehouse so now
[37:09] the question is how to build the data
[37:11] warehouse well there is as well four
[37:13] different approaches on how to build it
[37:15] the first one is the inone approach so
[37:17] again you have your sources and the
[37:19] first layer you start with the staging
[37:21] where the row data is landing and then
[37:23] the next layer you organize your data in
[37:25] something called Enterprise data
[37:27] Warehouse where you go and model the
[37:29] data using the third normal format it's
[37:32] about like how to structure and
[37:34] normalize your tables so you are
[37:36] building a new integrated data model
[37:38] from the multiple sources and then we go
[37:40] to the third layer it's called the data
[37:42] Mars where you go and take like small
[37:44] subset of the data warehouse and you
[37:46] design it in a way that is ready to be
[37:49] consumed from reporting and it focus on
[37:51] only one toque like for example the
[37:53] customers sales or products and after
[37:56] that you go and connect your bi tool
[37:58] like powerbi or Tableau to the data Mars
[38:00] so with that you have three layers to
[38:02] prepare the data before reporting now
[38:04] moving on to the next one we have the
[38:06] kle approach he says you know what
[38:08] building this Enterprise data warehouse
[38:10] it is wasting a lot of time so what we
[38:13] can do we can jump immediately from the
[38:15] stage layer to the final data marks
[38:18] because building this Enterprise data
[38:19] warehouse it is a big struggle and
[38:21] usually waste a lot of time so he always
[38:23] want you to focus and building the data
[38:26] marks quickly as possible so it is
[38:28] faster approach than Inon but with the
[38:30] time you might get chaos in the data
[38:32] Mars because you are not always focusing
[38:34] in the big picture and you might be
[38:35] repeating same Transformations and
[38:37] Integrations in different data Mars so
[38:40] there is like trade-off between the
[38:42] speed and consistent data warehouse now
[38:44] moving on to the third approach we have
[38:46] the Data Vault so we still have the
[38:48] stage and the data Mars but it says we
[38:50] still need this Central Data Warehouse
[38:53] in the middle but this middle layer
[38:55] we're going to bring more standards and
[38:56] rules so it tells you to split this
[38:59] middle layer into two layers the row
[39:01] Vault and the business vault in the row
[39:04] Vault you have the original data but in
[39:06] the business Vault you have all the
[39:07] business rules and Transformations that
[39:09] prepares the data for the data Mars so
[39:12] Data Vault it is very similar to the in
[39:13] one but it brings more standards and
[39:16] rules to the middle layer now I'm going
[39:18] to go and add a fourth one that I'm
[39:20] going to call it Medallion architecture
[39:23] and this one is my favorite one because
[39:25] it is very easy to understand and to
[39:27] build so it says you're going to go and
[39:29] build three layers bronze silver and
[39:31] gold the bronze layer it is very similar
[39:33] to the stage but we have understood with
[39:35] the time that the stage layer is very
[39:37] important because having the original
[39:39] data as it is it going to helps a lot by
[39:42] tracebility and finding issues then the
[39:44] next layer we have the silver layer it
[39:46] is where we do Transformations data
[39:48] cleansy but we don't apply yet any
[39:50] business rules now moving on to the last
[39:52] layer the gold layer it is as well very
[39:54] similar to the data Mars but there we
[39:56] can build different typ type of objects
[39:58] not only for reporting but as well for
[40:00] machine learning for AI and for many
[40:03] different purposes so they are like
[40:05] business ready objects that you want to
[40:07] share as a data product so those are the
[40:10] four approaches that you can use in
[40:12] order to build a data warehouse so again
[40:14] if you are building a data architecture
[40:16] you have to specify which approach you
[40:18] want to follow so at the start we said
[40:20] we want to build a data warehouse and
[40:22] then we have to decide between those
[40:23] four approaches on how to build the data
[40:25] warehouse and in this project we will be
[40:27] using using The Medallion architecture
[40:29] so this is a very important question
[40:30] that you have to answer as the first
[40:32] step of building a data architecture all
[40:34] right so with that we have decided on
[40:36] the approach so we can go and Mark it as
[40:39] done the next step we're going to go and
[40:40] design the layers of the data
[40:46] warehouse now there is like not 100%
[40:50] standard way and rules for each layer
[40:52] what you have to do as a data architect
[40:54] you have to Define exactly what is the
[40:57] purpose of each layer so we start with
[40:59] the bronze layer so we say it going to
[41:01] store row and unprocessed data as it is
[41:04] from the sources and why we are doing
[41:06] that it is for tracebility and debugging
[41:08] if you have a layer where you are
[41:10] keeping the row data it is very
[41:11] important to have the data as it is from
[41:13] the sources because we can go always
[41:15] back to the pron layer and investigate
[41:18] the data of specific Source if something
[41:20] goes wrong so the main objective is to
[41:22] have row untouched data that's going to
[41:25] helps you as a data engineer by
[41:27] analyzing the road cause of issues now
[41:29] moving on to the silver layer it is the
[41:31] layer where we're going to store clean
[41:33] and standardized data and this is the
[41:35] place where we're going to do basic
[41:37] transformations in order to prepare the
[41:39] data for the final layer now for the
[41:41] good layer it going to contain business
[41:43] ready data so the main goal here is to
[41:45] provide data that could be consumed by
[41:48] business users and analysts in order to
[41:50] build reporting and analytics so with
[41:52] that we have defined the main goal for
[41:54] each layer now next what I would like to
[41:56] do is to to define the object types and
[41:59] since we are talking about a data
[42:00] warehouse in database we have here
[42:02] generally two types either a table or a
[42:04] view so we are going for the bronze
[42:06] layer and the silver layer with tables
[42:08] but for the gold layer we are going with
[42:10] the views so the best practice says for
[42:12] the last layer in your data warehouse
[42:14] make it virtual using views it going to
[42:17] gives you a lot of dynamic and of course
[42:19] speed in order to build it since we
[42:20] don't have to make a load process for it
[42:22] and now the next step is that we're
[42:24] going to go and Define the load method
[42:25] so in this project I have decided to go
[42:27] with the full load using the method of
[42:29] trating and inserting it is just faster
[42:32] and way easier so we're going to say for
[42:33] the pron layer we're going to go with
[42:34] the full load and you have to specify as
[42:36] well for the silver layer as well we're
[42:38] going to go with the full load and of
[42:39] course for the views we don't need any
[42:41] load process so each time you decide to
[42:43] go with tables you have to define the
[42:45] load methods with full load incremental
[42:47] loads and so on now we come to the very
[42:49] interesting part the data
[42:51] Transformations now for the pron layer
[42:53] it is the easiest one about this topic
[42:55] because we don't have any
[42:56] transformations we have to commit
[42:58] ourself to not touch the data do not
[43:01] manipulate it don't change anything so
[43:03] it's going to stay as it is if it comes
[43:05] bad it's going to stay bad in the bronze
[43:06] layer and now we come to the silver
[43:08] layer where we have the heavy lifting as
[43:10] we committed in the objective we have to
[43:12] make clean and standardized data and for
[43:14] that we have different types of
[43:16] Transformations so we have to do data
[43:18] cleansing data standardizations data
[43:20] normalizations we have to go and derive
[43:23] new columns and data enrichment so there
[43:25] are like bunch of trans transformation
[43:27] that we have to do in order to prepare
[43:29] the data our Focus here is to transform
[43:32] the data to make it clean and following
[43:34] standards and try to push all business
[43:36] transformations to the next layer so
[43:38] that means in the god layer we will be
[43:40] focusing on business Transformations
[43:43] that is needed for the consumers for the
[43:44] use cases so what we do here we do data
[43:47] Integrations between Source system we do
[43:49] data aggregations we apply a lot of
[43:51] business Logics and rules and we build a
[43:54] data model that is ready for for example
[43:56] business inions so here we do a lot of
[43:58] business Transformations and in the
[44:00] silver layer we do basic data
[44:02] Transformations so it is really here
[44:04] very important to make the fine
[44:06] decisions what type of transformations
[44:09] to be done in each layer and make sure
[44:11] that you commit to those rules now the
[44:13] next aspect is about the data modeling
[44:15] in the bronze layer and the silver layer
[44:17] we will not break the data model that
[44:19] comes from the source system so if the
[44:21] source system deliver five tables we're
[44:23] going to have here like five tables and
[44:24] as well in the silver layer we will not
[44:26] go and D normalize or normalize or like
[44:29] make something new we're going to leave
[44:30] it exactly like it comes from the source
[44:32] system because what we're going to do
[44:34] we're going to build the data model in
[44:36] the gold layer and here you have to
[44:37] Define which data model you want to
[44:39] follow are you following the star schema
[44:41] the snowflake or are you just making
[44:43] aggregated objects so you have to go and
[44:45] make a list of all data models types
[44:47] that you're going to follow in the gold
[44:49] layer and at the end what you can
[44:50] specify in each layer is the target
[44:52] audience and this is of course very
[44:54] important decision in the bronze layer
[44:55] you don't want to give access access to
[44:57] any end user it is really important to
[44:59] make sure that only data Engineers
[45:01] access the bronze layer it makes no
[45:03] sense for data analysts or data
[45:05] scientist to go to the bad data because
[45:08] you have a better version for that in
[45:10] the silver layer so in the silver layer
[45:12] of course the data Engineers have to
[45:13] have an access to it and as well the
[45:15] data analysts and the data scientist and
[45:17] so on but still you don't give it to any
[45:19] business user that can't deal with the
[45:22] row data model from the sources because
[45:24] for the business users you're going to
[45:26] get a bit layer for them and that is the
[45:28] gold layer so the gold layer it is
[45:30] suitable for the data analyst and as
[45:33] well the business users because usually
[45:35] the business users don't have a deep
[45:36] knowledge on the technicality of the
[45:38] Sero layer so if you are designing
[45:40] multiple layers you have to discuss all
[45:42] those topics and make clear decision for
[45:45] each layer all right my friends so now
[45:47] before we proceed with the design I want
[45:49] to tell you a secret principle Concepts
[45:51] that each data architect must know and
[45:54] that is the separation of concerns so
[45:57] what is that as you are designing an
[45:58] architecture you have to make sure to
[46:00] break down the complex system into
[46:03] smaller independent parts and each part
[46:05] is responsible for a specific task and
[46:08] here comes the magic the component of
[46:10] your architecture must not be duplicated
[46:13] so you cannot have two parts are doing
[46:15] the same thing so the idea here is to
[46:17] not mix everything and this is one of
[46:20] the biggest mistakes in any big projects
[46:22] and I have sewn that almost everywhere
[46:25] so a good data architect follow this
[46:27] concept this principle so for example if
[46:30] you are looking to our data architecture
[46:32] we have already done that so we have
[46:34] defined unique set of tasks for each
[46:36] layer so for example we have said in the
[46:38] silver layer we do data cleansing but in
[46:41] the gold layer we do business
[46:43] Transformations and with that you will
[46:45] not be allowing to do any business
[46:47] transformations in the silver layer and
[46:49] the same thing goes for the gold layer
[46:50] you don't do in the gold layer any data
[46:52] cleansing so each layer has its own
[46:54] unique tasks and the same thing goes for
[46:57] the pron layer and the silver layer you
[46:59] do not allow to load data from The
[47:01] Source systems directly to the silver
[47:03] layer because we have decided the
[47:05] landing layer the first layer is the
[47:07] pron layer otherwise you will have like
[47:09] set of source systems that are loaded
[47:11] first to the pron layer and another set
[47:14] is skipping the layer and going to the
[47:16] silver and with that we have overlapping
[47:18] you are doing data inje in two different
[47:20] layers so my friends if you have this
[47:23] mindsets separation of concerns I
[47:25] promise you you're going to be a data
[47:27] architect so think about it all right my
[47:29] friends so with that we have designed
[47:31] the layers of the data warehouse we can
[47:33] go and close it the next step we're
[47:34] going to go to draw o and start drawing
[47:37] the data
[47:41] architecture so there is like no one
[47:43] standard on how to build a data
[47:45] architecture you can add your style and
[47:47] the way that you want so now the first
[47:49] thing that we have to show in data
[47:50] architecture is the different layers
[47:52] that we have the first layer is the
[47:54] source system layer so let's go and take
[47:56] a box like this and make it a little bit
[47:58] bigger and I'm just going to go and make
[48:00] the design so I'm going to remove the
[48:01] fill and make the line dotted one and
[48:04] after dots I'm going to go and change
[48:05] maybe the color to something like this
[48:07] gray so now we have like a container for
[48:10] the first layer and then we have to go
[48:11] and add like a text on top of it so what
[48:13] I'm going to do I'm going to take
[48:15] another box let's go and type inside it
[48:17] sources and I'm going to go and style it
[48:19] so I'm going to go to the text and make
[48:20] it maybe 24 and then remove the lines
[48:24] like this make it a little bit smaller
[48:26] and put it on top so this is the first
[48:29] layer this is where the data come from
[48:31] and then the data going to go inside a
[48:33] data warehouse so I'm just going to go
[48:34] and duplicate this one this one is the
[48:37] data
[48:41] warehouse all right so now the third
[48:43] layer what is going to be it's going to
[48:45] be the consumers who will be consuming
[48:47] this data warehouse so I'm going to put
[48:50] another box and say this is the consume
[48:53] layer okay so those are the three
[48:55] containers now inside the data warehouse
[48:57] we have decided to build it using the
[48:59] Medan architecture so we're going to
[49:01] have three layers inside the warehouse
[49:03] so I'm going to take again another box
[49:06] I'm going to call this one this is the
[49:08] bronze layer and now we have to go and
[49:11] put a design for it so I'm going to go
[49:12] with this color over here and then the
[49:14] text and maybe something like 20 and
[49:17] then make it a little bit smaller and
[49:19] just put it here and beneath that we're
[49:22] going to have the component so this is
[49:24] just a title of a container so I'm going
[49:26] to have it like this this remove the
[49:27] text from inside it and remove the
[49:30] filling so this container is for the
[49:33] bronze layer let's go and duplicate it
[49:35] for the next one so this one going to be
[49:38] the silver
[49:39] layer and of course we can go and change
[49:41] the coloring to gray because it is
[49:43] silver and as well the lines and remove
[49:46] the filling great and now maybe I'm
[49:48] going to make the font as bold all right
[49:51] now the third layer going to be the gold
[49:54] layer and we have to go and pick it
[49:56] color for that so style and here we have
[49:59] like something like yellow the same
[50:01] thing for the container I remove the
[50:03] filling so with that we are showing now
[50:05] the different layers inside our data
[50:07] warehouse now those containers are empty
[50:09] what we're going to do we're going to go
[50:10] inside each one of them and start adding
[50:11] contents so now in the sources it is
[50:14] very important to make it clear what are
[50:16] the different types of source system
[50:18] that you are connecting to the data
[50:19] warehouse because in real project there
[50:21] are like multiple types you might have a
[50:23] database API files CFA and here it's
[50:26] important to show those different types
[50:28] in our projects we have folders and
[50:29] inside those folders We have CSV files
[50:32] so now what you have to do we have to
[50:33] make it clear in this layer that the
[50:35] input for our project is CSV file so it
[50:38] really depend how you want to show that
[50:40] I'm going to go over here and say maybe
[50:42] folder and then I'm going to go and take
[50:43] the folder and put it here inside and
[50:45] then maybe search for file more results
[50:48] and go pick one of those icons for
[50:50] example I'm going to go with this one
[50:51] over here so I'm going to make it
[50:53] smaller and add it on top of the folder
[50:55] so with that we make it clear for
[50:57] everyone seeing the architecture that
[50:59] the sources is not a database is not an
[51:02] API it is a file inside the folder so
[51:05] now very important here to show is the
[51:07] source systems what are the sources that
[51:09] is involved in the project so here what
[51:10] we're going to do we're going to go and
[51:11] give it a name for example we have one
[51:13] source called CRM B like this and maybe
[51:16] make the icon and we have another source
[51:18] called Erp so we going to go and
[51:20] duplicate it put it over here and then
[51:22] rename it Erp so now it is for everyone
[51:25] clear we have two sources for the this
[51:26] project and the technology is used is
[51:28] simply a file so now what we can do as
[51:30] well we can go and add some descriptions
[51:32] inside this box to make it more clear so
[51:34] what I'm going to do I'm going to take a
[51:35] line because I want to split the
[51:37] description from the icons something
[51:39] like this and make it gray and then
[51:41] below it we're going to go and add some
[51:43] text and we're going to say is CSV file
[51:47] and the next point and we can say the
[51:49] interface is simply files in folder and
[51:53] of course you can go and add any
[51:55] specifications and explanation about the
[51:57] sources if it is a database you can see
[51:59] the type of the database and so on so
[52:01] with that we made it in the data
[52:02] architecture clear what are the sources
[52:04] of our data warehouse and now the next
[52:06] step what we're going to do we're going
[52:07] to go and design the content of the
[52:09] bronze silver and gold so I'm going to
[52:11] start by adding like an icon in each
[52:13] container it is to show about that we
[52:15] are talking about database so what we're
[52:17] going to do we're going to go and search
[52:18] for database and then more result more
[52:22] results I'm going to go with this icon
[52:24] over here so let's go and make it it's
[52:27] bigger something like this maybe change
[52:29] the color of that so we're going to have
[52:31] the bronze and as well here the silver
[52:34] and the gold so now what we're going to
[52:35] do we're going to go and add some arrows
[52:37] between those layers so we're going to
[52:39] go over here so we can go and search for
[52:41] Arrow and maybe go and pick one of those
[52:43] let's go and put it here and we can go
[52:45] and pick a color for that maybe
[52:47] something like this and adjust it so now
[52:50] we can have this nice Arrow between all
[52:52] the layers just to explain the direction
[52:54] of our architecture right so we can read
[52:56] this from left to right and as well
[52:58] between the gold layer and the consume
[53:01] okay so now what I'm going to do next
[53:02] we're going to go and add one statement
[53:05] about each layer the main objective so
[53:07] let's go and grab a text and put it
[53:09] beneath the database and we're going to
[53:11] say for example for the bl's layer it's
[53:12] going to be the row data maybe make the
[53:15] text bigger so you are the row data and
[53:18] then the next one in the silver you are
[53:21] cleans standard data and then the last
[53:25] one for the gos we can say
[53:27] business
[53:28] ready data so with that we make the
[53:31] objective clear for each layer now below
[53:33] all those icons what we going to do
[53:34] we're going to have a separator again
[53:36] like this make it like colored and
[53:39] beneath it we're going to add the most
[53:40] important specifications of this layer
[53:43] so let's go and add those separators in
[53:45] each layer okay so now we need a text
[53:48] below it let's take this one here so
[53:50] what is the object type of the bronze
[53:53] layer it's going to be a table and we
[53:55] can go and add the load methods we say
[53:58] this is patch processing since we are
[54:01] not doing streaming we can say it is a
[54:03] full load we are not doing incremental
[54:05] load so we can say here
[54:08] Tran and insert and then we add one more
[54:12] section maybe about the Transformations
[54:14] so we can say no
[54:16] Transformations and one more about the
[54:18] data model we're going to say none as is
[54:22] and now what I'm going to do I'm going
[54:23] to go and add those specifications as
[54:24] well for the silver and gold so here
[54:26] what we have discussed the object type
[54:28] the load process the
[54:29] Transformations and whether we are
[54:31] breaking the data model or not the same
[54:34] thing for the gold layer so I can say
[54:35] with that we have really nice layering
[54:38] of the data warehouse and what we are
[54:39] left is with the consumers over here you
[54:42] can go and add the different use cases
[54:43] and tools that can access your data
[54:45] warehouse like for example I'm adding
[54:47] here business intelligence and Reporting
[54:49] maybe using poweri or Tau or you can say
[54:52] you can access my data warehouse in
[54:53] order to do atoc analyzes using the SQ
[54:56] queries and this is what we're going to
[54:58] focus on the projects after we buil the
[55:00] data warehouse and as well you can offer
[55:02] it for machine learning purposes and of
[55:04] course it is really nice to add some
[55:05] icons in your architecture and usually I
[55:07] use this nice websites called Flat icon
[55:10] it has really amazing icons that you can
[55:12] go and use it in your architecture now
[55:14] of course we can go and keep adding
[55:15] icons and stuff to explain the data
[55:17] architecture and as well the system like
[55:19] for example it is very important here to
[55:21] say which tools you are using in order
[55:23] to build this data warehouse is it in
[55:25] the cloud are you using Azure data
[55:27] breaks or maybe snowflake so we're going
[55:29] to go and add for our project the icon
[55:32] of SQL Server since we are building this
[55:34] data warehouse completely in the SQL
[55:36] Server so for now I'm really happy about
[55:38] it as you can see we have now a plan
[55:39] right all right guys so with that we
[55:41] have designed the data architecture
[55:43] using the drw O and with that we have
[55:45] done the last step in this epic and now
[55:47] with that we have a design for the data
[55:49] architecture and we can say we have
[55:51] closed this epic now let's go to the
[55:53] next one we will start doing the first
[55:55] step to prepare our projects and the
[55:57] first task here is to create a detailed
[55:59] project
[56:03] plan all right my friends so now it's
[56:05] clear for us that we have three layers
[56:07] and we have to go and build them so that
[56:09] means our big epic is going to be after
[56:11] the layers so here I have added three
[56:14] more epics so we have build bronze layer
[56:16] build silver layer and gold layer and
[56:19] after that I went and start defining all
[56:22] the different tasks that we have to
[56:24] follow in the projects so at the start
[56:26] will be analyzing then coding and after
[56:29] that we're going to go and do testing
[56:30] and once everything is ready we're going
[56:32] to go and document stuff and at the end
[56:34] we have to commit our work in the get
[56:36] repo all those epics are following the
[56:39] same like pattern in the tasks so as you
[56:41] can see now we have a very detailed
[56:43] project structure and now things are
[56:45] more cleared for us how we going to
[56:47] build the data warehouse so with that we
[56:49] are done from this task and now the next
[56:52] task we have to go and Define the naming
[56:54] Convention of the projects
[57:00] all right so now at this phase of the
[57:01] projects we usually Define the naming
[57:03] conventions so what is that it a set of
[57:06] rules that you define for naming
[57:09] everything in the projects whether it is
[57:10] a database schema tables start
[57:13] procedures folders anything and if you
[57:16] don't do that at the early phase of the
[57:18] project I promise you chaos can happen
[57:21] because what going to happen you will
[57:22] have different developers in your
[57:23] projects and each of those developers
[57:25] have their own style of course so one
[57:27] developer might name a tabled Dimension
[57:30] customers where everything is lowercase
[57:32] and between them underscore and you have
[57:34] another developer creating another table
[57:36] called Dimension products but using the
[57:39] camel case so there is no separation
[57:41] between the words and the first
[57:42] character is capitalized and maybe
[57:44] another one using some prefixes like di
[57:46] imore categories so we have here like a
[57:49] shortcut of the dimension so as you can
[57:51] see there are different designs and
[57:53] styles and if you leave the door open
[57:55] what can happen in the middle of the
[57:56] projects you will notice okay everything
[57:58] looks inconsistence and you can define a
[58:01] big task to go and rename everything
[58:04] following specific role so instead of
[58:06] wasting all this time at this phase you
[58:09] go and Define the naming conventions and
[58:11] let's go and do that so we will start
[58:13] with a very important decision and that
[58:15] is which naming convention we going to
[58:17] follow in the whole project so you have
[58:19] different cases like the camel case the
[58:22] Pascal case the Kebab case and the snake
[58:25] case and for this project we're going to
[58:27] go with the snake case where all the
[58:30] letters of award going to be lowercase
[58:33] and the separation between wordss going
[58:35] to be an underscore for example a table
[58:37] name called customer info customer is
[58:40] lowercased info is as well lowercased
[58:42] and between them an underscore so this
[58:44] is always the first thing that you have
[58:46] to decide for your data project the
[58:48] second thing is to decide the language
[58:51] so for example I work in Germany and
[58:52] there is always like a decision that we
[58:54] have to make whether we use Germany or
[58:56] English so we have to decide for our
[58:58] project which language we're going to
[59:00] use and a very important general rule is
[59:03] that avoid reserved words so don't use a
[59:06] square reserved word as an object name
[59:08] like for example table don't give a
[59:10] table name as a table so those are the
[59:13] general principles so those are the
[59:15] general rules that you have to follow in
[59:17] the whole project this applies for
[59:19] everything for tables columns start
[59:21] procedures any names that you are giving
[59:23] in your scripts now moving on we have
[59:26] specifications for the table names and
[59:28] here we have different set of rules for
[59:30] each layer so here the rule says Source
[59:33] system uncore entity so we are saying
[59:35] all the tables in the bronze layer
[59:37] should start first with the source
[59:39] system name like for example CRM or Erb
[59:42] and after that we have an underscore and
[59:44] then at the end we have the entity name
[59:47] or the table name so for example we have
[59:49] this table name CRM uncore so that means
[59:53] this table comes from the source system
[59:54] CRM and then we have the table name the
[59:57] entity name customer info so this is the
[59:59] rule that we're going to follow in
[01:00:00] naming all tables in the pron layer then
[01:00:03] moving on to the silver layer it is
[01:00:05] exactly like the bronze because we are
[01:00:07] not going to rename anything we are not
[01:00:09] going to build any new data model so the
[01:00:12] naming going to be one to one like the
[01:00:14] bronze so it is exactly the same rules
[01:00:17] as the bronze but if we go to the gold
[01:00:19] here since we are building new data
[01:00:21] model we have to go and rename things
[01:00:24] and since as well we are integrating
[01:00:25] multi sources together we will not be
[01:00:28] using the source system name in the
[01:00:30] tables because inside one table you
[01:00:32] could have multiple sources so the rule
[01:00:34] says all the names must be meaningful
[01:00:36] business aligned names for the tables
[01:00:39] starting with the category prefix so
[01:00:41] here the rule says it start with
[01:00:43] category then underscore and then entity
[01:00:46] now what is category we have in the go
[01:00:48] layer different types of tables so we
[01:00:51] could build a table called a fact table
[01:00:53] another one could be a dimension a third
[01:00:55] type could be an aggregation or report
[01:00:58] so we have different types of tables and
[01:01:01] we can specify those types as a perect
[01:01:03] at the start so for example we are
[01:01:05] seeing here effect uncore sales so the
[01:01:09] category is effect and the table name
[01:01:11] called sales and here I just made like a
[01:01:13] table with different type of patterns so
[01:01:15] we could have a dimension so we say it
[01:01:18] start with the di imore for example the
[01:01:20] IM customers or products and then we
[01:01:23] have another type called fact table so
[01:01:25] it starts with fact underscore or
[01:01:27] aggregated table where we have the fair
[01:01:29] three characters like aggregating the
[01:01:31] customers or the sales monthly so as you
[01:01:34] can see as you are creating a naming
[01:01:35] convention you have first to make it
[01:01:37] clear what is the rule describe each
[01:01:40] part of the rule and start giving
[01:01:42] examples so with that we make it clear
[01:01:44] for the whole team which names they
[01:01:46] should follow so we talked here about
[01:01:48] the table naming convention then you can
[01:01:50] as well go and make naming convention
[01:01:52] for the columns like for example in the
[01:01:54] gold layer we're going to go and have
[01:01:56] circuit keys so we can Define it like
[01:01:58] this the circuit key should start with a
[01:02:00] table name and then underscore a key
[01:02:02] like for example we can call it customer
[01:02:04] underscore key it is a surrogate key in
[01:02:07] the dimension customers the same thing
[01:02:09] for technical columns as a data engineer
[01:02:11] we might add our own columns to the
[01:02:13] tables that don't come from the source
[01:02:15] system and those columns are the
[01:02:17] technical columns or sometimes we call
[01:02:19] them metadata columns now in order to
[01:02:21] separate them from the original columns
[01:02:24] that comes from the source system
[01:02:26] we can have like a prefix for that like
[01:02:28] for example the rule says if you are
[01:02:30] building any technical or metadata
[01:02:32] columns the column should start with
[01:02:35] dwore and then that column name for
[01:02:37] example if you want the metadata load
[01:02:39] date we can have
[01:02:41] dwore load dates so with that if anyone
[01:02:44] sees that column starts with DW we
[01:02:47] understand this data comes from a data
[01:02:49] engineer and we can keep adding rules
[01:02:51] like for example the St procedure over
[01:02:53] here if you are making an ETL script
[01:02:55] then it should should start with the
[01:02:56] prefix load uncore and then the layer
[01:02:59] for example the St procedure that is
[01:03:01] responsible for loading the bronze going
[01:03:03] to be called load uncore bronze and for
[01:03:06] the Silver Load uncore silver so those
[01:03:08] are currently the rules for the St
[01:03:11] procedure so this is how I do it usually
[01:03:12] in my projects all right my friends so
[01:03:14] with do we have a solid namey
[01:03:16] conventions for our projects so this is
[01:03:18] done and now the next with that we're
[01:03:19] going to go to git and you will create a
[01:03:21] brand new repository and we're going to
[01:03:24] prepare its structure so let's go
[01:03:29] go all right so now we come to as well
[01:03:31] important step in any projects and
[01:03:33] that's by creating the git repository so
[01:03:35] if you are new to git don't worry about
[01:03:37] it it is simpler than it sounds so it's
[01:03:39] all about to have a safe place where you
[01:03:41] can put your codes that you are
[01:03:43] developing and you will have the
[01:03:44] possibility to track everything happen
[01:03:46] to the codes and as well you can use it
[01:03:48] in order to collaborate with your team
[01:03:50] and if something goes wrong you can
[01:03:52] always roll back and the best part here
[01:03:54] once you are done with the project you
[01:03:55] can share your reposter as a part of
[01:03:57] your portfolio and it is really amazing
[01:03:59] thing if you are applying for a job by
[01:04:01] showcasing your skills that you have
[01:04:03] built a data warehouse by using well
[01:04:05] documented get reposter so now let's go
[01:04:07] and create the reposter of the project
[01:04:10] now we are at the overview of our
[01:04:11] account so the first thing that you have
[01:04:13] to do is to go to the repos stories over
[01:04:15] here and then we're going to go to this
[01:04:17] green button and click on you the first
[01:04:19] thing that we have to do is to give
[01:04:21] Theory name so let's call it SQL data
[01:04:24] warehouse project and then here we can
[01:04:27] go and give it a description so for
[01:04:29] example I'm saying building a modern
[01:04:31] data warehouse with SQL Server now the
[01:04:33] next option whether you want to make it
[01:04:35] public and private I'm going to leave it
[01:04:37] as a public and then let's go and add
[01:04:39] here a read me file and then here about
[01:04:42] the license we can go over here and
[01:04:43] select the MIT MIT license gives
[01:04:46] everyone the freedom of using and
[01:04:48] modifying your code okay so I think I'm
[01:04:51] happy with the setup let's go and create
[01:04:53] the repost story and with that we have
[01:04:55] our brand new reposter now the next step
[01:04:58] that I usually do is to create the
[01:05:00] structure of the reposter and usually I
[01:05:02] always follow the same patterns in any
[01:05:04] projects so here we need few folders in
[01:05:07] order to put our files right so what I
[01:05:09] usually do I go over here to add file
[01:05:11] create a new file and I start creating
[01:05:14] the structure over here so the first
[01:05:15] thing is that we need data sets then
[01:05:18] slash and with that the repos you can
[01:05:20] understand this is a folder not a file
[01:05:22] and then you can go and add anything
[01:05:24] like here play holder just an empty file
[01:05:28] this just can to help me to create the
[01:05:30] folders so let's go and commit so commit
[01:05:32] the changes and now if you go back to
[01:05:34] the main projects you can see now we
[01:05:36] have a folder called data sets so I'm
[01:05:38] going to go and keep creating stuff so I
[01:05:41] will go and create the documents
[01:05:44] placeholder commit the changes and then
[01:05:46] I'm going to go and create the scripts
[01:05:49] Place
[01:05:51] holder and the final one what I usually
[01:05:54] add is the the
[01:05:56] tests something like this so with that
[01:06:00] as you can see now we have the main
[01:06:01] folders of our repository now what I
[01:06:04] usually do the next with that I'm going
[01:06:05] to go and edit the main readme so you
[01:06:08] can see it over here as well so what
[01:06:09] we're going to do we're going to go
[01:06:10] inside the read me and then we're going
[01:06:12] to go to the edit button here and we're
[01:06:14] going to start writing the main
[01:06:16] information about our project this is
[01:06:18] really depend on your style so you can
[01:06:19] go and add whatever you want this is the
[01:06:22] main page of your repository and now as
[01:06:25] you can see the file name here ismd it
[01:06:28] stands for markdown it is just an easy
[01:06:31] and friendly format in order to write a
[01:06:33] text so if you have like documentations
[01:06:35] you are writing a text it is a really
[01:06:37] nice format in order to organize it
[01:06:39] structure it and it is very friendly so
[01:06:41] what I'm going to do at the start I'm
[01:06:43] going to give a few description about
[01:06:45] the project so we have the main title
[01:06:47] and then we have like a welcome message
[01:06:49] and what this reposter is about and in
[01:06:51] the next section maybe we can start with
[01:06:53] the project requirements and then maybe
[01:06:55] at the end you can say few words about
[01:06:58] the licensing and few words about you so
[01:07:01] as you can see it's like the homepage of
[01:07:02] the project and the repository so once
[01:07:04] you are done we're going to go and
[01:07:06] commit the changes and now if you go to
[01:07:08] the main page of the repository you can
[01:07:10] see always the folder and files at the
[01:07:13] start and then below it we're going to
[01:07:15] see the informations from the read me so
[01:07:17] again here we have the welcome statement
[01:07:19] and then the projects requirements and
[01:07:22] at the end we have the licensing and
[01:07:23] about me so my friends that's that's it
[01:07:25] we have now a repost story and we have
[01:07:28] now the main structure of the projects
[01:07:30] and through the projects as we are
[01:07:31] building the data warehouse we're going
[01:07:33] to go and commit all our work in this
[01:07:35] repository nice right all right so with
[01:07:38] that we have now your repository ready
[01:07:41] and as we go in the projects we will be
[01:07:43] adding stuff to it so this step is done
[01:07:45] and now the last step finally we're
[01:07:47] going to go to the SQL server and we're
[01:07:49] going to write our first scripts where
[01:07:51] we're going to create a database and
[01:07:53] schemas
[01:07:58] all right now the first step is we have
[01:07:59] to go and create brand new database so
[01:08:02] now in order to do that first we have to
[01:08:04] switch to the database master so you can
[01:08:06] do it like this use master and semicolon
[01:08:09] and if you go and execute it now we are
[01:08:11] switched to the master database it is a
[01:08:13] system database in SQL Server where you
[01:08:15] can go and create other databases and
[01:08:17] you can see from the toolbar that we are
[01:08:19] now logged into the master database now
[01:08:22] the next step we have to go and create
[01:08:24] our new database so we're going to say
[01:08:25] say create database and you can call it
[01:08:28] whatever you want so I'm going to go
[01:08:30] with data warehouse semicolon let's go
[01:08:33] and execute it and with that we have
[01:08:35] created our database let's go and check
[01:08:37] it from the object Explorer let's go and
[01:08:39] refresh and you can see our new data
[01:08:41] warehouse this is our new database
[01:08:43] awesome right now to the next step we're
[01:08:45] going to go and switch to the new
[01:08:47] database so we're going to say use data
[01:08:50] warehouse and semicolon so let's go and
[01:08:53] switch to it and you can see now now we
[01:08:55] are logged into the data warehouse
[01:08:57] database and now we can go and start
[01:08:59] building stuff inside this data
[01:09:01] warehouse so now the first step that I
[01:09:03] usually do is I go and start creating
[01:09:05] the schemas so what is the schema think
[01:09:07] about it it's like a folder or a
[01:09:09] container that helps you to keep things
[01:09:12] organized so now as we decided in the
[01:09:14] architecture we have three layers bronze
[01:09:16] silver gold and now we're going to go
[01:09:17] and create for each layer a schema so
[01:09:20] let's go and do that we're going to
[01:09:21] start with the first one create schema
[01:09:24] and the first one is bronze so let's do
[01:09:27] it like this and a semicolon let's go
[01:09:29] and create the first schema nice so we
[01:09:32] have new schema let's go to our database
[01:09:34] and then in order to check the schemas
[01:09:36] we go to the security and then to the
[01:09:38] schemas over here and as you can see we
[01:09:40] have the bronze and if you don't find it
[01:09:42] you have to go and refresh the whole
[01:09:44] schemas and then you will find the new
[01:09:46] schema great so now we have the first
[01:09:48] schema now what we're going to do we're
[01:09:49] going to go and create the others two so
[01:09:51] I'm just going to go and duplicate it so
[01:09:53] the next one going to be the silver and
[01:09:55] the third one going to be the golds so
[01:09:57] let's go and execute those two together
[01:10:00] we will get an error and that's because
[01:10:02] we are not having the go in between so
[01:10:05] after each command let's have a go and
[01:10:07] now if I highlight the silver and gold
[01:10:10] and then execute it will be working the
[01:10:12] go in SQL it is like separator so it
[01:10:15] tells SQL first execute completely the
[01:10:18] First Command before go to the next one
[01:10:20] so it is just separator now let's go to
[01:10:22] our schemas refresh and now we can see
[01:10:25] as well we have the gold and the silver
[01:10:27] so with this we have now a database we
[01:10:29] have the three layers and we can start
[01:10:31] developing each layer
[01:10:37] individually okay so now let's go and
[01:10:39] commit our work in the git so now since
[01:10:41] it is a script and code we're going to
[01:10:43] go to the folder scripts over here and
[01:10:45] then we're going to go and add a new
[01:10:46] file let's call it init database.sql and
[01:10:50] now we're going to go and paste our code
[01:10:52] over here so now I have done few
[01:10:54] modifications like for example before we
[01:10:56] create the database we have to check
[01:10:59] whether the database exists this is an
[01:11:01] important step if you are recreating the
[01:11:03] database otherwise if you don't do that
[01:11:05] you will get an error where it's going
[01:11:06] going to say the database already exists
[01:11:09] so first it is checking whether the
[01:11:11] database exist then it drops it I have
[01:11:13] added few comments like here we are
[01:11:15] saying creating the data warehouse
[01:11:17] creating the schemas and now we have a
[01:11:19] very important step we have to go and
[01:11:21] add a header comment at the start of
[01:11:23] each scripts to be honest after 3 months
[01:11:26] from now you will not be remembering all
[01:11:28] the details of these scripts and adding
[01:11:30] a comment like this it is like a sticky
[01:11:32] note for you later once you visit this
[01:11:35] script again and it is as well very
[01:11:36] important for the other developers in
[01:11:38] the team because each time you open a
[01:11:40] scripts the first question going to be
[01:11:42] what is the purpose of this script
[01:11:44] because if you or anyone in the team
[01:11:46] open the file the first question going
[01:11:48] to be what is the purpose of these
[01:11:50] scripts why we are doing these stuff so
[01:11:53] as you can see here we have a comment
[01:11:54] saying this scripts create a new data
[01:11:56] warehouse after checking if it already
[01:11:59] exists if the database exists it's going
[01:12:00] to drop it and recreate it and
[01:12:02] additionally it's going to go and create
[01:12:04] three schemas bronze silver gold so that
[01:12:07] it gives Clarity what this script is
[01:12:09] about and it makes everyone life easier
[01:12:12] now the second reason why this is very
[01:12:14] important to add is that you can add
[01:12:16] warnings and especially for this script
[01:12:19] it is very important to add these notes
[01:12:21] because if you run these scripts what's
[01:12:22] going to happen it's going to go and
[01:12:24] destroy the whole database imagine
[01:12:26] someone open the script and run it
[01:12:28] imagine an admin open the script and run
[01:12:31] it in your database everything going to
[01:12:33] be destroyed and all the data will be
[01:12:35] lost and this going to be a disaster if
[01:12:37] you don't have any backup so with that
[01:12:39] we have nice H our comment and we have
[01:12:41] added few comments in our codes and now
[01:12:43] we are ready to commit our codes so
[01:12:46] let's go and commit it and now we have
[01:12:49] our scripts in the git as well and of
[01:12:51] course if you are doing any
[01:12:52] modifications make sure to update the
[01:12:54] changes in the Gs okay my friends so
[01:12:57] with that we have an empty database and
[01:12:59] schemas and we are done with this task
[01:13:01] and as well we are done with the whole
[01:13:03] epic so we have completed the project
[01:13:05] initialization and now we're going to go
[01:13:07] to the interesting stuff we will go and
[01:13:09] build the bronze layer so now the first
[01:13:11] task is to analyze the source systems so
[01:13:14] let's
[01:13:17] go all right so now the big question is
[01:13:20] how to build the bronze layer so first
[01:13:22] thing first we do analyzing as you are
[01:13:24] developing anything you don't
[01:13:26] immediately start writing a code so
[01:13:28] before we start coding the bronze layer
[01:13:30] what we usually do is we have to
[01:13:32] understand the source system so what I
[01:13:34] usually do I make an interview with the
[01:13:36] source system experts and ask them many
[01:13:38] many questions in order to understand
[01:13:41] the nature of the source system that I'm
[01:13:43] connecting to the data warehouse and
[01:13:45] once you know the source systems then we
[01:13:47] can start coding and the main focus here
[01:13:49] is to do the data ingestion so that
[01:13:52] means we have to find a way on how to
[01:13:54] load the data from The Source into the
[01:13:56] data warehouse so it's like we are
[01:13:58] building a bridge between the source and
[01:14:01] our Target system the data warehouse and
[01:14:02] once we have the code ready the next
[01:14:04] step is we have to do data validation so
[01:14:07] here comes the quality control it is
[01:14:09] very important in the bronze layer to
[01:14:10] check the data completeness so that
[01:14:12] means we have to compare the number of
[01:14:15] Records between the source system and
[01:14:17] the bronze layer just to make sure we
[01:14:19] are not losing any data in between and
[01:14:21] another check that we will be doing is
[01:14:23] the schema checks and that's to make
[01:14:24] sure that the data is placed on the
[01:14:26] right position and finally we don't have
[01:14:28] to forget about documentation and
[01:14:31] committing our work in the gits so this
[01:14:33] is the process that we're going to
[01:14:34] follow to build the bronze
[01:14:39] layer all right my friends so now before
[01:14:42] connecting any Source systems to our
[01:14:43] data warehouse we have to make very
[01:14:45] important step is to understand the
[01:14:48] sources so how I usually do it I set up
[01:14:50] a meeting with the source systems
[01:14:52] experts in order to interview them to
[01:14:54] ask them a lot of stuff about the source
[01:14:56] and gaining this knowledge is very
[01:14:57] important because asking the right
[01:14:59] question will help you to design the
[01:15:02] correct scripts in order to extract the
[01:15:03] data and to avoid a lot of mistakes and
[01:15:06] challenges and now I'm going to show you
[01:15:08] the most common questions that I usually
[01:15:10] ask before connecting anything okay so
[01:15:12] we start first by understanding the
[01:15:14] business context and the ownership so I
[01:15:16] would like to understand the story
[01:15:17] behind the data I would like to
[01:15:19] understand who is responsible for the
[01:15:20] data which it departments and so on and
[01:15:23] then it's nice to understand as well
[01:15:25] what business process it supports does
[01:15:27] it support the customer transactions the
[01:15:29] supply chain Logistics or maybe Finance
[01:15:32] reporting so with that you're going to
[01:15:34] understand the importance of your data
[01:15:35] and then I ask about the system and data
[01:15:38] documentation so having documentations
[01:15:40] from the source is your learning
[01:15:41] materials about your data and it going
[01:15:43] to saves you a lot of time later when
[01:15:46] you are working and designing maybe new
[01:15:48] data models and as well I would like
[01:15:50] always to understand the data model for
[01:15:52] the source system and if they have like
[01:15:54] descript I of the columns and the tables
[01:15:56] it's going to be nice to have the data
[01:15:58] catalog this can helps me a lot in the
[01:15:59] data warehouse how I'm going to go and
[01:16:01] join the tables together so with that
[01:16:03] you get a solid foundations about the
[01:16:05] business context the processes and the
[01:16:07] ownership of the data and now in The
[01:16:09] Next Step we're going to start talking
[01:16:11] about the technicality so I would like
[01:16:13] to understand the architecture and as
[01:16:14] well the technology stack so the first
[01:16:16] question that I usually ask is how the
[01:16:19] source system is storing the data do we
[01:16:21] have the data on the on Prem like an SQL
[01:16:23] Server Oracle or is it in the cloud like
[01:16:26] Azure lws and so on and then once we
[01:16:29] understand that then we can discuss what
[01:16:31] are the integration capabilities like
[01:16:33] how I'm going to go and get the data do
[01:16:35] the source system offer apis maybe CFA
[01:16:38] or they have only like file extractions
[01:16:40] or they're going to give you like a
[01:16:42] direct connection to the database so
[01:16:44] once you understand the technology that
[01:16:46] you're going to use in order to extract
[01:16:47] the data then we're going to Deep dive
[01:16:49] into more technical questions and here
[01:16:51] we can understand how to extract the
[01:16:53] data from The Source system and and then
[01:16:55] load it into the data warehouse so the
[01:16:57] first things that we have to discuss
[01:16:58] with the experts can we do an
[01:17:00] incremental load or a full load and then
[01:17:03] after that we're going to discuss the
[01:17:04] data scope the historization do we need
[01:17:07] all data do we need only maybe 10 years
[01:17:09] of the data are there history is already
[01:17:11] in the source system or should we build
[01:17:13] it in the data warehouse and so on and
[01:17:16] then we're going to go and discuss what
[01:17:18] is the expected size of the extracts are
[01:17:20] we talking here about megabytes
[01:17:22] gigabytes terabytes and this is very
[01:17:24] important to understand whether we have
[01:17:26] the right tools and platform to connect
[01:17:29] the source system and then I try to
[01:17:31] understand whether there are any data
[01:17:32] volume limitations like if you have some
[01:17:34] Old Source systems they might struggle a
[01:17:37] lot with performance and so on so if you
[01:17:39] have like an ETL that extracting large
[01:17:41] amount of data you might bring the
[01:17:43] performance down of the source system so
[01:17:45] that's why you have to try to understand
[01:17:47] whether there are any limitations for
[01:17:49] your extracts and as well other aspects
[01:17:51] that might impact the performance of The
[01:17:53] Source system this is very important if
[01:17:55] they give you an access to the database
[01:17:57] you have to be responsible that you are
[01:17:59] not bringing the performance of the
[01:18:01] database down and of course very
[01:18:03] important question is to ask about the
[01:18:05] authentication and the authorization
[01:18:07] like how you going to go and access the
[01:18:08] data in the source system do you need
[01:18:10] any tokens Keys password and so on so
[01:18:13] those are the questions that you have to
[01:18:14] ask if you are connecting new source
[01:18:17] system to the data warehouse and once
[01:18:19] you have the answers for those questions
[01:18:21] you can proceed with the next steps to
[01:18:23] connect the sources to the that
[01:18:24] Warehouse all right my friends so with
[01:18:26] that you have learned how to analyze a
[01:18:28] new source systems that you want to
[01:18:30] connect to your data warehouse so this
[01:18:32] STP is done and now we're going to go
[01:18:34] back to coding where we're going to
[01:18:35] write scripts in order to do the data
[01:18:37] ingestion from the CSV files to the Bros
[01:18:43] layer and let's have quick look again to
[01:18:46] our bronze layer specifications so we
[01:18:48] just have to load the data from the
[01:18:50] sources to the data warehouse we're
[01:18:52] going to build tables in the bronze
[01:18:53] layer we are doing a full load so that
[01:18:56] means we are trating and then inserting
[01:18:58] the data there will be no data
[01:18:59] Transformations at all in the bronze
[01:19:01] layer and as well we will not be
[01:19:03] creating any data model so this is the
[01:19:05] specifications of the bronze layer all
[01:19:08] right now in order to create the ddl
[01:19:09] script for the bronze layer creating the
[01:19:11] tables of the bronze we have to
[01:19:13] understand the metadata the structure
[01:19:15] the schema of the incoming data and here
[01:19:18] either you ask the technical experts
[01:19:20] from The Source system about these
[01:19:21] informations or you can go and explore
[01:19:24] the incoming data and try to define the
[01:19:26] structure of your tables so now what
[01:19:28] we're going to do we're going to start
[01:19:29] with the First Source system the CRM so
[01:19:32] let's go inside it and we're going to
[01:19:33] start with the first table that customer
[01:19:35] info now if you open the file and check
[01:19:37] the data inside it you see we have a
[01:19:39] Header information and that is very good
[01:19:41] because now we have the names of the
[01:19:43] columns that are coming from the source
[01:19:45] and from the content you can Define of
[01:19:47] course the data types so let's go and do
[01:19:49] that first we're going to say create
[01:19:51] table and then we have to define the
[01:19:53] layer it's going to be the bronze and
[01:19:55] now very important we have to follow the
[01:19:56] naming convention so we start with the
[01:19:58] name of the source system it is the CRM
[01:20:01] underscore and then after that the table
[01:20:03] name from The Source system so it's
[01:20:05] going to be the costore info so this is
[01:20:08] the name of our first table in the
[01:20:10] bronze layer then the next step we have
[01:20:11] to go and Define of course the columns
[01:20:14] and here again the column names in the
[01:20:15] bronze layer going to be one to one
[01:20:18] exactly like the source system so the
[01:20:20] first one going to be the ID and I will
[01:20:22] go with the data type integer then the
[01:20:24] next one going to be the key invar Char
[01:20:27] and the length I will go with
[01:20:31] [Music]
[01:20:35] 50 and the last one going to be the
[01:20:38] create dates it's going to be date so
[01:20:41] with that we have covered all the
[01:20:43] columns available from The Source system
[01:20:45] so let's go and check and yes the last
[01:20:47] one is the create date so that's it for
[01:20:49] the first table now semicolon of course
[01:20:51] at the end let's go and execute it and
[01:20:53] now we're going to go to the object
[01:20:54] Explorer over here refresh and we can
[01:20:57] see the first table inside our data
[01:20:59] warehouse amazing right so now next what
[01:21:01] you have to do is to go and create a ddl
[01:21:04] statement for each file for those two
[01:21:07] systems so for the CRM we need three
[01:21:10] ddls and as well for the other system
[01:21:12] the Erp we have as well to create three
[01:21:15] ddls for the three files so at the ends
[01:21:17] we're going to have in the bronze ler
[01:21:19] Six Tables six ddls so now pause the
[01:21:22] video go create those ddls I will be
[01:21:24] doing the same as well and we will see
[01:21:26] you
[01:21:31] soon all right so now I hope you have
[01:21:33] created all those details I'm going to
[01:21:34] show you what I have just created so the
[01:21:36] second table in the source CRM we have
[01:21:39] the product informations and the third
[01:21:41] one is the sales details then we go to
[01:21:44] the second system and here we make sure
[01:21:46] that we are following the naming
[01:21:47] convention so first The Source system
[01:21:49] Erb and then the table name so the
[01:21:52] second system was really easy you can
[01:21:54] see we have only here like two columns
[01:21:55] and for the customers like only three
[01:21:58] and for the categories only four columns
[01:22:00] all right so after defining those stuff
[01:22:02] of course we have to go and execute them
[01:22:04] so let's go and do that and then we go
[01:22:06] to the object Explorer over here refresh
[01:22:08] the tables and with that you can see we
[01:22:11] have six empty tables in the bronze
[01:22:13] layer and with that we have all the
[01:22:15] tables from the two Source systems
[01:22:17] inside our database but still we don't
[01:22:19] have any data and you can see our naming
[01:22:21] convention is really nice you see the
[01:22:23] first three tables comes from the CRM
[01:22:26] Source system and then the other three
[01:22:28] comes from the Erb so we can see in the
[01:22:30] bronze layer the things are really
[01:22:31] splitted nicely and you can identify
[01:22:34] quickly which table belonged to which
[01:22:36] source system now there is something
[01:22:38] else that I usually add to the ddl
[01:22:40] script is to check whether the table
[01:22:42] exists before creating so for example
[01:22:45] let's say that you are renaming or you
[01:22:46] would like to change the data type of
[01:22:48] specific field if you just go and run
[01:22:51] this Square you will get an error
[01:22:52] because the database going to say we
[01:22:54] have already this table so in other
[01:22:56] databases you can say create or replace
[01:22:58] table but in the SQL Server you have to
[01:23:00] go and build a tsql logic so it is very
[01:23:03] simple first we have to go and check
[01:23:04] whether the object exist in the database
[01:23:06] so we say if object ID and then we have
[01:23:10] to go and specify the table name so
[01:23:12] let's go and copy the whole thing over
[01:23:15] here and make sure you get exactly the
[01:23:17] same name as a table name so there is
[01:23:19] see like space I'm just going to go and
[01:23:21] remove it and then we're going to go and
[01:23:22] Define the object type so going to be
[01:23:24] the U it stands for user it is the user
[01:23:27] defined tables so if this table is not
[01:23:30] null so this means the database did find
[01:23:32] this object in the database so what can
[01:23:35] happen we say go and drop the table so
[01:23:39] the whole thing again and semicolon so
[01:23:42] again if the table exist in the database
[01:23:44] is not null then go and drop the table
[01:23:47] and after that go and created so now if
[01:23:49] you go and highlight the whole thing and
[01:23:52] then execute it it will be working so
[01:23:54] first drop the table if it exist then go
[01:23:57] and create the table from scratch now
[01:23:59] what you have to do is to go and add
[01:24:01] this check before creating any table
[01:24:04] inside our database so it's going to be
[01:24:06] the same thing for the next table and so
[01:24:08] on I went and added all those checks for
[01:24:11] each table and what can happen if I go
[01:24:14] and execute the whole thing it going to
[01:24:16] work so with that I'm recreating all the
[01:24:18] tables in the bronze layer from the
[01:24:20] scratch
[01:24:25] now the methods that we're going to use
[01:24:26] in order to load the data from the
[01:24:28] source to the data warehouse is the bulk
[01:24:30] inserts bulk insert is a method of
[01:24:33] loading massive amount of data very
[01:24:35] quickly from files like CSV files or
[01:24:38] maybe a text file directly into a
[01:24:41] database it's is not like the classical
[01:24:43] normal inserts where it's going to go
[01:24:45] and insert the data row by row but
[01:24:47] instead the PK insert is one operation
[01:24:50] that's going to load all the data in one
[01:24:52] go into the database and that's what
[01:24:54] makes it very fast so let's go and use
[01:24:56] this methods okay so now let's start
[01:24:58] writing the script in order to load the
[01:25:00] first table in the source CRM so we're
[01:25:02] going to go and load the table customer
[01:25:04] info from the CSV file to the database
[01:25:07] table so the syntax is very simple we're
[01:25:09] going to start to saying pulk insert so
[01:25:12] with that SQL understand we are doing
[01:25:14] not a normal insert we are doing a pulk
[01:25:16] insert and then we have to go and
[01:25:17] specify the table name so it is bronze.
[01:25:21] CRM cost info so now now we have to
[01:25:24] specify the full location of the file
[01:25:27] that we are trying to load in this table
[01:25:29] so now what we have to do is to go and
[01:25:31] get the path where the file is stored so
[01:25:34] I'm going to go and copy the whole path
[01:25:36] and then add it to the P insert exactly
[01:25:38] like where the data exists so for me it
[01:25:41] is in csql data warehouse project data
[01:25:44] set in the source CRM and then I have to
[01:25:47] specify the file name so it's going to
[01:25:49] be the costore info. CSV you have to get
[01:25:53] it exactly like like the path of your
[01:25:55] files otherwise it will not be working
[01:25:57] so after the path now we come to the
[01:25:59] with CLA now we have to tell the SQL
[01:26:01] Server how to handle our file so here
[01:26:04] comes the specifications there is a lot
[01:26:06] of stuff that we can Define so let's
[01:26:08] start with the very important one is the
[01:26:11] row header now if you check the content
[01:26:13] of our files you can see always the
[01:26:15] first row includes the Header
[01:26:17] information of the file so those
[01:26:19] informations are actually not the data
[01:26:22] it's just the column names the ACT data
[01:26:24] starts from the second row and we have
[01:26:27] to tell the database about this
[01:26:29] information so we're going to say first
[01:26:31] row is actually the second row so with
[01:26:34] that we are telling SQL to skip the
[01:26:37] first row in the file we don't need to
[01:26:39] load those informations because we have
[01:26:40] already defined the structure of our
[01:26:43] table so this is the first
[01:26:44] specifications the next one which is as
[01:26:47] well very important and loading any CSV
[01:26:49] file is the separator between Fields the
[01:26:52] delimiter between Fields so it's really
[01:26:54] depend on the file structure that you
[01:26:55] are getting from the source as you can
[01:26:57] see all those values are splitted with a
[01:27:00] comma and we call this comma as a file
[01:27:03] separator or a delimiter and I saw a lot
[01:27:05] of different csvs like sometime they use
[01:27:07] a semicolon or a pipe or special
[01:27:09] character like a hash and so on so you
[01:27:11] have to understand how the values are
[01:27:13] splitted and in this file it's splitted
[01:27:15] by the comma and we have to tell SQL
[01:27:18] about this info it's very important so
[01:27:19] we going to say fill Terminator and then
[01:27:22] we're going to say it is the comma and
[01:27:25] basically those two informations are
[01:27:26] very important for SQL in order to be
[01:27:28] able to read your CSV file now there are
[01:27:31] like many different options that you can
[01:27:33] go and add for example tabe lock it is
[01:27:36] an option in order to improve the
[01:27:38] performance where you are locking the
[01:27:39] entire table during loading it so as SQL
[01:27:43] is loading the data to this table it
[01:27:45] going to go and lock the whole table so
[01:27:48] that's it for now I'm just going to go
[01:27:49] and add the semicolon and let's go and
[01:27:51] insert the data from the file inside our
[01:27:53] pron table let's execute it and now you
[01:27:55] can see SQL did insert around 880,000
[01:27:58] rows inside our table so it is working
[01:28:00] we just loaded the file into our data
[01:28:02] Bas but now it is not enough to just
[01:28:04] write the script you have to test the
[01:28:06] quality of your bronze table especially
[01:28:09] if you are working with files so let's
[01:28:10] go and just do a simple select so from
[01:28:13] our new
[01:28:15] table and let's run it so now the first
[01:28:19] thing that I check is do we have data
[01:28:21] like in each column well yes as you can
[01:28:23] see we have data and the second thing is
[01:28:26] do we have the data in the correct
[01:28:28] column this is very critical as you are
[01:28:30] loading the data from a file to a
[01:28:32] database do we have the data in the
[01:28:33] correct column so for example here we
[01:28:35] have the first name which of course
[01:28:37] makes sense and here we have the last
[01:28:38] name but what could happen and this
[01:28:40] mistakes happens a lot is that you find
[01:28:43] the first name informations inside the
[01:28:45] key and as well you see the last name
[01:28:47] inside the first name and the status
[01:28:50] inside the last name so there is like
[01:28:51] shifting of the data and this data
[01:28:54] engineering mistake is very common if
[01:28:55] you are working with CSV files and there
[01:28:58] are like different reasons why it
[01:28:59] happens maybe the definition of your
[01:29:01] table is wrong or the filled separator
[01:29:03] is wrong maybe it's not a comma it's
[01:29:05] something else or the separator is a bad
[01:29:08] separator because sometimes maybe in the
[01:29:10] keys or in the first name there is a
[01:29:12] comma and the SQL is not able to split
[01:29:15] the data correctly so the quality of the
[01:29:17] CSV file is not really good and there
[01:29:19] are many different reasons why you are
[01:29:21] not getting the data in the correct
[01:29:23] column but for now everything looks fine
[01:29:25] for us and the next step is that I go
[01:29:27] and count the rows inside this table so
[01:29:31] let's go and select that so we can see
[01:29:33] we have
[01:29:35] 18,490 and now what we can do we can go
[01:29:37] to our CSV file and check how many rows
[01:29:39] do we have inside this file and as you
[01:29:41] can see we have
[01:29:44] 18,490 we are almost there there is like
[01:29:46] one extra row inside the file and that's
[01:29:49] because of the header the first Header
[01:29:51] information is not loaded inside our
[01:29:53] table and that's why always in our
[01:29:55] tables we're going to have one less row
[01:29:57] than the original files so everything
[01:30:00] looks nice and we have done this step
[01:30:01] correctly now if I go and run it again
[01:30:04] what's going to happen we will get dcat
[01:30:07] inside the bronze layer so now we have
[01:30:09] loaded the file like twice inside the
[01:30:11] same table which is not really correct
[01:30:14] the method that we have discussed is
[01:30:16] first to make the table empty and then
[01:30:18] load trate and then insert in order to
[01:30:21] do that before the bulk inserts what
[01:30:24] we're going to do we're going to say
[01:30:25] truncate table and then we're going to
[01:30:27] have our
[01:30:29] table and that's it with a semicolon so
[01:30:32] now what we are doing is first we are
[01:30:34] making the table empty and then we start
[01:30:37] loading from the scratch we are loading
[01:30:39] the whole content of the file inside the
[01:30:42] table and this is what we call full load
[01:30:44] so now let's go and Mark everything
[01:30:46] together and execute and again if you go
[01:30:48] and check the content of the table you
[01:30:50] can see we have only 18,000 rows let's
[01:30:53] go and run it again the count of the
[01:30:56] bronze layer you can see we still have
[01:30:58] the 18,000 so each time you run this
[01:31:00] script now we are refreshing the table
[01:31:03] customer info from the file into the
[01:31:05] database table so we are refreshing the
[01:31:07] bronze layer table so that means if
[01:31:09] there is like now any changes in the
[01:31:11] file it will be loaded to the table so
[01:31:14] this is how you do a full load in the
[01:31:16] bronze layer by trating the table and
[01:31:19] then doing the inserts and now of course
[01:31:21] what we have to do is to Bow the video
[01:31:23] and go and write WR the same script for
[01:31:25] all six files so let's go and do
[01:31:30] [Music]
[01:31:33] that okay back so I hope that you have
[01:31:35] as well written all those scripts so I
[01:31:37] have the three tables in order to load
[01:31:39] the First Source system and then three
[01:31:41] sections in order to load the Second
[01:31:43] Source system and as I'm writing those
[01:31:45] scripts make sure to have the correct
[01:31:47] path so for the Second Source system you
[01:31:49] have to go and change the path for the
[01:31:50] other folder and as well don't forget
[01:31:52] the table name on the bronze layer is
[01:31:54] different from the file name because we
[01:31:56] start always with the source system name
[01:31:59] with the files we don't have that so now
[01:32:00] I think I have everything is ready so
[01:32:03] let's go and execute the whole thing
[01:32:05] perfect awesome so everything is working
[01:32:08] let me check the messages so we can see
[01:32:10] from the message how many rows are
[01:32:12] inserted in each table and now of course
[01:32:14] the task is to go through each table and
[01:32:17] check the
[01:32:21] content so that means now we have really
[01:32:23] ni script in order to load the bronze
[01:32:26] layer and we will use this script in
[01:32:29] daily basis every day we have to run it
[01:32:31] in order to get a new content to the
[01:32:33] data warehouse and as you learned before
[01:32:35] if you have like a script of SQL that is
[01:32:38] frequently used what we can do we can go
[01:32:40] and create a stored procedure from those
[01:32:43] scripts so let's go and do that it's
[01:32:45] going to be very simple we're going to
[01:32:46] go over here and say create or alter
[01:32:49] procedure and now we have to define the
[01:32:52] name of the Sol procedure I'm going to
[01:32:53] go and put it in the schema bronze
[01:32:55] because it belongs to the bronze layer
[01:32:58] so then we're going to go and follow the
[01:32:59] naming convention the S procedure starts
[01:33:02] with load underscore and then the bronze
[01:33:04] layer so that's it about the name and
[01:33:06] then very important we have to define
[01:33:07] the begin and as well the end of our SQL
[01:33:10] statements so here is the beginning and
[01:33:13] let's go to the end and say this is the
[01:33:16] end and then let's go highlight
[01:33:18] everything in between and give it one
[01:33:20] push with tab so with that it is easier
[01:33:22] to read so now next one we're going to
[01:33:24] do we're going to go and execute it so
[01:33:25] let's go and create this St procedure
[01:33:27] and now if you want to go and check your
[01:33:28] St procedure you go to the database and
[01:33:31] then we have here folder called
[01:33:32] programmability and then inside we have
[01:33:34] start procedure so if you go and refresh
[01:33:36] you will see our new start procedure
[01:33:38] let's go and test it so I'm going to go
[01:33:40] and have new query and what we're going
[01:33:42] to do we're going to say execute bronze.
[01:33:45] load bronze so let's go and execute it
[01:33:48] and with that we have just loaded
[01:33:50] completely the pron layer so as you can
[01:33:53] see SQL did go and insert all the data
[01:33:55] from the files to the bronze layer it is
[01:33:57] way easier than each time running those
[01:34:00] scripts of course all right so now the
[01:34:01] next step is that as you can see the
[01:34:03] output message it is really not having a
[01:34:06] lot of informations the message of your
[01:34:08] ETL with s procedure it will not be
[01:34:10] really clear so that's why if you are
[01:34:12] writing an ETL script always take care
[01:34:15] of the messaging of your code so let me
[01:34:17] show you a nice design let's go back to
[01:34:19] our St procedure so now what we can do
[01:34:21] we can go and divide the message p based
[01:34:24] on our code so now we can start with a
[01:34:26] message for example over here let's say
[01:34:27] print and we say what you are doing with
[01:34:30] this thir procedure we are loading the
[01:34:32] bronze ler so this is the main message
[01:34:35] the most important one and we can go and
[01:34:37] play with the separators like this so we
[01:34:39] can say print and now we can go and add
[01:34:41] some nice separators like for example
[01:34:43] the equals at the start and at the end
[01:34:46] just to have like a section so this is
[01:34:48] just a nice message at the start so now
[01:34:50] by looking to our code we can see that
[01:34:52] our code is splited into two sections
[01:34:54] the first section we are loading all the
[01:34:56] tables from The Source system CRM and
[01:34:59] the second section is loading the tables
[01:35:01] from the Erp so we can split the prints
[01:35:04] by The Source system so let's go and do
[01:35:05] that so we're going to say print and
[01:35:08] we're going to say loading CRM tables
[01:35:12] this is for the first section and then
[01:35:13] we can go and add some nice separators
[01:35:16] like the one let's take the minus and of
[01:35:19] course don't forget to add semicolons
[01:35:21] like me so we can to have semicolon
[01:35:24] for each print same thing over here I
[01:35:27] will go and copy the whole thing because
[01:35:29] we're going to have it at the start and
[01:35:30] as well at the end let's go copy the
[01:35:32] whole thing for the second section so
[01:35:34] for the Erp it starts over here and
[01:35:37] we're going to have it like this and
[01:35:39] we're going to call it loading Erp so
[01:35:41] with that in the output we can see nice
[01:35:43] separation between loading each Source
[01:35:45] system now we go to the next step where
[01:35:47] we go and add like a print for each
[01:35:50] action so for example here we are Tran
[01:35:53] getting the table so we say print and
[01:35:55] now what we can do we can go and add two
[01:35:57] arrows and we say what we are doing so
[01:35:59] we are trating the table and then we can
[01:36:02] go and add the table name in the message
[01:36:04] as well so this is the first action that
[01:36:06] we are doing and we can go and add
[01:36:08] another print for inserting the data so
[01:36:10] we can say inserting data into and then
[01:36:15] we have the table name so with that in
[01:36:17] the output we can understand what SQL is
[01:36:19] doing so let's go and repeat this for
[01:36:21] all other tables Okay so I just added
[01:36:24] all those prints and don't forget the
[01:36:25] semicolon at the end so I would say
[01:36:28] let's go and execute it and check the
[01:36:30] output so let's go and do that and then
[01:36:32] maybe at the start just to have quick
[01:36:34] output execute our stored procedure like
[01:36:37] this so let's see now if you check the
[01:36:40] output you can see things are more
[01:36:42] organized than before so at the start we
[01:36:44] are reading okay we are loading the
[01:36:45] bronze layer now first we are loading
[01:36:48] the source system CRM and then the
[01:36:50] second section is for the Erp and we can
[01:36:52] see the actions so we trating inserting
[01:36:54] trating inserting for each table and as
[01:36:57] well the same thing for the Second
[01:36:58] Source so as you can see it is nice and
[01:37:01] cosmetic but it's very important as you
[01:37:03] are debugging any errors and speaking of
[01:37:05] Errors we have to go and handle the
[01:37:07] errors in our St procedure so let's go
[01:37:10] and do that it's going to be the first
[01:37:12] thing that we do we say begin try and
[01:37:14] then we go to the end of our scripts and
[01:37:17] we say before the last end we say end
[01:37:20] try and then the next thing we have to
[01:37:22] add the catch so we're going to say
[01:37:24] begin catch and end catch so now first
[01:37:28] let's go and organize our code I'm going
[01:37:30] to take the whole codes and give it one
[01:37:34] more push and as well the begin try so
[01:37:37] it is more organized and as you know the
[01:37:39] try and catch is going to go and execute
[01:37:41] the try and if there is like any errors
[01:37:44] during executing this script the second
[01:37:47] section going to be executed so the
[01:37:49] catch will be executed only if the SQL
[01:37:51] failed to run that try so now what we
[01:37:54] have to do is to go and Define for SQL
[01:37:56] what to do if there's like an error in
[01:37:58] your code and here we can do multiple
[01:38:00] stuff like maybe creating a logging
[01:38:02] tables and add the messages inside this
[01:38:05] table or we can go and add some nice
[01:38:07] messaging to the output like very
[01:38:09] example we can go and add like a section
[01:38:11] again over here so again some equals and
[01:38:14] we can go and repeat it over here and
[01:38:17] then add some content in between so we
[01:38:19] can start with something like to say
[01:38:21] error Accord
[01:38:24] during loading bronze layer and then we
[01:38:27] can go and add many stuff like for
[01:38:29] example we can go and add the error
[01:38:33] message and here we can go and call the
[01:38:35] function
[01:38:36] error message and we can go and add as
[01:38:40] well for example the error number so
[01:38:42] error number and of course the output of
[01:38:45] this going to be in number but the error
[01:38:47] message here is a text so we have to go
[01:38:49] and change the data type so we're going
[01:38:51] to do a cast as in VAR Char like this
[01:38:55] and then there is like many functions
[01:38:57] that you can add to the output like for
[01:38:59] example the error States and so on so
[01:39:02] you can design what can happen if there
[01:39:03] is an error in the ETL now what else is
[01:39:06] very important in each ETL process is to
[01:39:09] add the duration of each like step so
[01:39:12] for example I would like to understand
[01:39:13] how long it takes to load this table
[01:39:16] over here but looking to the output I
[01:39:18] don't have any informations how long is
[01:39:20] taking to load my tables and this is
[01:39:22] very important because because as you
[01:39:24] are building like a big data warehouse
[01:39:26] the ATL process is going to take long
[01:39:28] time and you would like to understand
[01:39:30] where is the issue where is the
[01:39:31] bottleneck which table is consuming a
[01:39:33] lot of time to be loaded so that's why
[01:39:35] we have to add those informations as
[01:39:37] well to the output or even maybe to
[01:39:39] protocol it in a table so let's go and
[01:39:41] add as well this step so we're going to
[01:39:43] go to the start and now in order to
[01:39:45] calculate the duration you need the
[01:39:47] starting time and the end time so we
[01:39:49] have to understand when we started
[01:39:51] loaded and when we ended loading the
[01:39:53] table so now the first thing is we have
[01:39:55] to go and declare the variables so we're
[01:39:58] going to say declare and then let's make
[01:40:00] one called start time and the data type
[01:40:02] of this going to be the date time I need
[01:40:04] exactly the second when it started and
[01:40:07] then another one for the end time so
[01:40:10] another variable end time and as well
[01:40:12] the same thing date time so with that we
[01:40:14] have declared the variables and the next
[01:40:16] step is to go and use them so now let's
[01:40:18] go to the first table to the customer
[01:40:20] info and at the start we're going to say
[01:40:23] set
[01:40:23] start
[01:40:24] time equal to get date so we will get
[01:40:28] the exact time when we start loading
[01:40:31] this table and then let's go and copy
[01:40:32] the whole thing and go to the end of
[01:40:34] loading over here so we're going to say
[01:40:37] set this time the end time equal as well
[01:40:40] to the get dates so with that now we
[01:40:42] have the values of when we start loading
[01:40:45] this table and when we completed loading
[01:40:47] the table and now the next step is we
[01:40:49] have to go and print the duration those
[01:40:52] informations so over here we can go and
[01:40:54] say print and we can go and have as
[01:40:56] again the same design so two arrows and
[01:40:58] we can say very simply load duration and
[01:41:01] then double points and space and now
[01:41:04] what we have to do is to calculate the
[01:41:06] duration and we can do that using the
[01:41:08] date and time function date diff in
[01:41:10] order to find the interval between two
[01:41:13] dates so we're going to say plus over
[01:41:15] here and then use date diff and here we
[01:41:17] have to Define three arguments first one
[01:41:19] is the unit so you can Define second
[01:41:21] minute hours and so on so we're going to
[01:41:23] go with a second and then we're going to
[01:41:24] define the start of the interval it's
[01:41:26] going to be the start time and then the
[01:41:28] last argument is going to be the end of
[01:41:30] the boundary it's going to be the end
[01:41:32] time and now of course the output of
[01:41:34] this going to be in number that's why we
[01:41:35] have to go and cast it so we're going to
[01:41:37] say cast as enar Char and then we're
[01:41:40] going to close it like this and maybe at
[01:41:42] the ends we're going to say plus space
[01:41:46] seconds in order to have a nice message
[01:41:48] so again what we have done we have
[01:41:49] declared the two variables and we are
[01:41:51] using them at the start we we are
[01:41:53] getting the current date and time and at
[01:41:56] the end of loading the table we are
[01:41:57] getting the current date and time and
[01:41:59] then we are finding the differences
[01:42:01] between them in order to get the load
[01:42:03] duration and in this case we are just
[01:42:05] priting this information and now we can
[01:42:07] go of course and add some nice separator
[01:42:09] between each table so I'm going to go
[01:42:11] and do it like this just few minuses not
[01:42:14] a lot of stuff so now what we have to do
[01:42:16] is to go and add this mechanism for each
[01:42:19] table in order to measure the speed of
[01:42:22] the ETL for each one of
[01:42:24] [Music]
[01:42:28] them okay so now I have added all those
[01:42:31] configurations for each table and let's
[01:42:34] go and run the whole thing now so let's
[01:42:37] go and edit the stor procedure this and
[01:42:40] we're going to go and run it so let's go
[01:42:42] and execute so now as you can see we
[01:42:44] have here one more info about the load
[01:42:46] durations and it is everywhere I can see
[01:42:49] we have zero seconds and that's because
[01:42:52] it is super fast of loading those
[01:42:53] informations we are doing everything
[01:42:55] locally at PC so loading the data from
[01:42:57] files to database going to be Mega fast
[01:43:00] but of course in real projects you have
[01:43:02] like different servers and networking
[01:43:03] between them and you have millions of
[01:43:05] rods in the tables of course the
[01:43:07] duration going to be not like 0 seconds
[01:43:10] things going to be slower and now you
[01:43:11] can see easily how long it takes to load
[01:43:14] each of your tables and now of course
[01:43:16] what is very interesting is to
[01:43:18] understand how long it takes to load the
[01:43:20] whole pron lier so now your task is is
[01:43:23] as well to print at the ends
[01:43:25] informations about the whole patch how
[01:43:27] long it took to load the bronze
[01:43:32] [Music]
[01:43:34] layer okay I hope we are done now I have
[01:43:37] done it like this we have to Define two
[01:43:40] new variables so the start time of the
[01:43:42] batch and the end time of the batch and
[01:43:44] the first step in the start procedure is
[01:43:46] to get that date and time informations
[01:43:49] for the first variable and exactly at
[01:43:51] the end the last thing that we do in the
[01:43:53] start procedure we're going to go and
[01:43:55] get the date and time informations for
[01:43:58] the end time so we say again set get
[01:44:01] date for the patch in time and then all
[01:44:03] what you have to do is to go and print a
[01:44:05] message so we are saying loading bronze
[01:44:07] layer is completed and then we are
[01:44:09] printing total load duration and the
[01:44:11] same thing with a date difference
[01:44:13] between the patch start time and the end
[01:44:15] time and we are calculating the seconds
[01:44:17] and so on so now what you have to do is
[01:44:18] to go and execute the whole thing so
[01:44:21] let's go and refresh the definition of
[01:44:23] the S procedure and then let's go and
[01:44:26] execute it so in the output we have to
[01:44:28] go to the last message and we can see
[01:44:30] loading pron layer is completed and the
[01:44:32] total load duration is as well 0 seconds
[01:44:35] because the execution time is less than
[01:44:38] 1 seconds so with that you are getting
[01:44:40] now a feeling about how to build an ETL
[01:44:42] process so as you can see the data
[01:44:44] engineering is not all about how to load
[01:44:47] the data it's how to engineer the whole
[01:44:49] pipeline how to measure the speed of
[01:44:51] loading the data what can happen happen
[01:44:53] if there's like an error and to print
[01:44:55] each step in your ETL process and make
[01:44:58] everything organized and cleared in the
[01:45:00] output and maybe in the logging just to
[01:45:02] make debugging and optimizing the
[01:45:04] performance way easier and there is like
[01:45:06] a lot of things that we can add we can
[01:45:08] add the quality measures and stuff so we
[01:45:11] can add many stuff to our ETL scripts to
[01:45:13] make our data warehouse professional all
[01:45:16] right my friends so with that we have
[01:45:17] developed a code in order to load the
[01:45:19] pron layer and we have tested that as
[01:45:21] well and now in the next step we we're
[01:45:23] going to go back to draw because we want
[01:45:24] to draw a diagram about the data flow so
[01:45:27] let's
[01:45:31] go so now what is a data flow diagram
[01:45:34] we're going to draw a Syle visual in
[01:45:35] order to map the flow of your data where
[01:45:38] it come froms and where it ends up so we
[01:45:41] want just to make clear how the data
[01:45:42] flows through different layers of your
[01:45:45] projects and that's help us to create
[01:45:47] something called the data lineage and
[01:45:49] this is really nice especially if you
[01:45:51] are analyzing an issue so if you have
[01:45:53] like multiple layers and you don't have
[01:45:55] a real data lineage or flow it's going
[01:45:57] to be really hard to analyze the scripts
[01:45:59] in order to understand the origin of the
[01:46:01] data and having this diagram going to
[01:46:03] improve the process of finding issues so
[01:46:06] now let's go and create one okay so now
[01:46:08] back to draw and we're going to go and
[01:46:10] build the flow diagram so we're going to
[01:46:11] start first with the source system so
[01:46:14] let's build the layer I'm going to go
[01:46:16] and remove the fill dotted and then
[01:46:19] we're going to go and add like a box
[01:46:21] saying sources and we're going to put it
[01:46:23] over here increase the size 24 and as
[01:46:27] well without any lines now what do we
[01:46:30] have inside the sources we have like
[01:46:32] folder and files so let's go and search
[01:46:35] for a folder icon I'm going to go and
[01:46:37] take this one over here and say you are
[01:46:39] the CRM and we can as well increase the
[01:46:42] size and we have another source we have
[01:46:45] the
[01:46:46] Erp okay so this is the first layer
[01:46:49] let's go and now have the bronze layer
[01:46:52] so we're going to go and grab another
[01:46:53] box and we're going to go and make the
[01:46:56] coloring like this and instead of Auto
[01:46:58] maybe take the hatch maybe something
[01:47:00] like this whatever you know so rounded
[01:47:03] and then we can go and put on top of it
[01:47:06] like the title so we can say you are the
[01:47:09] bronze layer and increase as well the
[01:47:12] size of the font so now what you're
[01:47:14] going to do we're going to go and add
[01:47:15] boxes for each table that we have in the
[01:47:18] bronze layer so for example we have the
[01:47:20] sales details we can go and make it
[01:47:22] little bit smaller so maybe 16 and not
[01:47:25] bold and we have other two tables from
[01:47:28] the CRM we have the customer info and as
[01:47:32] well the product info so those are the
[01:47:35] three tables that comes from the CRM and
[01:47:37] now what we're going to do we're going
[01:47:38] to go and connect now the source CRM
[01:47:42] with all three tables so what we going
[01:47:44] to do we're going to go to the folder
[01:47:45] and start making arrows from the folder
[01:47:48] to the bronze layer like this and now we
[01:47:51] have to do the same thing for the Erp
[01:47:54] source so as you can see the data flow
[01:47:56] diagram shows us in one picture the data
[01:47:59] lineage between the two layers so here
[01:48:01] we can see easily those three tables
[01:48:03] actually comes from the CRM and as well
[01:48:05] those three tables in the bronze layer
[01:48:07] are coming from the Erp I understand if
[01:48:09] we have like a lot of tables it's going
[01:48:11] to be a huge Miss but if you have like
[01:48:13] small or medium data warehouse building
[01:48:16] those diagrams going to make things
[01:48:17] really easier to understand how
[01:48:19] everything is Flowing from the sources
[01:48:22] into the different layers in your data
[01:48:24] warehouse all right so with that we have
[01:48:26] the first version of the data flow so
[01:48:28] this step is done and the final step is
[01:48:30] to commit our code in the get
[01:48:36] repo okay so now let's go and commit our
[01:48:38] work since it is scripts we're going to
[01:48:40] go to the folder scripts and here we're
[01:48:42] going to have like scripts for the
[01:48:43] bronze silver and gold that's why maybe
[01:48:45] it makes sense to create a folder for
[01:48:47] each layer so let's go and start
[01:48:49] creating the bronze folder so I'm going
[01:48:51] to go and create a new file and then I'm
[01:48:53] going to say pron slash and then we can
[01:48:55] have the DL script of the pron layer dot
[01:48:59] SQL so now I'm going to go and paste the
[01:49:01] edal codes that we have created so those
[01:49:03] six tables and as usual at the start we
[01:49:06] have a comment where we are explaining
[01:49:08] the purpose of these scripts so we are
[01:49:09] saying these scripts creates tables in
[01:49:11] the pron schema and by running the
[01:49:13] scripts you are redefining the DL
[01:49:15] structure of the pron tables so let's
[01:49:18] have it like that and I'm going to go
[01:49:20] and commit the changes all right so now
[01:49:22] as you can see inside the scripts we
[01:49:24] have a folder called bronze and inside
[01:49:27] it we have the ddl script for the bronze
[01:49:29] layer and as well in the pron layer
[01:49:31] we're going to go and put our start
[01:49:32] procedure so we're going to go and
[01:49:34] create a new file let's call it proc
[01:49:36] load bronze. SQL and then let's go and
[01:49:40] paste our scripts and as usual I have
[01:49:43] put it at the start an explanation about
[01:49:45] the sord procedure so we are seeing this
[01:49:47] St procedure going to go and load the
[01:49:48] data from the CSV files into the pron
[01:49:51] schema so it going go and truncate first
[01:49:53] the tables and then do a pulk inserts
[01:49:56] and about the parameters this s
[01:49:58] procedure does not accept any parameter
[01:50:00] or return any values and here a quick
[01:50:02] example how to execute it all right so I
[01:50:04] think I'm happy with that so let's go
[01:50:07] and commit it all right my friends so
[01:50:10] with that we have committed our code
[01:50:12] into the gch and with that we are done
[01:50:14] building the pron layer so the whole is
[01:50:17] done now we're going to go to the next
[01:50:18] one this one going to be more advanced
[01:50:21] than the bronze layer because the there
[01:50:22] will be a lot of struggle with cleaning
[01:50:24] the data and so on so we're going to
[01:50:25] start with the first task where we're
[01:50:26] going to analyze and explore the data in
[01:50:29] the source systems so let's
[01:50:34] go okay so now we're going to start with
[01:50:36] the big question how to build the silver
[01:50:38] layer what is the process okay as usual
[01:50:41] first things first we have to analyze
[01:50:43] and now the task before building
[01:50:45] anything in the silver layer we have to
[01:50:46] go and explore the data in order to
[01:50:49] understand the content of our sources
[01:50:51] once we have it what we're going to do
[01:50:52] we will be starting coding and here the
[01:50:54] transformation that we're going to do is
[01:50:56] data cleansing this is usually process
[01:50:58] that take really long time and I usually
[01:51:00] do it in three steps the first step is
[01:51:03] to check first the data quality issues
[01:51:05] that we have in the pron layer so before
[01:51:07] writing any data Transformations first
[01:51:09] we have to understand what are the
[01:51:10] issues and only then I start writing
[01:51:13] data transformations in order to fix all
[01:51:16] those quality issues that we have in the
[01:51:17] bronze and the last step once I have
[01:51:20] clean results what we're going to do
[01:51:21] we're going to go and inserted into the
[01:51:23] silver layer and those are the three
[01:51:25] faces that we will be doing as we are
[01:51:27] writing the code for the silver layer
[01:51:29] and the third step once we have all the
[01:51:31] data in the server layer we have to make
[01:51:32] sure that the data is now correct and we
[01:51:35] don't have any quality issues anymore
[01:51:37] and if you find any issues of course
[01:51:38] what you going to do we're going to go
[01:51:39] back to coding we're going to do the
[01:51:41] data cleansing and again check so it is
[01:51:43] like a cycle between validating and
[01:51:45] coding once the quality of the silver
[01:51:47] layer is good we cannot skip the last
[01:51:50] phase where we going to document and
[01:51:51] commit our work in the Gs and here we're
[01:51:53] going to have two new documentations
[01:51:55] we're going to build the data flow
[01:51:57] diagram and as well the data integration
[01:51:59] diagram after we understood the
[01:52:01] relationship between the sources from
[01:52:03] the first step so this is the process
[01:52:05] and this is how we going to build the
[01:52:07] server
[01:52:11] layer all right so now exploring the
[01:52:13] data in the pron layer so why it is very
[01:52:15] important because understanding the data
[01:52:18] it is the key to make smart decisions in
[01:52:20] the server layer it was not the focus in
[01:52:22] the BR layer to understand the content
[01:52:24] of the data at all we focused only how
[01:52:26] to get the data to the data warehouse so
[01:52:29] that's why we have now to take a moment
[01:52:31] in order to explore and understand the
[01:52:33] tables and as well how to connect them
[01:52:36] what are the relationship between these
[01:52:37] tables and it is very important as you
[01:52:39] are learning about a new source system
[01:52:42] is to create like some kind of
[01:52:43] documentation so now let's go and
[01:52:45] explore the sources okay so now let's go
[01:52:47] and explore them one by one we can start
[01:52:49] with the first one from the CRM we have
[01:52:51] the customer info so right click on it
[01:52:53] and say select top thousand rows and
[01:52:56] this is of course important if you have
[01:52:57] like a lot of data don't go and explore
[01:52:59] millions of rows always limit your
[01:53:01] queries so for example here we are using
[01:53:02] the top thousands just to make sure that
[01:53:04] you are not impacting the system with
[01:53:06] your queries so now let's have a look to
[01:53:07] the content of this table so we can see
[01:53:09] that we have here customer informations
[01:53:12] so we have an ID we have a key for the
[01:53:14] customer we have first name last name my
[01:53:16] Ral status gender and the creation date
[01:53:19] of the customer so simply this is a
[01:53:21] table for the customer customer
[01:53:22] information and a lot of details for the
[01:53:25] customers and here we have like two
[01:53:26] identifiers one it is like technical ID
[01:53:29] and another one it's like the customer
[01:53:32] number so maybe we can use either the ID
[01:53:34] or the key in order to join it with
[01:53:35] other tables so now what I usually do is
[01:53:38] to go and draw like data model or let's
[01:53:41] say integration model just to document
[01:53:43] and visual what I am understanding
[01:53:45] because if you don't do that you're
[01:53:46] going to forget it after a while so now
[01:53:48] we go and search for a shape let's
[01:53:50] search for table and I'm going to go and
[01:53:51] pick this one over here so here we can
[01:53:54] go and change the style for example we
[01:53:56] can make it rounded or you can go make
[01:53:58] it sketch and so on and we can go and
[01:54:01] change the color so I'm going to make it
[01:54:02] blue then go to the text make sure to
[01:54:04] select the whole thing and let's make it
[01:54:08] bigger 26 and then what I'm going to do
[01:54:10] for those items I'm just going to select
[01:54:12] them and go to arrange and maybe make it
[01:54:15] 40 something like this so now what we're
[01:54:17] going to do we're going to just go and
[01:54:19] put the table name so this is the one
[01:54:21] that we are now learning about and what
[01:54:24] I'm going to do I'm just going to go and
[01:54:25] put here the primary key I will not go
[01:54:27] and list all the informations so the
[01:54:29] primary key was the ID and I will go and
[01:54:32] remove all those stuff I don't need it
[01:54:34] now as you can see the table name is not
[01:54:36] really friendly so I can go and bring a
[01:54:38] text and put it here on top and say this
[01:54:40] is the customer information just to make
[01:54:44] it friendly and do not forget about it
[01:54:46] and as well going to increase the size
[01:54:48] to maybe 20 something like this okay
[01:54:51] with that we have our first table and
[01:54:53] we're going to go and keep exploring so
[01:54:55] let's move to the second one we're going
[01:54:56] to take the product information right
[01:54:59] click on it and select the top thousand
[01:55:01] rows I will just put it below the
[01:55:03] previous query query it now by looking
[01:55:06] to this table we can see we have product
[01:55:08] informations so we have here a primary
[01:55:10] key for the product and then we have
[01:55:12] like key or let's say product number and
[01:55:14] after that we have the full name of the
[01:55:16] product the product costs and then we
[01:55:18] have the product line and then we have
[01:55:20] like start and end
[01:55:22] well this is interesting to understand
[01:55:24] why we have start and ends let's have a
[01:55:26] look for example for those three rows
[01:55:29] all of those three having the same key
[01:55:31] but they have different IDs so it is the
[01:55:34] same product but with different costs so
[01:55:37] for 2011 we have the cost of 12 then
[01:55:40] 2012 we have 14 and for the last year
[01:55:44] 2013 we have 13 so it's like we have
[01:55:47] like a history for the changes so this
[01:55:49] table not only holding the current
[01:55:51] affirmations of the product but also
[01:55:53] history informations of the products and
[01:55:55] that's why we have those two dates start
[01:55:58] and end now let's go back and draw this
[01:56:00] information over here so I'm just going
[01:56:02] to go and duplicate it so the name of
[01:56:04] this table going to be the BRD info and
[01:56:06] let's go and give it like a short
[01:56:07] description current and history products
[01:56:12] information something like this just to
[01:56:15] not forget that we have history in this
[01:56:16] table and here we have as well the PRD
[01:56:19] ID and there is like nothing that we can
[01:56:21] use in order to join those two tables we
[01:56:24] don't have like a customer ID here or in
[01:56:26] the other table we don't have any
[01:56:27] product ID okay so that's it for this
[01:56:29] table let's jump to the third table and
[01:56:31] the last one in the CRM so let's go and
[01:56:34] select I just made other queries as well
[01:56:36] short so let's go and execute so what do
[01:56:38] you have over here we have a lot of
[01:56:39] informations about the order the sales
[01:56:42] and a lot of measures order number we
[01:56:44] have the product key so this is
[01:56:46] something that we can use in order to
[01:56:48] join it with the product table we have
[01:56:50] the customer ID we don't have the
[01:56:52] customer key so here we have like ID and
[01:56:55] here we have key so there's like two
[01:56:56] different ways on how to join tables and
[01:56:59] then we have here like dates the order
[01:57:02] dates the shipping date the due date and
[01:57:04] then we have the sales amount the
[01:57:06] quantity and the price so this is like
[01:57:08] an event table it is transactional table
[01:57:11] about the orders and sales and it is
[01:57:13] great table in order to connect the
[01:57:15] customers with the products and as well
[01:57:18] with the orders so let's document this
[01:57:20] new information that we have so the
[01:57:22] table name is the sales details so we
[01:57:25] can go and describe it like this
[01:57:27] transactional records about sales and
[01:57:33] orders and now we have to go and
[01:57:35] describe how we can connect this table
[01:57:37] to the other two so we are not using the
[01:57:39] product ID we are using the product key
[01:57:43] and now we need a new column over here
[01:57:44] so you can hold control and enter or you
[01:57:47] can go over here and add a new row and
[01:57:49] the other row is going to be the
[01:57:50] customer ID so now for the the customer
[01:57:52] ID it is easy we can gr and grab an
[01:57:54] arrow in order to connect those two
[01:57:56] tables but for the product key we are
[01:57:58] not using the ID so that's why I'm just
[01:58:01] going to go and remove this one and say
[01:58:03] product key let's have here again a
[01:58:04] check so this is a product key it's not
[01:58:07] a product ID and if we go and check the
[01:58:09] old table the products info you can see
[01:58:11] we are using this key and not the
[01:58:14] primary key so what we're going to do
[01:58:15] now we will just go and Link it like
[01:58:18] this and maybe switch those two tables
[01:58:20] so I will put the customer below just
[01:58:23] perfect it looks nice okay so let's keep
[01:58:25] moving let's go now to the other source
[01:58:27] system we have the Erp and the first one
[01:58:29] is ARB cost and we have this cryptical
[01:58:32] name let's go and select the data so now
[01:58:35] here it's small table and we have only
[01:58:37] three informations so we have here
[01:58:39] something called C and then we have
[01:58:41] something I think this is the birthday
[01:58:43] and the gender information so we have
[01:58:45] here male female and so on so it looks
[01:58:47] again like the customer informations but
[01:58:49] here we have like extra data about the
[01:58:51] birthday and now if you go and compare
[01:58:53] it to the customer table that we have
[01:58:55] from the other source system let's go
[01:58:57] and query it you can see the new table
[01:58:59] from the Erb don't have IDs it has
[01:59:02] actually the customer number or the key
[01:59:05] so we can go and join those two tables
[01:59:07] using the customer key let's go and
[01:59:09] document this information so I will just
[01:59:11] go and copy paste and put it here on the
[01:59:13] right side I will just go and change the
[01:59:15] color now since we are now talking about
[01:59:17] different Source system and here the
[01:59:19] table name going to be this one and the
[01:59:22] key called C ID now in order to join
[01:59:25] this table with the customer info we
[01:59:27] cannot join it with the customer ID we
[01:59:29] need the customer key that's why here we
[01:59:31] have to go and add a new row so contrl
[01:59:33] enter and we're going to say customer
[01:59:35] key and then we have to go and make a
[01:59:37] nice Arrow between those two keys so
[01:59:40] we're going to go and give it a
[01:59:41] description customer
[01:59:44] information and here we have the birth
[01:59:47] dates okay so now let's keep going we're
[01:59:50] going to go to the next one we have the
[01:59:53] Erp location let's go and query this
[01:59:56] table so what do you have over here we
[01:59:58] have the CID again and as you can see we
[02:00:00] have country informations and this is of
[02:00:02] course again the customer number and we
[02:00:05] have only this information the country
[02:00:07] so let's go and docment this information
[02:00:09] this is the customer location table name
[02:00:12] going to be like this and we still have
[02:00:13] the same ID so we have here still the
[02:00:16] customer ID and we can go and join it
[02:00:18] using the customer key and we have to
[02:00:20] give it the description locate
[02:00:22] of customers and we can say here the
[02:00:25] country okay so now let's go to the last
[02:00:28] table and explore it we have the Erp PX
[02:00:32] catalog so let's go and query those
[02:00:35] informations so what do we have here we
[02:00:37] have like an ID a category a subcategory
[02:00:40] and the maintenance here we have like
[02:00:43] either yes and no so by looking to this
[02:00:45] table we have all the categories and the
[02:00:47] subcategories of the products and here
[02:00:49] we have like special identifier for
[02:00:52] those informations now the question is
[02:00:54] how to join it so I would like to join
[02:00:56] it actually with the product
[02:00:58] informations so let's go and check those
[02:01:00] two tables together okay so in the
[02:01:01] products we don't have any ID for the
[02:01:03] categories but we have these
[02:01:05] informations actually in the product key
[02:01:08] so the first five characters of the
[02:01:10] product key is actually the category ID
[02:01:13] so we can use this information over here
[02:01:16] in order to join it with the categories
[02:01:18] so we can go and describe this
[02:01:20] information like this and then we have
[02:01:22] to go and give it a name and then here
[02:01:24] we have the ID and the ID could be
[02:01:26] joined using the product key so that
[02:01:29] means for the product information we
[02:01:31] don't need at all the product ID the
[02:01:33] primary key all what we need is the
[02:01:36] product key or the product number and
[02:01:38] what I would like to do is like to group
[02:01:39] those informations in a box so let's go
[02:01:43] grab like any boxes here on the left
[02:01:45] side and make it bigger and then make
[02:01:49] the edges a little bit smaller let's
[02:01:51] remove move the fill and the line I will
[02:01:53] make a dotted line and then let's grab
[02:01:56] another box over here and say this is
[02:01:58] the CRM and we can go and increase the
[02:02:01] size maybe something like 40 smaller 35
[02:02:05] bold and change the color to Blue and
[02:02:07] just place it here on top of this box so
[02:02:09] with that we can understand all those
[02:02:11] tables belongs to the source system CRM
[02:02:14] and we can do the same stuff for the
[02:02:16] right side as well now of course we have
[02:02:18] to go and add the description here so
[02:02:21] it's going to be the product
[02:02:23] categories all right so with that we
[02:02:25] have now clear understanding how the
[02:02:27] tables are connected to each others we
[02:02:30] understand now the content of each table
[02:02:32] and of course it can to help us to clean
[02:02:34] up the data in the silver layer in order
[02:02:36] to prepare it so as you can see it is
[02:02:38] very important to take time
[02:02:41] understanding the structure of the
[02:02:42] tables the relationship between them
[02:02:44] before start writing any code all right
[02:02:46] so with that we have now clear
[02:02:47] understanding about the sources and with
[02:02:49] that we have as well created a data
[02:02:52] integration in the dro so with that we
[02:02:54] have more understanding about how to
[02:02:56] connect the sources and now in the next
[02:02:58] two task we will go back to SQL where
[02:03:00] we're going to start checking the
[02:03:01] quality and as well doing a lot of data
[02:03:04] Transformations so let's
[02:03:08] go okay so now let's have a quick look
[02:03:11] to the specifications of the server
[02:03:12] layer so the main objective to have
[02:03:14] clean and standardized data we have to
[02:03:17] prepare the data before going to the
[02:03:19] gold layer and we will be building
[02:03:21] tables inside the silver layer and the
[02:03:23] way of loading the data from the bronze
[02:03:25] to the silver is a full load so that
[02:03:27] means we're going to trate and then
[02:03:29] insert and here we're going to have a
[02:03:30] lot of data Transformations so we're
[02:03:32] going to clean the data we're going to
[02:03:33] bring normalizations standardizations
[02:03:36] we're going to derive new columns we
[02:03:38] will be doing as well data enrichment so
[02:03:40] a lot of things to be done in the data
[02:03:42] transformation but we will not be
[02:03:44] building any new data model so those are
[02:03:46] the specifications and we have to commit
[02:03:48] ourself to this scope okay so now
[02:03:50] building the ddl script for the layer
[02:03:52] going to be way easier than the bronze
[02:03:54] because the definition and the structure
[02:03:56] of each table in the silver going to be
[02:03:58] identical to the bronze layer we are not
[02:04:01] doing anything new so all what you have
[02:04:03] to do is to take the ddl script from the
[02:04:05] bronze layer and just go and search and
[02:04:07] replace for the schema I'm just using
[02:04:09] the notepad++ for the scripts so I'm
[02:04:11] going to go over here and say replace
[02:04:13] the bronze dots with silver dots and I'm
[02:04:16] going to go and replace all so with that
[02:04:19] now all the ddl is targeting the schema
[02:04:22] silver layer which is exactly what we
[02:04:24] need all right now before we execute our
[02:04:26] new ddl script for the silver we have to
[02:04:29] talk about something called the metadata
[02:04:31] columns they are additional columns or
[02:04:33] fields that the data Engineers add to
[02:04:36] each table that don't come directly from
[02:04:38] the source systems but the data
[02:04:40] Engineers use it in order to provide
[02:04:42] extra informations for each record like
[02:04:45] we can add a column called create date
[02:04:47] is when the record was loaded or an
[02:04:50] update date when the the record got
[02:04:52] updated or we can add the source system
[02:04:55] in order to understand the origin of the
[02:04:57] data that we have or sometimes we can
[02:04:59] add the file location in order to
[02:05:02] understand the lineage from which file
[02:05:04] the data come from those are great tool
[02:05:06] if you have data issue in your data
[02:05:08] warehouse if there is like corrupt data
[02:05:10] and so on this can help you to track
[02:05:13] exactly where this issue happens and
[02:05:15] when and as well it is great in order to
[02:05:18] understand whether I have Gap in my data
[02:05:20] especially if you are doing incremental
[02:05:21] mod it is like putting labels on
[02:05:23] everything and you will thank yourself
[02:05:25] later when you start using them in hard
[02:05:28] times as you have an issue in your data
[02:05:30] warehouse so now back to our ddl scripts
[02:05:32] and all what you have to do is to go and
[02:05:34] do the following so for example for the
[02:05:35] first table I will go and add at the end
[02:05:38] one more extra column so it start with
[02:05:41] the prefix DW as we have defined in the
[02:05:44] naming convention and then underscore
[02:05:46] let's have the create dates and the data
[02:05:49] tabe going to be date time to and now
[02:05:51] what we can do is we can go and add a
[02:05:53] default value for it I want the database
[02:05:55] to generate these informations
[02:05:57] automatically we don't have to specify
[02:05:59] that in any ETL scripts so which value
[02:06:01] it's going to be the get datee so each
[02:06:04] record going to be inserted in this
[02:06:05] table will get automatically a value
[02:06:08] from the current date and time so now as
[02:06:10] you can see the naming convention it is
[02:06:12] very important all those columns comes
[02:06:14] from the source system and only this one
[02:06:16] column comes from the data engineer of
[02:06:18] the data warehouse okay so that's it
[02:06:20] let's go and repeat the same thing for
[02:06:22] all other tables so I will just go and
[02:06:24] add this piece of information for each
[02:06:29] ddl all right so I think that's it all
[02:06:32] what you have to do is now to go and
[02:06:34] execute the whole ddl script for the
[02:06:36] silver layer let's go into that all
[02:06:38] right perfect there's no errors let's go
[02:06:40] and refresh the tables on the object
[02:06:42] Explorer and with that as you can see we
[02:06:44] have six tables for the silver layer it
[02:06:46] is identical to the bronze layer but we
[02:06:48] have one extra column for the metadata
[02:06:55] all right so now in the server layer
[02:06:57] before we start writing any data
[02:06:58] Transformations and cleansing we have
[02:07:01] first to detect the quality issues in
[02:07:03] the pron without knowing the issues we
[02:07:05] cannot find solution right we will
[02:07:07] explore first the quality issues only
[02:07:09] then we start writing the transformation
[02:07:12] scripts so let's
[02:07:13] [Music]
[02:07:19] go okay so now what we're going to do
[02:07:21] we're going to go through all the tables
[02:07:23] over the bronze layer clean up the data
[02:07:25] and then insert it to the server layer
[02:07:27] so let's start with the first table the
[02:07:29] first bronze table from The Source CRM
[02:07:32] so we're going to go to the bronze CRM
[02:07:34] customer info so let's go and query the
[02:07:37] data over here now of course before
[02:07:39] writing any data and Transformations we
[02:07:41] have to go and detect and identify the
[02:07:44] quality issues of this table so usually
[02:07:46] I start with the first check where we go
[02:07:48] and check the primary key so we have to
[02:07:51] go and check whether there are nulls
[02:07:52] inside the primary key and whether there
[02:07:54] are duplicates so now in order to detect
[02:07:57] the duplicates in the primary key what
[02:07:58] we have to do is to go and aggregate the
[02:08:01] primary key if we find any value in the
[02:08:03] primary key that exist more than once
[02:08:05] that means it is not unique and we have
[02:08:07] duplicates in the table so let's go and
[02:08:09] write query for that so what we're going
[02:08:11] to do we're going to go with the
[02:08:13] customer ID and then we're going to go
[02:08:14] and count and then we have to group up
[02:08:17] the data so Group by based on the
[02:08:19] primary key and of course we don't need
[02:08:21] all the results we need only where we
[02:08:23] have an issue so we're going to say
[02:08:25] having
[02:08:27] counts higher than one so we are
[02:08:29] interested in the values where the count
[02:08:32] is higher than one so let's go and
[02:08:34] execute it now as you can see we have
[02:08:36] issue in this table we have duplicates
[02:08:38] because all those IDs exist more than
[02:08:41] one in the table which is completely
[02:08:43] wrong we should have the primary key
[02:08:44] unique and you can see as well we have
[02:08:46] three records where the primary key is
[02:08:48] empty which is as well a bad thing now
[02:08:51] there is an issue here if we have only
[02:08:53] one null it will not be here at the
[02:08:55] result so what I'm going to do I'm going
[02:08:56] to go over here and say or the primary
[02:08:59] key is null just in case if we have only
[02:09:03] one null I'm still interested to see the
[02:09:05] results so if I go and run it again
[02:09:07] we'll get the same results so this is
[02:09:09] equality check that you can do on the
[02:09:11] table and as you can see it is not
[02:09:12] meeting the expectation so that means we
[02:09:15] have to do something about it so let's
[02:09:17] go and create a new query so here what
[02:09:19] we're going to do we can to start
[02:09:20] writing the query that is doing the data
[02:09:22] transformation and the data cleansing so
[02:09:25] let's start again by selecting the
[02:09:28] [Music]
[02:09:30] data and excuse it again so now what I
[02:09:33] usually do I go and focus on the issue
[02:09:36] so for example let's go and take one of
[02:09:38] those values and I focus on it before
[02:09:40] start writing the transformation so
[02:09:42] we're going to say where customer ID
[02:09:44] equal to this value all right so now as
[02:09:47] you can see we have here the issue where
[02:09:48] the ID exist three times but actually we
[02:09:51] are interested only on one of them so
[02:09:53] the question is how to pick one of those
[02:09:56] usually we search for a timestamp or
[02:09:58] date value to help us so if you check
[02:10:00] the creation date over here we can
[02:10:02] understand that this record this one
[02:10:04] over here is the newest one and the
[02:10:07] previous two are older than it so that
[02:10:09] means if I have to go and pick one of
[02:10:11] those values I would like to get the
[02:10:13] latest one because it holds the most
[02:10:16] fresh information so what we have to do
[02:10:18] is we have to go and rank all those
[02:10:20] values based on the create dates and
[02:10:23] only pick the highest one so that means
[02:10:26] we need a ranking function and for that
[02:10:28] in scale we have the amazing window
[02:10:30] functions so let's go and do that we
[02:10:32] will use the function row number over
[02:10:37] and then Partition by and here we have
[02:10:40] to divide the table by the customer ID
[02:10:42] so we're going to divide it by the
[02:10:44] customer ID and in order now to rank
[02:10:47] those rows we have to sort the data by
[02:10:49] something so order by and as we
[02:10:51] discussed we want to sort the data by
[02:10:53] the creation date so create
[02:10:56] date and we're going to sort it
[02:10:58] descending so the highest first then the
[02:11:00] lowest so let's go and do that and now
[02:11:02] we're going to go and give it the name
[02:11:04] flag last so now let's go and executed
[02:11:07] now the data is sorted by the creation
[02:11:09] date and you can see over here that this
[02:11:12] record is the number one then the one
[02:11:14] that is older is two and the oldest one
[02:11:16] is three of course we are interested in
[02:11:19] the rank number one now let's go and
[02:11:21] moove the filter and check everything so
[02:11:23] now if you have a look to the table you
[02:11:24] can see that on the flag we have
[02:11:26] everywhere like one and that's because
[02:11:29] the those primary Keys exist only one
[02:11:32] but sometimes we will not have one we
[02:11:33] will have two three and so on if there's
[02:11:35] like duplicates we can go of course and
[02:11:37] do a double check so let's go over here
[02:11:39] and say select
[02:11:40] star from this query we're going to say
[02:11:43] where flag last is in equal to one so
[02:11:47] let's go and query it and now we can see
[02:11:49] all the data that we don't need because
[02:11:51] they are causing duplicates in the
[02:11:52] primary key and they have like an old
[02:11:54] status so what we're going to do we're
[02:11:56] going to say equal to one and with that
[02:11:58] we guarantee that our primary key is
[02:12:00] unique and each value exist only once so
[02:12:03] if I go and query it like this you will
[02:12:05] see we will not find any duplicate
[02:12:07] inside our table and we can go and check
[02:12:09] that of course so let's go and check
[02:12:11] this primary key and we're going to say
[02:12:13] and customer ID equal to this value and
[02:12:16] you can see it exists now only once and
[02:12:18] we are getting the freshest data from
[02:12:20] this key so with that we have defined
[02:12:23] like transformation in order to remove
[02:12:25] any D Kates okay so now moving on to the
[02:12:27] next one as you can see in our table we
[02:12:30] have a lot of values where they are like
[02:12:33] string values now for these string
[02:12:35] values we have to check the unwanted
[02:12:37] spaces so now let's go and write a query
[02:12:39] that's going to detect those unwanted
[02:12:41] spaces so we're going to say
[02:12:43] select this column the first name from
[02:12:47] our table bronze customer information so
[02:12:50] let's go and query it now by just
[02:12:53] looking to the data it's going to be
[02:12:54] really hard to find those unwanted
[02:12:56] spaces especially if they are at the end
[02:12:58] of the world but there is a very easy
[02:13:01] way in order to detect those issues so
[02:13:03] what we're going to do we're going to do
[02:13:04] a filter so now we're going to say the
[02:13:05] first name is not equal to the first
[02:13:09] name after trimming the values so if you
[02:13:11] use the function trim what it going to
[02:13:13] do it's going to go and remove all the
[02:13:15] leading and trailing spaces so the first
[02:13:18] name so if this value is not equal to
[02:13:22] the first name after trimming it then we
[02:13:24] have an issue so it is very simple let's
[02:13:26] go and execute it so now in the result
[02:13:29] we will get the list of all first names
[02:13:31] where we have spaces either at the start
[02:13:34] or at the end so again the expectation
[02:13:36] here is no results and the same thing we
[02:13:40] can go and check something else like for
[02:13:42] example the last name so let's go and do
[02:13:45] that over here and here let's go and
[02:13:48] execute it we see in the result we have
[02:13:50] as well customers where they have like
[02:13:53] space in their last name which is not
[02:13:56] really good and we can go and keep
[02:13:57] checking all the string values that you
[02:14:00] have inside the table so for example the
[02:14:01] gender so let's go and check
[02:14:04] that and execute now as you can see we
[02:14:07] don't have any results that means the
[02:14:09] quality of the gender is better and we
[02:14:11] don't have any unwanted spaces so now we
[02:14:14] have to go and write transformation in
[02:14:16] order to clean up those two columns now
[02:14:18] what I'm going to do I'm just going to
[02:14:19] go and list all the column in the query
[02:14:22] instead of the star all right so now I
[02:14:24] have a list of all the columns that I
[02:14:26] need and now what we have to do is to go
[02:14:27] to those two columns and start removing
[02:14:30] The Unwanted spaces so we'll just use
[02:14:32] the trim it's very
[02:14:34] simple and give it a name of course the
[02:14:37] same name and we will trim as well the
[02:14:40] last name so let's go and query this and
[02:14:43] with that we have cleaned up those two
[02:14:44] colums from any unwanted spaces okay so
[02:14:47] now moving on we have those two
[02:14:49] informations we have the marital status
[02:14:51] and as well the gender if you check the
[02:14:53] values inside those two columns as you
[02:14:55] can see we have here low cardinality so
[02:14:57] we have limited numbers of possible
[02:14:59] values that is used inside those two
[02:15:02] columns so what we usually do is to go
[02:15:04] and check the data consistency inside
[02:15:06] those two columns so it's very simple
[02:15:09] what we're going to do we're going to do
[02:15:10] the following we're going to say
[02:15:13] distinct and we're going to check the
[02:15:15] values let's go and do that and now as
[02:15:17] you can see we have only three possible
[02:15:19] values either null F or M which is okay
[02:15:22] we can stay like this of course but we
[02:15:24] can make a rule in our project where we
[02:15:26] can say we will not be working with data
[02:15:29] abbreviations we will go and use only
[02:15:31] friendly full names so instead of having
[02:15:34] an F we're going to have like a full
[02:15:36] word female and instead of M we're going
[02:15:38] to have like male and we make it as a
[02:15:40] rule for the whole project so each time
[02:15:42] we find the gender informations we try
[02:15:44] to give the full name of it so let's go
[02:15:46] and map those two values to a friendly
[02:15:49] one so we're going to go to the gender
[02:15:50] of over here and say case when and we're
[02:15:53] going to say the gender is equal to F
[02:15:57] then make it a
[02:15:59] female and when it
[02:16:02] is equal to
[02:16:05] M then M it to male and now we have to
[02:16:09] make decision about the nulls as you can
[02:16:11] see over here we have nulls so do we
[02:16:13] want to leave it as a null or we want to
[02:16:15] use always the value unknown so with
[02:16:18] that we are replacing the missing values
[02:16:21] with a standard default value or you can
[02:16:23] leave it as a null but let's say in our
[02:16:25] project that we are replacing all the
[02:16:27] missing value with a default value so
[02:16:29] let's go and do that we going to say
[02:16:32] else I'm going to go with the na not
[02:16:35] available or you can go with the unknown
[02:16:37] of course so that's for the gender
[02:16:39] information like this and we can go and
[02:16:41] remove the old one and now there is one
[02:16:43] thing that I usually do in this case
[02:16:46] where sometimes what happens currently
[02:16:47] we are getting the capital F and the
[02:16:49] capital M but maybe in the the time
[02:16:51] something changed and you will get like
[02:16:53] lower M and lower F so just to make sure
[02:16:55] in those cases we still are able to map
[02:16:58] those values to the correct value what
[02:17:00] we're going to do we're going to just
[02:17:01] use the function upper just to make sure
[02:17:04] that if we get any lowercase values we
[02:17:07] are able to catch it so the same thing
[02:17:10] over here as well and now one more thing
[02:17:13] that you can add as well of course if
[02:17:15] you are not trusting the data because we
[02:17:17] saw some unwanted spaces in the first
[02:17:19] name and the last name you might not
[02:17:20] trust that in the future you will get
[02:17:22] here as well unwanted spaces you can go
[02:17:25] and make sure to trim
[02:17:27] everything just to make sure that you
[02:17:30] are catching all those cases so that's
[02:17:33] it for now let's go and excute now as
[02:17:35] you can see we don't have an m and an F
[02:17:37] we have a full word male and female and
[02:17:41] if we don't have a value we don't have a
[02:17:42] null we are getting here not available
[02:17:45] now we can go and do the same stuff for
[02:17:47] the Merial status you can see as well we
[02:17:49] have only three possibil ities the S
[02:17:51] null and an M we can go and do the same
[02:17:54] stuff so I will just go and copy
[02:17:56] everything from here and I will go and
[02:17:58] use the marital status I just remove
[02:18:01] this one from here and now what are the
[02:18:03] possible values we have the S so it's
[02:18:05] going to be single we have an M for
[02:18:09] married and we have as well a null and
[02:18:12] with that we are getting the not
[02:18:13] available so with that we are making as
[02:18:15] well data standardizations for this
[02:18:18] column so let's go and execute it now as
[02:18:21] you can see we don't have those short
[02:18:22] values we have a full friendly value for
[02:18:25] the status and as well for the gender
[02:18:27] and at the same time we are handling the
[02:18:29] nulls inside those two columns so with
[02:18:32] that we are done with those two columns
[02:18:33] and now we can go to the last one that
[02:18:35] create date for this type of
[02:18:36] informations we make sure that this
[02:18:39] column is a real date and not as a
[02:18:41] string or barar and as we defined it in
[02:18:43] the data type it is a date which is
[02:18:45] completely correct so nothing to do with
[02:18:48] this column and now the next step is
[02:18:50] that we're going to go and write the
[02:18:51] insert statement so how we're going to
[02:18:53] do it we're going to go to the start
[02:18:54] over here and say insert into silver do
[02:18:59] SRM customer info now we have to go and
[02:19:02] specify all the columns that should be
[02:19:04] inserted so we're going to go and type
[02:19:06] it so something like this and then we
[02:19:08] have the query over here let's go and
[02:19:11] execute it so let's do that so with that
[02:19:13] we have inserted clean data inside the
[02:19:16] silver table so now what we're going to
[02:19:17] do we're going to go and take all the
[02:19:19] queries that we have used used in order
[02:19:21] to check the quality of the bronze and
[02:19:23] let's go and take it to another query
[02:19:25] and instead of having bronze we're going
[02:19:27] to say silver so this is about the
[02:19:29] primary key let's go and execute it
[02:19:32] perfect we don't have any results so we
[02:19:34] don't have any duplicates the same thing
[02:19:36] for the next one so the silver and it
[02:19:40] was for the first name so let's go and
[02:19:43] check the first name and run it as you
[02:19:46] can see there is no results it is
[02:19:48] perfect we don't have any issues you can
[02:19:50] of course go and check the last
[02:19:53] name and run it again we don't have any
[02:19:56] result over here and now we can go and
[02:19:58] check those low cardinality columns like
[02:20:01] for
[02:20:02] example the gender let's go and execute
[02:20:05] it so as you can see we have the not
[02:20:06] available or the unknown male and female
[02:20:09] so perfect and you can go and have a
[02:20:11] final look to the table to the silver
[02:20:14] customer info let's go and check that so
[02:20:16] now we can have a look to all those
[02:20:18] columns as you can see everything looks
[02:20:19] perfect and you can see it is working
[02:20:22] this metadata information that we have
[02:20:24] added to the table definition now it
[02:20:26] says when we have inserted all those
[02:20:29] three cords to the table which is really
[02:20:31] amazing information to have a track and
[02:20:33] audit okay so now by looking to the
[02:20:35] script we have done different types of
[02:20:37] data Transformations the first one is
[02:20:39] with the first name and the last name
[02:20:41] here we have done trimming removing
[02:20:43] unwanted spaces this is one of the types
[02:20:45] of data cleansing so we remove
[02:20:47] unnecessary spaces or unwanted
[02:20:49] characters to to ensure data consistency
[02:20:52] now moving on to the next transformation
[02:20:54] we have this casewin so what we have
[02:20:56] done here is data normalization or we
[02:20:59] call it sometimes data standardization
[02:21:01] so this transformation is type of data
[02:21:03] cleansing where we can map coded values
[02:21:06] to meaningful userfriendly description
[02:21:09] and we have done the same transformation
[02:21:10] as well to the agender another type of
[02:21:13] transformation that we have done as well
[02:21:15] in the same case when is that we have
[02:21:17] handled the missing values so instead of
[02:21:19] nulls we can have not available so
[02:21:22] handling missing data is as well type of
[02:21:24] data cleansing where we are filling the
[02:21:26] blanks by adding for example a default
[02:21:29] value so instead of having an empty
[02:21:31] string or a null we're going to have a
[02:21:33] default value like the not available or
[02:21:35] unknown another type of data and
[02:21:36] Transformations that we have done in
[02:21:38] this script is we have removed the
[02:21:40] duplicates so removing duplicates is as
[02:21:42] well type of data cleansing where we
[02:21:44] ensure only one record for each primary
[02:21:47] key by identifying and retaining only
[02:21:50] the most relevant role to ensure there
[02:21:53] is no duplicates inside our data and as
[02:21:55] we are removing the duplicates of course
[02:21:57] we are doing data filtering so those are
[02:21:59] the different types of data
[02:22:01] Transformations that we have done in
[02:22:03] this
[02:22:06] script all right moving on to the second
[02:22:09] table in the bronze layer from the CRM
[02:22:11] we have the product info and of course
[02:22:13] as usual before we start writing any
[02:22:15] Transformations we have to search for
[02:22:17] data quality issues and we start with
[02:22:19] the first one we have to check the
[02:22:20] primary key so we have to check whether
[02:22:22] we have duplicates or nulls inside this
[02:22:24] key so what you have to do we have to
[02:22:26] group up the data by the primary key or
[02:22:28] check whether we have nulls so let's go
[02:22:30] and execute it so as you can see
[02:22:32] everything is safe we don't have dcat or
[02:22:34] nulls in the primary key now moving on
[02:22:36] to the next one we have the product key
[02:22:38] here we have in this column a lot of
[02:22:40] informations so now what you have to do
[02:22:41] is to go and split this string into two
[02:22:44] informations so we are deriving new two
[02:22:46] columns so now let's start with the
[02:22:48] first one is the category ID the first
[02:22:51] five characters they are actually the
[02:22:53] category ID and we can go and use the
[02:22:55] substring function in order to extract
[02:22:58] part of a string it needs three
[02:23:00] arguments the first one going to be the
[02:23:02] column that we want to extract from and
[02:23:04] then we have to define the position
[02:23:06] where to extract and since the first
[02:23:08] part is on the left side we going to
[02:23:10] start from the first position and then
[02:23:12] we have to specify the length so how
[02:23:14] many characters we want to extract we
[02:23:16] need five characters so 1 2 3 4 five so
[02:23:20] that's set for the category ID category
[02:23:22] ID let's go and execute it now as you
[02:23:25] can see we have a new column called the
[02:23:27] category ID and it contains the first
[02:23:29] part of the string and in our database
[02:23:32] from the other source system we have as
[02:23:33] well the category ID now we can go and
[02:23:36] double check just in order to make sure
[02:23:38] that we can join data together so we're
[02:23:40] going to go and check the ID from the
[02:23:43] pron table Erp and this can be from the
[02:23:47] category so in this table we have the
[02:23:49] category ID and you can see over here
[02:23:52] those are the IDS of the category and in
[02:23:54] the C layer we have to go and join those
[02:23:57] two tables but here we still have an
[02:23:58] issue we have here an underscore between
[02:24:01] the category and the subcategory but in
[02:24:04] our table we have actually a minus so we
[02:24:07] have to replace that with an underscore
[02:24:09] in order to have matching informations
[02:24:11] between those two tables otherwise we
[02:24:13] will not be able to join the tables so
[02:24:14] we're going to use the function
[02:24:16] replace and what we are replacing we are
[02:24:19] replacing the m
[02:24:21] with an underscore something like this
[02:24:24] and if you go now and execute it we will
[02:24:26] get an underscore exactly like the other
[02:24:29] table and of course we can go and check
[02:24:31] whether everything is matching by having
[02:24:33] very simple query where we say this new
[02:24:36] information not in and then we have this
[02:24:40] nice subquery so we are trying to find
[02:24:42] any category ID that is not available in
[02:24:46] the second table so let's go and execute
[02:24:48] it now as you can see we have only one
[02:24:49] category that is not matching we are not
[02:24:52] finding it in this table which is maybe
[02:24:54] correct so if you go over here you will
[02:24:56] not find this category I just make it a
[02:24:59] little bit bigger so we are not finding
[02:25:01] this one category from this table which
[02:25:03] is fine so our check is okay okay so
[02:25:06] with that we have the first part now we
[02:25:07] have to go and extract the second part
[02:25:10] and we're going to do the same thing so
[02:25:11] we're going to use the substring and the
[02:25:13] three argument the product key but this
[02:25:15] time we will not start cutting from the
[02:25:17] first position we have to be in the
[02:25:19] middle so 1 2 2 3 4 5 6 7 so we start
[02:25:24] from the position number seven and now
[02:25:26] we have to define the length how many
[02:25:28] characters to be extracted but if you
[02:25:30] look over here you can see that we have
[02:25:32] different length of the product keys it
[02:25:35] is not fixed like the category ID so we
[02:25:37] cannot go and use specified number we
[02:25:39] have to make something Dynamic and there
[02:25:41] is Trick In order to do that we can to
[02:25:43] go and use the length of the whole
[02:25:45] column with that we make sure that we
[02:25:47] are always getting enough characters to
[02:25:49] be extra Ed and we will not be losing
[02:25:51] any informations so we will make it
[02:25:54] Dynamic like this we will not have it as
[02:25:56] a fixed length and with that we have the
[02:25:58] product key so let's go and execute it
[02:26:02] as you can see we are now extracting the
[02:26:04] second part from this string now why we
[02:26:07] need the product key we need it in order
[02:26:09] to join it with another table called
[02:26:12] sales details so let's go and check the
[02:26:14] sales details so let me just check the
[02:26:17] column name it is SLS product key so
[02:26:22] from bronze
[02:26:24] CRM sales let's go and check the data
[02:26:27] over here and it looks wonderful so
[02:26:31] actually we can go and join those
[02:26:32] informations together but of course we
[02:26:34] can go and check that so we're going to
[02:26:35] say where and we're going to take our
[02:26:37] new column and we're going to say not in
[02:26:40] the subquery just to make sure that we
[02:26:42] are not missing anything so let's go and
[02:26:45] execute so it looks like we have a lot
[02:26:47] of products that don't have any orders
[02:26:51] well I don't have a nice feelings about
[02:26:53] it let's go and try something like this
[02:26:55] one here and we say where LS BRD key
[02:27:00] like this value over here so I'll just
[02:27:04] cut the last three just to search inside
[02:27:06] this table so we really don't have such
[02:27:09] a keys let me just cut the second one so
[02:27:12] let's go and search for it we don't have
[02:27:15] it as well so anything that starts with
[02:27:17] the FK we don't have any order with the
[02:27:20] product where it starts with the F key
[02:27:22] so let's go and remove it but still we
[02:27:25] are able to join the tables right so if
[02:27:27] I go and say in instead of not in so
[02:27:30] with that you are able to match all
[02:27:32] those products so that means everything
[02:27:34] is fine actually it's just products that
[02:27:37] don't have any orders so with that I'm
[02:27:40] happy with this transformation now
[02:27:42] moving on to the next one we have here
[02:27:44] the name of the product we can go and
[02:27:46] check whether there is unwanted spaces
[02:27:49] so let's go to our quality checks make
[02:27:51] sure to use the same table and we're
[02:27:54] going to use the product name and check
[02:27:56] whether we find any unmatching after
[02:27:59] trimming so let's go and do it well it
[02:28:01] looks really fine so we don't have to
[02:28:03] trim anything this column is safe now
[02:28:06] moving on to the next one we have the
[02:28:08] costs so here we have numbers and we
[02:28:10] have to check the quality of the numbers
[02:28:12] so what we can do we can check whether
[02:28:14] we have nulls or negative numbers so
[02:28:16] negative costs or negative prices which
[02:28:19] is not really realistic depend on the
[02:28:21] business of course so let's say in our
[02:28:22] business we don't have any negative
[02:28:25] costs so it's going to be like this
[02:28:27] let's go and check whether is something
[02:28:29] less than zero or whether we have costs
[02:28:33] that is null so let's go and check those
[02:28:37] informations well as you can see we
[02:28:39] don't have any negative values but we
[02:28:40] have nulls so we can go and handle that
[02:28:43] by replacing the null with a zero of
[02:28:46] course if the business allow that so in
[02:28:48] SQL server in order to replace the null
[02:28:50] with a zero we have a very nice function
[02:28:52] called is null so we are saying if it is
[02:28:56] null then replace this value with a zero
[02:28:59] it is very simple like this and we give
[02:29:02] it a name of course so let's go and
[02:29:04] execute it and as you can see we don't
[02:29:06] have any more nulls we have zero which
[02:29:09] is better for the calculations if you
[02:29:10] are later doing any aggregate functions
[02:29:13] like the average now moving on to the
[02:29:15] next one we have the product line This
[02:29:17] is again abbreviation of something and
[02:29:19] the cardinality is low so let's go and
[02:29:21] check all possible values inside this
[02:29:24] column so we're just going to use the
[02:29:26] distinct going to be BRD line so let's
[02:29:29] go and execute it and as you can see the
[02:29:31] possible values are null Mr rst and
[02:29:34] again those are abbreviations but in our
[02:29:36] data warehouse we have decided to give
[02:29:39] full nice names so we have to go and
[02:29:41] replace those codes those abbreviations
[02:29:44] with a friendly value and of course in
[02:29:46] order to get those informations I
[02:29:47] usually go and ask the expert from the
[02:29:50] The Source system or an expert from the
[02:29:52] process so let's start building our case
[02:29:55] win and then let's use the upper and as
[02:29:58] well the trim just to make sure that we
[02:30:00] are having all the cases so the BRD
[02:30:04] line is equal to so let's start with the
[02:30:07] first value the M then we will get the
[02:30:11] friendly value it's going to be Mountain
[02:30:14] then to the next one so I will just copy
[02:30:16] and paste here if it is an R then it is
[02:30:20] rods and another one for let me check
[02:30:24] what do we have here we have Mr and then
[02:30:27] s the S stands for other sales and we
[02:30:32] have the T so let's go and get the T so
[02:30:35] the T stands for
[02:30:37] touring we have at the end an else for
[02:30:40] unknown not available so we don't need
[02:30:43] any nulls so that's it and we're going
[02:30:45] to name it as before so product line so
[02:30:48] let's remove the old one and let's
[02:30:50] execute it and as you can see we don't
[02:30:53] have here anymore those shortcuts and
[02:30:55] the abbreviations we have now full
[02:30:57] friendly value but I will go and have
[02:31:00] here like capital O it looks nicer so
[02:31:03] that we have nice friendly value now by
[02:31:05] looking to this case when as you can see
[02:31:06] it is always like we are mapping one
[02:31:08] value to another value and we are
[02:31:10] repeating all time upper time upper time
[02:31:13] and so on we have here a quick form in
[02:31:15] the case when if it is just a simple
[02:31:17] mapping so the syntax is very simple we
[02:31:19] say case and then we have the column so
[02:31:23] we are evaluating this value over here
[02:31:25] and then we just say when without the
[02:31:28] equal so if it is an M then make it
[02:31:31] Mountain the same thing for the next one
[02:31:33] and so so with that we have the
[02:31:36] functions only once and we don't have to
[02:31:38] go and keep repeating the same function
[02:31:39] over and over and this one only if you
[02:31:41] are mapping values but if you have
[02:31:43] complex conditions you can do it like
[02:31:45] this but for now I'm going to stay with
[02:31:47] the quick form of the case wi it looks
[02:31:49] nicer and shorter so let's go and
[02:31:50] execute it we will get the same results
[02:31:53] okay so now back to our table let's go
[02:31:54] to the last two columns we have the
[02:31:56] start and end date so it's like defining
[02:31:59] an interval we have start and end so
[02:32:01] let's go and check the quality of the
[02:32:02] start and end dates we're going to go
[02:32:04] and say select
[02:32:05] star from our bronze table and now we're
[02:32:09] going to go and search it like this we
[02:32:11] are searching for the end date that is
[02:32:14] smaller than the starts so PRT start
[02:32:18] dates so let's let's go and query this
[02:32:21] so you can see the start is always like
[02:32:23] after the end which makes no sense at
[02:32:26] all so we have here data issue with
[02:32:27] those two dates so now for this kind of
[02:32:29] data Transformations what I usually do
[02:32:31] is I go and grab few examples and put it
[02:32:34] in Excel and try to think about how I'm
[02:32:36] going to go and fix it so here I took
[02:32:38] like two products this one and this one
[02:32:40] over here and for that we have like
[02:32:41] three rows for each one of them and we
[02:32:43] have this situation over here so the
[02:32:45] question now how we going to go and fix
[02:32:47] it I will go and make like a copy of one
[02:32:49] solution where we're going to say it's
[02:32:51] very simple let's go and switch the
[02:32:53] start date with the end date so if I go
[02:32:55] and grab the end dates and put it at the
[02:32:58] starts things going to look way nicer
[02:33:00] right so we have the start is always
[02:33:02] younger than the end but my friends the
[02:33:05] data now makes no sense because we say
[02:33:07] it starts from 2007 and ends by 2011 the
[02:33:11] price was 12 but between 2018 and 2012
[02:33:15] we have 14 which is not really good
[02:33:18] because if you take for example the year
[02:33:20] 2010 for 2010 it was 12 and at the same
[02:33:23] time 14 so it is really bad to have an
[02:33:26] overlapping between those two dates it
[02:33:28] should start from 2007 and end with 11
[02:33:32] and then start febe from 12 and end with
[02:33:34] something else there should be no
[02:33:36] overlapping between years so it's not
[02:33:38] enough to say the start should be always
[02:33:41] smaller than the end but as well the end
[02:33:44] of the first history should be younger
[02:33:47] than the start of the next records this
[02:33:49] is as well a rule in order to have no
[02:33:52] overlapping this one has no start but
[02:33:54] has already an end which is not really
[02:33:57] okay because we have always to have a
[02:33:59] starts each new record in historization
[02:34:02] has to has a start so for this record
[02:34:04] over here this is as well wrong and of
[02:34:07] course it is okay to have the start
[02:34:09] without an end so in this scenario it's
[02:34:11] fine because this indicate this is the
[02:34:14] current informations about the costs so
[02:34:16] again this solution is not working at
[02:34:19] all so now for for the solution to what
[02:34:20] we can say let's go and ignore
[02:34:22] completely the end date and we take only
[02:34:25] the start dates so let's go and paste it
[02:34:27] over here but now we go and rebuild the
[02:34:29] end date completely from the start date
[02:34:32] following the rules that we have defined
[02:34:34] so the rule says the end of date of the
[02:34:37] current records comes from the start
[02:34:39] date from the next records so here this
[02:34:42] end date comes from this value over here
[02:34:45] from the next record so that means we
[02:34:47] take the next start date and put it at
[02:34:49] the end date for the previous records so
[02:34:51] with that as you can see it is working
[02:34:53] the end date is higher than the start
[02:34:56] dates and as well we are making sure
[02:34:58] this date is not overlapping with the
[02:35:00] next record but as well in order to make
[02:35:02] it way nicer we can subtract it with one
[02:35:05] so we can take the previous day like
[02:35:08] this so with that we are making sure the
[02:35:10] end date is smaller than the next start
[02:35:13] now for the next record this one over
[02:35:16] here the end date going to come from the
[02:35:18] next start date so we will take this one
[02:35:20] for here and put it as an end Ag and
[02:35:22] subtract it with one so we will get the
[02:35:26] previous day so now if you compare those
[02:35:28] two you can see it's still higher than
[02:35:30] the start and if you compare it with the
[02:35:32] NY record this one over here it is still
[02:35:35] smaller than the next one so there is no
[02:35:37] overlapping and now for the last record
[02:35:39] since we don't have here any
[02:35:40] informations it will be a null which is
[02:35:43] totally fine so as you can see I'm
[02:35:45] really happy with this scenario over
[02:35:47] here of course you can go and validate
[02:35:48] this with an exp from The Source system
[02:35:51] let's say I've done that and they
[02:35:52] approved it and now I can go and clean
[02:35:54] up the data using this New Logic so this
[02:35:57] is how I usually brainstorm about fixing
[02:35:59] an issues if I have like a complex stuff
[02:36:01] I go and use Excel and then discuss it
[02:36:03] with the expert using this example it's
[02:36:05] way better than showing a database
[02:36:07] queries and so on it just makees things
[02:36:10] easier to explain and as well to discuss
[02:36:12] so now how I usually do it I usually go
[02:36:14] and make a focus on only the columns
[02:36:16] that I need and take only one two
[02:36:18] scenarios while I'm building the logic
[02:36:20] and once everything is ready I go and
[02:36:22] integrate it in the query so now I'm
[02:36:24] focusing only on these columns and only
[02:36:26] for these products so now let's go and
[02:36:29] build our logic now in SQL if you are at
[02:36:31] specific record and you want to access
[02:36:33] another information from another records
[02:36:36] and for that we have two amazing window
[02:36:38] functions we have the lead and lag in
[02:36:40] this scenario we want to access the next
[02:36:42] records that's why we have to go with
[02:36:44] the function lead so let's go and build
[02:36:46] it lead and then what do we need we need
[02:36:48] the lead or
[02:36:50] the
[02:36:50] start date so we want the start date of
[02:36:53] the next records and then we say over
[02:36:56] and we have to partition the data so the
[02:36:59] window going to be focusing on only one
[02:37:02] product which is the product key and not
[02:37:04] the product ID so we are dividing the
[02:37:06] data by product key and of course we
[02:37:08] have to go and sort the data so order by
[02:37:10] and we are sorting the data by the start
[02:37:13] dates and ascending so from the lowest
[02:37:16] to the highest and let's go and give it
[02:37:18] another name so as let's say test for
[02:37:21] example just to test the data so let's
[02:37:24] go and execute and I think I missed
[02:37:26] something here it say Partition by so
[02:37:28] let's go and execute again and now let's
[02:37:30] go and check the results for the first
[02:37:32] partition over here so the start is 2011
[02:37:35] and the end is 2012 and this information
[02:37:38] came from the next record so this data
[02:37:41] is moved to the previous record over
[02:37:43] here and the same thing for this record
[02:37:45] so the end date comes from the next
[02:37:48] record so our logic is working and the
[02:37:50] last record over here is null because we
[02:37:52] are at the end of the window and there
[02:37:54] is no next data that's why we will get
[02:37:56] null and this is perfect of course so it
[02:37:58] looks really awesome but what is missing
[02:38:00] is we have to go and get the previous
[02:38:03] day and we can do that very simply using
[02:38:05] minus one we are just subtracting one
[02:38:07] day so we have no overlapping between
[02:38:09] those two dates and the same thing for
[02:38:11] those two dates so as you can see we
[02:38:13] have just buil a perfect end date which
[02:38:15] is way better than the original data
[02:38:17] that we got from the source system now
[02:38:19] let's take this one over here and put it
[02:38:22] inside our query so we don't need the
[02:38:24] end H we need our new end dat we just
[02:38:28] remove that test and execute now it
[02:38:30] looks perfect all right now we are not
[02:38:32] done yet with those two dates actually
[02:38:35] we are saying all time dates because we
[02:38:37] don't have here any informations about
[02:38:39] the time always zero so it makes no
[02:38:41] sense to have these informations inside
[02:38:44] our data so what we can do we can do a
[02:38:46] very simple cast and we make this column
[02:38:49] as a date instead of date time so this
[02:38:52] is for the first one and as well for the
[02:38:54] next one as dates so let's try that out
[02:38:57] and as you can see it is nicer we don't
[02:38:59] have the time informations of course we
[02:39:02] can tell the source systems about all
[02:39:03] those issues but since they don't
[02:39:05] provide the time it makes no sense to
[02:39:07] have date and time okay so it was a long
[02:39:09] run but we have now cleaned product
[02:39:12] informations and this is way nicer than
[02:39:14] the original product information that we
[02:39:16] got from the source CRM so if you grab
[02:39:18] the ddl of the server table you can see
[02:39:20] that we don't have a category ID so we
[02:39:23] have product ID and product key and as
[02:39:25] well those two columns we just change
[02:39:27] the data type so it's date time here but
[02:39:29] we have changed that to a date so that
[02:39:31] means we have to go and do few
[02:39:33] modifications to the ddl so what we
[02:39:35] going to do we're going to go over here
[02:39:36] and say category ID and I will be using
[02:39:38] the same data type and for the start and
[02:39:41] end this time it's going to be date and
[02:39:43] not date and time so that's it for now
[02:39:45] let's go ah and execute it in order to
[02:39:47] repair the ddl and this is what happen
[02:39:49] in the silver layer sometimes we have to
[02:39:51] adjust the metadata if the quality of
[02:39:54] the data types and so on is not good or
[02:39:56] we are building new derived informations
[02:39:58] in order later to integrate the data so
[02:40:00] it will be like very close to the bronze
[02:40:02] layer but with few modifications so make
[02:40:05] sure to update your ddl scripts and now
[02:40:08] the next step is that we're going to go
[02:40:09] and insert the data into the table and
[02:40:12] now the next step we're going to go and
[02:40:13] insert the result of this query that is
[02:40:16] cleaning up the bronze table into the
[02:40:18] silver table so as we' done it before
[02:40:20] insert into silver the product info and
[02:40:24] then we have to go and list all the
[02:40:25] columns I've just prepared those columns
[02:40:27] so with that we can go and now run our
[02:40:31] query in order to insert the data so now
[02:40:33] as you can see SQL did insert the data
[02:40:35] and the very important step is now to
[02:40:37] check the quality of the silver table so
[02:40:39] we go back to our data quality checks
[02:40:41] and we go switch to the silver so let's
[02:40:44] check the primary key there is no issues
[02:40:47] and we can go and check for example here
[02:40:49] the the trims there is as well no issue
[02:40:51] and now let's go and check the costs it
[02:40:54] should not be negative or null which is
[02:40:56] perfect let's go and check the data
[02:40:59] standardizations as you can see they are
[02:41:01] friendly and we don't have any nulls and
[02:41:03] now very interesting the order of the
[02:41:05] dates so let's go and check that as you
[02:41:07] can see we don't have any issues and
[02:41:10] finally what I do I go and have a final
[02:41:13] look to the silver table and as we can
[02:41:15] see everything is inserted correctly in
[02:41:18] the correct color colums so all those
[02:41:20] columns comes from the source system and
[02:41:22] the last one is automatically generated
[02:41:24] from the ddl indicate when we loaded
[02:41:27] this table now let's sit back and have a
[02:41:29] look to our script what are the
[02:41:30] different types of data Transformations
[02:41:32] that we have done here is for example
[02:41:34] over here the category ID and the
[02:41:35] product key we have derived new columns
[02:41:38] so it is when we create a new column
[02:41:41] based on calculations or transformations
[02:41:43] of an existing one so sometimes we need
[02:41:45] columns only for analytics and we cannot
[02:41:48] each time go to the source system and
[02:41:49] ask them to create it so instead of that
[02:41:52] we derive our own columns that we need
[02:41:54] for the analytics another transformation
[02:41:56] we have is that is null over here so we
[02:41:59] are handling here missing information
[02:42:01] instead of null we're going to have a
[02:42:02] zero and one more transformation we have
[02:42:05] over here for the product line we have
[02:42:07] done here data normalization instead of
[02:42:09] having a code value we have a friendly
[02:42:12] value and as well we have handled the
[02:42:14] missing data for example over here
[02:42:16] instead of having a null we're going to
[02:42:17] have not available all right moving on
[02:42:19] to another data transformation we have
[02:42:21] done data type casting so we are
[02:42:23] converting the data type from one to
[02:42:25] another and this considered as well to
[02:42:27] be a data transformation and now moving
[02:42:29] on to the last one we are doing as well
[02:42:31] data type casting but what's more
[02:42:33] important we are doing data enrichment
[02:42:36] this type of transformation it's all
[02:42:37] about adding a value to your data so we
[02:42:40] are adding a new relevant data to our
[02:42:42] data sets so those are the different
[02:42:44] types of data Transformations that we
[02:42:47] have done for this table
[02:42:52] okay so let's keep going we have the
[02:42:53] sales details and this is the last table
[02:42:55] in the CRM so what do you have over here
[02:42:57] we have the order number and this is a
[02:42:59] string of course we can go and check
[02:43:00] whether we have an issue with the
[02:43:02] unwanted spaces so we can search whether
[02:43:04] we're going to find something so we can
[02:43:06] say trim and something like this and
[02:43:09] let's go and execute it so we can see
[02:43:10] that we don't have any unwanted spaces
[02:43:12] that means we don't have to transform
[02:43:14] this column so we can leave it as it is
[02:43:16] now the next two columns they are like
[02:43:18] keys and ideas is in order to connect it
[02:43:20] with the other tables as we learned
[02:43:22] before we are using the product key in
[02:43:24] order to connect it with the product
[02:43:26] informations and we are connecting the
[02:43:28] customer ID with the customer ID from
[02:43:30] the customer info so that means we have
[02:43:32] to go and check whether everything is
[02:43:33] working perfectly so we can go and check
[02:43:35] the Integrity of those columns where we
[02:43:37] say the product key Nots in and then we
[02:43:40] make a subquery and this time we can
[02:43:42] work with the silver layer right so we
[02:43:44] can say the product key from Silver do
[02:43:48] product info so let's go and query this
[02:43:50] and as you can see we are not getting
[02:43:52] any issue that means all the product
[02:43:53] keys from the sales details can be used
[02:43:56] and connected with the product info the
[02:43:58] same thing we can go and check the
[02:44:00] Integrity of the customer ID and we can
[02:44:02] use not the products we can go to the
[02:44:04] customer info and the name was CST ID so
[02:44:08] let's go and query that and the same
[02:44:10] thing we don't have here any issues so
[02:44:12] that means we can go and connect the
[02:44:13] sales with the customers using the
[02:44:15] customer ID and we don't have to do any
[02:44:17] Transformations for it so things looks
[02:44:19] really nice for those three columns now
[02:44:21] we come to the challenging one we have
[02:44:23] here the dates now those dates are not
[02:44:26] actual dates they are integer so those
[02:44:28] are numbers and we don't want to have it
[02:44:30] like this we would like to clean that up
[02:44:32] we have to change the data type from
[02:44:34] integer to a DAT now if you want to
[02:44:36] convert an integer to a date we have to
[02:44:38] be careful with the values that we have
[02:44:40] inside each of those columns so now
[02:44:42] let's check the quality for example of
[02:44:44] the order dates let's say where order
[02:44:46] dates is less than zero for example
[02:44:49] something negative well we don't have
[02:44:51] any negative values which is good let's
[02:44:53] go and check whether we have any zeros
[02:44:55] well this is bad so we have here a lot
[02:44:57] of zeros now what we can do we can
[02:44:59] replace those informations with a null
[02:45:01] we can use of course the null IF
[02:45:03] function like this we can say null if
[02:45:05] and if it is zero then make it null so
[02:45:08] let's execute it and as you can see now
[02:45:11] all those informations are null now
[02:45:13] let's go and check again the data so now
[02:45:15] this integer has the years information
[02:45:17] at the start then the months and then
[02:45:19] the day so here we have to have like 1 2
[02:45:21] 3 4 5 so the length of each number
[02:45:24] should be H and if the length is less
[02:45:26] than eight or higher than eight then we
[02:45:28] have an issue let's go and check that so
[02:45:30] we're going to say or length sales order
[02:45:33] is not equal to eight that means less or
[02:45:37] higher let's go and execute it now let's
[02:45:39] go and check the results over here and
[02:45:41] those two informations they don't look
[02:45:43] like dates so we cannot go and make from
[02:45:45] these informations a real dates they are
[02:45:48] just bad data and of course you can go
[02:45:50] and check the boundaries of a DAT like
[02:45:52] for example it should not be higher than
[02:45:55] for example let's go and get this value
[02:45:57] 2050 and then I need for the month and
[02:45:59] the date so let's go and execute it and
[02:46:01] if we just remove those informations
[02:46:03] just to make sure so we don't have any
[02:46:05] date that is outside of the boundaries
[02:46:07] that you have in your business or you go
[02:46:09] for example and say the boundary should
[02:46:11] be not less than depend when your
[02:46:13] business started maybe something like
[02:46:15] this we are getting of course those
[02:46:17] values because they are less than n but
[02:46:19] if you have values around these dates
[02:46:21] you will get it as well in the query so
[02:46:23] we can go and add the rests so all those
[02:46:25] checks like validate the column that has
[02:46:28] date informations and it has the data
[02:46:30] type integer so again what are the
[02:46:32] issues over here we have zeros and
[02:46:34] sometimes we have like strange numbers
[02:46:37] that cannot be converted to a dates so
[02:46:39] let's go and fix that in our query so we
[02:46:41] can say case when the sales order the
[02:46:44] order date is equal to zero or of the
[02:46:47] order date is not equal to 8 then null
[02:46:51] right we don't want to deal with those
[02:46:52] values they are just wrong and they are
[02:46:54] not real dates otherwise we say else
[02:46:57] it's going to be the order dates now
[02:46:59] what we're going to do we're going to go
[02:47:00] and convert this to a date we don't want
[02:47:02] this as an integer so how we can do that
[02:47:04] we can go and cast it first to varar
[02:47:08] because we cannot cast from integer to
[02:47:10] date in SQL Server first you have to
[02:47:13] convert it to a varar and then from
[02:47:15] varar you go to a dates well this is how
[02:47:17] we do it in scq server so we cast it
[02:47:19] first to a varar and then we cast it to
[02:47:23] a date like this that's it so we have
[02:47:25] end and we are using the same column
[02:47:28] name so this is how we transform an
[02:47:31] integer to a date so let's go and query
[02:47:34] this and as you can see the order date
[02:47:36] now is a real date it is not a number so
[02:47:39] we can go and get rid of the old column
[02:47:41] now we have to go and do the same stuff
[02:47:43] for the shipping dates so we can go over
[02:47:45] here and replace everything with the
[02:47:48] shipping date and let's go query well as
[02:47:50] you can see the shipping date is perfect
[02:47:52] we don't have any issue with this column
[02:47:54] but still I don't like that we found a
[02:47:55] lot of issues with the order dates so
[02:47:57] what we're going to do just in case this
[02:47:59] happens for the shipping date in the
[02:48:00] future I will go and apply the same
[02:48:03] rules to the shipping dates oh let's
[02:48:05] take the shipping
[02:48:07] date like this and if you don't want to
[02:48:09] apply it now you have always to build
[02:48:11] like quality checks that runs every day
[02:48:14] in order to detect those issues and once
[02:48:16] you detect it then you can go and do the
[02:48:18] Transformations but for now I'm going to
[02:48:20] apply it right away so that is for the
[02:48:22] shipping date now we go to the due date
[02:48:24] and we will do the same
[02:48:26] test let's go and execute it and as well
[02:48:29] it is perfect so still I'm going to
[02:48:32] apply the same rules so let's get the D
[02:48:34] everywhere here in the query just make
[02:48:36] sure you don't miss anything here so
[02:48:39] let's go and execute now perfect as you
[02:48:41] can see we have the order date shipping
[02:48:43] date and due date and all of them are
[02:48:45] date and don't have any wrong data
[02:48:47] inside those columns now still there is
[02:48:49] one more check that we can do and is
[02:48:51] that the order date should be always
[02:48:53] smaller than the shipping date or the
[02:48:55] due date because it's makes no sense
[02:48:57] right if you are delivering an item
[02:48:59] without an order so first the order
[02:49:01] should happen then we are shipping the
[02:49:03] items so there is like an order of those
[02:49:05] dates and we can go and check that so we
[02:49:07] are checking now for invalid date orders
[02:49:09] where we going to say the order date is
[02:49:12] higher than the shipping date or we are
[02:49:15] searching as well for an order where the
[02:49:18] order date date is higher than the due
[02:49:20] dates so we going to have it like this
[02:49:22] due dates so let's go and check well
[02:49:24] that's really good we don't have such a
[02:49:26] mistake on the data and the quality
[02:49:28] looks good so the order date is always
[02:49:30] smaller than the shipping date or the
[02:49:33] due dates so we don't have to do any
[02:49:35] Transformations or cleanup okay friends
[02:49:37] now moving on to the last three columns
[02:49:39] we have the sales quantity and the price
[02:49:41] all those informations are connected to
[02:49:43] each others so we have a business rule
[02:49:45] or calculation it says the sales must be
[02:49:48] equal to quantity multiplied by the
[02:49:50] price and all sales quantity and price
[02:49:53] informations must be positive numbers so
[02:49:55] it's not allowed to be negative zero or
[02:49:58] null so those are the business rules and
[02:50:00] we have to check the data consistency in
[02:50:02] our table does all those three
[02:50:04] informations following our rules so
[02:50:07] we're going to start first with our rule
[02:50:08] right so we're going to say if the sales
[02:50:11] is not equal to quantity multiplied by
[02:50:15] the price so we are searching where the
[02:50:17] result is not matching our expectation
[02:50:20] and as well we can go and check other
[02:50:22] stuff like the nulls so for example we
[02:50:23] can say or sales is null or quantity is
[02:50:29] null and the last one for the price and
[02:50:33] as well we can go and check whether they
[02:50:35] are negative numbers or zero so we can
[02:50:38] go over here and say less or equal to
[02:50:40] zero and apply it for the other columns
[02:50:42] as well so with that we are checking the
[02:50:45] calculation and as well we are checking
[02:50:47] whether we have null0 Z or negative
[02:50:49] numbers let's go and check our
[02:50:51] informations I'm going to have here A
[02:50:52] distinct so let's go and query it and of
[02:50:56] course we have here bad data but we can
[02:50:58] go and sort the data by the sales
[02:51:01] quantity and the price so let's do it
[02:51:05] now by looking to the data we can see in
[02:51:06] the sales we have nulls we have negative
[02:51:09] numbers and zeros so we have all bad
[02:51:12] combinations and as well we have here
[02:51:14] bad calculations so as you can see the
[02:51:16] price here is 50 the quantity is one but
[02:51:18] the sales is two which is not correct
[02:51:20] and here we have as well wrong
[02:51:21] calculations here we have to have a 10
[02:51:23] and here nine or maybe the price is
[02:51:25] wrong and by looking to the quantity now
[02:51:28] you can see we don't have any nulls we
[02:51:30] don't have any zeros or negative numbers
[02:51:32] so the quantity looks better than the
[02:51:33] sales and if you look to the prices we
[02:51:36] have nulls we have negatives and yeah we
[02:51:39] don't have zeros so that means the
[02:51:40] quality of the sales and the price is
[02:51:42] wrong the calculation is not working and
[02:51:44] we have these scenarios now of course
[02:51:46] how I do it here I don't go and try now
[02:51:48] to transform everything on my own I
[02:51:50] usually go and talk to an expert maybe
[02:51:53] someone from the business or from the
[02:51:54] source system and I show those scenarios
[02:51:56] and discuss and usually there is like
[02:51:58] two answers either they going to tell me
[02:52:00] you know what I will fix it in my source
[02:52:02] so I have to live with it there is
[02:52:04] incoming bad data and the bad data can
[02:52:06] be presented in the warehouse until the
[02:52:08] source system clean up those issues and
[02:52:10] the other answer you might get you know
[02:52:12] what we don't have the budget and those
[02:52:14] data are really old and we are not going
[02:52:16] to do anything so here you have to
[02:52:18] decide either you leave it as it is or
[02:52:20] you say you know what let's go and
[02:52:21] improve the quality of the data but here
[02:52:23] you have to ask for the experts to
[02:52:25] support you solving these issues because
[02:52:28] it really depend on their rules
[02:52:29] different rules makes different
[02:52:31] Transformations so now let's say that we
[02:52:33] have the following rules if the sales
[02:52:35] informations are null or negative or
[02:52:37] zero then use the calculation the
[02:52:39] formula by multiplying the quality with
[02:52:41] the price and now if the prices are
[02:52:44] wrong for example we have here null or
[02:52:46] zero then go and calculate it from the
[02:52:48] sales and a quantity and if you have a
[02:52:50] price that is a minus like minus 21 a
[02:52:54] negative number then you have to go and
[02:52:55] convert it to a 21 so from a negative to
[02:52:59] a positive without any calculations so
[02:53:01] those are the rules and now we're going
[02:53:02] to go and build the Transformations
[02:53:04] based on those rules so let's do it step
[02:53:06] by step I will go over here and we're
[02:53:08] going to start building the new sales so
[02:53:11] what is the rule Sals case when of
[02:53:13] course as usual if the
[02:53:15] sales is null or let's say the sales is
[02:53:20] negative number or equal to zero or
[02:53:22] another scenario we have a sales
[02:53:23] information but it is not following the
[02:53:26] calculation so we have wrong information
[02:53:28] in the sales so we're going to say the
[02:53:30] sales is not equal to the quantity
[02:53:33] multiplied by the price but of course we
[02:53:36] will not leave the price like this by
[02:53:38] using the function APS the absolute it's
[02:53:41] going to go and convert everything from
[02:53:43] negative to a positive then what we have
[02:53:44] to do is to go and use the calculation
[02:53:48] so so it's going to be the quantity
[02:53:50] multiplied by the price so that means we
[02:53:53] are not using the value that come from
[02:53:54] the source system we are recalculating
[02:53:57] it now let's say the sales is correct
[02:53:59] and not one of those scenarios so we can
[02:54:01] say else we will go with the sales as it
[02:54:04] is that comes from the source because it
[02:54:06] is correct it's really nice let's go and
[02:54:08] say an end and give it the same name I
[02:54:10] will go and rename the old one here as
[02:54:13] an old value and the same for the price
[02:54:16] the quantity will not T it because it is
[02:54:19] correct so like this and now let's go
[02:54:21] and transform the prices so again as
[02:54:24] usual we go with case wi so what are the
[02:54:27] scenarios the price is null or the price
[02:54:32] is less or equal to zero then what we're
[02:54:34] going to do we're going to do the
[02:54:36] calculation so it going to be the sales
[02:54:39] divided by the quantity the SLS quantity
[02:54:42] but here we have to make sure that we
[02:54:44] are not dividing by zero currently we
[02:54:46] don't have any zeros in the quantity but
[02:54:47] you don't know future you might get a
[02:54:49] zero and the whole code going to break
[02:54:51] so what you have to do is to go and say
[02:54:53] if you get any zero replace it with a
[02:54:56] null so null if if it is zero then make
[02:54:59] it null so that's it now if the price is
[02:55:02] not null and the price is not negative
[02:55:04] or equal to zero then everything is fine
[02:55:06] and that's why we're going to have now
[02:55:07] the else it's going to be the price as
[02:55:11] it is from The Source system so that's
[02:55:13] it we're going to say end as price so
[02:55:15] I'm totally happy with that let's go and
[02:55:17] execute it and check of course so those
[02:55:19] are the old informations and those are
[02:55:21] the new transformed cleaned up
[02:55:23] informations so here previously we have
[02:55:24] a null but now we have two so two
[02:55:27] multiply with one we are getting two so
[02:55:29] the sales is here correct now moving on
[02:55:31] to the next one we have in the sales 40
[02:55:34] but the price is two so two multiplied
[02:55:37] with one we should get two so the new
[02:55:39] sales is correct it is two and not 40
[02:55:41] now to the next one over here the old
[02:55:43] sales is zero but if you go and multiply
[02:55:45] the four with the quantity you will get
[02:55:47] four so the sales here is not correct
[02:55:49] that's why in the new sales we have it
[02:55:51] correct as a four and let's go and get a
[02:55:53] minus so in this case we have a minus
[02:55:55] which is not correct so we are getting
[02:55:57] the price multiplied with one we should
[02:55:59] get here a nine and this sales here is
[02:56:02] correct now let's go and get a scenario
[02:56:04] where the price is a null like this here
[02:56:07] so we don't have here price but we
[02:56:08] calculated from the sales and the
[02:56:10] quantity so we divided the 10 by two and
[02:56:13] we have five so the new price is better
[02:56:15] and the same thing for the minuses so we
[02:56:18] have here minus 21 and in the output we
[02:56:20] have 21 which is correct so for now I
[02:56:22] don't see any scenario where the data is
[02:56:25] wrong so everything looks better than
[02:56:27] before and with that we have applied the
[02:56:29] business rules from the experts and we
[02:56:32] have cleaned up the data in the data
[02:56:34] warehouse and this is way better than
[02:56:35] before because we are presenting now
[02:56:37] better data for analyzes and Reporting
[02:56:40] but it is challenging and you have
[02:56:42] exactly to understand the business so
[02:56:43] now what we're going to do we're going
[02:56:44] to go and copy those informations and
[02:56:47] integrate it in our query so instead of
[02:56:49] sales we're going to get our new
[02:56:51] calculation and instead of the price we
[02:56:54] will get our correct calculation and
[02:56:56] here I'm missing the end let's go and
[02:56:59] run the whole thing again so with that
[02:57:01] we have as well now cleaned sales
[02:57:04] quantity and price and it is following
[02:57:06] our business rules so with that we are
[02:57:08] done cleaning up the sales details The
[02:57:11] Next Step we're going to go and inserted
[02:57:12] to the sales details but we have to go
[02:57:14] and check again the ddl so now all what
[02:57:17] you have to do is to compare those
[02:57:18] results with the ddl so the first one is
[02:57:20] the order number it's fine the product
[02:57:22] key the customer ID but here we have an
[02:57:24] issue all those informations now are
[02:57:27] date and not an integer so we have to go
[02:57:29] and change the data type and with that
[02:57:31] we have better data type than before
[02:57:33] then the sales quantity price it is
[02:57:35] correct let's go and drop the table and
[02:57:38] create it from scratch again and don't
[02:57:40] forget to update your ddl script so
[02:57:42] that's it for this and we're going to go
[02:57:44] now and insert the results into our
[02:57:46] silver table say details and we have to
[02:57:49] go and list now all the columns I have
[02:57:51] already prepared the list of all the
[02:57:53] columns so make sure that you have the
[02:57:55] correct order of the columns so let's go
[02:57:57] now and insert the data and with that
[02:58:00] and with that we can see that the SQL
[02:58:02] did insert data to our sales details but
[02:58:04] now very important is to check the
[02:58:06] health of the silver table so what we
[02:58:08] going to do instead here of bronze we're
[02:58:10] going to go and switch it to Silver so
[02:58:12] let's check over here so here always the
[02:58:14] order is smaller than the shipping and
[02:58:17] the due date which is really nice but
[02:58:19] now I'm very interested on the
[02:58:21] calculations so here we're going to
[02:58:23] switch it from bronze to Silver and I'm
[02:58:25] going to go and get rid of all those
[02:58:26] calculations because we don't need it
[02:58:29] this and now let's see whether we have
[02:58:31] any issue well perfect our data is
[02:58:34] following the business rules we don't
[02:58:35] have any nulls negative values zeros now
[02:58:38] as usual the last step the final check
[02:58:41] we will just have a final look to the
[02:58:42] table so we have the order number the
[02:58:44] product key the customer ID the three
[02:58:47] dates we have have the sales quantity
[02:58:49] and the price and of course we have our
[02:58:52] metadata column everything is perfect so
[02:58:54] now by looking to our code what are the
[02:58:56] different types of data Transformations
[02:58:58] that we are doing so in those three
[02:59:00] columns we are doing the following so at
[02:59:02] the start we are handling invalid data
[02:59:05] and this is as well type of
[02:59:06] transformation and as well at the same
[02:59:08] time we are doing data type casting so
[02:59:11] we are changing it to more correct data
[02:59:13] type and if you are looking to the sales
[02:59:15] over here then what we are doing over
[02:59:17] here is we are handling the missing data
[02:59:19] and as well the invalid data by deriving
[02:59:23] the column from already existing one and
[02:59:26] it is as well very similar for the price
[02:59:28] we are handling as well the invalid data
[02:59:31] by deriving it from specific calculation
[02:59:33] over here so those are the different
[02:59:35] types of data Transformations that you
[02:59:37] have done in these
[02:59:41] scripts all right now let's keep moving
[02:59:43] to the next our system we have the
[02:59:46] customer AZ 12 so here we have we have
[02:59:48] like only three columns and let's start
[02:59:50] with the ID first so here again we have
[02:59:52] the customers informations and if we go
[02:59:54] and check again our model you can see
[02:59:56] that we can connect this table with the
[02:59:59] CRM table customer info using the
[03:00:02] customer key so that means we have to go
[03:00:03] and make sure that we can go and connect
[03:00:06] those two tables so let's go and check
[03:00:09] the other table we can go and check of
[03:00:10] course the silver layer so let's query
[03:00:13] it and we can query both of the tables
[03:00:16] now we can see there is here like exract
[03:00:18] characters that are not included in the
[03:00:20] customer key from the CRM so let's go
[03:00:23] and search for example for this customer
[03:00:25] over here where C ID like so we are
[03:00:31] searching for customer has similar ID
[03:00:34] now as you can see we are finding this
[03:00:35] customer but the issue is that we have
[03:00:37] those three characters in as there is no
[03:00:40] specifications or explanation why we
[03:00:42] have the nas so actually what we have to
[03:00:44] do is to go and remove those
[03:00:45] informations we don't need it so let's
[03:00:48] again check the data so it looks like
[03:00:50] the old data have an Nas at the start
[03:00:53] and then afterward we have new data
[03:00:55] without those three characters so we
[03:00:56] have to clean up those IDs in order to
[03:00:58] be able to connect it with other tables
[03:01:01] so we're going to do it like this we're
[03:01:02] going to start with the case wiin since
[03:01:04] we have like two scenarios in our data
[03:01:06] so if the C ID is like the three
[03:01:10] characters in as so if the ID start with
[03:01:13] those three characters then we're going
[03:01:15] to go and apply transformation function
[03:01:17] otherwise eyes it's going to stay like
[03:01:19] it is so that's it so now we have to go
[03:01:23] and build the transformation so we're
[03:01:25] going to use substring and then we have
[03:01:28] to define the string it's going to be
[03:01:30] the C ID and then we have to define the
[03:01:32] position where it start cutting or
[03:01:34] extracting so we can say 1 2 3 and then
[03:01:38] four so we have to define the position
[03:01:40] number four and then we have to define
[03:01:42] the string how many characters should be
[03:01:44] extracted I will make it Dynamic so I
[03:01:47] will go with the link
[03:01:48] I will not go and count how much so
[03:01:50] we're going to say the C ID so it looks
[03:01:52] good if it's like an as then go and
[03:01:55] extract from the CID at the position
[03:01:57] number four the rest of the characters
[03:02:00] so let's go and execute it and I'm
[03:02:02] missing here a comma again where we
[03:02:04] don't have any Nas at the start and if
[03:02:07] you scroll down you can see those as
[03:02:10] well are not affected so with that we
[03:02:13] have now a nice ID to be joined with
[03:02:15] other table of course we can go and test
[03:02:17] it like this where and then we take the
[03:02:19] whole thing the whole transformation and
[03:02:22] say not in we remove of course the alas
[03:02:24] name we don't need it and then we make
[03:02:27] very simple substring select distinct
[03:02:30] CST key the customer key from the silver
[03:02:34] table can be silver CRM cost info so
[03:02:39] that's it let's go and check so as you
[03:02:41] can see it is working fine so we are not
[03:02:43] able to find any unmatching data between
[03:02:46] the customer info from ERB and the CRM
[03:02:48] but of course after the transformation
[03:02:51] if you don't use the transformation so
[03:02:52] if I just remove it like this we will
[03:02:54] find a lot of unmatching data so this
[03:02:57] means our transformation is working
[03:02:59] perfectly and we can go and remove the
[03:03:01] original value so that's it for the
[03:03:03] First Column okay now moving on to the
[03:03:05] next field we have the birthday of their
[03:03:07] customers so the first thing to do is to
[03:03:09] check the data type it is a date so it's
[03:03:11] fine it is not an integer or a string so
[03:03:14] we don't have to convert anything but
[03:03:16] still there is something to check with
[03:03:17] the birth dates so we can check whether
[03:03:19] we have something out of range so for
[03:03:21] example we can go and check whether we
[03:03:23] have really old dates at the birth dates
[03:03:25] so let's take 1900 and let's say 24 and
[03:03:30] we can take the first date of the month
[03:03:32] so let's go and check that well it looks
[03:03:34] like that we have customers that are
[03:03:36] older than a 100 Year well I don't know
[03:03:38] maybe this is correct but it sounds of
[03:03:41] course strange to bit of the business of
[03:03:43] course this is Creed and he is in charge
[03:03:47] of
[03:03:48] something that is correct say hi to the
[03:03:50] kids hi kids yay and then we can go and
[03:03:53] check the other boundary where it is
[03:03:56] almost impossible to have a customer
[03:03:58] that the birthday is in the future so we
[03:04:01] can say birth date is higher than the
[03:04:04] current dates like this so let's go and
[03:04:07] query this information well it will not
[03:04:09] work because we have to have like an or
[03:04:11] between them and now if we check the
[03:04:12] list over here we have dates that are
[03:04:15] invalid for the birth dates so all those
[03:04:18] dates they are all birthday in the
[03:04:20] future and this is totally unacceptable
[03:04:23] so this is an indicator for bad data
[03:04:25] quality of course you can go and report
[03:04:26] it to the source system in order to
[03:04:28] correct it so here it's up to you what
[03:04:29] to do with those dates either leave it
[03:04:31] as it is as a bad data or we can go and
[03:04:34] clean that up by replacing all those
[03:04:36] dates with a null or maybe replacing
[03:04:38] only the one that is Extreme where it is
[03:04:41] 100% is incorrect so let's go and write
[03:04:44] the transformation for that as usual
[03:04:46] we're going to start with case whenn per
[03:04:48] dates is larger than the current date
[03:04:51] and time then null otherwise we can have
[03:04:55] an else where we have the birth dat as
[03:04:58] it is and then we have an end as birth
[03:05:01] date so let's go and excuse it and with
[03:05:04] that we should not get any customer we
[03:05:07] the birthday in the future so that's it
[03:05:09] for the birth dates now let's move to
[03:05:12] the next one we have the gender now
[03:05:13] again the gender informations is
[03:05:15] localities so we have to go and check
[03:05:17] all the possible values inside this
[03:05:19] column so in order to check all the
[03:05:21] possible values we're going to use
[03:05:23] select distinct gen from our table so
[03:05:26] let's go and execute it and now the data
[03:05:28] doesn't look really good so we have here
[03:05:30] a null we have an F we have here an
[03:05:33] empty string we have male female and
[03:05:36] again we have the m so this is not
[03:05:38] really good what we going to do we're
[03:05:39] going to go and clean up all those
[03:05:40] informations in order to have only three
[03:05:43] values male female and not available so
[03:05:46] we're going to do it like this we're
[03:05:47] going to say case when and now we're
[03:05:48] going to go and trim the values just to
[03:05:51] make sure there is like no empty spaces
[03:05:53] and as well I'm going to go and use the
[03:05:55] upper function just to make sure that in
[03:05:57] the future if we get any lower cases and
[03:05:59] so on we are covering all the different
[03:06:01] scenarios so case this is in F4 let's
[03:06:06] say
[03:06:07] female then make it as female and we can
[03:06:11] go and do the same thing for the male
[03:06:14] like this so if it is an M or a male
[03:06:17] make sure it is capital letters because
[03:06:19] here we are using the upper then it is a
[03:06:21] male otherwise all other scenarios it
[03:06:24] should be not available so whether it is
[03:06:26] an empty string or nulls and so on so we
[03:06:29] have to have an end of course as gen so
[03:06:32] now let's go and test it and check
[03:06:34] whether we have covered everything so
[03:06:35] you can see the m is now male the empty
[03:06:38] is not available the f is female the
[03:06:40] empty string or maybe spaces here is not
[03:06:43] available female going to stay as it is
[03:06:45] and the same for the male so with that
[03:06:47] we are covering all the scenarios and we
[03:06:49] are following our standards in the
[03:06:51] project so I'm going to go and cut this
[03:06:53] and put it in our original query over
[03:06:56] here so let's go and execute the whole
[03:06:58] thing and with that we have cleaned up
[03:07:01] all those three columns now the question
[03:07:03] is did we change anything in the ddl
[03:07:05] well we didn't change anything we didn't
[03:07:07] introduce any new column or change any
[03:07:09] data type so that means the next step is
[03:07:11] we're going to go and insert it in the
[03:07:13] server layer so as usual we're going to
[03:07:15] say here insert into silver Erp the
[03:07:20] customer and then we're going to go and
[03:07:21] list all the column names so C ID birth
[03:07:24] dat and the gender all right so let's go
[03:07:28] and execute it and with that we can see
[03:07:30] it inserted all the data and of course
[03:07:32] the very important step as the next is
[03:07:34] to check that data quality so let's go
[03:07:36] back to our query over here and change
[03:07:38] it from bronze to Silver so let's go and
[03:07:40] check the silver layer well of course we
[03:07:42] are getting those very old customers but
[03:07:46] we didn't change that we only change the
[03:07:48] birthday that is in the future and we
[03:07:50] don't see it here in the results so that
[03:07:52] means everything is clean so for the
[03:07:54] next one let's go and check the
[03:07:55] different genders and as you can see we
[03:07:58] have only those three values and of
[03:08:00] course we can go and take a final look
[03:08:01] to our table so you can see the C ID
[03:08:04] here the birth date the gender and then
[03:08:06] we see our metadata column and
[03:08:08] everything looks amazing so that's it
[03:08:11] what are the different types of data
[03:08:12] Transformations that we have done first
[03:08:14] with the ID what you have done we have
[03:08:16] handled inv valid values so we have
[03:08:19] removed this part where it is not needed
[03:08:21] and the same thing goes for the birth
[03:08:23] dates we have handled as well invalid
[03:08:25] values and then for the last one for the
[03:08:27] gender we have done data normalizations
[03:08:30] by mapping the code to more friendly
[03:08:32] value and as well we have handled the
[03:08:34] missing values so those are the types
[03:08:37] that we have done in this
[03:08:41] code okay moving on to the second table
[03:08:44] we have the location informations so we
[03:08:46] have Erp location
[03:08:48] a101 so now here the task is easy
[03:08:50] because we have only two columns and if
[03:08:52] you go and check the integration model
[03:08:54] we can find our table over here so we
[03:08:56] can go and connect it together with the
[03:08:58] customer info from the other system
[03:09:00] using the CI ID with the customer key so
[03:09:03] those two informations must be matching
[03:09:05] in order to join the tables so that
[03:09:07] means we have to go and check the data
[03:09:09] so let's go and select the data CST key
[03:09:13] from let's go and get the silver Data
[03:09:15] customer info so let's now if you go and
[03:09:18] check the result you can see over here
[03:09:20] that we have an issue with the CI ID
[03:09:23] there is like a minus between the
[03:09:25] characters and the numbers but the
[03:09:26] customer ID the customer number we don't
[03:09:29] have anything that splits the characters
[03:09:31] with the numbers so if you go and join
[03:09:33] those two informations it will not be
[03:09:34] working so what we have to do we have to
[03:09:36] go and get rid of this minus because it
[03:09:38] is totally unnecessary so let's go and
[03:09:41] fix that it's going to be very simple so
[03:09:42] what we're going to do we're going to
[03:09:43] say C ID so we're going to go and search
[03:09:46] for the m
[03:09:48] and replace it with nothing it's very
[03:09:50] simple like this so let's go and quer it
[03:09:52] again and with that things looks very
[03:09:54] similar to each others and as well we
[03:09:56] can go and query it so we're going to
[03:09:58] say where our
[03:09:59] transformation is not in then we can go
[03:10:02] and use this as a subquery like this so
[03:10:05] let's go and execute it and as you can
[03:10:08] see we are not finding any unmatching
[03:10:10] data now so that means our
[03:10:11] transformation is working and with that
[03:10:13] we can go and connect those two tables
[03:10:15] together so if I take take the
[03:10:17] transformation away you can see that we
[03:10:19] will find a lot of unmatching data so
[03:10:22] the transformation is okay we're going
[03:10:23] to stay with it and now let's speak
[03:10:25] about the countries now we have here
[03:10:27] multiple values and so on what I'm going
[03:10:29] to do this is low cardinality and we
[03:10:31] have to go and check all possible values
[03:10:34] inside this column so that means we are
[03:10:36] checking whether the data is consistent
[03:10:38] so we can do it like this distinct the
[03:10:42] country from our table I'm just going to
[03:10:45] go and copy it like this and as well I'm
[03:10:46] going to go s the data by the country so
[03:10:50] let's go and check the informations now
[03:10:52] you can see we have a null we have an
[03:10:54] empty string which is really bad and
[03:10:56] then we have a full name of country and
[03:10:59] then we have as well an abbreviation of
[03:11:01] the countries well this is a mix this is
[03:11:04] not really good because sometimes we
[03:11:05] have the E and sometimes we have Germany
[03:11:08] and then we have the United Kingdom and
[03:11:10] then for the United States we have like
[03:11:11] three versions of the same information
[03:11:14] which is as well not really good so the
[03:11:16] quality of the is not really good so
[03:11:19] let's go and work on the transformation
[03:11:20] as usual we're going to start with the
[03:11:22] case win if trim
[03:11:24] country is equal to D then we're going
[03:11:29] to transform it to Germany and the next
[03:11:32] one it's going to be about the USA so if
[03:11:34] trim country is in so now let's go and
[03:11:38] get those two values the US and the USA
[03:11:41] so us and USA then it's going to be the
[03:11:46] United States States states so with that
[03:11:49] we have covered as well those three
[03:11:51] cases now we have to talk about the null
[03:11:53] and the empty string so we're going to
[03:11:55] say when trim country is equal to empty
[03:11:59] string or country is null then it's
[03:12:04] going to be not available otherwise I
[03:12:07] would like to get the country as it is
[03:12:09] so trim country just to make sure that
[03:12:11] we don't have any leading or trailing
[03:12:13] spaces so that's it let's go and say
[03:12:16] this is the country so it is working and
[03:12:19] the country information is transformed
[03:12:22] and now what I'm going to do I'm going
[03:12:22] to take the whole new transformation and
[03:12:25] compare it to the old one let me just
[03:12:27] call this as old country and let's go
[03:12:31] and query it so now we can check those
[03:12:33] value State as before so nothing did
[03:12:35] change the de is now Germany the empty
[03:12:38] string is not available the null the
[03:12:40] same thing and the United Kingdom State
[03:12:43] as like it's like before and now we have
[03:12:46] one value for all those information so
[03:12:48] it's only the United States so it looks
[03:12:51] perfect and with that we have cleaned as
[03:12:53] well the second column so with that we
[03:12:55] have now clean results and now the
[03:12:56] question did we change anything in the
[03:12:58] ddl well we haven't changed anything
[03:13:00] both of them are varar so we can go now
[03:13:03] immediately and insert it into our table
[03:13:06] so insert into silver customer location
[03:13:09] and here we have to specify the columns
[03:13:12] it's very simple the ID and the country
[03:13:14] so let's go and execute it and as you
[03:13:17] can see we got now inserted all those
[03:13:19] values of course as a next we go and
[03:13:21] double check those informations I would
[03:13:23] just go and remove all those stuff as
[03:13:25] well here and instead of bronze let's go
[03:13:28] with the silver so as you can see all
[03:13:31] the values of the country looks good and
[03:13:33] let's have a final look to the table so
[03:13:35] like this so we have the IDS without the
[03:13:38] separator we have the countries and as
[03:13:41] well our metadata information so with
[03:13:43] that we have cleaned up the data for the
[03:13:44] location okay so now what are the
[03:13:46] different types of data transformation
[03:13:48] that we have done here is first we have
[03:13:50] handled invalid values so we have
[03:13:52] removed the minus with an empty string
[03:13:54] and for the country we have done data
[03:13:57] normalization so we have replaced codes
[03:13:59] with friendly values and as well at the
[03:14:02] same time we have handled missing values
[03:14:04] by replacing the empty string and null
[03:14:07] with not available and one more thing of
[03:14:09] course we have removed the unwanted
[03:14:11] spaces so those are the different types
[03:14:13] of transformation that we have done for
[03:14:15] this table
[03:14:20] okay guys now keep the energy up keep
[03:14:22] the spirit up we have to go and clean up
[03:14:24] the last table in the bronze layer and
[03:14:26] of course we cannot go and Skip anything
[03:14:28] we have to check the quality and to
[03:14:30] detect all the errors so now we have a
[03:14:32] table about the categories for the
[03:14:34] products and here we have like four
[03:14:36] columns let's go and start with the
[03:14:37] first one the ID as you can see in our
[03:14:40] integration model we can connect this
[03:14:42] table together with the product info
[03:14:44] from the CRM using the product key and
[03:14:46] as as you remember in the silver layer
[03:14:48] we have created an extra column for that
[03:14:50] in the product info so if you go and
[03:14:52] select those data you can see we have a
[03:14:55] column called category ID and this one
[03:14:57] is exactly matching the ID that we have
[03:15:00] in this table and we have done the
[03:15:02] testing so this ID is ready to be used
[03:15:05] together with the other table so there
[03:15:07] is nothing to do over here and now for
[03:15:09] the next columns they are string and of
[03:15:11] course we can go and check whether there
[03:15:13] are any unwanted spaces so we are
[03:15:15] checking for The Unwanted spaces is so
[03:15:17] let's go and check select star from and
[03:15:20] we're going to go and get the same table
[03:15:22] like this here and first we are checking
[03:15:24] the category so the category is not
[03:15:27] equal to the category after trimming The
[03:15:30] Unwanted spaces so let's go and execute
[03:15:33] it and as you can see we don't have any
[03:15:35] results so there are no unwanted spaces
[03:15:38] let's go and check the other column for
[03:15:39] example the subcategory the next one so
[03:15:43] let's get the subcategory and the under
[03:15:45] query as well we don't have anything so
[03:15:47] that means we don't have unwanted spaces
[03:15:50] for the subcategory let's go now and
[03:15:52] check the last column so I will just
[03:15:54] copy and paste now let's get the
[03:15:56] maintenance and let's go and execute and
[03:15:59] as well no results perfect we don't have
[03:16:01] any unwanted spaces inside this table so
[03:16:04] now the next step is that we're going to
[03:16:05] go and check the data standardizations
[03:16:08] because all those columns has low
[03:16:10] cardinality so what we're going to do
[03:16:11] we're going to say
[03:16:13] select this thing let's get the cat
[03:16:16] category from our table I'll just copy
[03:16:19] and paste it and check all values so as
[03:16:21] you can see we have the accessories
[03:16:23] bikes clothing and components everything
[03:16:26] looks perfect we don't have to change
[03:16:27] anything in this column let's go and
[03:16:29] check the subcategory and if you scroll
[03:16:32] down all values are friendly and nice as
[03:16:35] well nothing to change here and let's go
[03:16:38] and check the last column the
[03:16:39] maintenance perfect we have only two
[03:16:41] values yes and no we don't have any
[03:16:43] nulls so my friends that means this
[03:16:46] table has really nice data quality and
[03:16:48] we don't have to clean up anything but
[03:16:50] still we have to follow our process we
[03:16:52] have to go and load it from the bronze
[03:16:54] to the silver even if we didn't
[03:16:56] transform anything so our job is really
[03:16:58] easy here we're going to go and say
[03:17:00] insert into silver dots Erp PX and so on
[03:17:05] and we're going to go and Define The
[03:17:07] Columns so it's going to be the ID the
[03:17:10] category sub category maintenance so
[03:17:13] that's it let's go and insert the data
[03:17:15] now as usual what we're going to do
[03:17:16] we're going to go and check the data so
[03:17:20] silver Erp PX let's have a look all
[03:17:23] right so we can see the IDS are here the
[03:17:25] categories the subcategories the
[03:17:27] maintenance and we have our meta column
[03:17:31] so everything is inserted correctly all
[03:17:33] right so now I have all those queries
[03:17:35] and the insert statements for all six
[03:17:38] tables and now what is important before
[03:17:40] inserting any data we have to make sure
[03:17:42] that we are trating and emptying the
[03:17:45] table because if you run this qu twice
[03:17:47] what's going to happen you will be
[03:17:48] inserting duplicates so first truncate
[03:17:51] the data and then do a full load insert
[03:17:53] all data so we're going to have one step
[03:17:56] before it's like the bronze layer we're
[03:17:57] going to say trate table and then we
[03:17:59] will be trating the silver customer info
[03:18:02] and only after that we have to go and
[03:18:04] insert the data and of course we can go
[03:18:06] and give this nice information at the
[03:18:08] start so first we are truncating the
[03:18:10] table and then inserting so if I go and
[03:18:13] run the whole thing so let's go and do
[03:18:15] it it will be working so if I can run it
[03:18:17] again we will not have any duplicates so
[03:18:19] we have to go and add this tip before
[03:18:21] each insert so let's go and do that all
[03:18:24] right so I'm done with all tables so now
[03:18:27] let's go and run everything so let's go
[03:18:30] and execute it and we can see in the
[03:18:32] messaging everything working perfectly
[03:18:34] so with that we made all tables empty
[03:18:36] and then we inserted the
[03:18:41] data so perfect with that we have a nice
[03:18:43] script that loads the silver layer but
[03:18:46] of course like the bronze layer we're
[03:18:48] going to put everything in one stored
[03:18:50] procedure so let's go and do that we'll
[03:18:52] go to the beginning over here and say
[03:18:54] create or alter procedure and we're
[03:18:58] going to put it in the schema silver and
[03:19:00] using the naming convention load silver
[03:19:02] and we're going to go over here and say
[03:19:03] begin and take the whole code end it is
[03:19:07] long one and give it one push with a tab
[03:19:09] and then at the end we're going to say
[03:19:11] and perfect so we have our s procedure
[03:19:14] but we forgot here the US with that we
[03:19:16] will not have any error let's go and
[03:19:18] execute it so the thir procedure is
[03:19:20] created if you go to the programmability
[03:19:23] and you will find two procedures load
[03:19:25] bronze and load silver so now let's go
[03:19:27] and try it out all what you have to do
[03:19:29] is now only to execute the Silver Load
[03:19:32] silver so let's execute the start
[03:19:35] procedure and with that we will get the
[03:19:37] same results this thir procedure now is
[03:19:40] responsible of loading the whole silver
[03:19:42] layer now of course the messaging here
[03:19:45] is not really good because we have
[03:19:47] learned in the bronze layer we can go
[03:19:48] and add many stuff like handling the
[03:19:51] error doing nce messaging catching the
[03:19:53] duration time so now your task is to
[03:19:56] pause the video take this thir procedure
[03:19:59] and go and transform it to be very
[03:20:01] similar to the bronze layer with the
[03:20:03] same messaging and all the add-ons that
[03:20:05] we have added so pause the video now I
[03:20:07] will do it as well offline and I will
[03:20:09] see you
[03:20:14] soon okay so I hope you are done and I
[03:20:17] can show you the results it's like the
[03:20:19] bronze layer we have defined at the star
[03:20:21] few variables in order to catch the
[03:20:23] duration so we have the start time the
[03:20:25] end time patch start time and Patch end
[03:20:28] time and then we are printing a lot of
[03:20:30] stuff in order to have like nice
[03:20:31] messaging in the outut so at the start
[03:20:33] we are saying loading the server layer
[03:20:36] and then we start splitting by The
[03:20:37] Source system so loading the CRM tables
[03:20:40] and I'm going to show you only one table
[03:20:41] for now so we are setting the timer so
[03:20:44] we are saying start time get the dat
[03:20:46] date and time informations to it then we
[03:20:48] are doing the usual we are truncating
[03:20:50] the table and then we are inserting the
[03:20:52] new informations after cleaning it up
[03:20:55] and we have this nice message where we
[03:20:56] say load duration where we are finding
[03:20:59] the differences between the start time
[03:21:00] and the end time using the function dat
[03:21:03] diff and we want to show the result in
[03:21:05] the seconds so we are just printing how
[03:21:08] long it took to load this table and
[03:21:10] we're going to go and repeat this
[03:21:12] process for all the tables and of course
[03:21:14] we are putting everything in try and Cat
[03:21:16] so the SQL going to go and try to
[03:21:18] execute the tri part and if there are
[03:21:21] any issues the SQL going to go and
[03:21:23] execute the catch and here we are just
[03:21:25] printing few information like the error
[03:21:27] message the error number and the error
[03:21:29] States and we are following exactly the
[03:21:31] same standard at the bronze layer so
[03:21:34] let's go and execute the whole thing and
[03:21:37] with that we have updated the definition
[03:21:38] of the S procedure let's go now and
[03:21:40] execute it so execute silver do load
[03:21:44] silver so let's go and do that it went
[03:21:47] very fast like few than 1 second again
[03:21:49] because we are working on local machine
[03:21:51] loading the server layer loading the CRM
[03:21:54] tables and we can see this nice
[03:21:56] messaging so it start with trating the
[03:21:58] table inserting the data and we are
[03:22:00] getting the load duration for this table
[03:22:02] and you will see that everything is
[03:22:04] below 1 second and that's because at in
[03:22:06] real project you will get of course more
[03:22:08] than 1 second so at the end we have low
[03:22:11] duration of the whole silver layer and
[03:22:13] now I have one more thing for you let's
[03:22:15] say that you are changing the design of
[03:22:18] this thr procedure for the silver layer
[03:22:19] you are adding different types of
[03:22:21] messaging or maybe are creating logs and
[03:22:24] so on so now all those new ideas and
[03:22:26] redesigns that you are doing for the
[03:22:28] silver layer you have always to think
[03:22:30] about bringing the same changes as well
[03:22:32] in the other store procedure for the
[03:22:34] pron layer so always try to keep your
[03:22:36] codes following the same standards don't
[03:22:39] have like one idea in One S procedure
[03:22:41] and an old idea in another one always
[03:22:44] try to maintain those scripts and to
[03:22:46] keep them all up to date following the
[03:22:48] same standards otherwise it can to be
[03:22:50] really hard for other developers to
[03:22:52] understand the cause I know that needs a
[03:22:54] lot of work and commitments but this is
[03:22:56] your job to make everything following
[03:22:58] the best practices and following the
[03:23:00] same naming convention and standards
[03:23:02] that you put for your projects so guys
[03:23:04] now we have very nice two ETL scripts
[03:23:07] one that loads the pron layer and
[03:23:09] another one for the server layer so now
[03:23:11] our data bear house is very simple all
[03:23:13] what you have to do is to run first the
[03:23:15] bronze layer and with that we are taking
[03:23:17] all the data from the CSV files from the
[03:23:20] source and we put it inside our data
[03:23:22] warehouse in the pron layer and with
[03:23:24] that we are refreshing the whole bronze
[03:23:26] layer once it's done the next step is to
[03:23:29] run the start procedure of the servey
[03:23:31] layer so once you executed you are
[03:23:33] taking now all the data from the bronze
[03:23:35] layer transforming it cleaning it up and
[03:23:38] then loading it to the server layer and
[03:23:41] as you can see the concept is very
[03:23:42] simple we are just moving the data from
[03:23:44] one layer another layer with different
[03:23:47] tasks all right guys so as you can see
[03:23:48] in the silver layer we have done a lot
[03:23:50] of data Transformations and we have
[03:23:52] covered all the types that we have in
[03:23:54] the data cleansing so we remove
[03:23:56] duplicates data filtering handling
[03:23:58] missing data invalid data unwanted
[03:24:00] spaces casting the data types and so on
[03:24:03] and as well we have derived new columns
[03:24:05] we have done data enrichment and we have
[03:24:07] normalized a lot of data so now of
[03:24:09] course what we have not done yet
[03:24:11] business rules and logic data
[03:24:13] aggregations and data integration this
[03:24:15] is for the next layer all right my
[03:24:17] friends so finally we are done cleaning
[03:24:19] up the data and checking the quality of
[03:24:22] our data so we can go and close those
[03:24:24] two steps and now to the next step we
[03:24:26] have to go and extend the data flow
[03:24:28] diagram so let's
[03:24:32] go okay so now let's go and extend our
[03:24:35] data flow for the silver layer so what
[03:24:38] I'm going to do I'm just going to go and
[03:24:40] copy the whole thing and put it side by
[03:24:43] side to the bronze layer and let's call
[03:24:45] it silver
[03:24:46] layer and the table names going to stay
[03:24:48] as before because we have like one to
[03:24:51] one like the bronze layer but what we're
[03:24:52] going to do we're going to go and change
[03:24:54] the coloring so I'm going to go and Mark
[03:24:55] everything and make it gray like silver
[03:24:59] and of course what is very important is
[03:25:00] to make the lineage so I'm going to go
[03:25:02] now from the bronze and take an arrow
[03:25:05] and put it to the server table and now
[03:25:08] with that we have like a lineage between
[03:25:09] three layers and you are checking this
[03:25:11] table the customer info you can
[03:25:13] understand aha this comes from the
[03:25:15] bronze layer from the customer info and
[03:25:18] as well this comes from the source
[03:25:19] system CRM so now you can see the
[03:25:22] lineage between different layers and
[03:25:24] without looking to any scripts and so on
[03:25:26] in one picture you can understand the
[03:25:28] whole projects so I don't have to
[03:25:30] explain a lot of stuff by just looking
[03:25:32] to this picture you can understand how
[03:25:34] the data is Flowing between sources
[03:25:37] bronze layer silver layer and to the
[03:25:39] gold layer of course later so as you can
[03:25:41] see it looks really nice and clean all
[03:25:44] right so with that we have updated the
[03:25:45] data flow
[03:25:46] next we're going to go and commit our
[03:25:48] work in the get repo so let's
[03:25:53] go okay so now let's go and commit our
[03:25:56] scripts we're going to go to the folder
[03:25:58] scripts and here we have a server layer
[03:26:00] if you don't have it of course you can
[03:26:01] go and create it so first we're going to
[03:26:03] go and put the ddl scripts for the
[03:26:05] server layer so let's go and I will
[03:26:08] paste the code over here and as usually
[03:26:10] we have this comment at the header
[03:26:12] explaining the purpose of this scripts
[03:26:14] so let's go and commit our work work and
[03:26:17] we're going to do the same thing for the
[03:26:18] start procedure that loads the silver
[03:26:21] layer so I'm going to go over here I
[03:26:23] have already file for that so let's go
[03:26:25] and paste that so we have here our
[03:26:27] stored procedures and as usual at the
[03:26:29] start we have as well so this script is
[03:26:31] doing the ETL process where we load the
[03:26:34] data from bronze into silver so the
[03:26:36] action is to truncate the table first
[03:26:38] and then insert transformed cleans data
[03:26:41] from bronze to Silver there are no
[03:26:43] parameters at all and this is how you
[03:26:45] can use the start procedure okay so
[03:26:47] we're going to go and commit our work
[03:26:50] and now one more thing that we want to
[03:26:52] commit in our project all those quaries
[03:26:54] that you have built to check the quality
[03:26:56] of the server layer so this time we will
[03:26:58] not put it in the scripts we're going to
[03:27:00] go to the tests and here we're going to
[03:27:01] go and make a new file called quality
[03:27:03] checks silver and inside it we're going
[03:27:06] to go and paste all the queries that we
[03:27:08] have filled I just here reorganize them
[03:27:11] by the tables so here we can see all the
[03:27:13] checks that we have done during the
[03:27:16] course and at the header we have here
[03:27:18] nice comments so here we are just saying
[03:27:20] that this script is going to check the
[03:27:21] quality of the server layer and we are
[03:27:23] checking for nulls duplicates unwanted
[03:27:25] spaces invalid date range and so on so
[03:27:28] that each time you come up with a new
[03:27:30] quality check I'm going to recommend you
[03:27:32] to share it with the project and with
[03:27:33] other team in order to make it part of
[03:27:36] multiple checks that you do after
[03:27:38] running the atls so that's it I'm going
[03:27:40] to go and put those checks in our repo
[03:27:43] and in case I come up with new check I'm
[03:27:45] going to go and update it perfect so now
[03:27:48] we have our code in our repository all
[03:27:50] right so with that our code is safe and
[03:27:53] we are done with the whole epic so we
[03:27:55] have build the silver layer now let's go
[03:27:58] and minimize it and now we come to my
[03:28:00] favorite layer the gold layer so we're
[03:28:02] going to go and build it the first step
[03:28:04] as usual we have to analyze and this
[03:28:06] time we're going to explore the business
[03:28:07] objects so let's
[03:28:12] go all right so now we come to the big
[03:28:14] question how we going to build the gold
[03:28:15] layer as usual we start with analyzing
[03:28:18] so now what we're going to do here is to
[03:28:19] explore and understand what are the main
[03:28:22] business objects that are hidden inside
[03:28:24] our source system so as you can see we
[03:28:26] have two sources six files and here we
[03:28:28] have to identify what are the business
[03:28:29] objects once we have this understanding
[03:28:32] then we can start coding and here the
[03:28:33] main transformation that we are doing is
[03:28:35] data integration and here usually I
[03:28:37] split it into three steps the first one
[03:28:40] we're going to go and build those
[03:28:41] business objects that we have identified
[03:28:43] and after we have a business object we
[03:28:45] have to look at it and decide what is
[03:28:48] the type of this table is it a dimension
[03:28:50] is it a fact or is it like maybe a flat
[03:28:52] table so what type of table that we have
[03:28:54] built and the last step is of course we
[03:28:57] have now to rename all the columns into
[03:28:59] something friendly and easy to
[03:29:01] understand so that our consumers don't
[03:29:03] struggle with technical names so once we
[03:29:05] have all those steps what we're going to
[03:29:06] do it's time to validate what we have
[03:29:07] created so what we have to do the new
[03:29:09] data model that we have created it
[03:29:11] should be connectable and we have to
[03:29:13] check that the data integration is done
[03:29:15] correctly and once everything is fine we
[03:29:17] cannot skip the last step we have to
[03:29:19] document and as well commit our work in
[03:29:22] the git and here we will be introducing
[03:29:24] new type of documentations so we're
[03:29:25] going to have a diagram about the data
[03:29:27] model we're going to build a data
[03:29:29] dictionary where we going to describe
[03:29:31] the data model and of course we can
[03:29:32] extend the data flow diagram so this is
[03:29:34] our process those are the main steps
[03:29:36] that we will do in order to build the
[03:29:38] gold
[03:29:42] layer okay so what is exactly data
[03:29:45] modeling usually usually the source
[03:29:46] system going to deliver for you row data
[03:29:49] an organized messy not very useful in
[03:29:52] its current States but now the data
[03:29:54] modeling is the process of taking this
[03:29:56] row data and then organize it and
[03:29:59] structure it in meaningful way so what
[03:30:01] we are doing we are putting the data in
[03:30:03] a new friendly and easy to understand
[03:30:06] objects like customers orders products
[03:30:09] each one of them is focused on specific
[03:30:11] information and what is very important
[03:30:13] is we're going to describe the
[03:30:15] relationship between those objects so by
[03:30:17] connecting them using lines so what you
[03:30:19] have built on the right side we call it
[03:30:21] logical data model if you compare to the
[03:30:23] left side you can see the data model
[03:30:25] makes it really easy to understand our
[03:30:27] data and the relationship the processes
[03:30:29] behind them now in data modeling we have
[03:30:31] three different stages or let's say
[03:30:32] three different ways on how to draw a
[03:30:34] data model the first stage is the
[03:30:36] conceptual data model here the focus is
[03:30:39] only on the entity so we have customers
[03:30:41] orders products and we don't go in
[03:30:43] details at all so we don't specify any
[03:30:46] columns or attributes inside those boxes
[03:30:48] we just want to focus what are the
[03:30:50] entities that we have and as well the
[03:30:52] relationship between them so the
[03:30:54] conceptual data model don't focus at all
[03:30:56] on the details it just gives the big
[03:30:58] picture so the second data model that we
[03:31:00] can build is The Logical data model and
[03:31:03] here we start specifying what are the
[03:31:05] different columns that we can find in
[03:31:07] each entity like we have the customer ID
[03:31:09] the first name last name and so on and
[03:31:11] we still draw the relationship between
[03:31:13] those entities and as well we make it
[03:31:15] clear which columns are the primary key
[03:31:17] and so on so as you can see we have here
[03:31:18] more details but one thing we don't
[03:31:20] describe a lot of details for each
[03:31:22] column and we are not worry how exactly
[03:31:25] we going to store those tables in the
[03:31:27] database the third and last stage we
[03:31:29] have the physical data model this is
[03:31:31] where everything gets ready before
[03:31:33] creating it in the database so here you
[03:31:35] have to add all the technical details
[03:31:37] like adding for each column the data
[03:31:39] types and the length of each data type
[03:31:42] and many other database techniques and
[03:31:44] details so again if if you look to the
[03:31:46] conceptual data model it gives us the
[03:31:48] big picture and in The Logical data
[03:31:50] model we dive into details of what data
[03:31:52] we need and the physical layer model
[03:31:54] prepares everything for the
[03:31:56] implementation in the database and to be
[03:31:58] honest in my projects I only draw the
[03:32:00] conceptual and The Logical data model
[03:32:03] because drawing and building the
[03:32:04] physical data model needs a lot of
[03:32:06] efforts and time and there are many
[03:32:08] tools like in data bricks they
[03:32:10] automatically generate those models so
[03:32:12] in this project what we're going to do
[03:32:13] we're going to draw The Logical data
[03:32:15] model for the gold
[03:32:20] layer all right so now for analytics and
[03:32:23] specially for data warehousing and
[03:32:24] business intelligence we need a special
[03:32:26] data model that is optimized for
[03:32:28] reporting and analytics and it should be
[03:32:31] flexible scalable and as well easy to
[03:32:33] understand and for that we have two
[03:32:35] special data models the first type of
[03:32:37] data model we have the star schema it
[03:32:39] has a central fact table in the middle
[03:32:41] and surrounded by Dimensions the fact
[03:32:43] table contains transactions events and
[03:32:46] the dimensions contains descriptive
[03:32:48] informations and the relationship
[03:32:50] between the fact table in the middle and
[03:32:51] the dimensions around it forms like a
[03:32:54] star shape and that's why we call it
[03:32:56] star schema and we have another data
[03:32:58] model called snowflake schema it looks
[03:33:00] very similar to the star schema so we
[03:33:02] have again the fact in the middle and
[03:33:04] surrounded by Dimensions but the big
[03:33:06] difference is that we break the
[03:33:08] dimensions into smaller subdimensions
[03:33:11] and the shape of this data model as you
[03:33:13] are extending the dimensions it's going
[03:33:15] to look like a snowflake so now if you
[03:33:17] compare them side by side you can see
[03:33:19] that the star schema looks easier right
[03:33:21] so it is usually easy to understand easy
[03:33:23] to query it is really perfect for
[03:33:25] analyzes but it has one issue with that
[03:33:28] the dimension might contain duplicates
[03:33:30] and your Dimensions get bigger with the
[03:33:32] time now if you compare to the snowflake
[03:33:34] you can see the schema is more complex
[03:33:36] you so you need a lot of knowledge and
[03:33:38] efforts in order to query something from
[03:33:41] the snowflake but the main advantage
[03:33:42] here comes with the normalization as you
[03:33:44] are breaking those redundancies in small
[03:33:47] tables you can optimize the storage but
[03:33:49] to be honest who care about the storage
[03:33:51] so for this project I have chose to use
[03:33:53] the star schema because it is very
[03:33:55] commonly used perfect for reporting like
[03:33:57] for example if you're using power pii
[03:33:59] and we don't have to worry about the
[03:34:01] storage so that's why we going to adapt
[03:34:03] this model to build our gold
[03:34:08] layer okay so now one more thing about
[03:34:10] those data models is that they contain
[03:34:12] two types of tables fact and dimensions
[03:34:15] so when I I say this is a fact table or
[03:34:17] a dimension table well the dimension
[03:34:19] contains descriptive informations or
[03:34:21] like categories that gives some context
[03:34:23] to your data for example a product info
[03:34:26] you have product name category
[03:34:27] subcategories and so on this is like a
[03:34:29] table that is describing the product and
[03:34:32] this we call it Dimension but in the
[03:34:34] other hand we have facts they are events
[03:34:36] like transactions they contain three
[03:34:39] important informations first you have
[03:34:41] multiple IDs from multiple dimensions
[03:34:44] then we have like the informations like
[03:34:47] when the transaction or the event did
[03:34:49] happen and the third type of information
[03:34:51] you're going to have like measures and
[03:34:52] numbers so if you see those three types
[03:34:54] of data in one table then this is a fact
[03:34:57] so if you have a table that answers how
[03:35:00] much or how many then this is a fact but
[03:35:02] if you have a table that answers who
[03:35:05] what where then this is a dimension
[03:35:07] table so this is what dimension and fact
[03:35:13] tables all right my friends so so far in
[03:35:15] the bronze layer and in the silver layer
[03:35:18] we didn't discuss anything about the
[03:35:20] business so the bronze and silver were
[03:35:22] very technical we are focusing on data
[03:35:24] Eng gestion we are focusing on cleaning
[03:35:26] up the data quality of the data but
[03:35:28] still the tables are very oriented to
[03:35:30] the source system now comes the fun part
[03:35:33] in the god layer where we're going to go
[03:35:34] and break the whole data model of the
[03:35:37] sources so we're going to create
[03:35:38] something completely new to our business
[03:35:41] that is easy to consume for business
[03:35:43] reporting and analyzes and here it is
[03:35:45] very very important to have a clear
[03:35:47] understanding of the business and the
[03:35:48] processes and if you don't know it
[03:35:50] already at this phase you have really to
[03:35:52] invest time by meeting maybe process
[03:35:54] experts the domain experts in order to
[03:35:57] have clear understanding what we are
[03:35:59] talking about in the data so now what
[03:36:01] we're going to do we're going to try to
[03:36:02] detect what are the business objects
[03:36:05] that are hidden in the source systems so
[03:36:07] now let's go and explore that all right
[03:36:09] now in order to build a new data model I
[03:36:11] have to understand first the original
[03:36:13] data model what are the main business
[03:36:15] objects that we have how things are
[03:36:17] related to each others and this is very
[03:36:19] important process in building a new
[03:36:21] model so now what I usually do I start
[03:36:23] giving labels to all those tables so if
[03:36:26] you go to the shapes over here let's go
[03:36:27] and search for label and if you go to
[03:36:29] more icons I'm going to go and take this
[03:36:32] label over here so drag and drop it and
[03:36:34] then I'm going to go and increase maybe
[03:36:36] the size of the font so let's go with 20
[03:36:39] and bold just make it a little bit
[03:36:41] bigger so now by looking to this data
[03:36:43] model we can see that we have a bradu
[03:36:45] for informations in the CRM and as well
[03:36:47] in the ARP and then we have like
[03:36:49] customer informations and transactional
[03:36:51] table so now let's focus on the product
[03:36:54] so the product information is over here
[03:36:56] we have here the current and the history
[03:36:58] product informations and here we have
[03:37:00] the categories that's belong to the
[03:37:02] products so in our data model we have
[03:37:04] something called products so let's go
[03:37:06] and create this label it's going to be
[03:37:07] the products and so let's go and give it
[03:37:10] a color to the style let's Pi for
[03:37:13] example the red one now let's go and
[03:37:15] move this label and put it beneath this
[03:37:17] table over here that I have like a label
[03:37:20] saying this table belongs to the objects
[03:37:23] called products now I'm going to do the
[03:37:25] same thing for the other table over here
[03:37:27] so I'm going to go and tag this table to
[03:37:29] the product as well so that I can see
[03:37:31] easily which tables from the sources
[03:37:33] does has informations about the product
[03:37:36] business object all right now moving on
[03:37:38] we have here a table called customer
[03:37:40] information so we have a lot of
[03:37:41] information about the customer we have
[03:37:43] as well in the ARB customer information
[03:37:45] where we have the birthday and the
[03:37:46] country so those three tables has to do
[03:37:49] with the object customer so that means
[03:37:51] we're going to go and label it like that
[03:37:53] so let's call it customer and I'm going
[03:37:55] to go and pick different color for that
[03:37:58] let's go with the green so I will tag
[03:38:01] this table like this and the same thing
[03:38:03] for the other tables so copy tag the
[03:38:06] second table and the third table now it
[03:38:09] is very easily for me to see which table
[03:38:11] to belong to which business objects and
[03:38:13] now we have the final table over here
[03:38:15] and only one table about the sales and
[03:38:18] orders in the ARB we don't have any
[03:38:20] informations about that so this one
[03:38:22] going to be easy let's call it sales and
[03:38:25] let's move it over here and as well
[03:38:27] maybe change the color of that to for
[03:38:29] example this color over here now this
[03:38:32] step is very important by building any
[03:38:34] data model in the gold layer it gives
[03:38:35] you a big picture about the things that
[03:38:37] you are going to module so now the next
[03:38:39] step with that we're going to go and
[03:38:40] build those objects step by step so
[03:38:42] let's start with the first objects with
[03:38:44] our customers so here we we have three
[03:38:45] tables and we're going to start with the
[03:38:47] CRM so let's start with this table over
[03:38:49] here all right so with that we know what
[03:38:51] are our business objects and this task
[03:38:54] is done and now in The Next Step we're
[03:38:55] going to go back to SQL and start doing
[03:38:58] data Integrations and building
[03:39:00] completely new data model so let's go
[03:39:02] and do
[03:39:06] that now let's have a quick look to the
[03:39:09] gold layer specifications so this is the
[03:39:11] final stage we're going to provide data
[03:39:12] to be consumed by reporting and
[03:39:14] Analytics and this time we will not be
[03:39:16] building tables we will be using views
[03:39:19] so that means we will not be having like
[03:39:21] start procedure or any load process to
[03:39:23] the gold layer all what you are doing is
[03:39:25] only data transformation and the focus
[03:39:28] of the data transformation going to be
[03:39:29] data integration aggregation business
[03:39:31] logic and so on and this time we're
[03:39:33] going to introduce a new data model we
[03:39:35] will be doing star schema so those are
[03:39:38] the specifications for the gold layer
[03:39:40] and this is our scope so this time we
[03:39:42] make sure that we are selecting data
[03:39:44] from the silver layer
[03:39:45] not from the bronze because the bronze
[03:39:48] has bad data quality and the server is
[03:39:50] everything is prepared and cleaned up in
[03:39:52] order to build the good layer going to
[03:39:53] be targeting the server layer so let's
[03:39:56] start with select star from and we're
[03:39:59] going to go to the silver CRM customer
[03:40:02] info so let's go and hit execute and now
[03:40:04] we're going to go and select the columns
[03:40:06] that we need to be presented in the gold
[03:40:08] layer so let's start selecting The
[03:40:10] Columns that we want we have the ID the
[03:40:13] key the first name
[03:40:19] I will not go and get the metadata
[03:40:21] information this only belongs to the
[03:40:23] Silver Perfect the next step is that I'm
[03:40:25] going to go and give this table an ilas
[03:40:27] so let's go and call it CI and I'm going
[03:40:29] to make sure that we are selecting from
[03:40:32] this alas because later we're going to
[03:40:33] go and join this table with other tables
[03:40:36] so something like this so we're going to
[03:40:37] go with those columns now let's move to
[03:40:39] the second table let's go and get the
[03:40:41] birthday information so now we're going
[03:40:43] to jump to the other system and we have
[03:40:45] to join the data by the CI ID together
[03:40:48] with the customer key so now we have to
[03:40:49] go and join the data with another table
[03:40:52] and here I try to avoid using the inner
[03:40:55] join because if the other table doesn't
[03:40:57] have all the information about the
[03:40:58] customers I might lose customers so
[03:41:01] always start with the master table and
[03:41:03] if you join it with any other table in
[03:41:05] order to get informations try always to
[03:41:08] avoid the inner join because the other
[03:41:10] source might not have all the customers
[03:41:12] and if you do inner join you might lose
[03:41:14] customers so iend to start from the
[03:41:16] master table and then everything else is
[03:41:18] about the lift join so I'm going to say
[03:41:20] Lift join silver Erp customer a z12 so
[03:41:24] let's give it the ls CA and now we have
[03:41:26] to join the tables so it's going to be
[03:41:28] by C from the first table it going to be
[03:41:31] the customer key equal to ca and we have
[03:41:35] the CI ID now of course we're going to
[03:41:37] get matching data because we checked the
[03:41:39] silver layer but if we haven't prepared
[03:41:41] the data in the silver layer we have to
[03:41:43] do here preparation step in order to
[03:41:45] join Jo the tables but we don't have to
[03:41:46] do that because that was a preep in the
[03:41:49] silver layer so now you can see the
[03:41:50] systematic that we have in this pron
[03:41:53] silver gold so now after joining the
[03:41:55] tables we have to go and pick the
[03:41:56] information that we need from the second
[03:41:58] table which is the birth dat so B dat
[03:42:02] and as well from this table there is
[03:42:04] another nice information it is the
[03:42:06] gender information so that's all what we
[03:42:09] need from the second table let's go and
[03:42:11] check the third table so the third table
[03:42:14] is about the location information the
[03:42:16] countries and as well we connect the
[03:42:18] tables by the C ID with the key so let's
[03:42:20] go and do that we're going to say as
[03:42:22] well left join silver Erp location and
[03:42:26] I'm going to give it the name LA and
[03:42:28] then we have to join while the keys the
[03:42:30] same thing it's going to be CI customer
[03:42:33] key equal to La a CI ID again we have
[03:42:37] prepared those IDs and keys in the
[03:42:39] server layer so the joint should be
[03:42:41] working now we have to go and pick the
[03:42:43] data from the second table so what do we
[03:42:45] we have over here we have the ID the
[03:42:47] country and the metadata information so
[03:42:49] let's go and just get the country
[03:42:51] perfect so now with that we have joined
[03:42:53] all the three tables and we have picked
[03:42:55] all the columns that we want in this
[03:42:58] object so again by looking over here we
[03:43:00] have joined this table with this one and
[03:43:02] this one so with that we have collected
[03:43:04] all the customer informations that we
[03:43:06] have from the two Source systems okay so
[03:43:09] now let's go and query in order to make
[03:43:10] sure that we have everything correct and
[03:43:12] in order to understand that your joints
[03:43:14] are correct you have to keep your eye in
[03:43:17] those three columns so if you are seeing
[03:43:19] that you are getting data that means you
[03:43:21] are doing the the joints correctly but
[03:43:24] if you are seeing a lot of nulls or no
[03:43:26] data at all that means your joints are
[03:43:29] incorrect but now it looks for me it is
[03:43:31] working and another check that I do is
[03:43:34] that if your first table has no
[03:43:36] duplicates what could happen is that
[03:43:38] after doing multiple joints you might
[03:43:40] now start getting dgates because the
[03:43:42] relationship between those tables is not
[03:43:44] clear one to one you might get like one
[03:43:46] to many relationship or many to many
[03:43:48] relationships so now the check that I
[03:43:50] usually do at this stage advance I have
[03:43:52] to make sure that I don't have
[03:43:54] duplicates from their results so we
[03:43:56] don't have like multiple rows for the
[03:43:58] same customer so in order to do that we
[03:44:00] go and do a quick group bu so we're
[03:44:03] going to group by the data by the
[03:44:05] customer ID and then we do the
[03:44:07] counts from this subquery so this is the
[03:44:11] whole subquery and then after that we're
[03:44:14] going to go and say Group by the
[03:44:17] customer ID and then we say having
[03:44:21] counts higher than one so this query
[03:44:25] actually try to find out whether we have
[03:44:28] any duplicates in the primary key so
[03:44:30] let's go and executed we don't have any
[03:44:32] duplicate and that means after joining
[03:44:35] all those tables with the customer info
[03:44:38] those tables didn't didn't cause any
[03:44:39] issues and it didn't duplicate my data
[03:44:42] so this is very important check to make
[03:44:44] sure that you are in the right way all
[03:44:46] right so that means everything is fine
[03:44:48] about the D Kates we don't have to worry
[03:44:50] about it now we have here an integration
[03:44:53] issue so let's go and execute it again
[03:44:54] and now if you look to the data we have
[03:44:56] two sources for the gender informations
[03:44:58] one comes from the CRM and another where
[03:45:01] come from the Erp so now the question is
[03:45:03] what are we going to do with this well
[03:45:04] we have to do data integration so let me
[03:45:07] show you how I do it first I go and have
[03:45:09] a new query and then I'm going to go and
[03:45:11] remove all other stuff and I'm going to
[03:45:14] leave only those two informations and
[03:45:16] use it distinct just to focus on the
[03:45:19] integration and let's go and execute it
[03:45:21] and maybe as well to do an order bu so
[03:45:23] let's do one and two let's go and
[03:45:25] execute it again so now here we have all
[03:45:27] the scenarios and we can see sometimes
[03:45:30] there is a matching so from the first
[03:45:32] table we have female and the other table
[03:45:33] we have as well female but sometimes we
[03:45:35] have an issue like those two tables are
[03:45:37] giving different informations and the
[03:45:39] same thing over here so this is as well
[03:45:41] an issue different informations another
[03:45:43] scenario where we have a from the first
[03:45:45] table like here we have the female but
[03:45:47] in the other table we have not available
[03:45:50] well this is not a problem so we can get
[03:45:52] it from the first table but we have as
[03:45:54] well the exact opposite scenario where
[03:45:56] from the first table the data is not
[03:45:58] available but it is available from the
[03:46:00] second table and now here you might
[03:46:02] wonder why I'm getting a null over here
[03:46:04] we did handle all the missing data in
[03:46:06] the silver layer and we replace
[03:46:07] everything with not available so why we
[03:46:09] are still getting a null this null
[03:46:11] doesn't come directly from the tables it
[03:46:14] just come because of joining tables so
[03:46:17] that means there are customers in the
[03:46:19] CRM table that is not available in the
[03:46:22] Erb table and if there is like no match
[03:46:25] what's going to happen we will get a
[03:46:27] null from scel so this null means there
[03:46:30] was no match and that's why we are
[03:46:32] getting this null it is not coming from
[03:46:34] the content of the tables and this is of
[03:46:36] course an issue but now the big issue
[03:46:38] what can happen for those two scenarios
[03:46:40] here we have the data but they are
[03:46:42] different and here again we have to ask
[03:46:44] the experts about it what is the master
[03:46:47] here is it the CRM system or the ARP and
[03:46:50] let's say from their answer going to say
[03:46:52] the master data for the customer
[03:46:54] information is the CRM so that means the
[03:46:57] CRM informations are more accurate than
[03:47:00] the Erp information and this is only
[03:47:02] about the customers of course so for
[03:47:04] this scenario where we have female and
[03:47:06] male then the correct information is the
[03:47:09] female from the First Source system the
[03:47:10] same goes over here and here we have
[03:47:12] like male and female then the correct
[03:47:14] one is is the mail because this Source
[03:47:17] system is the master okay so now let's
[03:47:19] go and build this business rule we're
[03:47:21] going to start as usual with the case wi
[03:47:23] so the first very important rule is if
[03:47:25] we have a data in the gender information
[03:47:28] from the CRM system from the master then
[03:47:31] go and use it so we're going to go and
[03:47:32] check the gender information from the
[03:47:34] CRM table so customer gender is not
[03:47:38] equal to not available so that means we
[03:47:40] have a value male or female let me just
[03:47:42] have here a comma like this then what
[03:47:45] going to happen go and use it so we're
[03:47:47] going to use the value from the master
[03:47:50] CRM is the master for gender info now
[03:47:55] otherwise that means it is not available
[03:47:58] from the CRM table then go and use and
[03:48:02] grab the information from the second
[03:48:03] table so we're going to say ca gender
[03:48:07] but now we have to be careful this null
[03:48:09] over here we have to convert it to not
[03:48:11] available as well so we're going to use
[03:48:13] the Calis
[03:48:14] so if this is a null then go and use the
[03:48:18] not available like this so that's it
[03:48:20] let's have an end let me just push this
[03:48:23] over here so let's go and call it new
[03:48:26] chin for now let's go and excute it and
[03:48:28] let's go and check the different
[03:48:30] scenarios all those values over here we
[03:48:33] have data from the CRM system and this
[03:48:35] is as well represented in the new column
[03:48:38] but now for the second parts we don't
[03:48:40] have data from the first system so we
[03:48:42] are trying to get it from the second
[03:48:44] system so for the first one is not
[03:48:46] available and then we try to get it from
[03:48:48] the Second Source system so now we are
[03:48:50] activating the else well it is null and
[03:48:52] with that the CIS is activated and we
[03:48:55] are replacing the null with not
[03:48:57] available for the second scenario as
[03:48:59] well the first system don't have the
[03:49:02] gender information that's why we are
[03:49:03] grabbing it from the second so with that
[03:49:06] we have a female and then the third one
[03:49:07] the same thing we don't have information
[03:49:09] but we get it from the Second Source
[03:49:11] system we have the mail and the last one
[03:49:13] it is not available in in both Source
[03:49:15] systems that's why we are getting not
[03:49:17] available so with that as you can see we
[03:49:19] have a perfect new column where we are
[03:49:21] integrating two different Source system
[03:49:24] in one and this is exactly what we call
[03:49:26] data integration this piece of
[03:49:28] information it is way better than the
[03:49:31] source CRM and as well the source ARP it
[03:49:34] is more rich and has more information
[03:49:37] and this is exactly why we Tred to get
[03:49:39] data from different Source system in
[03:49:40] order to get rich information in the
[03:49:43] data warehouse so do we have a nice
[03:49:45] logic and as you can see it's way easier
[03:49:47] to separate it in separate query in
[03:49:49] order first to build the logic and then
[03:49:51] take it to the original query so what
[03:49:53] I'm going to do I'm just going to go and
[03:49:55] copy everything from here and go back to
[03:49:57] our query I'm going to go and delete
[03:49:59] those informations the gender and I will
[03:50:02] put our new logic over here so a comma
[03:50:05] and let's go and execute so with that we
[03:50:07] have our new nice column now with that
[03:50:09] we have very nice objects we don't have
[03:50:11] delates and we have integrated data
[03:50:13] together so we took three three tables
[03:50:15] and we put it in one object now the next
[03:50:17] step is that we're going to go and give
[03:50:19] nice friendly names the rule in the gold
[03:50:22] layer that to use friendly names and not
[03:50:24] to follow the names that we get from The
[03:50:26] Source system and we have to make sure
[03:50:28] that we are following the rules by the
[03:50:30] naming conventions so we are following
[03:50:32] the snake case so let's go and do it
[03:50:34] step by step for the first one let's go
[03:50:36] and call it the customer ID and then the
[03:50:39] next one I will get rid of using keys
[03:50:41] and so on I'm going to go and call it
[03:50:43] customer number because those are
[03:50:46] customer numbers then for the next one
[03:50:48] we're going to call it first name
[03:50:51] without using any prefixes and the next
[03:50:54] one last name and we have here marital
[03:50:58] status so I will be using the exact name
[03:51:01] but without the prefix and here we just
[03:51:04] going to call it gender and this one we
[03:51:06] going to call it create date and this
[03:51:09] one birth dat and the last one going to
[03:51:12] be the country so let's go and execute
[03:51:16] it now as you can see the names are
[03:51:18] really friendly so we have customer ID
[03:51:20] customer numbers first name last name
[03:51:22] material status gender so as you can see
[03:51:25] the names are really nice and really
[03:51:27] easy to understand now the next step I'm
[03:51:29] going to think about the order of those
[03:51:30] columns so the first two it makes sense
[03:51:32] to have it together the first name last
[03:51:34] name then I think the country is very
[03:51:36] important information so I'm going to go
[03:51:38] and get it from here and put it exactly
[03:51:40] after the last name it's just nicer so
[03:51:43] let's go and execute it again so the
[03:51:44] first name last name country it's always
[03:51:47] nice to group up relevant columns
[03:51:48] together right so we have here the
[03:51:50] status of the gender and so on and then
[03:51:52] we have the CATE date and the birth date
[03:51:54] I think I'm going to go and switch the
[03:51:56] birth date with the CATE date it's more
[03:51:58] important than the CATE dates like this
[03:52:01] and here not forget a comma so execute
[03:52:03] again so it looks wonderful now comes a
[03:52:06] very important decision about this
[03:52:08] objects is it a fact table or a
[03:52:10] dimension well as we learned Dimensions
[03:52:12] hold descriptive information about an
[03:52:15] object and as you can see we have here a
[03:52:17] descriptions about the customers so all
[03:52:20] those columns are describing the
[03:52:22] customer information and we don't have
[03:52:23] here like transactions and events and we
[03:52:26] don't have like measures and so on so we
[03:52:28] cannot say this object is a fact it is
[03:52:31] clearly a dimension so that's why we're
[03:52:33] going to go and call this object the
[03:52:35] dimension customer now there is one
[03:52:37] thing that if you creating a new
[03:52:39] dimension you need always a primary key
[03:52:41] for the dimension of course we can go
[03:52:43] over here and the depend on the primary
[03:52:45] key that we get from The Source system
[03:52:47] but sometimes you can have like
[03:52:49] Dimensions where you don't have like a
[03:52:51] primary key that you can count on so
[03:52:53] what we have to do is to go and generate
[03:52:55] a new primary key in the data warehouse
[03:52:58] and those primary Keys we call it
[03:52:59] surrogate keys serate keys are system
[03:53:02] generated unique identifier that is
[03:53:05] assigned to each record to make the
[03:53:07] record unique it is not a business key
[03:53:10] it has no meaning and no one in the
[03:53:12] business knows about it we only use it
[03:53:14] in order to connect our data model and
[03:53:17] in this way we have more control on how
[03:53:19] to connect our data model and we don't
[03:53:21] have to depend all way on the source
[03:53:23] system and there are different ways on
[03:53:25] how to generate surrogate Keys like
[03:53:27] defining it in the ddl or maybe using
[03:53:30] the window function row number in this
[03:53:32] data warehouse I'm going to go with a
[03:53:33] simple solution where we're going to go
[03:53:35] and use the window function so now in
[03:53:37] order to generate a Sur key for this
[03:53:40] Dimension what we're going to do it is
[03:53:42] very simple so we're going to say row
[03:53:43] number
[03:53:45] over and here if we have to order by
[03:53:48] something you can order by the create
[03:53:51] date or the customer ID or the customer
[03:53:53] number whatever you want but in this
[03:53:55] example I'm going to go and order by the
[03:53:58] customer ID so we have to follow the
[03:54:00] naming convention that's all surate keys
[03:54:02] with the key at the end as a suffix so
[03:54:05] now let's go and query those
[03:54:06] informations and as you can see at the
[03:54:08] start we have a customer key and this is
[03:54:11] a sequence we don't have here of course
[03:54:13] any duplicates and now this sgate key is
[03:54:15] generated in the data warehouse and we
[03:54:18] going to use this key in order to
[03:54:20] connect the data model so now with that
[03:54:22] our query is ready and the last step is
[03:54:24] that we're going to go and create the
[03:54:26] object and as we decided all the objects
[03:54:28] in the gold layer going to be a virtual
[03:54:30] one so that means we're going to go and
[03:54:32] create a view so we're going to say
[03:54:34] create View gold. dim so follow damic
[03:54:38] convention stand for the dimension and
[03:54:41] we're going to have the customers and
[03:54:42] then after that we have us so with that
[03:54:45] everything is ready let's go and excuse
[03:54:47] it it was successful let's go to the
[03:54:50] Views now and you can see our first
[03:54:52] objects so we have the dimension
[03:54:54] customers in the gold layer now as you
[03:54:56] know me in the next of that we're going
[03:54:58] to go and check the quality of this new
[03:55:00] objects so let's go and have a new query
[03:55:03] so select star from our view temp
[03:55:07] customers and now we have to make sure
[03:55:09] that everything in the right position
[03:55:11] like this and now we can do different
[03:55:13] checks like the uniqueness and so on but
[03:55:16] I'm worried about the gender information
[03:55:19] so let's go and have a distinct of all
[03:55:21] values so as you can see it is working
[03:55:23] perfectly we have only female male and
[03:55:25] not available so that's it with that we
[03:55:28] have our first new
[03:55:33] dimension okay friends so now let's go
[03:55:36] and build the second object we have the
[03:55:38] products so as you can see product
[03:55:40] information is available in both Source
[03:55:42] systems as usual we're going to start
[03:55:44] with the CRM informations and then we're
[03:55:46] going to go and join it with the other
[03:55:48] table in order to get the category
[03:55:50] informations so those are the columns
[03:55:52] that we want from this table now we come
[03:55:54] here to a big decision about this
[03:55:56] objects this objects contains historical
[03:55:58] informations and as well the current
[03:56:00] informations now of course depend on the
[03:56:02] requirement whether you have to do
[03:56:03] analysis on the historical informations
[03:56:05] but if you don't have such a
[03:56:07] requirements we can go and stay with
[03:56:09] only the current informations of the
[03:56:11] products so we don't have to include all
[03:56:12] the history in the objects and it is
[03:56:15] anyway as we learned from the model over
[03:56:16] here we are not using the primary key we
[03:56:19] are using the product key so now what we
[03:56:22] have to do is to filter out the
[03:56:24] historical data and to stay only with
[03:56:26] the current data so we're going to have
[03:56:27] here aware condition and now in order to
[03:56:30] select the current data what we're going
[03:56:31] to do we're going to go and Target the
[03:56:33] end dates if the end date is null that
[03:56:36] means it is a current data let's take
[03:56:38] this example over here so you can see
[03:56:40] here we have three record for the same
[03:56:42] product key and for the first two
[03:56:44] records we have here an information in
[03:56:46] the end dates because it is historical
[03:56:48] informations but the last record over
[03:56:51] here we have it as a null and that's
[03:56:53] because this is the current information
[03:56:55] it is open and it's not closed yet so in
[03:56:58] order to select only the current
[03:56:59] informations it is very simple we're
[03:57:01] going to say BRD in dat is null so if
[03:57:05] you go now and execute it you will get
[03:57:07] only the current products you will not
[03:57:09] have any history and of course we can go
[03:57:11] and add comment to it filter out all
[03:57:15] historical data and this means of course
[03:57:17] we don't need the end date in our
[03:57:19] selection of course because it is always
[03:57:21] a null so with that we have only the
[03:57:24] current data now the next step that we
[03:57:26] have to go and join it with the product
[03:57:29] categories from the Erp and we're going
[03:57:31] to use here the ID so as usual the
[03:57:34] master information is the CRM and
[03:57:36] everything else going to be secondary
[03:57:38] that's why I use the Live join just to
[03:57:41] make sure I'm not losing I'm not
[03:57:43] filtering any data because if there is
[03:57:44] no match then we lose data so let's join
[03:57:48] silver Erp and the category so let's
[03:57:51] call it PC and now what we're going to
[03:57:53] do we're going to go and join it using
[03:57:55] the key so PN from the CRM we have the
[03:57:58] category ID equal to PC ID and now we
[03:58:02] have to go and pick columns from the
[03:58:04] second table so it's going to be the PC
[03:58:06] we have the category very important PC
[03:58:10] we have the
[03:58:11] subcategory and we can go and get the
[03:58:13] maintenance
[03:58:14] so something like this let's go and
[03:58:17] query and with that we have all those
[03:58:19] columns comes from the first table and
[03:58:22] those three comes from the second so
[03:58:24] with that we have collected all the
[03:58:25] product informations from the two Source
[03:58:28] systems now the next step is we have to
[03:58:30] go and check the quality of these
[03:58:32] results and of course what is very
[03:58:34] important is to check the uniqueness so
[03:58:37] what we're going to do we're going to go
[03:58:38] and have the following query I want to
[03:58:41] make sure that the product key is unique
[03:58:45] because we're going to use it later in
[03:58:47] order to join the table with the sales
[03:58:49] so
[03:58:50] from and then we have to have group by
[03:58:53] product key and we're going to say
[03:58:55] having
[03:58:56] counts higher than one so let's go and
[03:59:00] check perfect we don't have any
[03:59:02] duplicates the second table didn't cause
[03:59:04] any duplicates for our join and as well
[03:59:07] this means we don't have historical data
[03:59:09] and each product is only one records and
[03:59:12] we don't have any duplicates so I'm
[03:59:14] really happy about that so let's go in
[03:59:16] query again now of course the next step
[03:59:18] do we have anything to integrate
[03:59:20] together do we have the same information
[03:59:22] twice well we don't have that the next
[03:59:25] step is that we're going to go and group
[03:59:27] up the relevant informations together so
[03:59:29] I'm going to say the product ID then the
[03:59:32] product key and the product name are
[03:59:35] together so all those three informations
[03:59:37] are together and after that we can put
[03:59:39] all the category informations together
[03:59:41] so we can have the category ID the
[03:59:43] category itself the subcategory let me
[03:59:46] just query and see the results so we
[03:59:48] have the product ID key name and then we
[03:59:51] have the category ID name and the
[03:59:53] subcategory and then maybe as well to
[03:59:55] put the maintenance after the
[03:59:58] subcategory like this and I think the
[04:00:00] product cost and the line can start
[04:00:02] could stay at the end so let me just
[04:00:04] check so those three four informations
[04:00:07] about the category and then we have the
[04:00:08] cost line and the start date I'm really
[04:00:11] happy with that the next step we're
[04:00:12] going to go and give n names friendly
[04:00:14] names for those columns so let's start
[04:00:17] with the first one this is the product
[04:00:19] ID the next one going to be the product
[04:00:22] number we need the key for the surrogate
[04:00:25] key later and then we have the product
[04:00:28] name and after that we have the category
[04:00:31] ID and the category and this is the
[04:00:36] subcategory and then the next one going
[04:00:38] to stay as it is I don't have to rename
[04:00:40] it the next one going to be the cost and
[04:00:43] the
[04:00:44] line and the last one will be the start
[04:00:47] dates so let's go and execute it now we
[04:00:50] can see very nicely in the output all
[04:00:52] those friendly names for the columns and
[04:00:55] it looks way nicer than before I don't
[04:00:57] have even to describe those informations
[04:00:59] the name describe it so perfect now the
[04:01:01] next big decision is what do we have
[04:01:04] here do we have a effect or Dimension
[04:01:06] what do you think well as you can see
[04:01:07] here again we have a lot of descriptions
[04:01:10] about the products so all those
[04:01:12] informations are describing the business
[04:01:14] object products we don't have like here
[04:01:17] transactions events a lot of different
[04:01:19] keys and ideas so we don't have really
[04:01:22] here a facts we have a dimension each
[04:01:24] row is exactly describing one object
[04:01:27] describing one products that's why this
[04:01:29] is a dimension okay so now since this is
[04:01:32] a dimension we have to go and create a
[04:01:34] primary key for it well actually the
[04:01:36] surrogate key and as we have done it for
[04:01:38] the customers we're going to go and use
[04:01:40] the window function row number in order
[04:01:42] to generate it over and then we have to
[04:01:44] S the data I will go with the start
[04:01:47] dates so let's go with the start dates
[04:01:49] and as well the product key and we're
[04:01:53] going to gra it a name products key like
[04:01:56] this so let's go and execute it with
[04:01:59] that we have now generated a primary key
[04:02:02] for each product and we're going to be
[04:02:05] using it in order to connect our data
[04:02:07] model all right now the next step we
[04:02:08] does we're going to go and build the
[04:02:10] view so we're going to say create view
[04:02:13] we're going to say go
[04:02:14] and dimension products and then ask so
[04:02:18] let's go and create our objects and now
[04:02:20] if you go and refresh the views you will
[04:02:22] see our second object the second
[04:02:25] dimension so we have here in the gold
[04:02:26] layer the dimension products and as
[04:02:29] usual we're going to go and have a look
[04:02:30] to this view just to make sure that
[04:02:33] everything is fine so them products so
[04:02:37] let's execute it and by looking to the
[04:02:39] data everything looks nice so with that
[04:02:41] we have now two dimensions
[04:02:47] all right friends so with that we have
[04:02:49] covered a lot of stuff so we have
[04:02:50] covered the customers and the products
[04:02:52] and we are left with only one table
[04:02:55] where we have the transactions the sales
[04:02:57] and for the sales information we have
[04:02:59] only data from the CRM we don't have
[04:03:01] anything from the Erp so let's go and
[04:03:03] build it okay so now I have all those
[04:03:05] informations and now of course we have
[04:03:06] only one table we don't have to do any
[04:03:08] Integrations and so on and now we have
[04:03:10] to answer the big question do we have
[04:03:12] here a dimension or a fact well by
[04:03:14] looking to those details we can see
[04:03:16] transactions we can see events we have a
[04:03:19] lot of dates informations we have as
[04:03:21] well a lot of measures and metrics and
[04:03:23] as well we have a lot of IDs so it is
[04:03:26] connecting multiple dimensions and this
[04:03:28] is exactly a perfect setup for effects
[04:03:31] so we're going to go and use those
[04:03:32] informations as effects and of course as
[04:03:35] we learned effect is connecting multiple
[04:03:37] Dimensions we have to present in this
[04:03:39] fact the surrogate keys that comes from
[04:03:42] the dimensions so those two informations
[04:03:44] the product key and the customer ID
[04:03:47] those informations comes from the searce
[04:03:49] system and as we learned we want to
[04:03:50] connect our data model using the surate
[04:03:53] keys so what we're going to do we're
[04:03:54] going to replace those two informations
[04:03:56] with the surate keys that we have
[04:03:58] generated and in order to do that we
[04:04:00] have to go and join now the two
[04:04:02] dimensions in order to get the surate
[04:04:05] key and we call this process of course
[04:04:07] data lookup so we are joining the tables
[04:04:10] in order only to get one information so
[04:04:12] let's go and do that we will go with the
[04:04:14] lift joint of course not to lose any
[04:04:16] transaction so first we're going to go
[04:04:18] and join it with the product key now of
[04:04:20] course in the silver layer we don't have
[04:04:22] any ciruit Keys we have it in the good
[04:04:25] layer so that means for the fact table
[04:04:27] we're going to be joining the server
[04:04:29] layer together with the gold layer so
[04:04:31] gold dots and then the dimension
[04:04:34] products and I'm going to just call it
[04:04:36] PR and we're going to join the SD using
[04:04:39] the product key together with the
[04:04:42] product number
[04:04:43] [Music]
[04:04:44] from the dimension and now the only
[04:04:46] information that we need from the
[04:04:48] dimension is the key the sget key so
[04:04:51] we're going to go over here and say
[04:04:53] product key and what I'm going to do I'm
[04:04:56] going to go and remove this information
[04:04:58] from here because we don't need it we
[04:04:59] don't need the original product key from
[04:05:01] The Source system we need the circuit
[04:05:03] key that we have generated in our own in
[04:05:05] this data warehouse so the same thing
[04:05:07] going to happen as well for the customer
[04:05:09] so gold Dimension customer again again
[04:05:13] we are doing here a look up in order to
[04:05:16] get the information on SD so we are
[04:05:19] joining using this ID over here equal to
[04:05:23] the customer ID because this is a
[04:05:26] customer ID and what we're going to do
[04:05:28] the same thing we need the circuit key
[04:05:31] the customer key and we're going to
[04:05:33] delete the ID because we don't need it
[04:05:35] now we have the circuit key so now let's
[04:05:37] go and execute it and now with that we
[04:05:40] have in our fact table the two keys from
[04:05:42] the dimensions and now this can help us
[04:05:45] to connect the data model to connect the
[04:05:47] facts with the dimensions so this is
[04:05:49] very necessary Step Building the fact
[04:05:51] table you have to put the surrogate keys
[04:05:53] from the dimensions in the facts so that
[04:05:55] was actually the hardest part building
[04:05:57] the facts now the next step all what you
[04:05:59] have to do is to go and give friendly
[04:06:01] names so we're going to go over here and
[04:06:03] say order number then the surrogate keys
[04:06:06] are already friendly so we're going to
[04:06:08] go over here and say this is the order
[04:06:10] date and the next one going to be
[04:06:13] shipping
[04:06:14] date and then the next one due date and
[04:06:18] the sales going to be I'm going to say
[04:06:21] sales
[04:06:22] amount the
[04:06:24] quantity and the final one is the price
[04:06:28] so now let's go and execute it and look
[04:06:30] to the results so now as you can see the
[04:06:32] columns looks very friendly and now
[04:06:34] about the order of the columns we use
[04:06:36] the following schema so first in the
[04:06:38] fact table we have all the surrogate
[04:06:40] keys from the dimensions then second we
[04:06:42] have all the dates and at the end you
[04:06:45] group up all the measures and the
[04:06:47] matrics at the end of The Facts so
[04:06:49] that's it for the query for the facts
[04:06:51] now we can go and build it so we're
[04:06:53] going to say create a view gold in the
[04:06:57] gold layer and this time we're going to
[04:06:59] use the fact underscore and we're going
[04:07:01] to go and call it sales and then don't
[04:07:03] forget about the ass so that's it let's
[04:07:05] go and create it perfect now we can see
[04:07:08] the facts so with that we have three
[04:07:10] objects in the gold layer we have two
[04:07:12] dimensions and one and facts and now of
[04:07:14] course the next step with this we're
[04:07:15] going to go and check the quality of the
[04:07:18] view so let's have a simple
[04:07:21] select fact sales so let's execute it
[04:07:25] now by checking the result you can see
[04:07:27] it is exactly like the result from the
[04:07:29] query and everything looks nice okay so
[04:07:32] now one more trick that I usually do
[04:07:33] after building a fact is try to connect
[04:07:36] the whole data model in order to find
[04:07:38] any issues so let's go and do that we
[04:07:39] will do just simple left join with the
[04:07:42] dimensions so gold Dimension customers C
[04:07:48] and we will use the
[04:07:50] [Music]
[04:07:51] keys and then we're going to say where
[04:07:54] customer key is null so there is no
[04:07:56] matching so let's go and execute this
[04:07:59] and with that as you can see in the
[04:08:00] results we are not getting anything that
[04:08:02] means everything is matching perfectly
[04:08:05] and we can do as well the same thing
[04:08:07] with the products so left join C them
[04:08:11] products p
[04:08:14] on product key and then we connect it
[04:08:17] with the facts product key and then we
[04:08:20] going to go and check the product key
[04:08:22] from the dimension like this so we are
[04:08:24] checking whether we can connect the
[04:08:26] facts together with the dimension
[04:08:28] products let's go and check and as you
[04:08:30] can see as well we are not getting
[04:08:31] anything and this is all right so with
[04:08:33] that we have now SQL codes that is
[04:08:35] tested and as well creating the gold
[04:08:38] layer now in The Next Step as you know
[04:08:40] in our requirements we have to make
[04:08:42] clear documentations for the end users
[04:08:44] in order to use our data model so let's
[04:08:46] go and draw a data model of the star
[04:08:52] schema so let's go and draw our data
[04:08:54] model let's go and search for a table
[04:08:57] and now what I'm going to do I'm going
[04:08:58] to go and take this one where I can say
[04:09:00] what is the primary key and what is the
[04:09:03] for key and I'm going to go and change
[04:09:05] little bit the design so it's going to
[04:09:06] be rounded and let's say I'm going to go
[04:09:08] and change to this color and maybe go to
[04:09:11] the size make it 16 and then I'm going
[04:09:13] to go and select all the columns and
[04:09:15] make it as well 16 just to increase the
[04:09:18] size and then go to our range and we can
[04:09:21] go and increase it 39 so now let's go
[04:09:24] and zoom in a little bit for the first
[04:09:26] table let's go and call it gold
[04:09:28] Dimension customers and make it a little
[04:09:31] bit bigger like this and now we're going
[04:09:33] to go and Define here the primary key it
[04:09:35] is the customer key and what else we're
[04:09:37] going to do we're going to go and list
[04:09:38] all the columns in the dimension is
[04:09:40] little bit annoying but the results
[04:09:42] going to be awesome so what do we we
[04:09:43] have the customer ID we have the
[04:09:46] customer number and then we have the
[04:09:49] first name now in case you want a new
[04:09:52] rows so you can hold control and enter
[04:09:55] and you can go and add the other columns
[04:09:56] so now pause the video and then go and
[04:09:59] create the two Dimensions the customers
[04:10:00] and the products and add all the columns
[04:10:03] that you have built in the
[04:10:04] [Music]
[04:10:08] view welcome back so now I have those
[04:10:11] two Dimensions the third one one going
[04:10:13] to be the fact table now for the fact
[04:10:16] table I'm going to go with different
[04:10:17] color for example the blue and I'm going
[04:10:19] to go and put it in the middle something
[04:10:21] like this so we're going to say gold
[04:10:24] fact sales and here for that we don't
[04:10:27] have primary key so we're going to go
[04:10:29] and delete it and I have to go and add
[04:10:31] all The Columns of the facts so order
[04:10:33] number products key customer key okay
[04:10:37] all right perfect now what we can do we
[04:10:39] can go and add the foreign key
[04:10:41] information so the product key is a
[04:10:42] foreign key key for the products so
[04:10:44] you're going to say fk1 and the customer
[04:10:46] key going to be the foreign key for the
[04:10:48] customers so fk2 and of course you can
[04:10:50] go and increase the spacing for that
[04:10:53] okay so now after we have the tables the
[04:10:55] next step in data modeling is to go and
[04:10:57] describe the relationship between these
[04:10:59] tables this is of course very important
[04:11:00] for reporting and analytics in order to
[04:11:03] understand how I'm going to go and use
[04:11:05] the data model and we have different
[04:11:06] types of relationships we have one to
[04:11:08] one one too many and in Star schema data
[04:11:10] model the relationship between the
[04:11:12] dimension and the fact is one too many
[04:11:15] and that's because in the table
[04:11:16] customers we have for a specific
[04:11:18] customer only one record describing the
[04:11:20] customer but in the fact table the
[04:11:22] customer might exist in multiple records
[04:11:25] and that's because customers can order
[04:11:27] multiple times so that's why in fact it
[04:11:29] is many and in the dimension side it is
[04:11:32] one now in order to see all those
[04:11:33] relationships we're going to go to the
[04:11:35] menu to the left side and as you can see
[04:11:37] we have here entity relations and now
[04:11:39] you have different types of arrows so
[04:11:41] here for example we have zero to many
[04:11:43] one one to many one to one and many
[04:11:45] different types of relations so now
[04:11:47] which one we going to take we're going
[04:11:49] to go and pick with this one so it says
[04:11:50] one mandatory so that means the customer
[04:11:53] must exist in the dimension table too
[04:11:55] many but it is optional so here we have
[04:11:57] three scenarios the customer didn't
[04:11:59] order anything or the customer did order
[04:12:01] only once or the customer did order many
[04:12:04] things so that's why in the fact table
[04:12:06] it is optional so we're going to take
[04:12:08] this one and place it over here so we're
[04:12:10] going to go and connect this part to the
[04:12:14] customer Dimension and the many parts to
[04:12:16] the facts well actually we have to do it
[04:12:19] on the customers so with that we are
[04:12:21] describing the relationship between the
[04:12:22] dimensions and fact with one to many one
[04:12:25] is mandatory for the customer Dimension
[04:12:27] and many is optional to the facts so we
[04:12:30] have the same story as well for the
[04:12:31] products so the many part to the facts
[04:12:35] and the one goes to the products so it's
[04:12:38] going to look like this each time you
[04:12:39] are connecting new dimension to the fact
[04:12:41] table it is usually one too many
[04:12:44] relationship so you can go and add
[04:12:45] anything you want to this model like for
[04:12:47] example a text like explaining something
[04:12:50] for example if you have some complicated
[04:12:52] calculations and so on you can go and
[04:12:54] write this information over here so for
[04:12:56] example we can say over here sales
[04:12:58] calculation we can make it a little bit
[04:13:00] smaller so let's go with 18 so we can go
[04:13:03] and write here the formula for that so
[04:13:06] sales equal quantity multipli with a
[04:13:09] price and make this a little bit bigger
[04:13:13] so it is really nice info that we can
[04:13:15] add it to the data model and even we can
[04:13:17] go and Link it to the column so we can
[04:13:20] go and take this arrow for example with
[04:13:22] it like this and Link it to the column
[04:13:24] and with that you have as well nice
[04:13:26] explanation about the business rule or
[04:13:28] the calculation so you can go and add
[04:13:30] any descriptions that you want to the
[04:13:32] data model just to make it clear for
[04:13:34] anyone that is using your data model so
[04:13:36] with that you don't have only like three
[04:13:38] tables in the database you have as well
[04:13:40] like some kind of documentations and
[04:13:42] explanation in one Blick we can see how
[04:13:45] the data model is built and how you can
[04:13:47] connect the tables together it is
[04:13:49] amazing really for all users of your
[04:13:50] data model all right so now with that we
[04:13:52] have really nice data model and now in
[04:13:55] The Next Step we're going to go and
[04:13:56] create quickly a data
[04:14:01] catalog all right great so with that we
[04:14:03] have a data model and we can say we have
[04:14:05] something called a data products and we
[04:14:07] will be sharing this data product with
[04:14:10] different type of users and there's
[04:14:11] something that's every every data
[04:14:13] product absolutely needs and that is the
[04:14:16] data catalog it is a document that can
[04:14:18] describe everything about your data
[04:14:20] model The Columns the tables maybe the
[04:14:23] relationship between the tables as well
[04:14:25] and with that you make your data product
[04:14:26] clear for everyone and it's going to be
[04:14:28] for them way easier to derive more
[04:14:31] insights and reports from your data
[04:14:33] product and what is the most important
[04:14:34] one it is timesaving because if you
[04:14:37] don't do that what can happen each
[04:14:39] consumer each user of your data product
[04:14:41] will keep asking you the same question
[04:14:43] questions about what do you mean with
[04:14:44] this column what is this table how to
[04:14:46] connect the table a with the table B and
[04:14:48] you will keep repeating yourself and
[04:14:50] explaining stuff so instead of that you
[04:14:52] prepare a data catalog a data model and
[04:14:55] you deliver everything together to the
[04:14:57] users and with that you are saving a lot
[04:14:59] of time and stress I know it is annoying
[04:15:01] to create a data catalog but it is
[04:15:03] Investments and best practices so now
[04:15:05] let's go and create one okay so now in
[04:15:07] order to do that I've have created a new
[04:15:08] file called Data catalog in the folder
[04:15:11] documents and here what we're going to
[04:15:12] do is very St straightforwards we're
[04:15:13] going to make a section for each table
[04:15:15] in the gold layer so for example we have
[04:15:17] here the table dimension customers what
[04:15:19] you have to do first is to describe this
[04:15:21] table so we are saying it stores details
[04:15:23] about the customers with the
[04:15:25] demographics and Geographics data so you
[04:15:27] give a short description for the table
[04:15:29] and then after that you're going to go
[04:15:31] and list all your columns inside this
[04:15:33] table and maybe as well the data type
[04:15:34] but what is way important is the
[04:15:36] description for each column so you give
[04:15:38] a very short description like for
[04:15:40] example here the gender of the customer
[04:15:43] now one of the best practices of
[04:15:44] describing a column is to give examples
[04:15:46] because you can understand quickly the
[04:15:49] purpose of the columns by just seeing an
[04:15:50] example right so here we are seeing we
[04:15:52] can find inside it a male female and not
[04:15:55] available so with that the consumer of
[04:15:56] your table can immediately understand
[04:15:58] uhhuh it will not be an M or an F it's
[04:16:01] going to be a full friendly value
[04:16:02] without having them to go and query the
[04:16:04] content of the table they can understand
[04:16:06] quickly the purpose of the column so
[04:16:08] with that we have a full description for
[04:16:09] all the columns of our Dimension the
[04:16:12] same thing we're going to do for the
[04:16:13] products so again a description for the
[04:16:15] table and as well a description for each
[04:16:17] column and the same thing for the facts
[04:16:20] so that's it with that you have like
[04:16:22] data catalog for your data product at
[04:16:24] the code layer and with that the
[04:16:26] business user or the data analyst have
[04:16:28] better and clear understanding of the
[04:16:30] content of your gold layer all right my
[04:16:32] friends so that's all for the data
[04:16:33] catalog in The Next Step we're going to
[04:16:35] go back to Dro where we're going to
[04:16:37] finalize the data flow diagram so let's
[04:16:40] go
[04:16:44] okay so now we're going to go and extend
[04:16:46] our data flow diagram but this time for
[04:16:48] the gold layer so now let's go and copy
[04:16:51] the whole thing from the silver layer
[04:16:52] and put it over here side by side and of
[04:16:55] course we're going to go and change the
[04:16:56] coloring to the gold and now we're going
[04:16:58] to go and rename stuff so this is the
[04:17:02] gold layer but now of course we cannot
[04:17:04] leave those tables like this we have
[04:17:06] completely new data model so what do we
[04:17:08] have over here we have the fact sales we
[04:17:11] have dimension customers and as well we
[04:17:14] have Dimension products so now what I'm
[04:17:18] going to do I'm going to go and remove
[04:17:19] all those stuff we have only three
[04:17:21] tables and let's go and put those three
[04:17:23] tables somewhere here in the center so
[04:17:25] now what you have to do is to go and
[04:17:26] start connecting those stuff I'm going
[04:17:28] to go with this Arrow over here direct
[04:17:31] connection and start connecting stuff so
[04:17:34] the sales details goes to the fact table
[04:17:36] maybe put the fact table over here and
[04:17:38] then we have the dimension customer this
[04:17:40] comes from the CRM customer our info and
[04:17:43] we have two tables from the Erp it comes
[04:17:47] from this table as well and the location
[04:17:49] from the Erp now the same thing goes for
[04:17:52] the products it comes from the product
[04:17:55] info and comes from the categories from
[04:17:58] the Erp now as you can see here we have
[04:18:00] cross arrows so what we going to do we
[04:18:01] can go and select everything and we can
[04:18:03] say line jumps with a gap and this makes
[04:18:06] it a little bit like Pitter individual
[04:18:08] for the arrows so now for example if
[04:18:10] someone asks you where the data come
[04:18:12] from for the dimension products you can
[04:18:14] open this diagram and tell them okay
[04:18:16] this comes from the silver layer we have
[04:18:19] like two tables the product info from
[04:18:21] the CRM and as well the categories from
[04:18:23] the Erp and those server tables comes
[04:18:25] from the pron layer and you can see the
[04:18:27] product info comes from the CRM and the
[04:18:30] category comes from the Erp so it is
[04:18:32] very simple we have just created a full
[04:18:34] data lineage for our data warehouse from
[04:18:36] the sources into the different layers in
[04:18:38] our data warehouse and data lineage is
[04:18:40] is really amazing documentation that's
[04:18:42] going help not only your users but as
[04:18:44] well the developers all right so with
[04:18:46] that we have very nice data flow diagram
[04:18:47] and a data lineage all right so we have
[04:18:50] completed the data flow it's really feel
[04:18:52] like progress like achievement as we are
[04:18:54] clicking through all those tasks and now
[04:18:56] we come to the last task in building the
[04:18:58] data warehouse where we're going to go
[04:19:00] and commit our work in the get
[04:19:05] repo okay so now let's put our scripts
[04:19:08] in the project so we're going to go to
[04:19:09] the scripts over here we have here
[04:19:11] bronze silver but we don't have a gold
[04:19:12] so let's go and create a new file we're
[04:19:14] going to have gold/ and then we're going
[04:19:16] to say ddl gold. SQL so now we're going
[04:19:19] to go and paste our views so we have
[04:19:22] here our three views and as usual at the
[04:19:24] start we going to describe the purpose
[04:19:26] of the views so we are saying create
[04:19:28] gold views this script can go and create
[04:19:30] views for the code layer and the code
[04:19:32] layer represent the final Dimension and
[04:19:34] fact tables the star schema each view
[04:19:36] perform Transformations and combination
[04:19:38] data from the server layer to produce
[04:19:40] business ready data sets and those us
[04:19:42] can be used for analytics and Reporting
[04:19:44] so that it let's go and commit it okay
[04:19:47] so with that as you can see we have the
[04:19:49] PRS the silver so we have all our etls
[04:19:53] and scripts in the reposter and now as
[04:19:56] well for the gold layer we're going to
[04:19:57] go and add all those quality checks that
[04:19:59] we have used in order to validate the
[04:20:01] dimensions and facts so we're going to
[04:20:03] go to The Taste over here and we're
[04:20:05] going to go and create a new file it's
[04:20:06] going to be quality checks gold and the
[04:20:10] file type is SQL so now let's go and
[04:20:12] paste our quality checks so we have the
[04:20:14] check for the fact the two dimensions
[04:20:17] and as well an explanation about the
[04:20:19] script so we are validating the
[04:20:20] integrity and the accuracy of the gold
[04:20:22] layer and here we are checking the
[04:20:23] uniqueness of the circuit keys and
[04:20:25] whether we are able to connect the data
[04:20:27] model so let's put that as well in our
[04:20:29] git and commit the changes and in case
[04:20:32] we come up with a new quality checks
[04:20:34] we're going to go and add it to our
[04:20:35] script here so those checks are really
[04:20:37] important if you are modifying the atls
[04:20:39] or you want to make sure that after each
[04:20:41] ATL those script SC should run and so on
[04:20:43] it is like a quality gate to make sure
[04:20:46] that everything is fine in the gold
[04:20:47] layer perfect so now we have our code in
[04:20:50] our repo story okay friends so now what
[04:20:52] you have to do is to go and finalize the
[04:20:54] get repo so for example all the
[04:20:56] documentations that we have created
[04:20:58] during the projects we can go and upload
[04:21:01] them in the docs so for example you can
[04:21:02] see here the data architecture the data
[04:21:04] flow data integration data model and so
[04:21:06] on so with that each time you edit those
[04:21:09] pages you can commit your work and you
[04:21:10] have likey version of that and another
[04:21:12] thing that you can do is that you go to
[04:21:15] the read me like for example over here I
[04:21:17] have added the project overview some
[04:21:19] important links and as well the data
[04:21:21] architecture and a little description of
[04:21:23] the architecture of course and of course
[04:21:25] don't forget to add few words about
[04:21:27] yourself and important profiles in the
[04:21:29] different social medias all right my
[04:21:31] friends so with that we have completed
[04:21:32] our work and as well closed the last
[04:21:35] epek building the gold layer and with
[04:21:37] that we have completed all the faces of
[04:21:40] building a data warehouse everything is
[04:21:42] 100% And this feels really nice all
[04:21:45] right my friends so if you're still here
[04:21:47] and you have built with me the data
[04:21:49] warehouse then I can say I'm really
[04:21:51] proud of you you have built something
[04:21:53] really complex and amazing because
[04:21:55] building a data warehouse is usually a
[04:21:57] very complex data projects and with that
[04:21:59] you have not only learned SQL but you
[04:22:01] have learned as well how we do a complex
[04:22:04] data projects in real world so with that
[04:22:06] you have a real knowledge and as well
[04:22:09] amazing portfolio that you can share
[04:22:10] with others if you are applying for a
[04:22:12] job or if you are showcase that you have
[04:22:14] learned something new and with that you
[04:22:15] have experienced different rules in the
[04:22:17] project what the data Architects and the
[04:22:19] data Engineers do in complex data
[04:22:21] projects so that was really an amazing
[04:22:23] journey even for me as I'm creating this
[04:22:25] project so now in the next and with that
[04:22:27] you have done the first type of data
[04:22:29] analytics projects using SQL the data
[04:22:31] warehousing now in The Next Step we're
[04:22:33] going to do another type of projects the
[04:22:35] exploratory data analyzes Eda where
[04:22:37] we're going to understand and explore
[04:22:39] our data sets if you like this video and
[04:22:41] you want me to create more content like
[04:22:43] this I'm going to really appreciate it
[04:22:45] if you support the channel by
[04:22:47] subscribing liking sharing commenting
[04:22:50] all those stuff going to help the
[04:22:51] Channel with the YouTube algorithm and
[04:22:53] as well my content going to reach to the
[04:22:55] others so thank you so much for watching
[04:22:58] and I will see you in the next tutorial
[04:23:00] bye