# The Data Movie | Data Literacy Explained Visually

https://www.youtube.com/watch?v=J2rQTJby8XM

[00:00] Hey friends, if you are new to the world of the data, you might start hearing many terms and buzzwords that might be confusing.
[00:05] I have been explaining those terms using animated sketches on my iPad with Procreate.
[00:11] All of them are handdrawn, so nothing is AI generated.
[00:13] So now I have collected everything in one video.
[00:16] So grab some popcorn, sit back and enjoy the data movie.
[00:20] Let's go.
[00:30] The role of the data analyst.
[00:30] So what is a data analyst?
[00:32] Well, it is very simple.
[00:35] You answer the business questions using data.
[00:37] Think about yourself as a bridge between the raw data and the real business decisions.
[00:42] So now I can say that I have joined like around seven data teams and projects.
[00:46] And now I would like you to understand exactly what data analysts do in real companies.
[00:51] So now it is a story time.
[00:54] Let's go.
[00:54] Okay.
[00:56] So now in companies in business we have managers, stakeholders, project leads
[01:01] and their whole job is to make smart decisions by asking critical questions.
[01:05] So for example, which region is underperforming?
[01:07] Why are profits down?
[01:10] And should we invest here?
[01:12] So there are many challenges in the business and they need quick and smart decisions.
[01:16] So now without using data, they are just guessing.
[01:18] They rely on opinions, gut feelings and many outdated informations left and right.
[01:24] And there is famous quote from William Deming I really like and use a lot without data you are just another person with an opinion.
[01:30] So that means if your decision process is based on opinions this has high chance to lead to confusion wasted resources and making bad decisions for the business.
[01:44] I think that pretty much sums it up.
[01:46] So now instead of guessing they have to decide based on data.
[01:48] So now in companies the data are scattered everywhere.
[01:50] Your data is stored in many different databases.
[01:52] There are a lot of spreadsheets and maybe as well APIs that provides data.
[01:57] So they are everywhere
[02:01] and the managers of course don't have the time or the skills in order to dive into all those places.
[02:07] So that's why they go and hire an expert specialist.
[02:10] They hire the data analyst.
[02:12] So think about it like you are a detective and you start gathering data from different sources.
[02:17] So you go and query the different databases and as well pull the data from the spreadsheets and the different APIs.
[02:24] So you have to go and get the data wherever they lives.
[02:26] So of course all those data are messy and you have to go and clean and structure those data.
[02:31] So maybe you're going to go and put the data everything in one spreadsheet like in Excel and you start doing a lot of stuff on the data cleaning up the data organizing it doing a lot of calculations transformations to get the data ready to answer the business question.
[02:44] Now after you find something meaningful from the data you start turning the data into insights into a visual reports because it is way easier to communicate the result to the stakeholders and managers using visuals.
[02:57] So that means once you are ready you go to the managers and you start presenting the result as a story using the report
[03:04] and visuals that you have prepared.
[03:06] Now the managers got facts and this time they have answers to their questions using data.
[03:12] So now they are more confident and they make better and smarter decisions for the business.
[03:16] So now you can see clearly what the data analyst does.
[03:18] He or she is the bridge between the raw data on the left side and the business on the right side and all what he is doing is answering the critical business questions using the raw data and as well it is very clear which skills this analyst needs.
[03:33] So as we understood first you will be querying the databases in order to pull the data and in order to talk to the data and the databases you use SQL or sometimes we call it SQL and we have understood as well once you collect all the data you will be using spreadsheet like Excel in order to clean sort filter and do calculations on the data so they have to master Excel and after analyzing the data using the spreadsheet you need a data visualization skills using tools like maybe PowerBI or Tableau To create charts and clear visuals to be reported
[04:06] to the managers and of course one very important skill and crucial you must clearly understand the business questions from the managers and communicate the results and the findings very effectively.
[04:16] So you have to master this skill.
[04:18] You have to tell a story behind the data.
[04:20] So you need really very good communication skills.
[04:25] So that means if you master those skills you will be an amazing data analyst that's going to help the company and the managers making the smart decisions.
[04:30] And now my friends we have an issue.
[04:33] This might work for a small company but it will not work for a modern big companies and here's why.
[04:39] Now big companies generates big data and it's going to be really hard for the data analyst to manually extract the data for analyzers because you're going to need a lot of data and your request going to take like hours and even days to get the data and it might crash and it going to be sometimes impossible to get the data.
[04:55] So it's going to be for you as a data analyst really hard to answer any questions and your process going to take weeks until you give any answers and of course managers cannot wait for weeks in
[05:07] order to get like an answer.
[05:09] So my friend this approach will not work.
[05:11] This is simply does not scale.
[05:14] So now we have to go and scale things and hire new people.
[05:16] Now a data architect it's like an architect of a building.
[05:18] You have to design the blueprint of a scalable data system and you have to go and organize the data into multiple layers.
[05:25] Like for example the medelian architecture you have a bronze layer for the raw data and a server layer for a clean data and a gold and very important layer for clean structured and optimized data.
[05:36] So all what the data architect is doing is designing this scalable data system.
[05:40] So now as you can see the data architect is designing this scalable data system but as usual they are not the one that is building it.
[05:49] For that we need another experts and we call them the data engineers.
[05:53] So now as a data engineer you're going to go and build something called data pipelines.
[05:56] Well it's all about to connect to multiple source systems and start automatically moving the data from the sources to the new data system and as well very important
[06:07] The data must move quickly.
[06:09] So the pipeline's going to bring the data to the first layer to the bronze layer as a row data.
[06:13] Nothing fancy going to happen here.
[06:15] Then the data engineer going to move the data to the server layer and there the data get cleaned, structured and prepared for the final layer.
[06:23] Now in the final layer, the data engineer going to go and build a data model which is highly organized and optimized model that is perfectly made for quick analyzes.
[06:34] And we have with that a process that runs automatically from the source systems until the final layer.
[06:39] And this runs every day very fast and in some scenarios as a stream like a real life data.
[06:44] So remember now our data analyst friend.
[06:46] So now, thanks to the data architect and the data engineer, analysts life now is way easier.
[06:50] No more nightmare of pulling the data manually from the source system.
[06:55] No need anymore for the spreadsheets and the Excel lists.
[06:59] Now everything is prepared for the data analyst to BE FAST.
[07:03] HELL YEAH.
[07:03] HEY, COME ON, BABY.
[07:03] COME ON.
[07:07] YES.
[07:07] COME ON.
[07:10] So now again as a data analyst once you get a question from the business you can go quickly and straight away to the code layer where everything is prepared and now using SQL you will be querying the data model in the go layer in order to explore and try to find answers for the question.
[07:24] So once you find it, you will go and build again like a visual report and tell a story for the managers.
[07:30] And with this setup, you only focus on what matters when finding answers for the business.
[07:35] And this can speed the whole process for you and you will be ready to answer quickly the questions.
[07:39] Now I'm really sorry we are not done yet.
[07:43] There is still an issue in this story.
[07:45] Now all you are doing here as a data analyst is that you are answering ad hoc questions and your answers are always like one-time reports.
[07:55] But what can happen for sure is that one or few reports going to deliver really an incredible value and it is not just useful for once and only for the managers.
[08:03] Many people across the company going to be interested in this report.
[08:07] So you're going to have a new user groups that going to ask you one
[08:11] question.
[08:11] I want your report EVERY DAY.
[08:14] NO.
[08:14] GOD PLEASE NO.
[08:14] NO.
[08:14] AND THE nightmare going to come back where you have every day to run the same query and generate the same reports and you're going to daily send this reports to the users and this going to be total waste of your time and going to build huge stress on you and that's why companies going to bring another person to the data team and this time we're going to have the business intelligence developer or BI developer.
[08:39] So once the BI developer joins you as a data analyst going to say hey go and automate this report and make it visible for many users.
[08:49] So first the BI developer going to build an environment like a server where the dashboards and reports going to live.
[08:55] This must be secure and has an access management.
[08:57] Sometimes you cannot give all the data for everyone in the company.
[09:00] And as well this server must be accessible 24/7.
[09:02] So the report and dashboard must be always available.
[09:07] So once the reporting server is there the BI developer going to take everything the report the logic and build from it
[09:13] something called interactive dashboard.
[09:15] and what is very important this dashboard going to be automatically connected to the gold layer to the data model.
[09:22] so that every day and this is important the dashboard and the reports going to get fresh data from the data system from the gold layer.
[09:28] and once everything is live the user is going to go and start requesting access to the reports and start accessing the dashboards and the data.
[09:38] everything is highly automated and scalable.
[09:40] So the data going to flow from the sources into the different layers of our data system.
[09:45] and as well automatically flowing to the reporting server and refreshing the data there.
[09:50] So this is the big differences between the data analyst and the BI developer.
[09:54] The data analyst is the one that is bringing the first report and insights.
[09:58] and once a visual or a report gets very important.
[10:02] The BI developer is the one that is responsible of making this report accessible, scalable and everyday available for many users.
[10:10] Now, so far we have held the business and the managers with questions.
[10:14] that are about the past and the present.
[10:17] What happened?
[10:17] How much did we sell?
[10:19] Why did sale drop?
[10:21] But at some point the managers can ask bigger questions.
[10:23] So, they are advanced questions about the future.
[10:26] What will happen next month?
[10:28] Can we predict which customers might leave us?
[10:30] What can happen if we change our pricing strategy?
[10:34] And to be honest, the data analysts don't have the tools in order to answer them.
[10:36] And for that, we need another expert called the data scientist.
[10:43] This is Sheldor the Conqueror.
[10:45] We are about to enter Axel's fortress.
[10:48] The data scientist is the one that going to use the data in order to build and train models.
[10:52] They going to do a lot of experiments.
[10:54] So at the end the scientists going to use this model in order to give answers for the future.
[10:58] So with that the managers are not just informed they are as well prepared for the future.
[11:02] They are proactively making decisions.
[11:04] They stay ahead of the competition and as well they tackle challenges before they even appear.
[11:08] Now this type of work is very important
[11:14] thing not only for the managers and projects leads this is important for everyone else.
[11:21] Now this model this work from the data scientist must be visible for other users and this is the same thing that happens before right with the BI developer but now we need another type we have the machine learning developer ML developer.
[11:33] So now what they going to do they're going to as well prepare a server and environments to run and deploy the model from the data scientist.
[11:41] So they're going to go and put it in productive environments and everything going to be now automatically secure, scalable and available 24/7.
[11:48] And the result of this model could be presented in like an application to show the result of the productions for everyone or as well sometimes they show it in dashboards and reports.
[11:58] So their job is to make the result of this model available in many services in the company and accessible by other users and build by the ML engineer and with that everyone from the managers to teams across the company benefits from advanced predictions every day and anytime.
[12:13] So now by looking to this
[12:15] everything is highly automated.
[12:17] The data going to flow from the sources to the data system and then flows to the reports and dashboards and as well automatically flow to our model that is trained from the data analyst and built from the ML engineer.
[12:30] So this is what a modern big company built.
[12:33] They build scalable data system.
[12:35] Now if you look to this you can see there is a pattern.
[12:36] You can see the data architect is working closely to the data engineer to build the data system.
[12:43] The data analyst works as well closely to the BI developer in order to build this reporting system and the data scientist together with the ML engineer.
[12:50] They work on building an advanced analytical system.
[12:55] And there is actually another pattern.
[12:56] You can see in those roles there is always somebody is like doing a design like the data architect is like designing the data system and the data analyst is like building the first version of the reports.
[13:09] Same thing for the data scientist is the one that is experimenting and training the first version of the model.
[13:14] So those one are always like trying to discover trying to
[13:17] put the blueprint.
[13:19] And now in the other side we have the other category you have the persons that are building it.
[13:23] bringing it to a productive scalable environment and making things live.
[13:28] Like the data engineer is the one that builds the data pipelines and this data system.
[13:32] The BI developer is bringing the reports and dashboards into scalable and productive reporting system and the machine learning engineer is the one that is bringing the model and deploy it into a productive system and offering the results in different services.
[13:47] So they are the engineers, they are the developers.
[13:51] So this is what a real modern data team looks like.
[13:58] All right.
[14:00] Now the story with the data always start with a problem in the business.
[14:02] So we have a business question like for example how many customers do we have in the last quarter.
[14:08] Now of course it is smart thing to ask the data instead of having gut feelings about the problem.
[14:13] That's why we try to get the answer from the data and for that we hire the data analyst.
[14:16] So now the data
[14:18] and is going to open their tools and start pulling the data from different sources.
[14:21] maybe from the sales system, customers database and product logs and of course they are in between many steps.
[14:27] but at the end the data analyst is building a report a dashboard using BI tools like PowerBI or Tableau and of course using visuals it is always easier to deliver the message the story behind the data for the business users.
[14:42] and now once everything is ready the data analyst going to go and present the result for the business users.
[14:45] but Now this going to scare everyone.
[14:48] The trends is going down.
[14:50] We are losing customers in the last 3 months and everyone starts to panic.
[14:55] And now what can happen?
[14:57] Of course, one of the managers going to ask a very important question.
[14:59] Can we predict how many customers are going to leave in the future?
[15:03] So which customers are likely to leave so that we can take an action maybe in order to prevent that?
[15:09] And now if the managers go and ask the data analyst about the predictions, highly likely you will get an answer like this.
[15:15] Well, I have only PowerBI.
[15:18] I can only show you the data for the
[15:20] current situation or the history, but I can't make something very intelligent in order to predict things in the future.
[15:26] And my friends, exactly for this scenario as we get more complicated and advanced questions.
[15:31] We need a data expert, a data specialist in order to solve this problem.
[15:35] And here comes the data scientist.
[15:37] So that means if the business question is all about the current situation or maybe the past and the history then we need a data analyst in order to do something called descriptive analyszis.
[15:50] But now if the question is looking to the future then we need a data scientist in order to do predictive analyszis and of course this type of analyszis it is way more complex than the descriptive analyzes and for that we need like an intelligence system using tools like PowerBI Tableau SQL will not really help.
[16:06] So now let's see how our friend the data scientist going to solve this problem.
[16:08] Now of course for each problem we need always data.
[16:12] That's why the first step is to collect the data.
[16:16] But now this time the data scientist needs way more data than the data analyst.
[16:18] So anything the company is
[16:22] generating about the customers going to be important in this phase.
[16:26] So the data going to get extracted from databases, logs, spreadsheets.
[16:30] So everything that is related to this issue going to be collected in one place.
[16:34] So now of course all those sources speaks different languages and they structure and store their data differently.
[16:39] So that means the data stinks the data is chaotic and it's time to get hands dirty.
[16:44] So that means in this phase we are doing data preparations or let's say like pre-processing like for example we can go and merge those files together and the sources using joins and we have to check the content of the data.
[16:56] If there are a lot of nulls we can go and replace them with something more meaningful.
[17:01] And another thing that we can do is that we can correct the data types.
[17:05] Maybe remove the duplicates.
[17:07] Maybe they are like columns that are totally useless.
[17:09] Now this tip is really time consuming.
[17:14] Like you spend 70 time of your total projects only doing those stuff only cleaning and preparing the data for the next steps.
[17:18] And of course if you are lucky and have a data engineer like me in your projects, then we're going to go and
[17:23] prepare all those stuff for the data scientist so that the data scientist only focus on the next steps.
[17:29] But you don't have this luxury in each company.
[17:31] That's why mostly you'll end up doing those stuff.
[17:34] All right.
[17:35] So with that, we have now a perfect data set that is clean, structured, and ready for the next steps.
[17:40] And now you might get excited and say yes.
[17:42] Now we're going to go and do some magic, some AI and machine learning, right?
[17:48] Well, I have to stop you there.
[17:49] Before we touch anything about machine learning, we have to do something very important.
[17:51] We have to explore and understand the data.
[17:53] So we have to do something called exploratory data analyszis.
[17:59] So what we're going to do we're going to go and open our notebooks and start asking questions like for example how the data is distributed how we can cluster the customers is there like any outliers what are the relationship between like different measures.
[18:13] So with that you are exploring and understand the content of the data and this is something that we will not go and share with the stakeholders.
[18:19] This step we do it for us the data scientist in order to
[18:25] understand the data.
[18:27] It's like doctors they read the charts before doing any diagnosis.
[18:30] Right?
[18:32] So it's like a thinking phase.
[18:34] You look at the data, ask questions and make notes.
[18:36] So now once we have a good feeling about the data and we feel confident, we can move to the next step.
[18:40] And the next one is very powerful but yet very underrated.
[18:43] Sadly a lot of data scientists skip this.
[18:46] Now the scenario is like this.
[18:48] The data set that we have is yes clean prepared but the data itself it is very row.
[18:53] It is not yet like an information like for example you could have a column called the sign up date of the customer.
[18:59] Now this is very row instead we can make like an extra column.
[19:04] We can derive new column and we calculate the days since the sign up.
[19:10] So that means we are creating new columns.
[19:12] We are deriving informations from the row data and we can create other measures like for example the total spent by customers the average session duration.
[19:21] So we are creating deriving smart columns that
[19:26] don't exist on the original data set and of course they are not easy to create.
[19:30] You need some domain and business knowledge.
[19:33] That's why we have done the ADA phase right and this is very important.
[19:37] I really don't get it why some skip this phase.
[19:39] the quality of the features that you add to the data set gonna decide really the outcomes of your work.
[19:46] So this is what we mean with the feature engineering.
[19:47] We are deriving important informations from our row data set.
[19:50] And now my friends with that we have everything.
[19:53] We are ready and it is magic time.
[19:54] So it is time to do some machine learning.
[19:57] So now the first step is that we're going to go and split our data sets into two sets.
[19:59] The first one is the training set where our model going to learn from.
[20:05] And the second one is a smaller one, the test set.
[20:08] We're going to use it in order to test the performance the output of our model.
[20:17] Because if you train and test at the same data set, basically you are cheating.
[20:19] So we don't do that.
[20:21] We split the data sets.
[20:23] And now the next step is the most important one.
[20:25] We have to
[20:26] choose an algorithm. An algorithm it's
[20:28] like a set of mathematical steps that
[20:31] describes how to learn from data. So
[20:34] basically this is completely mathematics
[20:36] and what we have to do is that to
[20:37] understand those algorithms because we
[20:39] have many different algorithms and we
[20:41] have to pick the right one for the right
[20:43] problem. So now once you picked this
[20:44] algorithm the next step is that to go
[20:46] and apply the training data sets on the
[20:49] algorithm. So now we are combining
[20:50] mathematics with data. So the algorithm
[20:53] going to go through all your data and
[20:55] start learning from the data sets and
[20:58] this is what we call my friend training
[21:00] a model. Now that means the model going
[21:03] to start learning the patterns. It's
[21:04] going to go and find connections and
[21:06] relationships between the data and going
[21:08] to start adjusting itself as well to
[21:11] minimize the errors. So that at the end
[21:13] we get something very powerful called a
[21:15] trained model that we could use in order
[21:17] to do predictions for new data. Now the
[21:20] output of this trained model going to be
[21:23] predictions. So it could be like yes the
[21:25] customer is likely to stay or no this
[21:28] customer might leave soon. So we are
[21:30] labeling each customer or we could have
[21:32] like percentage on how likely the
[21:34] customer going to leave. Now of course
[21:36] we have to share now this knowledge to
[21:38] the business users. What we could do we
[21:40] could import everything the data and the
[21:42] predictions in PowerBI or Tableau and
[21:44] build again like a visual in order to
[21:46] show the final results. So as you can
[21:48] see it's always nice thing to
[21:50] communicate with the business users
[21:51] using PI tools because they are very
[21:53] friendly compared to maybe like visual
[21:56] inside your notebook. So now we go and
[21:58] present the results for the business
[22:00] users and now they have more
[22:02] understanding what could happen in the
[22:04] future. Now next what going to happen
[22:06] they going to plan actions. So maybe
[22:07] they're going to go and launch retention
[22:10] campaign or maybe sends a new offers to
[22:12] the risky customers or maybe the sales
[22:15] team going to go and reach them
[22:16] directly. And this is the most beautiful
[22:18] moment as a data scientist. You see you
[22:21] are adding value to the company. You are
[22:23] not just building like a fancy model
[22:25] because you could. You are really
[22:27] helping the business to see what is
[22:29] coming and to do something about it. Now
[22:31] this sounds like an happy end, right?
[22:33] You have delivered your work. You have
[22:35] improved the business and that's it. But
[22:37] sadly the bad news is that we are not
[22:40] done yet. So far everything that we have
[22:42] done is manually and as well like we can
[22:44] call it prototyping. But now of course
[22:46] this is not one-time activity. We have
[22:49] to continue doing those stuff and you
[22:51] cannot keep doing everything like
[22:52] manually on your notebook. And of course
[22:55] the sources keep generating new data
[22:57] every day and maybe they are useful
[22:59] informations that you have to train the
[23:01] model again. So now we are at a point
[23:03] where we have to automate everything and
[23:05] make deployments. Deployments means we
[23:08] take everything that we have done
[23:09] manually so far and put it in real
[23:12] automated system. [clears throat] This
[23:13] system of course should not run locally
[23:15] at your notebook. We have to run it on
[23:18] servers for example in the cloud. And of
[23:20] course what we can do we can use APIs in
[23:22] order to connect internal applications
[23:25] and system in the company to the model
[23:27] in order to show those scores at the
[23:29] front end. So the whole thing is all
[23:31] about to bring everything that you have
[23:32] done manually at your notebook and to
[23:34] deploy it to professional platform that
[23:37] is fully automated, scalable, highly
[23:40] available and secure and connectable to
[23:42] different applications at your business.
[23:44] And if you are lucky enough and you have
[23:46] ML engineers or ML ops, they have to do
[23:49] those stuff. So as a data scientist,
[23:51] actually this is not the thing that we
[23:53] do. But if you don't have them, then you
[23:55] have to do it in your own. So now my
[23:57] friends everything that you have just
[23:58] seen like collecting the data, preparing
[24:00] the data, training the model, this is
[24:02] what we call classical machine learning
[24:05] process because we are now in 2025 and
[24:08] we have entered the world of pre-trained
[24:11] models especially the LLMs large
[24:14] language models like TBT cloud mist and
[24:17] others they are models that already
[24:19] trained on massive amount of data in the
[24:22] internet on public data like text
[24:24] website documents and They already
[24:26] understand the language, context and
[24:29] reasoning. And this is really crazy
[24:31] because before we have like models and
[24:32] we have always to train the models but
[24:35] now everything is prepared for you. We
[24:36] have models and as well they are
[24:38] pre-trained with massive amount of data.
[24:40] But now you might say yeah okay those
[24:42] pre-trained models are really good but
[24:44] they are very like generic. They trained
[24:46] on the public data. In the companies we
[24:49] have like special data. Well it's fine
[24:51] because most of those models they allow
[24:53] something called finetuning. So that
[24:55] means you can go and pick one of those
[24:57] pre-trained models and train it,
[24:59] fine-tune it with the company's data and
[25:01] with that you make it smart with your
[25:03] domain, with your business. And now you
[25:05] might ask, okay, why we need all those
[25:06] LLMs? Well, think about it like this. If
[25:09] each time the stackholder or manager
[25:11] need like new reports or new
[25:13] informations from the model and you go
[25:15] and jump and get the data from the
[25:16] model, put it in PowerBI and then
[25:18] present it for the stakeholders. This is
[25:21] really slow. So instead what we can do
[25:23] we let the user have a chat with the
[25:25] model and the users could start
[25:27] conversations like for example why are
[25:29] customers leaving a region B summarize
[25:32] all the feedback from the cancelled
[25:34] users so they're going to have like a
[25:36] chat with the model and this is way
[25:38] better than waiting each time for your
[25:40] PowerBI report and for that we could use
[25:42] the help of those pre-trained models the
[25:44] LLMs and now of course comes the scary
[25:47] part where you ask if we have all those
[25:50] pre-trained models why Do we have even
[25:52] data scientists? I understood from you.
[25:54] We need data scientists in order to
[25:56] train models. But if we have pre-trained
[25:58] models, why do we need them? Well, first
[26:00] of all, my friends, everything that I
[26:02] have described is the industrial data
[26:04] scientist. So, someone is hired in the
[26:06] company to do those stuff. But the one
[26:08] that pre-trained those data models, they
[26:10] are as well data scientists, but they
[26:12] are not hired from the industry. They
[26:14] are actually researchers. And of course
[26:16] they work in massive engineering teams
[26:19] in big tech companies like OpenAI,
[26:21] Google, Meta and they do the amazing
[26:23] work of bringing those pre-trained
[26:25] models in the market. And from the other
[26:27] side of course we still need the
[26:29] industrial data science for one very
[26:31] important reason. Well my friends all
[26:33] those LLMs and pre-trained models they
[26:35] are trained on public data and most of
[26:37] the companies they don't bring the data
[26:39] on public they all internal confidential
[26:41] and even secrets. So there will be a lot
[26:43] of business problems that depends on the
[26:45] company's data and as long as the
[26:47] companies protect their data. We're
[26:49] going to end up in the situation where
[26:51] we need to hire data scientist in the
[26:53] industry so that they either fine-tune
[26:55] the pre-trained models or in many
[26:57] scenarios they have to train from the
[27:00] scratch the models using the company's
[27:02] data. And that's why the data scientist
[27:04] in the industry is very relevant. And to
[27:06] be honest, we have now a lot of work to
[27:08] do because we suddenly exposed to all
[27:11] those pre-trained models and we have to
[27:13] fine-tune a lot of new models. So I am
[27:15] very positive and excited about it that
[27:17] this going to accelerate our work. This
[27:19] going to open the door for many use
[27:20] cases that I never thought before. And
[27:23] believe me, we have a lot of work to do.
[27:25] So back to our story. Now we come to the
[27:27] last thing that we know so far. We have
[27:29] something called AI agents. So now we
[27:31] are far beyond having a quick
[27:33] conversation between the business user
[27:35] and your model. Now the user is going to
[27:37] ask for many stuff like for example how
[27:39] many customers did left last year. So
[27:41] what we can do we can have like one AI
[27:43] agent that is using pre-trained model to
[27:46] convert this text into an SQL query that
[27:50] goes directly to the database and grab
[27:51] the data and of course show it at the
[27:53] end as visualizations. And another thing
[27:55] the customer might ask for example where
[27:57] I find the customer's data in which
[28:00] system in which application. So for this
[28:02] scenario we can use another AI agent
[28:04] that could use as well pre-trained model
[28:06] that's specialized on scanning
[28:08] documents. So this time we don't need
[28:10] the company's data we need the
[28:11] documentations of the company. And now
[28:13] if the customers ask something about the
[28:15] future how many customers going to leave
[28:16] next year then you can use an AI agent
[28:19] that connects to our model that we
[28:21] trained from the scratch that we are
[28:23] very proud of. So if you look to this,
[28:24] we're going to have a lot of AI agents
[28:26] that are connecting to different models,
[28:29] connecting to different data and
[28:30] sources. So of course, we have to
[28:32] orchestrate all those stuff and connect
[28:34] everything. And that's why we're going
[28:36] to have like a manager AI agent, the top
[28:38] level agent that going to get the
[28:40] prompts from the users and decide which
[28:43] agents and models are involved and then
[28:45] respond back to the users or do an
[28:47] action like for example sending an
[28:49] email. Well, that might be a lot of
[28:52] informations for some of you. I can keep
[28:54] going and adding stuff to this big
[28:55] picture but it is [music] not fitting
[28:57] anymore
[29:02] to understand [music] what data
[29:03] engineers does. We are not building
[29:05] apps. We are not building softwares. We
[29:08] are just building [music] data
[29:09] pipelines. Pipeline that pulls the data,
[29:12] clean it, transform it, load it and keep
[29:15] running every day. Data engineering is
[29:17] like the engine room behind every
[29:19] datadriven company. You are the
[29:21] engineer. You are the one that going to
[29:23] go and move, transform and store massive
[29:26] amount of data. So my friend that means
[29:28] you are not building a shiny dashboard
[29:30] or you are building front ends. You are
[29:33] as a data engineer behind the scenes but
[29:35] you go and expose the hidden complex
[29:37] important data of the company and you
[29:39] [music] bring it to a suitable platform
[29:41] so that others can do smart things with
[29:43] the data. Okay. So what is a data
[29:46] pipeline? At a very high level a data
[29:48] pipeline is just a flow. So data comes
[29:50] in from the sources. They go through few
[29:53] steps where we prepare them, clean them
[29:55] and enhance them. And once they are
[29:57] ready, we're going to load them into the
[29:59] target table. So it is no rocket
[30:01] science. As a data engineers, we have to
[30:02] build only this the data pipeline. And
[30:05] we use Python in order to do that. And
[30:07] then other people going to start using
[30:08] this prepared data in order to build
[30:11] analytical use cases like a dashboard or
[30:13] an AI and some machine learning stuff.
[30:16] This is the big picture of pipeline. And
[30:18] now we're going to zoom in into each
[30:20] step in order to understand which Python
[30:22] concepts are needed. So now let's start
[30:24] with the first step on the left side.
[30:25] The sources of our data could be in
[30:27] different technologies like databases or
[30:30] stored in files or maybe they are
[30:32] provided as a stream like using Kafka
[30:34] and they could live in APIs. So we have
[30:37] different sources of our data. And now
[30:39] the very first step in our pipeline is
[30:41] extract and load. This is where
[30:43] everything starts. So it is very simple.
[30:45] We have to connect to the sources. start
[30:48] reading the data and then loading it
[30:50] into our system. So we are just taking
[30:53] one copy of our data and putting it in
[30:55] our system. No big smart thing, no
[30:57] transformations. So now in order to do
[30:59] this step in the pipeline, we write a
[31:01] Python script. Okay, so that's all about
[31:02] the first step. We didn't do something
[31:04] smart. We just focus on how to get the
[31:06] data in. Now the next step is all about
[31:08] cleaning up the data and enhancing it.
[31:10] So what we're going to do, we're going
[31:11] to take this raw copy of the data that
[31:13] is missing. We're going to start fixing
[31:15] it, handling missing value, removing
[31:17] duplicate, and doing small enhancements
[31:19] on top of it. And of course, we're going
[31:20] to do this step by writing as well a
[31:22] Python main script. And as well, we're
[31:24] going to have the same add-ons where
[31:25] we're going to have a config file, a
[31:27] logging, a data quality, and so on.
[31:29] Okay. So now, so far what you have done,
[31:30] the first step, we just brought the data
[31:32] in. The second step, we just cleaned up
[31:35] the data. And now moving on to the last
[31:37] step inside our pipeline. Here we can
[31:39] apply the business logic, and the fun
[31:41] starts. So we're going to take the data
[31:43] that is clean and start joining it
[31:45] together doing some data aggregations
[31:47] and start applying the business logic
[31:50] the business data transformations. And
[31:52] of course for this the same thing going
[31:53] to happen. We're going to write main
[31:55] Python scripts in order to do this step
[31:57] and we're going to have on top of it
[31:58] some add-ons like the config files, the
[32:01] logging and the data quality. So as you
[32:02] can see as data engineers our job is not
[32:04] that hard. We just have to build this
[32:06] data pipeline where it has mainly three
[32:08] steps. The first one is extract and
[32:10] load. So we're going to bring the data
[32:11] from the sources into our system without
[32:14] any extra logic or transformations. And
[32:16] now the next step, we're going to take
[32:18] the raw data and start cleaning it up
[32:20] and enhance it. So we're going to fix
[32:21] the data type. We're going to clean the
[32:23] text. We're going to handle the dates
[32:25] and make sure everything is prepared for
[32:27] the last step where we going to
[32:28] transform and apply the business logic.
[32:30] So we're going to join the tables,
[32:32] aggregate the data, apply the rules and
[32:34] prepare a final product for the
[32:36] analytics and as [music] well for AI use
[32:38] cases.
[32:43] In any [music] industry or business,
[32:45] everyone at the top had the same
[32:46] mission. Managers, stakeholders, project
[32:49] leads all need to make smart decisions
[32:52] and they need to make them fast. And
[32:54] they ask question like which region is
[32:56] failing behind, why profits are dropping
[32:59] in the last month and where they should
[33:01] put their money next. And those are
[33:03] really critical business questions and
[33:05] they need answers. So now if they try
[33:07] without a good data, they are going to
[33:10] guess. So they going to use opinions and
[33:13] some gut feelings in order to answer
[33:14] them. Deming said it best. Without data,
[33:17] you are just another person with an
[33:20] opinion. That's why most of the
[33:21] companies they going to try to answer
[33:23] the data using the company's data. And
[33:26] for that the business going to hire a
[33:28] data analyst in order to find the
[33:30] answers from the data. But to do that
[33:32] they need a tool. And if the company
[33:35] doesn't have the right one, they usually
[33:37] go and use spreadsheets like the Excel
[33:39] files. So the analyst going to try to go
[33:42] and collect the data from the sources
[33:44] and put it in the Excel files. They're
[33:46] going to clean up the data, fix mistakes
[33:48] and prepare the data so that at the end
[33:50] they going to present the answers using
[33:53] numbers and charts to the managers and
[33:55] after that the manager going to use them
[33:57] in order to make critical business
[33:59] decisions. So now this sounds like happy
[34:01] ending, right? Everyone is happy and we
[34:03] got the answers. But in real world, this
[34:06] doesn't work like this. If you are using
[34:08] spreadsheet or Excel to do data
[34:09] analyszis, things can start to fall
[34:12] apart. Let me show you what I mean. Now,
[34:14] the entire process is manual. So, you
[34:16] are connecting the sources, exporting
[34:18] files, merging sheets, cleaning rows.
[34:21] People think it is easy, but in reality,
[34:23] this is painful and it takes a lot of
[34:25] time. Sometimes [music] it take weeks
[34:27] and I have seen cases where it takes
[34:29] months. And if it takes a lot of time to
[34:32] present the data, that means the data
[34:34] inside the reports are actually old,
[34:36] which means the managers are making
[34:38] decisions based on old data and the
[34:41] business might have been changed in the
[34:43] last week and no one knows about it
[34:45] because we have old data inside the
[34:47] Excel files. So that means managers
[34:49] makes decisions based on outdated data
[34:52] and this might result to bad decisions
[34:55] which going to cost the company a lot.
[34:57] So the new generated data inside the
[34:59] systems will not be automatically
[35:01] updating the charts or whatever results
[35:04] that you have in Excel. It's going to be
[35:06] like snapshot static. Another issue is
[35:08] that the company's systems are not
[35:10] always like a simple database or files.
[35:13] Now the systems lives in cloud. They
[35:16] provide data in APIs or even they stream
[35:19] their data in platforms like Kafka. and
[35:21] pulling the data from those modern
[35:24] systems into Excel by hand is almost
[35:26] impossible. So you're going to struggle
[35:28] with the modern technology. Another big
[35:30] issue as well. The sources now are
[35:32] generating massive amount of data. So if
[35:35] you try to export everything to Excel
[35:37] and put it in file at some point your
[35:40] Excel file is going to explode and maybe
[35:42] freeze or crashes. They simply can't
[35:45] handle real big data. And another one
[35:47] that can be funny to watch if the
[35:49] company decide to speed up the process
[35:51] by adding more people to the projects.
[35:53] So you're going to have more than one
[35:55] analyst that they going to try to
[35:56] prepare this Excel file and guess what
[35:58] [music] they cannot work inside the same
[36:01] file at the same time. And the solution
[36:03] for that each one going to go and create
[36:05] their own copy. So that means you can
[36:07] have multiple versions of the same file
[36:09] and good luck merging all those stuff
[36:12] back to one file. And another big issue
[36:14] using Excel files is the security. So
[36:16] the analysts or the managers might send
[36:18] the Excel files through emails and those
[36:21] files could be storing sensitive data
[36:23] about the company and it is really easy
[36:25] to hack the Excel files to get access to
[36:28] them and could be easily end up in the
[36:31] wrong hands. So your company's data are
[36:33] totally unprotected if you put it
[36:34] [music] in spreadsheets. And another
[36:36] thing about the security, not all the
[36:38] data inside the Excel should be
[36:40] available for everyone in the company.
[36:42] Sometimes part of the data is really
[36:44] sensitive and only upper management in
[36:46] the company are allowed to see it not
[36:48] everyone in the company. So in Excel you
[36:50] cannot go and apply something called
[36:51] role level security to control which
[36:54] department or which people are allowed
[36:56] to see the data. So you're going to end
[36:57] up creating multiple versions for
[37:00] multiple departments. And now since we
[37:01] are talking about departments we come to
[37:03] the biggest issue the biggest
[37:05] catastrophic. Different departments like
[37:07] sales, finance, marketing have their own
[37:10] analysts. So one analyst is exporting
[37:12] the data, cleaning up and then they
[37:14] build their own formulas and guess what
[37:18] can happen if you leave it like this.
[37:19] First people are spending their mornings
[37:22] doing the same task and now if the
[37:24] managers ask the same question to three
[37:27] different analysts they going to get
[37:29] three different answers. It's not
[37:31] because they are wrong. It is because of
[37:33] this whole system. One file is 2 weeks
[37:35] old. Another one is 1 week old and
[37:38] another just got updated yesterday. And
[37:40] as well the calculation in each file
[37:42] could be slightly different. So that
[37:45] means my friends it is big confusion and
[37:47] there is no more single point of truth
[37:50] for the data. And my friends to be
[37:52] honest I have seen this almost in every
[37:54] company that I have joined. It is total
[37:56] disaster and pure CHAOS
[38:02] and that's why many of the companies
[38:04] they understood they cannot rely on
[38:06] spreadsheets to do data analyzes.
[38:08] Instead they need a standard they need a
[38:10] process platform to do the whole thing
[38:13] in one place and we call this process
[38:16] this system as a business intelligence
[38:18] or as a shortcut BI. So what is that? A
[38:21] business intelligence is the whole
[38:23] process of working with data. So
[38:25] collecting the data from the sources
[38:27] cleaning up the data preparing and
[38:30] organizing it and then turn it into
[38:32] visuals so people can understand what is
[38:35] going on and make better decisions. So
[38:38] the BI the business intelligence is the
[38:40] full workflow from start to finish. A
[38:43] lot of people thinks business
[38:44] intelligence is just like making charts
[38:46] and dashboards. Well, it is way more
[38:49] than that. And of course we need a tools
[38:51] a platforms to do now the business
[38:53] intelligence, right? And for that we
[38:54] have two very famous platforms. We have
[38:57] the Tableau and the Microsoft PowerBI.
[39:00] And that's why you have it in the name
[39:01] PowerBI BI business intelligence. That's
[39:04] why we call PowerBI as a BI tool. Same
[39:08] things goes of course for Tableau. Now
[39:10] quickly what is the history of PowerBI?
[39:12] Because PowerBI didn't start as a giant
[39:15] platform. Everything began inside the
[39:17] Excel. So Microsoft has tools like Power
[39:19] Query and Power Bivot hidden inside
[39:22] Excel. Those tools were very powerful.
[39:25] But the thing is Excel itself was not
[39:27] able to handle the big modern data. As I
[39:30] showed you all the issues and problem of
[39:32] the Excel and at the same time other
[39:34] platforms like Tableau start to becoming
[39:37] very popular because it offered a clear
[39:39] way and easy way to do data
[39:42] visualizations. So that's why Microsoft
[39:44] understood they need something
[39:46] completely new. They cannot keep adding
[39:48] stuff to [music] Excel. They start
[39:50] taking the strong parts of Excel and
[39:52] turn them into new products focused on
[39:55] data analyzers. And that's idea became
[39:58] PowerBI in 2015. And since that point
[40:01] until now, Microsoft keep [music]
[40:03] pushing and keep improving the PowerBI
[40:05] month after month. Better visuals, more
[40:08] connectors, improving the performance,
[40:11] deep integration of PowerBI with the
[40:13] Microsoft ecosystem. And in the last
[40:15] four years, they introduced the
[40:17] Microsoft fabric and copilots in
[40:20] PowerBI. So over the time PowerBI grew
[40:23] into a full business intelligence
[40:25] platform or a data analytics platform.
[40:27] It became one of the Tableau's main
[40:30] competitors. Both of them are now the
[40:32] top tools in the world for BI. If you
[40:34] have a look at Google Trends, you can
[40:36] exactly see the clear shift. Tableau was
[40:38] dominating for a long time. But PowerBI
[40:41] kept growing and growing until we
[40:43] reached the point where PowerBI get
[40:45] served almost as much as Tableau
[40:47] worldwide. So it is interesting where
[40:50] PowerBI and Tableau are going to go over
[40:51] the time and for PowerBI we have another
[40:54] name we call it as data visualization
[40:56] tool. So what is that? But first let's
[40:59] have some cafe rights.
[41:02] All right. So it is the process of
[41:05] turning raw data and boring numbers into
[41:08] visuals and charts like bar line pie
[41:11] charts and heat maps just to make it
[41:14] easier for human to understand the
[41:16] complex raw data instantly because my
[41:18] friends we are visual creatures. If you
[41:21] see a picture of a tree you're going to
[41:22] understand it right away. But if you
[41:24] read the word tree your brain has first
[41:28] to process it and turn it into visual.
[41:30] So that means visuals skips this extra
[41:33] steps which makes it easier for your
[41:36] brain. So that means visuals are
[41:38] processed way faster than reading a text
[41:41] or looking to numbers. And not only
[41:44] that, our brain remember them longer. So
[41:46] we keep small part of what we hear, a
[41:49] bit more of what we read, but most of
[41:52] what we see. So this is the power of
[41:55] visuals and data visualizations are very
[41:58] powerful because you understand the
[41:59] information faster. You see the patterns
[42:02] and problems instantly. You can use them
[42:04] to explain your ideas. You can tell a
[42:07] story and with that the others can make
[42:09] better decisions and everyone going to
[42:12] remember the visuals way longer than
[42:14] numbers. And exactly for that tools like
[42:16] PowerBI focuses on data visualizations.
[42:19] So now the question is what is inside
[42:22] PowerBI and how it actually work. It is
[42:24] [clears throat] divided into five
[42:26] components and each one of them has its
[42:28] own clear job. The first one is power
[42:31] query. This is where you clean up your
[42:33] data. You remove mistakes, duplicates,
[42:36] you fix the formats. So it's all about
[42:39] preparing the data. The second component
[42:41] is the data module. Once everything is
[42:44] cleared, it's time [music] to structure
[42:46] and organize your data. So it's all
[42:48] about how you go and connect your tables
[42:50] to each others. And this is very
[42:52] important to make everything accurate,
[42:54] organized, and as well fast. Now moving
[42:57] on to the next one, we have the DAX.
[42:59] Everyone is afraid of this one. So DAX
[43:02] is all about using calculations and math
[43:04] in order to build the business logic
[43:07] that you're going to need for the next
[43:08] step. So everything so far is like
[43:10] hidden. It's the background of the
[43:12] PowerBI. No one sees it. And now it's
[43:15] the first time you're going to build
[43:16] something that the end user is going to
[43:17] see the visuals. So here you're going to
[43:19] build all those charts and big numbers,
[43:22] tables and those filters and the dynamic
[43:25] of everything just in order at the end
[43:28] to tell a story using the data. Now once
[43:31] everything is ready, you go to the last
[43:33] component in order to share and publish
[43:36] what you have built. So at the end
[43:37] people going to view your report,
[43:39] interact with it and make decisions. So
[43:41] those are the five components and blocks
[43:44] inside PowerBI that you have to use in
[43:47] order to build something in PowerBI. So
[43:49] it looks like a process a simple
[43:51] pipeline. You bring the data in, you
[43:54] clean it, you module it and organize it
[43:57] then visualize it and at the end you
[43:59] share it and you can do the whole thing
[44:01] in just one platform the PowerBI. So now
[44:04] let's talk about the PowerBI ecosystem
[44:06] because PowerBI itself it is not just
[44:09] one thing it is multiple things. So
[44:11] Microsoft offers multiple tools and each
[44:14] one of them has its own job. The first
[44:15] one is the PowerBI desktop. This is the
[44:18] app that you install locally on your PC
[44:21] and I have bad news just to make it
[44:23] clear you cannot install it on Mac
[44:27] because it is Windows only unless you
[44:30] are using some extra workarounds. So the
[44:32] PowerBI desktop is where you spend most
[44:34] of your time building things. It is
[44:37] basically the engine of PowerBI.
[44:39] Everything that you have talked about
[44:41] inside this tool. The Power Query, the
[44:43] data module, the DAX, the visuals, the
[44:46] whole thing, the whole process is inside
[44:48] this one tool. And everything, the data,
[44:51] the modules, the reports going to be
[44:53] stored into one single file locally at
[44:55] your PC. And the extension of this file
[44:58] going to be PVIX. And of course you can
[45:00] go and share this file directly with the
[45:02] others. But we don't want to have
[45:04] another Excel's issue, right? We don't
[45:06] do that. Instead of that, we're going to
[45:08] publish everything that we have built to
[45:09] a new tool to the next part of the
[45:12] ecosystem, the PowerBI service. And at
[45:15] this point, once you publish your report
[45:17] from desktop to the service, your report
[45:20] now going to live there. Now what going
[45:22] to happen? your end users, your
[45:23] consumers can open their browser and
[45:26] start viewing your reports and
[45:28] dashboards and interact with them. So
[45:30] they can start filtering things,
[45:32] checking the number and whatever they
[45:33] need. And of course, if you have some
[45:35] top managers that don't have the time to
[45:37] open their laptops, no problem. This is
[45:39] totally fine because they can go and
[45:41] install the third tool, the PowerBI app
[45:44] on their phone or tablet in order to
[45:47] connect to the same reports that lives
[45:49] in the cloud in the service. So using
[45:52] their devices, they still can interact
[45:54] with your report. Of course, there are a
[45:56] lot of more extra stuff, but those are
[45:58] the three major, let's say, tools that
[46:00] Microsoft offers. A desktop where you
[46:03] build things, the PowerBI service where
[46:05] you're going to host and share your
[46:06] reports, and this nice PowerBI app to
[46:09] interact with your stuff using phone and
[46:11] tablets. Now, you might ask, is PowerBI
[46:14] only made for data analyst? Because we
[46:16] have understood it is their main tool.
[46:18] They use it in order to answer business
[46:20] questions using data and visuals. So if
[46:23] I am not data analyst, if I do something
[46:25] else, should I learn PowerBI? Well, my
[46:27] friend, of course. Like for example, me
[46:30] as a data engineer who go and build data
[46:32] pipelines, I still go and use PowerBI in
[46:35] order to make dashboards to monitor my
[46:38] system to check the data quality and
[46:40] make sure that everything is loaded
[46:42] correctly the way it should. So I use
[46:44] PowerBI to monitor internally the whole
[46:46] system that I'm building as a data
[46:48] engineer. Now as well you can use it as
[46:50] a data scientist. So you go and build
[46:52] models, predictions and machine learning
[46:54] solutions but still as a data scientist
[46:56] you have to go and share your work with
[46:59] non tech and business users and you
[47:01] cannot share with them your code or
[47:03] notebooks. You're going to go and use
[47:05] PowerBI to share your work because it is
[47:07] simple and easy for business users. So
[47:10] it is the best way to explain the
[47:12] complex things sets you do as a data
[47:13] scientist. And of course if you are a
[47:15] business analyst you're going to end up
[47:16] as well using PowerBI. So you're going
[47:18] to use it to answer what happens, why it
[47:21] happened and what should happen next. So
[47:24] you use insights for conversations and
[47:26] discussions. And by the way, not only at
[47:28] work, I know a lot of people that use
[47:31] PowerBI for personal life at home. They
[47:33] use it in order to track their spending,
[47:36] for budgeting, for investments. Anytime
[47:38] you have numbers and you would like to
[47:40] make visuals, you can use PowerBI. So
[47:42] for example, I use it for YouTube stuff
[47:44] to analyze and track things. It just
[47:47] works for everyone. And now my friends,
[47:49] there is like an endless war about which
[47:51] BI [music] tool is better, PowerBI or
[47:53] Tableau. I saw this topic discussed
[47:55] everywhere at YouTube, social media, and
[47:58] at companies. There are endless content
[48:01] and meeting and efforts about it. Now if
[48:03] you ask me which one should I use or
[48:05] which one is better I'm going to tell
[48:07] you both are good and there is one of
[48:09] the project that I have done that I
[48:11] utilized both of them PowerBI and
[48:13] Tableau at the same time. So now I'm
[48:15] going to tell you a few reasons why some
[48:17] teams tend to use PowerBI instead of
[48:19] Tableau. The first and obvious reason is
[48:22] it fits your daily tools. Most of people
[48:25] at work already use Microsoft products
[48:27] like Outlook, Teams, Excel, PowerPoint,
[48:30] SharePoint and PowerBI feel natural
[48:33] because it fits. It looks like Excel. It
[48:36] looks like PowerPoint and it is already
[48:38] connect to all those tools without any
[48:40] extra steps. The second reason, it is
[48:43] very easy to switch from Excel to
[48:46] PowerBI because you already know DAX.
[48:48] The interface looks like an Excel and
[48:50] the charts and dashboards looks similar.
[48:53] you have already the background to
[48:55] switch to PowerBI. Reason number three,
[48:57] it is more affordable. PowerBI cost less
[49:00] than other tools and now my friend of
[49:02] course all the companies are trying to
[49:04] cut costs [music] and one of the major
[49:07] cost in each budget is actually
[49:09] licenses. This is a winning factor why
[49:12] they utilize PowerBI and another reason
[49:14] inside PowerBI you have a lot of things
[49:16] like the data cleanup and preparation
[49:19] using the power query. Many other
[49:21] platforms like Tableau, they offer like
[49:23] a second tool like Tableau prep in order
[49:26] to prepare the data. But with the
[49:27] PowerBI desktop, you have like
[49:29] everything in one place which makes it
[49:31] easier for you not to jump between like
[49:33] tools to do the whole process. And
[49:35] another reason why people tend to use
[49:37] PowerBI is that Microsoft really focuses
[49:40] on this tool. Like there is a new
[49:42] release, new updates every single month,
[49:44] new visuals, new features, new
[49:46] connectors. The tool is really keep
[49:48] growing very fast. And another reason
[49:50] which is my favorite one the data
[49:53] moduling inside PowerBI is really
[49:55] strong. So there are a lot of features
[49:57] and advanced stuff that you can do as
[50:00] you are moduling the data. If you
[50:01] compare to Tableau Tableau is really
[50:03] limited on how you build the data model
[50:05] which [music] is a clear win for PowerBI
[50:08] because if you have a strong data model
[50:09] you can make very flexible and fast
[50:12] insight with the PowerBI. So those are
[50:14] the reasons why we use PowerBI instead
[50:17] of other tools like Tableau. But the
[50:19] major reasons are again it is cheaper
[50:21] than other tools. It is integrated
[50:23] everywhere in the Microsoft ecosystem at
[50:26] work. And the third one, come on, it
[50:28] looks like a powerful Excel. Everyone is
[50:30] familiar and use Excel. So the interface
[50:33] of the PowerBI looks really familiar. So
[50:35] those are my reasons why I think PowerBI
[50:38] is becoming very popular and widely
[50:40] used. Okay. So my friends, PowerBI is
[50:42] great but actually not perfect. And
[50:44] there are some scenarios where I'm going
[50:46] to say PowerBI is limited compared to
[50:49] the other tools like Tableau. Like the
[50:51] biggest limitation with the PowerBI is
[50:53] you cannot use it for very complex
[50:55] visuals. Now if you are building
[50:57] operational reports where you have some
[50:59] basic charts, a bar line or pie charts,
[51:02] some table and filter then PowerBI is
[51:05] more than enough. But the thing is if
[51:07] things get a little bit advanced and
[51:09] complex like if you are a data scientist
[51:11] that is making dot plots or some
[51:14] advanced insights about the data it is
[51:16] way easier to use Tableau and as well
[51:19] faster. It's stills stronger and wins
[51:21] here compared to PowerBI. And another
[51:24] thing where I cannot use PowerBI is that
[51:26] there are limitations. So if you are
[51:27] using scatter charts in PowerBI, you are
[51:30] limited only to show few thousands of
[51:32] data points which is really a disaster
[51:35] if you have a lot of data. You would
[51:37] like even to show hundred of thousands
[51:39] of data points in order to find those
[51:41] outliers. And Tableau does not enforce
[51:44] any hard limitations or restrictions on
[51:46] the number of data points on the visual.
[51:49] Literally I had one visual that is
[51:51] showing millions of data points on my
[51:53] dashboard in Tableau. And another reason
[51:55] which is really annoying, you cannot run
[51:57] PowerBI desktop for Mac users. You need
[52:00] always to use workaround where tools
[52:02] like Tableau you can install it in both
[52:04] Windows and Mac. So those are the major
[52:06] reasons let's say to not use PowerBI.
[52:08] And this is exactly why I have utilized
[52:11] in one project both the tools PowerBI
[52:13] and Tableau. So for basic operation
[52:16] reports and standard reports that goes
[52:19] to managers and leaders I have utilized
[52:22] PowerBI which is more than enough and I
[52:24] can tell you 80% of requirements and
[52:27] reports fits into those two categories
[52:30] and only for complex and advanced
[52:32] analytics that is usually prepared by
[52:34] advanced data analyst and data
[52:37] scientist. We use Tableau because it is
[52:39] faster for big data and as well you can
[52:41] create highly customized charts and of
[52:44] course not everyone in the company did
[52:46] use it. So we give it an access only for
[52:48] like a focused audience that has
[52:50] advanced analytical requirements. So
[52:53] this is my take on why we use PowerBI
[52:55] and when not to use it. All right
[52:57] friends, so that was a whole full
[52:59] introduction about what [music] is
[53:01] PowerBI.
[53:05] So what is exactly [music] SQL?
[53:07] Everything generate data and data is
[53:09] everywhere. Your first name is data,
[53:11] your mobile and everything inside the
[53:13] mobile is data. Car is as well
[53:15] generating a lot of data. Bank, your
[53:17] finance statements, everything is data.
[53:19] And now of course the question is where
[53:20] do we store our data? Personally, we
[53:22] store a lot of our data in like Excels,
[53:25] spreadsheets, in a text file. So you
[53:27] store a lot of your data in different
[53:29] files. Now how about companies? They
[53:31] have a lot of things that generate a lot
[53:33] of data that the products that they
[53:35] produce their customers as well
[53:37] generating a lot of data and sales
[53:39] informations and a lot of things. So
[53:41] companies generate massive amount of
[53:43] data. So now the big question is how
[53:44] they handle the data, how they store it.
[53:46] Of course they cannot go and use like
[53:48] simple files. They need something
[53:50] bigger, stronger and smarter. And here
[53:52] where the database comes in. So think
[53:54] about the database. It's like a
[53:56] container for storing data. But instead
[53:58] of just dumping files into folders, the
[54:01] database organized the data. So it is
[54:03] easy to access, to manage and to search.
[54:06] So a database simply it is a container
[54:08] that stores data. So now you might ask
[54:10] why we are using database. Can't we just
[54:12] use files like I do it personally? Well,
[54:14] let me tell you why we use databases.
[54:16] Imagine that someone asks the following
[54:18] question. Go and find the total spending
[54:20] in your data. So now in order for Mike
[54:22] to find the total spending and the costs
[54:25] he will be opening each of those files
[54:27] one by one searching for the costs
[54:30] trying to combine the data and it's
[54:32] going to be very long and messy process.
[54:34] But now in the other side if your data
[54:36] in database and you want to ask a
[54:38] question it's going to be very easy. So
[54:39] all what you have to do is to talk to
[54:41] the database to ask a question and the
[54:43] database can answer your question with a
[54:45] result. And now comes of course the
[54:47] question how do we talk to a database?
[54:50] Well, we use SQL. SQL is the language
[54:53] that you use in order to talk to the
[54:56] database. It stands for structured query
[54:59] language. SQL. And here you have people
[55:01] that call it SQL like me and others that
[55:04] call it SQL. There is no right and
[55:06] wrong, but if you follow me through the
[55:08] course, I think you will start saying
[55:10] SQL. So, by using SQL, you can ask the
[55:13] database, you can ask your data, and the
[55:15] database going to answer your question
[55:17] by sending you a result. So this process
[55:19] is very easy, simple and fast. And this
[55:22] is way better than having your data
[55:24] stored in different files. Another
[55:25] reason why we use databases is that they
[55:28] can handle really huge amount of data.
[55:30] So sometimes we have like millions of
[55:32] data inside our database. But in the
[55:34] other side, if you are storing your data
[55:35] inside spreadsheets and you have like
[55:37] massive amount of data, what can happen?
[55:39] Your spreadsheets going to just break.
[55:41] They simply can't handle big data. And
[55:44] another reason why we use databases is
[55:46] that it is just secure. It is safer to
[55:48] store important and critical data inside
[55:50] the database than just storing it in
[55:53] spreadsheets and files. So the databases
[55:55] are secure and you can control who is
[55:57] accessing what. So it is just more
[55:59] professional to store the data inside a
[56:01] database. All right my friends so far
[56:03] what we have learned most of the
[56:04] companies stores their data inside a
[56:07] container called a database and for you
[56:09] in order to ask questions and to talk to
[56:12] your database you have to speak the
[56:14] language of SQL.
[56:19] Now I'm going to show you how it looks
[56:20] like usually in companies. So we have
[56:22] our data inside the database and then
[56:24] you will have multiple people with
[56:26] multiple roles that are just writing
[56:28] different SQLs in order to talk to the
[56:30] data. But now not only employees and
[56:32] people interact with the database. You
[56:34] could build a website or an application
[56:36] that as well interacts with the database
[56:38] by sending different SQLs. And of course
[56:41] depend on how many people are
[56:42] interacting with the application and the
[56:44] website it might generate really massive
[56:46] amount of SQLs that sends to the
[56:48] database. And not only that you might
[56:50] has as well tools in order to do data
[56:52] visualizations where you have like a
[56:54] dashboard or reports maybe created using
[56:57] PowerBI or Tableau and it is used by
[57:00] stakeholders and managers in order to
[57:02] make decisions and as well those tools
[57:04] will be connected to the database and
[57:06] creating SQLs. So now as you can see we
[57:09] have a lot of interactions with the
[57:11] database from people applications tools
[57:14] a lot of things are generating SQLs and
[57:16] interacting with the database but the
[57:18] database is just a container and storage
[57:21] right so we need something a software
[57:23] that manage all those requests and
[57:25] that's why we have something called
[57:27] database management system DPMS so it is
[57:30] a software that going to manage all
[57:32] those different requests to our database
[57:34] and it going to make the priority which
[57:36] SQL must be executed First, this
[57:38] software can as well manage the security
[57:40] whether the SQL is allowed to be
[57:42] executed in the first place. So my
[57:44] friends, the DPMS is the software that
[57:47] going to manage the database. And now we
[57:49] are not done yet. There is something
[57:50] missing. So we have our data, we have
[57:52] the software. What is missing here is
[57:54] the hardware. So in real companies, we
[57:56] cannot run that on our PC because first
[57:58] our PC is weak and as well it goes
[58:00] offline. That's why we need a server.
[58:03] server. It is like a very powerful PC
[58:05] and as well it lives 24/7. So it is
[58:08] always available and here we can decide
[58:10] whether we're going to have a server
[58:11] inside the company or we can use cloud
[58:13] services in order to run our database.
[58:16] So my friends so far what we have
[58:18] learned the database it is container to
[58:20] store the data. The SQL it is the
[58:22] language in order to talk to the
[58:24] database. The DPMS it is the manager it
[58:26] manages the database and the server it
[58:29] is the physical machine where the
[58:30] database [music] lives. So this is how
[58:32] it looks like.
[58:37] And now my [music] friends there are
[58:38] different types of databases. So let's
[58:40] see what do we have. The first and the
[58:42] most famous one it is the relational
[58:44] database. It is very simple. It is like
[58:47] spreadsheets call them table where we
[58:49] have columns and rows and then there is
[58:51] like a relationship between those tables
[58:53] to describe how they relate to each
[58:54] other and that's why we call it
[58:56] relational database. So if people hear a
[58:58] database they're going to think about
[59:00] this one. Now we have another type of
[59:01] database is called key value. This time
[59:04] the data is organized completely
[59:05] different where you have pairs of keys
[59:08] and values. Think about it. It's like a
[59:10] big dictionary where you have a word
[59:12] like the key and the definition of the
[59:14] word. This is the value. And now moving
[59:16] on to the next one. This is as well
[59:17] important column based. So now instead
[59:19] of grouping the data by the rows, this
[59:21] type of databases group the data into
[59:24] columns. That's why it's called column
[59:26] based. And this is very advanced
[59:27] database in order to handle huge amount
[59:30] of data where the main purpose is to
[59:32] search for data. Moving on to another
[59:33] database called graph database. The main
[59:36] focus here is the relationship between
[59:38] objects. So the main idea here is how to
[59:40] connect my data points. And now finally
[59:42] we have the document database. The data
[59:45] is stored as entire documents where the
[59:48] structure of the data is not that
[59:49] important. What is more important is to
[59:51] fit everything in one page in one
[59:53] document. And now if you look to those
[59:54] five types we can group the document
[59:56] graph column based key value all those
[59:59] databases called noSQL databases and the
[01:00:02] relational database SQL database and in
[01:00:05] this course we will be focusing of
[01:00:07] course on the relational database and
[01:00:09] I'm sure you have heard about like the
[01:00:10] Microsoft SQL server the MySQL the
[01:00:13] posris SQL all those databases they are
[01:00:16] SQL relational database and for the key
[01:00:19] value you have the radius the Amazon
[01:00:21] Dynamo DB and we have for The column
[01:00:23] based we have the Cassandra and the red
[01:00:25] shift. For the graph database we have
[01:00:27] the Neo4G and the very famous database
[01:00:30] the MongoDB as a document database. Now
[01:00:33] my friends for this course we're going
[01:00:34] to be focusing on the SQL relational
[01:00:36] databases because it is the most famous
[01:00:39] one and the most used one in companies
[01:00:41] and I will be focusing on the Microsoft
[01:00:43] SQL server. So those are the different
[01:00:45] types of databases. [music]
[01:00:51] Now the databases are very structured
[01:00:52] and organized. It has the following
[01:00:54] hierarchy. The starting point is the
[01:00:56] server as we learned it is powerful PC
[01:00:58] and it is where the database lives and
[01:01:01] inside it we can have multiple
[01:01:03] databases. So maybe you have a database
[01:01:05] for the sales and another one for the
[01:01:07] HR. So the server can host multiple
[01:01:10] databases. And as we learned a database
[01:01:12] is a container of your data. Now moving
[01:01:14] on to the next level. In each database
[01:01:16] we can have multiple schemas. A schema
[01:01:18] it is like category or you can call it a
[01:01:20] logical container that we can use it in
[01:01:22] order to group up related objects like
[01:01:25] let's say you have hundred of tables. So
[01:01:27] you can split all the tables that has to
[01:01:29] do with the orders in one schema and
[01:01:31] then another group of tables with the
[01:01:33] schema customers and so on. So it help
[01:01:35] you to organize your tables and your
[01:01:37] objects in the database. And now if you
[01:01:39] go inside schema you can have multiple
[01:01:41] objects like tables. So now of course
[01:01:43] the question is what is a table? It is
[01:01:45] like spreadsheet. It organize your data
[01:01:47] in two columns. The column define the
[01:01:50] data that you store inside it. So you
[01:01:52] have one column about the customer ID,
[01:01:54] another column about the names, the
[01:01:56] scores, the birthday. So each column is
[01:01:58] about one type of data and sometimes we
[01:02:00] call the columns as fields. Now the
[01:02:02] other thing that we have in tables is
[01:02:04] the rows or sometimes we call it
[01:02:05] records. It is where actually the data
[01:02:08] is stored. Now in this example, each
[01:02:10] record represent one customer, one
[01:02:12] person. So we have one record for Maria,
[01:02:15] John and Peter. Thus we call them rows.
[01:02:17] Now in each table there is like one very
[01:02:19] important column called the primary key.
[01:02:22] It is always very important to have like
[01:02:24] one unique identifier for each customer
[01:02:27] for each row. And we use it for
[01:02:29] different purposes in order to combine
[01:02:30] it with another table in order to
[01:02:32] identify quickly one customer. So it is
[01:02:35] unique. It's like fingerprint and there
[01:02:37] is no two customers having the same ID.
[01:02:39] Now the overlapping between the columns
[01:02:41] and the rows we have a single value a
[01:02:43] cell and each value each column stores
[01:02:46] specific data type. A data type it is
[01:02:48] like what kind of data we are storing
[01:02:50] like an integer 1 2 30 or a decimal
[01:02:54] where you have a decimal point 3.14. Now
[01:02:56] if you want to store characters we have
[01:02:58] different data types for that like you
[01:03:00] want to store the name or the
[01:03:02] description. So here we can use the char
[01:03:04] or the vchar. So you store inside them
[01:03:06] like the first name Maria or something.
[01:03:08] Now you might ask what is a char or
[01:03:10] vchar? So the char always a fixed one.
[01:03:13] So if you define it like five characters
[01:03:15] always it's going to go and reserve five
[01:03:17] characters from the space. But if you
[01:03:19] want things more dynamic then you go
[01:03:20] with the vchar. And now moving on we
[01:03:22] have another data types called the date
[01:03:24] and time. So if you want to store a date
[01:03:26] like the birth dates and if you want to
[01:03:28] store the time information you can use
[01:03:30] the time data type. So we call those
[01:03:32] stuff intimol char date time they are
[01:03:35] data types. So my friends as you can see
[01:03:37] SQL databases are very organized and
[01:03:39] structured. [music]
[01:03:44] Okay. So now let's focus more about the
[01:03:46] SQL itself. We have in SQL different
[01:03:48] type of commands. So let's say that we
[01:03:50] have a database and this database is
[01:03:52] empty. So we have nothing inside it. Now
[01:03:54] of course the first thing that you have
[01:03:55] to do is to write an SQL with the
[01:03:57] command create in order to create brand
[01:04:00] new table in the database. So once you
[01:04:02] execute it the database going to go and
[01:04:04] build one. But this table is empty. So
[01:04:06] we have nothing inside it. So now what
[01:04:08] you have done here is you have defined
[01:04:10] something new. Right? And we call this
[01:04:12] type of commands the data definition
[01:04:14] language. The DDL we have create to
[01:04:16] create something new. Alter in order to
[01:04:19] edit something that already exists and
[01:04:21] drop in order to delete something to
[01:04:23] drop for example a table. So this is the
[01:04:25] first family of commands. Now if you
[01:04:27] look at our table it is empty. What do
[01:04:29] we need? We need data. So let's say that
[01:04:31] we have a website or an application. Now
[01:04:33] this application is generating a lot of
[01:04:35] data. Now in order for this application
[01:04:37] to move the data inside our new table,
[01:04:39] it must use the SQL command insert. So
[01:04:42] if you execute insert, you can add a new
[01:04:44] data inside your table. This type of
[01:04:46] commands we call it data manipulation
[01:04:49] language. And here we have three
[01:04:50] commands. Insert in order to insert a
[01:04:52] new data, update in order to update an
[01:04:54] already existing data, and delete in
[01:04:56] order to go and delete data from your
[01:04:58] table. And that's why we call it data
[01:05:00] manipulation language because you are
[01:05:02] manipulating your data. So what do we
[01:05:03] have now? We have table. We have data
[01:05:05] inside the table. Now what we can do? We
[01:05:07] can start asking questions. So let's say
[01:05:09] that you have analytical question about
[01:05:11] your data. Now all what you have to do
[01:05:13] is to write something called SQL query
[01:05:15] and inside it you use the command select
[01:05:18] but the whole thing we call it a query.
[01:05:20] So you send a query to the database. you
[01:05:22] have a question and the database can
[01:05:24] return for you the result the data
[01:05:27] answering your query your question and
[01:05:29] we call this type of activities using
[01:05:31] SQL the data query language and here we
[01:05:34] have only one and it is very famous we
[01:05:36] have the select we can use it in order
[01:05:38] to query our data so those are the three
[01:05:41] different commands in SQL and of course
[01:05:43] we're going to learn all of them but we
[01:05:45] will spend most of our time learning how
[01:05:47] to write the correct query for the
[01:05:50] correct answer.
[01:05:54] And now you might ask me bar why we have
[01:05:56] to learn SQL and if the time goes backs
[01:05:59] are you going to learn SQL again? Well
[01:06:01] for sure of course and here are the top
[01:06:03] three reasons that I have. The first one
[01:06:05] you have to learn it in order to talk to
[01:06:07] the data. You know most of the companies
[01:06:09] stores their data in databases and this
[01:06:11] is a standard way. This is how they do
[01:06:12] it. And if you want to work on the
[01:06:14] company in the data field and you want
[01:06:16] to talk to their data then you have to
[01:06:18] use SQL. It's like you move to another
[01:06:20] country where they speak another
[01:06:21] language and you want to live there for
[01:06:23] a long time, you have to speak the
[01:06:25] language. The same thing here. If you
[01:06:26] want to work with data, you have to
[01:06:28] learn the language in order to speak to
[01:06:30] the database, the SQL. So this is for me
[01:06:32] the most important reason why we have to
[01:06:34] learn SQL and SQL it is in high demand.
[01:06:36] If you go now and check the job
[01:06:38] description of the software developer,
[01:06:40] data analyst, data engineer, data
[01:06:42] scientist, I promise you, you will find
[01:06:43] there that they going to demand for SQL.
[01:06:46] So you will find they're going to ask
[01:06:47] for SQL skills almost in each job
[01:06:50] description. So if you check for any
[01:06:52] data related jobs you will find that
[01:06:54] they're going to ask for SQL skills. Now
[01:06:56] another reason that I have is it is
[01:06:58] industry standard. So if you go and
[01:07:00] check multiple modern data platforms and
[01:07:02] tools like PowerBI, Tableau, Kafka,
[01:07:05] Spark, Synapse, you will understand that
[01:07:08] there will be always a section where you
[01:07:09] have to enter SQL code. So most of those
[01:07:12] vendors adopt SQL because it is the
[01:07:14] standard. It is widely used. It is like
[01:07:16] selling points that their tools are
[01:07:19] easy. So those are my top three reasons
[01:07:21] why SQL is still relevant and why you
[01:07:23] have to learn it. Okay my friends. So
[01:07:25] with that we have now clear
[01:07:26] understanding what is an SQL, why we
[01:07:28] need it, what are databases and their
[01:07:30] different types, why do we have DBMS,
[01:07:33] servers and as well now you have
[01:07:35] understanding how things are very
[01:07:37] organized and structured inside the
[01:07:38] databases. [music] So that's all this is
[01:07:41] SQL.
[01:07:45] Okay. So [music] now let's start with
[01:07:47] basics. What is programming language? So
[01:07:49] now imagine that you want to give your
[01:07:50] computer a task and you say hey computer
[01:07:52] calculate 5 + 5. Well your computer will
[01:07:55] not understand what is exactly you're
[01:07:57] talking about and nothing going to
[01:07:58] happen. So we cannot use the natural
[01:08:00] human language in order to give tasks
[01:08:02] for your PC. So now instead with us we
[01:08:04] have to give the instructions in a
[01:08:06] language where the computer understand
[01:08:08] it. We write a short piece of code like
[01:08:10] for example here in Python print [music]
[01:08:12] 5 + 5. Now the computer going to
[01:08:14] understand the instruction and give us
[01:08:17] the result 10. So a program it is like
[01:08:19] set of instructions that is written in a
[01:08:22] language the computer can follow. But my
[01:08:24] friend not all languages are the same.
[01:08:26] Some are made for people and others are
[01:08:28] made for machines. So now at the top we
[01:08:30] have the natural language like you say
[01:08:32] hey please calculate 5 + 5. And we have
[01:08:35] a lot of natural languages. We have the
[01:08:37] English, Spanish, Hindi. They are easy
[01:08:39] for us but far too complex for computer
[01:08:42] to understand. Now moving on. Then we
[01:08:44] have the highlevel languages. So they
[01:08:46] are programming languages [music] but
[01:08:48] they are made for human because they are
[01:08:50] very simplified, logical and really easy
[01:08:53] to write and to read. And we have a lot
[01:08:55] of languages like Python and JavaScript.
[01:08:58] They are high-level languages. And now
[01:09:00] next we are going lower. We are going to
[01:09:02] the low-level languages. Programming
[01:09:05] languages can talk directly to the
[01:09:07] machines, but they are really hard for
[01:09:08] humans. It's hard to read, to write, and
[01:09:11] as well to understand. [music] And here
[01:09:12] we have examples like the assembly and
[01:09:14] the C language. And now at the very
[01:09:16] bottom, we have the machine language.
[01:09:18] And this one is made of binary code, a
[01:09:21] combination of ones and zeros. [music]
[01:09:23] And this is exactly what your computer
[01:09:25] can understand, but it is impossible for
[01:09:27] humans to read it, to understand it, to
[01:09:29] write [music] it. So now if you look at
[01:09:31] these levels each step down it going to
[01:09:33] bring us closer on how the machine
[01:09:35] thinks and each step up it going to
[01:09:37] bring us closer on how human thinks. So
[01:09:40] now if you're looking at Python the
[01:09:41] highle language it is closer to the
[01:09:43] natural language than the machine
[01:09:45] language. So that means it is really
[01:09:47] easy to learn and to write and it's like
[01:09:49] an abstraction that hides the complexity
[01:09:52] of low-level and machine languages. So
[01:09:54] it's like a bridge between both worlds.
[01:09:56] So this is exactly what are programming
[01:09:58] languages and what [music] is Python.
[01:10:04] Okay. So now I would like you to
[01:10:06] understand [music] exactly how Python
[01:10:08] works. So let's say you open your editor
[01:10:10] and you start writing Python code and
[01:10:12] usually we write the code inside a
[01:10:13] Python file with the file extension of
[01:10:16] py. [music] So here you are writing the
[01:10:18] source code. Once you run your code, the
[01:10:20] computer cannot immediately execute your
[01:10:22] instructions because it is really high
[01:10:24] level and the computer needs a lower
[01:10:26] level of the code. So what happens?
[01:10:28] There is something called compiler in
[01:10:30] Python. It's going to take your Python
[01:10:32] code and translate it to another code.
[01:10:35] It's called the pyite [music] code with
[01:10:37] the extension pyc python compiled. And
[01:10:40] this can happen automatically and you
[01:10:42] will not even notice it. So what
[01:10:43] happened here? the compiler [music] of
[01:10:45] Python going to translate the highlevel
[01:10:48] language the Python code to a low-level
[01:10:51] language [music] the bite code. So as
[01:10:53] you can see it's really hard to
[01:10:54] understand the bite code compared to the
[01:10:56] Python code. So now our code is not yet
[01:10:59] executed right Python [music] just did a
[01:11:01] translation but now before anything to
[01:11:02] be executed what can happen Python might
[01:11:05] as well link some libraries. It is like
[01:11:07] pre-written chunks of codes that helps
[01:11:09] your code to do something specific like
[01:11:12] working with files, handling data and so
[01:11:14] on. Now Python has everything the bite
[01:11:16] code the libraries and now Python can
[01:11:19] run your bite code using something
[01:11:21] called Python virtual machine. So it's
[01:11:23] like a software that can understand the
[01:11:25] bite code of Python and take care of
[01:11:27] running it. So the Python virtual
[01:11:29] machine going to finally converts your
[01:11:31] instructions into machine codes once and
[01:11:33] zero because this is the only thing that
[01:11:35] your computer can really understand and
[01:11:37] once it runs you will see at the results
[01:11:40] whatever your program was designed to
[01:11:42] do. And now all those three steps
[01:11:44] compile, translate and execute we call
[01:11:47] them the Python interpreter. So it's
[01:11:49] like a toolbox that handles everything
[01:11:51] needed to [music] run a Python code. So
[01:11:53] again my friend you write Python code in
[01:11:56] a high language then Python compiles it
[01:11:58] and translate it into the byte code the
[01:12:01] low language and then the virtual
[01:12:02] machine going to take care of everything
[01:12:04] going to take your bite code some
[01:12:06] libraries and then it's going to run it
[01:12:08] at your computer to see the results. So
[01:12:10] this process happens every time you run
[01:12:12] a Python code and everything is of
[01:12:14] course automated [music]
[01:12:15] and behind the scenes.
[01:12:20] Okay. is now let's [music] talk about
[01:12:22] why we should learn Python in the first
[01:12:24] place. There are like too many
[01:12:26] programming languages out there. So why
[01:12:28] millions of people choose Python? So the
[01:12:30] first and most important reason is
[01:12:32] Python is very powerful and as well at
[01:12:34] the same time very simple. You can build
[01:12:36] really serious stuff with few line of
[01:12:38] codes. So if you compare like Java and
[01:12:40] Python in Java or C++ you can write like
[01:12:43] a lot of lines in order to do a simple
[01:12:45] task compared to Python where you need
[01:12:48] only like one or two line. So as you can
[01:12:50] see, Python is very simple compared to
[01:12:52] other programming languages. Okay,
[01:12:53] moving on to the second reason. Python
[01:12:55] is used literally everywhere and you can
[01:12:57] use it in order to build everything. So
[01:12:59] you can use it to build websites, to
[01:13:02] automate tasks, to work with data, to
[01:13:04] build games and even control robots. So
[01:13:07] whatever direction you want to go in
[01:13:09] tech, you will find probably Python.
[01:13:11] Okay, moving on to the next one. One of
[01:13:13] the best things about Python is the
[01:13:15] community. You are my friend, never
[01:13:17] alone. There are thousands of developers
[01:13:20] and experts like me sharing their
[01:13:22] knowledge in tutorials or they are like
[01:13:24] writing blog posts and as well
[01:13:26] developing and sharing free libraries
[01:13:28] opensource projects in GitHub and if you
[01:13:31] are stuck or have any complex task I'm
[01:13:33] sure there is like someone already
[01:13:34] solved it and probably made a YouTube
[01:13:36] video or a GitHub repo about it. So this
[01:13:39] type of community makes the language
[01:13:41] alive and when it comes to AI Python is
[01:13:44] leading the way. Almost everything you
[01:13:46] hear about today like Shajbt, image
[01:13:48] generations, self-driving car models,
[01:13:51] it's built with [music] Python because
[01:13:53] Python has an incredible ecosystem for
[01:13:55] AI and machine learning. So my friends,
[01:13:57] this means the programming language of
[01:13:59] AI is Python. And if you are interested
[01:14:01] in the future, then Python is the right
[01:14:03] language. And now because of all these
[01:14:05] reasons, Python is one of [music] the
[01:14:07] most in demand programming language in
[01:14:09] the world right now. If you check any
[01:14:11] job description in the tech world, you
[01:14:13] will see everyone across all industries,
[01:14:16] finance, healthcare, logistics, car
[01:14:18] manufacturing, everyone is requesting
[01:14:20] Python skill. So my friends, Python is
[01:14:23] really easy to learn, has incredible use
[01:14:25] cases. It is shaping the future with the
[01:14:27] AI. It has huge community and it is
[01:14:30] amazing for your career. So this is
[01:14:32] exactly why millions learns [music] and
[01:14:34] works with Python.
[01:14:39] Okay. So [music] now I hope that I got
[01:14:41] you motivated to learn Python and I will
[01:14:43] not leave you there. I'm going to show
[01:14:44] you now how to learn Python. So Python
[01:14:46] is like a journey and every journey
[01:14:48] needs a road map. So now everyone
[01:14:50] everybody going to start from the same
[01:14:52] place. Every expert, a data engineer,
[01:14:54] AI, web, gaming developers, everyone has
[01:14:57] to start at the same place. Everyone has
[01:14:59] to learn how to write a simple code,
[01:15:02] understand variables, how to work
[01:15:03] [music] with data types, how to control
[01:15:06] the flow of your codes and as well you
[01:15:08] have to learn about the functions and
[01:15:09] how to organize your code into small
[01:15:11] reusable blocks. So this stage builds
[01:15:13] your foundation and now after that
[01:15:15] everyone has to go to the intermediate
[01:15:17] level. Now we're going to level things
[01:15:18] up. You're going to start learning how
[01:15:20] to handle errors and exceptions, how to
[01:15:22] structure your code using the OOP,
[01:15:25] object-oriented programming, how to
[01:15:27] split your projects into modules, and
[01:15:29] how to work with files and many other
[01:15:31] stuff. So at this phase, you start
[01:15:33] feeling how your code is getting more
[01:15:35] professional, more like a real
[01:15:36] developer. So things going to make more
[01:15:38] sense here. And now after that, everyone
[01:15:40] going to start moving to the advanced
[01:15:41] level. And here you can start learning
[01:15:43] advanced techniques. We're going to go
[01:15:45] outside of our code in order to connect
[01:15:47] ourself with APIs in [music] order to
[01:15:49] grab data from the internets and we're
[01:15:51] going to learn stuff like how to test
[01:15:52] your codes and how to scrap website to
[01:15:55] collect data automatically and at this
[01:15:57] level you're going to start doing real
[01:15:59] projects and solving real problems. So
[01:16:01] now as you can see all those three
[01:16:02] levels they are the core of Python and
[01:16:05] everyone must learn these techniques
[01:16:07] because everyone need those techniques
[01:16:09] in whatever you are building in the
[01:16:11] future. And now after that of course we
[01:16:13] cannot keep learning everything. It's
[01:16:15] going to be really impossible. So you
[01:16:16] have to really make thoughts about which
[01:16:18] path you want to follow the direction
[01:16:20] that match your interest and your career
[01:16:22] goals. So you have to pick your path and
[01:16:25] there are like a lot of options like for
[01:16:26] example you could be a data engineer. If
[01:16:29] you'd like to work with pipelines,
[01:16:30] automations, moving data around then you
[01:16:34] have to learn something like sparkl
[01:16:36] processes and automations with python.
[01:16:38] [music] Another path you can use Python
[01:16:41] for data science if you enjoy working
[01:16:43] with data charts insights. If yes then
[01:16:46] you're going to be working with
[01:16:47] libraries like pandas numpy and plotty
[01:16:50] in order to analyze and visual your
[01:16:51] data. If you want to go deeper and build
[01:16:53] smart systems and models then you have
[01:16:56] to work with Python libraries like
[01:16:57] pytorch transformers and tensorflow and
[01:17:00] another option and path you could use
[01:17:02] python for web development. So if you
[01:17:05] want to build websites and web apps then
[01:17:07] you can use Python libraries and
[01:17:09] frameworks like Flask, Django and
[01:17:11] requests. And of course not only those
[01:17:13] four we have a lot of other options like
[01:17:15] using Python for game development. But
[01:17:18] as you can see at the expert level at
[01:17:19] the right side going to be almost
[01:17:21] impossible to learn all those libraries.
[01:17:23] So now if I look to this and tell you
[01:17:25] about my journey. So I learned
[01:17:27] everything in Python about data
[01:17:28] engineering and then with the time I
[01:17:30] start picking few libraries from the
[01:17:32] other options like from the data science
[01:17:34] I learned the plotty and the pandas and
[01:17:36] from the AI and machine learning the
[01:17:38] transformers. So I'm going to recommend
[01:17:39] you to pick first a path make yourself
[01:17:41] an expert there and with the [music]
[01:17:43] time start exploring the other stuff.
[01:17:49] So if you are new [music] to git don't
[01:17:51] worry about it. It is simpler than it
[01:17:52] sounds. So it's all about to have a safe
[01:17:54] place where you can put your codes that
[01:17:56] you are developing and you will have the
[01:17:58] possibility to track everything happens
[01:18:00] to the codes and as well you can use it
[01:18:02] in order to collaborate with your team
[01:18:04] and if [music] something goes wrong you
[01:18:05] can always roll back.
[01:18:11] Now designing [music] the data
[01:18:12] architecture it is exactly like building
[01:18:14] a house. So before construction starts,
[01:18:17] an architect's going to go and design a
[01:18:19] plan, a blueprint for the house. How the
[01:18:21] rooms will be connected, how to make the
[01:18:23] house functional, safe and wonderful.
[01:18:26] And without this blueprint from the
[01:18:28] architects, the builders might create
[01:18:29] something unstable, inefficient or maybe
[01:18:32] unlivable. The same goes for data
[01:18:34] projects. A data architect is like a
[01:18:36] house architecture. They design how your
[01:18:38] data will flow, integrate and be
[01:18:40] accessed. So as data architects we make
[01:18:42] sure that the data warehouse is not only
[01:18:44] functioning but also scalable and easy
[01:18:47] to maintain. And this is exactly what we
[01:18:49] will do now. We will play the role of
[01:18:51] the data architect and we will start
[01:18:53] brainstorming and designing the
[01:18:55] architecture of the data warehouse. So
[01:18:57] now I'm going to show you a sketch in
[01:18:58] order to understand what are the
[01:18:59] different approaches in order to design
[01:19:02] a data architecture. And this phase of
[01:19:04] the projects usually is very exciting
[01:19:05] for me because this is my main role in
[01:19:08] data projects. I am a data architect and
[01:19:10] I discuss a lot of different projects
[01:19:12] where we try to find out the best design
[01:19:14] for the projects. All right. So now
[01:19:16] let's go. Now the first step of building
[01:19:18] a data architecture is to make a very
[01:19:20] important decision to choose between
[01:19:22] four major types. The first approach is
[01:19:24] to build a data warehouse. It is very
[01:19:27] suitable if you have only structured
[01:19:29] data and your business want to build
[01:19:31] solid foundations for reporting and
[01:19:34] business intelligence. And another
[01:19:35] approach is to build a data leak. This
[01:19:38] one is way more flexible than a data
[01:19:40] warehouse where you can store not only
[01:19:42] structured data but as well semi and
[01:19:44] unstructured data. We usually use this
[01:19:46] approach if you have mixed types of data
[01:19:48] like database tables, logs, images,
[01:19:51] videos and your business want to focus
[01:19:53] not only on reporting but as well on
[01:19:55] advanced analytics or machine learning
[01:19:58] but it's not that organized like a data
[01:19:59] warehouse and data leaks. If it's too
[01:20:02] much unorganized and turns into data
[01:20:04] swamp and this is where we need the next
[01:20:06] approach. So the next one we can go and
[01:20:08] build data lake house. So it is like a
[01:20:11] mix between data warehouse and data
[01:20:13] lake. You get the flexibility of having
[01:20:16] different types of data from the data
[01:20:17] lake but you still want to structure and
[01:20:19] organize your data like we do in the
[01:20:21] data warehouse. So you mix those two
[01:20:23] words into one. And this is a very
[01:20:25] modern way on how to build that
[01:20:27] architecture and this is currently my
[01:20:28] favorite way of building data management
[01:20:30] system. Now the last and very recent
[01:20:32] approach is to build data mesh. So this
[01:20:35] is a little bit different. Instead of
[01:20:36] having centralized data management
[01:20:38] system, the idea now in the data mesh is
[01:20:40] to make it decentralized. You cannot
[01:20:42] have like one centralized data
[01:20:44] management system because always if you
[01:20:46] say centralized then it means
[01:20:48] bottleneck. So instead you have multiple
[01:20:50] departments and multiple domains where
[01:20:52] each one of them is building a data
[01:20:54] product and sharing it with others. So
[01:20:56] now you have to go and pick one of those
[01:20:57] approaches. And in this project we will
[01:20:59] be focusing on the data warehouse. So
[01:21:01] now the question is how to build the
[01:21:03] data warehouse. Well there is as well
[01:21:05] four different approaches on how to
[01:21:07] build it. The first one is the inmon
[01:21:09] approach. So again you have your sources
[01:21:11] and the first layer you start with the
[01:21:13] staging where the row data is landing
[01:21:15] and then the next layer you organize
[01:21:17] your data in something called enterprise
[01:21:19] data warehouse where you go and model
[01:21:22] the data using the third normal format.
[01:21:25] It's about like how to structure and
[01:21:27] normalize your tables. So you are
[01:21:28] building a new integrated data model
[01:21:31] from the multiple sources. And then we
[01:21:32] go to the third layer. It's called the
[01:21:34] data marts where you go and take like
[01:21:36] small subset of the data warehouse and
[01:21:39] you design it in a way that is ready to
[01:21:41] be consumed from reporting and it focus
[01:21:44] on only one topic like for example the
[01:21:46] customers sales or products and after
[01:21:49] that you go and connect your BI tool
[01:21:51] like PowerBI or Tableau to the data
[01:21:53] march. So with that you have three
[01:21:54] layers to prepare the data before
[01:21:56] reporting. Now moving on to the next one
[01:21:58] we have the Kimple approach. He says you
[01:22:00] know what building this enterprise data
[01:22:02] warehouse it is wasting a lot of time.
[01:22:05] So what we can do we can jump
[01:22:07] immediately from the stage layer to the
[01:22:09] final data ms because building this
[01:22:11] enterprise data warehouse it is a big
[01:22:13] struggle and usually waste a lot of
[01:22:15] time. So he always want you to focus and
[01:22:17] building the data ms quickly as
[01:22:19] possible. So it is faster approach than
[01:22:22] inmon but with the time you might get
[01:22:24] chaos in the data ms because you are not
[01:22:26] always focusing in the big picture and
[01:22:27] you might be repeating same
[01:22:29] transformations and integrations in
[01:22:31] different data ms. So there is like
[01:22:33] trade-off between the speed and
[01:22:35] consistent data warehouse. Now moving on
[01:22:37] to the third approach we have the data
[01:22:39] vault. So we still have the stage and
[01:22:41] the data marts but it says we still need
[01:22:44] this central data warehouse in the
[01:22:46] middle but this middle layer we're going
[01:22:48] to bring more standards and rules. So it
[01:22:50] tells you to split this middle layer
[01:22:52] into two layers the row vault and the
[01:22:55] business vault. In the row vault you
[01:22:57] have the original data but in the
[01:22:59] business vault you have all the business
[01:23:00] rules and transformations that prepares
[01:23:02] the data for the data march. So data
[01:23:04] vault it is very similar to the inmon
[01:23:06] but it brings more standards and rules
[01:23:09] to the middle layer. Now I'm going to go
[01:23:11] and add a fourth one that I'm going to
[01:23:13] call it medallion architecture and this
[01:23:16] one is my favorite one because it is
[01:23:18] very easy to understand and to build. So
[01:23:20] it says you're going to go and build
[01:23:22] three layers bronze, silver and gold.
[01:23:24] The bronze layer it is very similar to
[01:23:26] the stage but we have understood with
[01:23:28] the time that the stage layer is very
[01:23:30] important because having the original
[01:23:32] data as it is it going to helps a lot by
[01:23:35] traceability and finding issues. Then
[01:23:37] the next layer we have the silver layer.
[01:23:38] It is where we do transformations data
[01:23:41] cleansing but we don't apply yet any
[01:23:43] business rules. Now moving on to the
[01:23:45] last layer the gold layer. It is as well
[01:23:47] very similar to the data marts but there
[01:23:49] we can build different type of objects
[01:23:51] not only for reporting but as well for
[01:23:53] machine learning for AI and for many
[01:23:56] different purposes. So they are like
[01:23:58] business ready objects that you want to
[01:24:00] share as a data products. So those are
[01:24:02] the four approaches that you can use in
[01:24:04] order to build a data warehouse. So
[01:24:07] again if you are building a data
[01:24:08] architecture you have to specify [music]
[01:24:10] which approach you want to follow.
[01:24:16] What is exactly data [music] warehouse?
[01:24:18] Why the companies try to build such a
[01:24:20] data management system? So now the
[01:24:22] question is what is a data warehouse? I
[01:24:23] will just use the definition of the
[01:24:25] father of the data warehouse bill in
[01:24:28] one. A data warehouse is
[01:24:29] subject-oriented integrated time variant
[01:24:32] and nonvolatile collection of data
[01:24:34] designed to support the management's
[01:24:36] decision-m process. Okay, I I know that
[01:24:39] might be confusing. Subject-oriented. It
[01:24:41] means that a warehouse is always focused
[01:24:43] on a business area like the sales,
[01:24:45] customers, finance and so on. Integrated
[01:24:48] because it goes and integrate multiple
[01:24:49] source systems. Usually you build a
[01:24:51] warehouse not only for one source but
[01:24:54] for multiple sources. Time variance it
[01:24:56] means you can keep historical data
[01:24:58] inside the data warehouse. Nonvolatile
[01:25:00] it means once the data enter the data
[01:25:02] warehouse it is not deleted or modified.
[01:25:05] So this is how build and modified data
[01:25:07] warehouse. Okay. Okay, I'm going to show
[01:25:09] you the scenario where your company
[01:25:10] don't have a real data management. So
[01:25:13] now let's say that you have one system
[01:25:14] and you have like one data analyst has
[01:25:16] to go to this system and start
[01:25:18] collecting and extracting the data and
[01:25:20] then he going to spend days and
[01:25:22] sometimes weeks transforming the raw
[01:25:24] data into something meaningful. Then
[01:25:26] once they have the reports they're going
[01:25:28] to go and share it. And this data
[01:25:30] analyst is sharing the report using an
[01:25:32] Excel. And then you have like another
[01:25:33] source of data and you have another data
[01:25:35] analyst that she is doing maybe the same
[01:25:37] steps collecting the data spending a lot
[01:25:40] of time transforming the data and then
[01:25:42] share at the end like a report and this
[01:25:44] time she is sharing the data using
[01:25:46] PowerPoint and a third system and the
[01:25:48] same story but this time he is sharing
[01:25:50] the data using maybe PowerBI. So now if
[01:25:52] the company works like this then there
[01:25:54] is a lot of issues. First this process
[01:25:56] it take two way long. I saw a lot of
[01:25:59] scenarios where sometimes it takes weeks
[01:26:01] and even months until the employee
[01:26:03] manually generating those reports. And
[01:26:05] of course, what can happen for the
[01:26:07] users? They are consuming multiple
[01:26:09] reports with multiple state of the data.
[01:26:12] One report is 40 days old, another one
[01:26:14] 10 days and a third one is like 5 days.
[01:26:17] So it's going to be really hard to make
[01:26:18] a real decision based on this structure.
[01:26:21] A manual process is always slow and
[01:26:23] stressful and the more employees you
[01:26:25] involved in the process the more you
[01:26:27] open the door for human errors and
[01:26:29] errors of course in reports leads to bad
[01:26:32] decisions and another issue of course is
[01:26:34] handling the big data. If one of your
[01:26:36] sources generating like massive amount
[01:26:38] of data then the data analyst going to
[01:26:40] struggle collecting the data and maybe
[01:26:42] in some scenarios it will not be anymore
[01:26:44] possible to get the data. So the whole
[01:26:47] process kind of breaks and you cannot
[01:26:48] generate anymore fresh data for specific
[01:26:51] reports. And one last very big issue
[01:26:53] with that. If one of your stakeholders
[01:26:55] asks for an integrated report from
[01:26:58] multiple sources, well good luck with
[01:27:00] that because merging all those data
[01:27:02] manually is very chaotic, timeconuming
[01:27:05] and full of risk. So this is just a
[01:27:07] picture. If a company is working without
[01:27:09] a proper data management, without a data
[01:27:12] leak, data warehouse, data lake houses.
[01:27:14] So in order to make real and good
[01:27:16] decisions, you need data management. So
[01:27:19] now let's talk about the scenario of a
[01:27:21] data warehouse. So the first thing
[01:27:22] that's going to happen is that you will
[01:27:24] not have your data team collecting
[01:27:26] manually the data. You're going to have
[01:27:28] a very important component called ETL.
[01:27:31] ETL stands for extract, transform, and
[01:27:34] load. It is a process that you do in
[01:27:36] order to extract the data from the
[01:27:38] sources and then apply multiple
[01:27:39] transformations on those sources and at
[01:27:42] the end it loads the data to the data
[01:27:44] warehouse and this one can be the single
[01:27:46] point of truth for analyzes and
[01:27:48] reporting and it is called data
[01:27:50] warehouse. So now what can happen all
[01:27:52] your reports going to be consuming this
[01:27:55] single point of truth. So with that you
[01:27:57] create your multiple reports and as well
[01:28:00] you can create integrated reports from
[01:28:02] multiple sources not only from one
[01:28:04] single source. So now by looking to the
[01:28:06] right side it looks already organized
[01:28:08] right and the whole process is
[01:28:10] completely automated. There is no more
[01:28:12] manual steps which of course it reduces
[01:28:15] the human error and as well it is pretty
[01:28:17] fast. So usually you can load the data
[01:28:19] from the sources until the reports in
[01:28:22] matter of hours or sometimes in minutes.
[01:28:25] So there is no need to wait like weeks
[01:28:27] and months in order to refresh anything.
[01:28:30] And of course the big advantage is that
[01:28:31] the data warehouse itself it is
[01:28:33] completely integrated. So that means it
[01:28:36] goes and bring all those sources
[01:28:37] together in one place which makes it
[01:28:39] really easier for reporting and not only
[01:28:42] integrated you can build in the data
[01:28:44] warehouse as well history. So we have
[01:28:46] now the possibility to access historical
[01:28:48] data and what is also amazing that all
[01:28:51] those reports having the same data
[01:28:53] status. So all those reports can have
[01:28:55] the same status maybe sometimes one day
[01:28:57] old or something. And of course if you
[01:28:59] have a modern data warehouse in cloud
[01:29:01] platforms you can really easily handle
[01:29:04] any big data sources. So no need to
[01:29:06] panic if one of your sources is
[01:29:08] delivering massive amount of data. And
[01:29:10] of course in order to build the data
[01:29:12] warehouse you need different types of
[01:29:14] developers. So usually the one that
[01:29:15] builds the ETL component and the data
[01:29:18] warehouse is the data engineer. So they
[01:29:21] are the one that is accessing the
[01:29:22] sources, scripting the ATLs and building
[01:29:25] the database for the data warehouse. And
[01:29:27] now for the other part, the one that is
[01:29:29] responsible for that is the data
[01:29:31] analyst. They are the one that is
[01:29:33] consuming the data warehouse, building
[01:29:35] different data models and reports and
[01:29:38] sharing it with the stakeholders. So
[01:29:40] they are usually contacting the
[01:29:41] stakeholders, understanding the
[01:29:43] requirements and building multiple
[01:29:45] reports based on the data warehouse. So
[01:29:47] now if you have a look to those two
[01:29:48] scenarios, this is exactly why we need
[01:29:51] data management. [music]
[01:29:52] Your data team is not wasting time and
[01:29:55] fighting with the data. They are now
[01:29:57] more organized and more focused and with
[01:30:00] like a data warehouse and you are
[01:30:02] delivering professional and fresh
[01:30:04] reports that your company can count on
[01:30:06] in order to make good [music] and fast
[01:30:09] decisions. So this is why you need a
[01:30:11] data management like a data warehouse.
[01:30:13] Think about data warehouse as a busy
[01:30:15] restaurant. Every day different
[01:30:16] suppliers bring in fresh ingredients,
[01:30:19] vegetables, spices, meat, you name it.
[01:30:21] They don't just use it immediately and
[01:30:23] throw everything in one pot, right? They
[01:30:25] clean it, shop it, and organize
[01:30:27] everything and store each ingredients in
[01:30:30] the right place, fridge or freezer. So,
[01:30:32] this is the preparing phase. And when
[01:30:34] the order comes in, they quickly grab
[01:30:37] the prepared ingredients and create a
[01:30:39] perfect dish and then serve it to the
[01:30:41] customers of the restaurant. And this
[01:30:43] process is exactly like the data
[01:30:44] warehouse process. It is like the
[01:30:46] kitchen where the raw ingredients your
[01:30:48] data are cleaned, sorted and stored. And
[01:30:51] when you need a report or analyzes, it
[01:30:53] is ready to [music] serve up exactly
[01:30:55] like what you need.
[01:31:01] What is [music] exactly an ETL? So our
[01:31:03] data exist in a source system and now
[01:31:05] what we want to do is is to get our data
[01:31:08] from the source and move it to the
[01:31:10] target. Source and target could be like
[01:31:12] database tables. So now the first step
[01:31:14] that we have to do is to specify which
[01:31:16] data we have to load from the source. Of
[01:31:18] course we can say that we want to load
[01:31:20] everything. But let's say that we are
[01:31:22] doing incremental load. So we're going
[01:31:23] to go and specify a subset of the data
[01:31:26] from the source in order to prepare it
[01:31:28] and load it later to the target. So this
[01:31:30] step in the ATL process we call it
[01:31:32] extract. We are just identifying the
[01:31:34] data that we need. We pull it out and we
[01:31:37] don't change here anything. It's going
[01:31:38] to be like one to one like the source
[01:31:40] system. So the extract has only one task
[01:31:43] to identify the data that we have to
[01:31:45] pull out from the source and to not
[01:31:47] change anything. So we will not
[01:31:48] manipulate the data at all. It can stay
[01:31:51] as it is. So this is the first step in
[01:31:53] the ETL process, the extract. Now moving
[01:31:56] on to the stage number two. We're going
[01:31:57] to take this extract data and we will do
[01:32:00] some manipulations, transformations and
[01:32:03] we're going to change the shape of those
[01:32:05] data. And this process is really heavy
[01:32:07] working. We can do a lot of stuff like
[01:32:09] data cleansing, data integration and a
[01:32:12] lot of formatting and data
[01:32:13] normalizations. So a lot of stuff we can
[01:32:15] do in this step. So this is the second
[01:32:17] step in the ETL process. The
[01:32:19] transformation we're going to take the
[01:32:21] original data and reshape it, transform
[01:32:24] it into exactly the format that we need
[01:32:26] into a new format and shapes that we
[01:32:29] need for analyzes and reporting. Now
[01:32:31] finally we get to the last step in the
[01:32:32] ATL process. We have the load. So in
[01:32:35] this step we're going to take this new
[01:32:37] data and we're going to insert it into
[01:32:39] the target. So it is very simple. We're
[01:32:41] going to take this prepared data from
[01:32:43] the transformation step and we're going
[01:32:44] to move it into its final destination
[01:32:47] the target like for example data
[01:32:49] warehouse. So that's ETL in a nutshell.
[01:32:51] First extract the row data then
[01:32:53] transform it into something meaningful
[01:32:55] and finally load it to a target where
[01:32:58] it's going to make a difference. So
[01:32:59] that's it. This is what we mean with the
[01:33:01] ETL process. Now in real projects we
[01:33:04] don't have like only source and targets
[01:33:06] our data architecture going to have like
[01:33:08] multiple layers depend on your design
[01:33:10] whether you are building a warehouse or
[01:33:12] a data lake or a data warehouse and
[01:33:14] usually there are like different ways on
[01:33:16] how to load the data between all those
[01:33:18] layers and in order now to load the data
[01:33:20] from one layer to another one there are
[01:33:22] like multiple ways on how to use the ATL
[01:33:24] process. So usually if you are loading
[01:33:26] the data from the source to the layer
[01:33:28] number one like only extract the data
[01:33:30] from the source and load it directly to
[01:33:32] the layer number one without doing any
[01:33:34] transformations because I want to see
[01:33:36] the data as it is in the first layer.
[01:33:38] And now between the layer number one and
[01:33:40] the layer number two you might go and
[01:33:42] use the full ETL. So we're going to
[01:33:44] extract from the layer one transform it
[01:33:47] and then load it to the layer number
[01:33:49] two. So with that we are using the whole
[01:33:51] process the ETL. And now between layer
[01:33:53] two and layer three we can do only
[01:33:54] transformation and then load. [music] So
[01:33:56] we don't have to deal with how to
[01:33:58] extract the data because it is maybe
[01:34:00] using the same technology. And we are
[01:34:02] taking all data from layer 2 to layer
[01:34:04] three. So we transform the whole layer 2
[01:34:07] and then load it to layer three. And now
[01:34:09] between three and four you can use only
[01:34:11] the elm. So maybe it's something like
[01:34:13] duplicating and replicating the data and
[01:34:16] then you are doing the transformation.
[01:34:18] So you load to the new layer and then
[01:34:20] transform it. Of course, this is not a
[01:34:22] real scenario. I'm just showing you that
[01:34:24] in order to move from source to a
[01:34:25] target, you don't have always to use a
[01:34:28] complete ETL. Depend on the design of
[01:34:30] your data architecture, you might use
[01:34:32] only few components from the ETL. Okay.
[01:34:35] So, this is how ETL [music] looks like
[01:34:37] in real projects.
[01:34:42] [music]
[01:34:42] So, what is exactly data modeling?
[01:34:44] Usually the source system going to
[01:34:46] deliver for you row data unorganized,
[01:34:48] messy, not very useful in its current
[01:34:51] states. But now the data modeling is the
[01:34:54] process of taking this row data and then
[01:34:56] organize it and structure it in
[01:34:58] meaningful way. So what we are doing we
[01:35:00] are putting the data in new friendly and
[01:35:03] easy to understand objects like
[01:35:06] customers, orders, products. Each one of
[01:35:08] them is focused on specific information
[01:35:10] and what is very important is we're
[01:35:12] going to describe the relationship
[01:35:14] between those objects. So by connecting
[01:35:16] them using lines. So what we have built
[01:35:18] on the right side we call it logical
[01:35:20] data model. If you compare to the left
[01:35:22] side you can see the data model makes it
[01:35:24] really easy to understand our data and
[01:35:26] the relationship the processes behind
[01:35:28] them. Now in data modeling we have three
[01:35:30] different stages or let's say three
[01:35:32] different ways on how to draw a data
[01:35:33] model. The first stage is the conceptual
[01:35:36] data model. Here the focus is only on
[01:35:38] the entity. So we [music] have
[01:35:40] customers, orders, products and we don't
[01:35:42] go in details at all. So we don't
[01:35:44] specify any columns or attributes inside
[01:35:46] those boxes. We just want to focus what
[01:35:48] are the entities that we have and as
[01:35:50] well the relationship between them. So
[01:35:52] the conceptual data model don't focus at
[01:35:55] all on the details. It just gives the
[01:35:57] big picture. So the second data model
[01:35:59] that we can build is the logical data
[01:36:01] model. And here we start specifying what
[01:36:03] are the different columns that we can
[01:36:05] find in each entity like we have the
[01:36:07] customer ID, the first name, last name
[01:36:09] and so on. And we still draw the
[01:36:11] relationship between those entities and
[01:36:13] as well we make it clear which columns
[01:36:15] are the primary key and so on. So as you
[01:36:16] can see we have here more details but
[01:36:18] one thing we don't describe a lot of
[01:36:20] details for each column and we are not
[01:36:22] worry how exactly we going to store
[01:36:25] those tables in the database. The third
[01:36:27] and last stage we have the physical data
[01:36:29] model. This is where everything gets
[01:36:31] ready before creating it in the
[01:36:33] database. So here you have to add all
[01:36:35] the technical details like adding for
[01:36:37] each column the data types and the
[01:36:39] length of each data type and many other
[01:36:42] database techniques and details. So
[01:36:44] again if you look to the conceptual data
[01:36:45] model it gives us the big picture and in
[01:36:48] the logical data model we dive into
[01:36:49] details of what data we need and the
[01:36:52] physical layer model prepares everything
[01:36:54] for the implementation in the database.
[01:36:56] And to be honest in my projects I only
[01:36:59] draw the conceptual and the logical data
[01:37:01] model because drawing and building the
[01:37:03] physical data model needs a lot of
[01:37:05] efforts and time and there are many
[01:37:07] tools like in data bricks they
[01:37:08] automatically generate those models. So
[01:37:11] in this project what we're going to do
[01:37:12] we're going to draw the logical data
[01:37:14] model for the gold layer.
[01:37:20] All right. It's now for analytics and
[01:37:21] especially for data warehousing and
[01:37:23] business intelligence. We need a special
[01:37:25] data model that is optimized for
[01:37:27] reporting and analytics and it should be
[01:37:29] flexible, scalable and as well easy to
[01:37:32] understand. And for that we have two
[01:37:34] special data models. The first type of
[01:37:36] data model we have the star schema. It
[01:37:38] has a central fact table in the middle
[01:37:40] and surrounded by dimensions. The fact
[01:37:42] table contains transactions, events, and
[01:37:45] the dimensions contains descriptive
[01:37:47] informations. And the relationship
[01:37:48] between the fact table in the middle and
[01:37:50] the dimensions around it forms like a
[01:37:53] star shape. And that's why we call it
[01:37:54] star schema. And we have another data
[01:37:56] model called snowflake schema. It looks
[01:37:59] very similar to the star schema. So we
[01:38:01] have again the fact in the middle and
[01:38:03] surrounded by dimensions. But the big
[01:38:05] difference is that we break the
[01:38:07] dimensions into smaller subdimensions.
[01:38:10] And the shape of this data model as you
[01:38:12] are extending the dimensions it's going
[01:38:14] to looks like a snowflake. So now if you
[01:38:16] compare them side by side you can see
[01:38:17] that the star schema looks easier right?
[01:38:20] So it is usually easy to understand easy
[01:38:22] to query. It is really perfect for
[01:38:24] analyzers but it has one issue with
[01:38:26] that. The dimension might contain
[01:38:28] duplicates and your dimensions get
[01:38:30] bigger with the time. Now if you compare
[01:38:32] to the snowflake you can see the schema
[01:38:34] is more complex. You saw you need a lot
[01:38:36] of knowledge and efforts in order to
[01:38:38] query something from the snowflake. But
[01:38:40] the main advantage here comes with the
[01:38:42] normalization as you are breaking those
[01:38:44] redundancies in small tables. You can
[01:38:46] optimize the storage but to be honest
[01:38:48] who care about the storage. So for this
[01:38:50] project I have choose to use the star
[01:38:52] schema because it is very commonly used
[01:38:54] perfect for reporting like for example
[01:38:56] if you're using PowerBI and we don't
[01:38:59] have to worry about the storage. So
[01:39:00] that's why we're going to adopt this
[01:39:02] model to build our gold layer.
[01:39:08] Okay. So now one more thing about those
[01:39:09] data models is that they contain two
[01:39:11] types of tables fact and dimensions.
[01:39:13] [music]
[01:39:14] So when I say this is a fact table or a
[01:39:16] dimension table well the dimension
[01:39:18] contains descriptive informations or
[01:39:20] like categories that [music] gives some
[01:39:22] context to your data. For example a
[01:39:24] product info you have product name
[01:39:26] category subcategories and so on. This
[01:39:28] is like a table that is describing the
[01:39:30] products [music] and this we call it
[01:39:32] dimension. But in the other hand we have
[01:39:34] facts. They are events like
[01:39:36] transactions. They contain three
[01:39:38] important informations. First you have
[01:39:40] multiple ids from multiple dimensions.
[01:39:43] Then we have like date informations like
[01:39:45] when the transaction or the event did
[01:39:48] happen. And the third type of
[01:39:49] information you're going to have like
[01:39:50] measures and numbers. So if you see
[01:39:52] those three types of data in one table,
[01:39:55] then this is effect. So if you have a
[01:39:57] table that answers how much or how many,
[01:40:00] then this is effect. But if you have a
[01:40:02] table that answers who, what, where,
[01:40:04] then this is a dimension table. [music]
[01:40:06] So this is what dimension and fact
[01:40:08] tables.
[01:40:16] [music] All right, my friends. So we're
[01:40:18] going to start with a secret, a little
[01:40:20] trick that I usually do by analyzing any
[01:40:23] data sets. So let's start with a little
[01:40:24] coffee before [music] we start. H, this
[01:40:27] is really hot. Okay. So the secret says
[01:40:30] as I'm looking to any data sets in any
[01:40:32] projects I see the data always divided
[01:40:34] between dimensions and measures.
[01:40:39] >> What truth?
[01:40:40] >> You take the blue pill, you take the red
[01:40:42] pill. All I'm offering is the truth.
[01:40:44] Nothing more.
[01:40:47] If you see your data like me as
[01:40:49] dimensions and measures, you can
[01:40:51] generate like endless amount of insights
[01:40:54] from any projects from any data sets and
[01:40:56] you will find me through the projects
[01:40:58] that I'm always speaking about measures
[01:41:00] and dimensions. So I'm going to show you
[01:41:02] how I usually do it. So now usually by
[01:41:04] looking to any data set in any project.
[01:41:06] So you have like multiple columns and
[01:41:08] rows. Here I see the data always
[01:41:10] splitted into two categories either a
[01:41:12] dimension or a measure. And now of
[01:41:14] course the question is here is my column
[01:41:17] a dimension or a measure? Well in order
[01:41:19] to assign it to one of those categories
[01:41:21] you have to ask the first question is it
[01:41:24] a numeric value? If it's not so you have
[01:41:26] like string or date or any other data
[01:41:28] type then it is a dimension and if it is
[01:41:31] yes a numeric then you have to ask the
[01:41:34] second question does it make sense to
[01:41:36] aggregate it. So if the answer for both
[01:41:39] questions is yes, it is numeric and it
[01:41:41] makes sense to aggregate it then it is a
[01:41:44] measure otherwise it is a dimension. Now
[01:41:46] let's practice and have some examples.
[01:41:48] So now by looking to the values of the
[01:41:50] column category you can see all the
[01:41:52] values are characters so it is not
[01:41:54] numeric that means this column is a
[01:41:56] dimension. So it is very simple. Let's
[01:41:58] take another column. We have the sales
[01:42:00] amount. So now as you can see the values
[01:42:02] are numeric and as well it makes sense
[01:42:04] to aggregate those values. we can get
[01:42:06] the total sales or the average sales and
[01:42:08] so on. So it fulfill both of the
[01:42:10] conditions. It is numeric and it makes
[01:42:12] sense to aggregate it. That's why we say
[01:42:14] sales is a measure. Now if you're
[01:42:16] checking the values of the product name,
[01:42:18] you can see that all of them are
[01:42:20] characters and names. So it is not
[01:42:22] numeric. That means the product is a
[01:42:24] dimension. Moving on to the next one, we
[01:42:26] have the quantity. The values are
[01:42:28] numeric and as well it makes sense to
[01:42:30] aggregate it. Can summarize all those
[01:42:32] values to have the total quantity. So
[01:42:34] quantity is a measure. Now if you are
[01:42:35] looking to the values of the birth dates
[01:42:37] you can see this is a date information
[01:42:39] it is not numeric so that means it is a
[01:42:42] dimension right but if you calculate the
[01:42:44] age from the birth dates age of the
[01:42:47] customer going to be in numeric and it
[01:42:49] makes sense to aggregate it for example
[01:42:51] finding the average age of customers. So
[01:42:54] if we derive a numeric value from a
[01:42:57] dimension then we can use [music] it as
[01:42:59] a measure. So age is a measure and now
[01:43:01] we come to something really tricky. This
[01:43:03] is the ID. So for example, if you are
[01:43:05] checking the customer ID, you can see
[01:43:07] all those values are numeric. So the
[01:43:10] first condition is fulfilled. Now the
[01:43:12] very important question does it make
[01:43:14] sense to aggregate the ids? Well, those
[01:43:17] ids are unique identifier for a customer
[01:43:19] and if you find like the average of that
[01:43:21] it is not like helpful, right? I cannot
[01:43:23] think of one use case of aggregating the
[01:43:26] customer ID like having the average of
[01:43:28] all those ids or summarizing the ids. So
[01:43:32] it makes no sense to aggregate it.
[01:43:34] That's why we can consider the ID of a
[01:43:36] customer as a dimension not as a
[01:43:38] measure. So as you can see it is very
[01:43:40] simple. If it is numeric and it makes
[01:43:42] sense to aggregate then it is measure
[01:43:44] otherwise it is a dimension. And this is
[01:43:47] the foundations of any data analytics.
[01:43:49] If you see your data as dimensions and
[01:43:52] measures you can generate a lot of use
[01:43:54] cases and insights from your data sets.
[01:43:57] Now I totally understand if you are
[01:43:58] still confused about dimensions and
[01:44:00] measures and you might be asking why do
[01:44:02] I need measures and dimensions. Well, if
[01:44:04] you are doing any type of data analysis
[01:44:06] or you are exploring any data sets, you
[01:44:08] will be end up always like grouping up
[01:44:10] the data by something like you are
[01:44:12] grouping the data by countries or
[01:44:14] grouping the data by for example
[01:44:16] products or categories. So we need
[01:44:19] dimensions to group up our data and in
[01:44:21] the other sides you will be asking
[01:44:23] questions like how much, how many, what
[01:44:25] is the total of something. So you always
[01:44:27] need to aggregate or calculate something
[01:44:29] right and for that you need the measure.
[01:44:31] So we need the measures in order to
[01:44:33] answer the question how many and how
[01:44:35] much and we need the dimensions in order
[01:44:38] to group up the data by something. So
[01:44:40] that's why almost in any type of data
[01:44:42] analyzes you need dimensions and
[01:44:44] measures and this can be [music] more
[01:44:45] clear as we progress in the projects.
[01:44:51] I [music] want to tell you a secret
[01:44:53] principle concept that each data
[01:44:55] architect must know and that is the
[01:44:58] separation of concerns. So what is that?
[01:45:00] As you are designing an architecture,
[01:45:02] you have to make sure to break down the
[01:45:04] complex system into smaller independent
[01:45:07] parts and each part is responsible for a
[01:45:10] specific task. And here comes the magic.
[01:45:12] The component of your architecture must
[01:45:14] not be duplicated. So you cannot have
[01:45:17] two parts are doing the same thing. So
[01:45:19] the idea here is to not mix everything
[01:45:22] and this is one of the biggest mistakes
[01:45:24] in any big projects and I have shown
[01:45:26] that almost everywhere. So a good data
[01:45:28] architects follow this concept this
[01:45:31] principle. So for example if you are
[01:45:33] looking to our data architecture we have
[01:45:35] already done that. So we have defined
[01:45:37] unique set of tasks for each layer. So
[01:45:40] for example we have said in the silver
[01:45:42] layer we do data cleansing but in the
[01:45:44] gold layer we do business
[01:45:46] transformations and with that you will
[01:45:48] not be allowing to do any business
[01:45:49] transformations in the silver layer and
[01:45:51] the same thing goes for the gold layer.
[01:45:53] You don't do in the gold layer any data
[01:45:55] cleansing. So each layer has its own
[01:45:57] unique tasks and the same thing goes for
[01:46:00] the bronze layer and the silver layer.
[01:46:01] You do not allow to load data from the
[01:46:04] source systems directly to the silver
[01:46:06] layer because we have decided the
[01:46:08] landing layer. The first layer is the
[01:46:10] bronze layer otherwise you will have
[01:46:12] like set of source systems that are
[01:46:14] loaded first to the bronze layer and
[01:46:16] another set is skipping the layer and
[01:46:18] going to the silver. And with that we
[01:46:20] have overlapping you are doing data
[01:46:22] ingestion in two different layers. So my
[01:46:24] friends if you have this mindsets
[01:46:26] separation of concerns I [music] promise
[01:46:28] you you're going to be a top data
[01:46:30] architect.
[01:46:34] All right, great. So with that we
[01:46:36] [music] have a data model and we can say
[01:46:37] we have something called a data products
[01:46:39] and we will be sharing this data product
[01:46:42] with different type of users and there
[01:46:44] is something that every data products
[01:46:46] absolutely needs and that is [music] the
[01:46:48] data catalog. It is a document that can
[01:46:51] describe everything about your data
[01:46:52] model. columns, the tables, maybe the
[01:46:55] relationship between the tables as well.
[01:46:57] And with that, you make your data
[01:46:58] product clear for everyone. And it's
[01:47:00] going to be for them way easier to
[01:47:02] derive more insights and reports from
[01:47:04] your data product. And what is the most
[01:47:06] important one? It is time-saving because
[01:47:09] if you don't do that, what can happen?
[01:47:11] Each consumer, each user of your data
[01:47:13] product will keep asking you the same
[01:47:15] questions about what do you mean with
[01:47:17] this column? What is this table? How to
[01:47:18] connect the table A with the table P?
[01:47:20] and you will keep repeating yourself and
[01:47:22] explaining stuff. So instead of that you
[01:47:25] prepare a data catalog, a data model and
[01:47:27] you deliver everything together to the
[01:47:29] users and with that you are saving a lot
[01:47:31] of time and stress. I know it is
[01:47:33] annoying to create a data catalog but it
[01:47:35] is investments [music] and best
[01:47:36] practices.
[01:47:41] So let's keep it simple. An AI engineer
[01:47:43] is someone that builds [music]
[01:47:44] systems that use AI to solve real
[01:47:48] business problems. So that means you are
[01:47:50] not building a shad GBT or you are not
[01:47:53] training a data model. You are building
[01:47:56] [music] an AI system. So what is inside
[01:47:58] an AI system? You're going to find the
[01:47:59] following components. You will be
[01:48:01] connecting AI models like the open AI
[01:48:04] models from hugging face. You will be
[01:48:07] connecting the company's data,
[01:48:09] databases, files, documents and you will
[01:48:12] be connecting as well the company's
[01:48:14] tools and apps like emails, some
[01:48:16] internal services and stuff and
[01:48:18] interfaces where the user going to
[01:48:20] interact with the AI system. So as you
[01:48:22] can see you are just connecting stuff in
[01:48:25] one place and your main job is to make
[01:48:28] everything correct, secure, [music]
[01:48:30] fast, scalable and cost efficient.
[01:48:36] >> [music]
[01:48:36] >> Prompt engineering. Most people think
[01:48:38] prompt engineering is just like you are
[01:48:40] typing something into Shajib and hoping
[01:48:43] for a good answer. But there is actually
[01:48:45] a skill behind it. And it's all about
[01:48:48] how you communicate with a model so it
[01:48:51] [music] understand exactly what you want
[01:48:53] and how to get a tailored answer exactly
[01:48:56] how you expect. So that means you're
[01:48:58] going to give a detailed and clear
[01:49:00] instructions and context. [music] You're
[01:49:02] going to tell the model who it is and
[01:49:04] what its role and show examples of the
[01:49:07] results that you want. Each time you get
[01:49:09] an answer from the AI, you're going to
[01:49:12] review it and [music] improve your
[01:49:14] prompts step by step.
[01:49:19] Open AAI API. So the API going to let
[01:49:22] you interact with the same model but
[01:49:24] with one big difference inside your own
[01:49:27] app, inside your website, your products.
[01:49:29] So with that you are building like a
[01:49:31] chatbot assistance inside your tools and
[01:49:34] it is very simple to do. It's just few
[01:49:37] lines of Python. [music] You send a
[01:49:39] prompt, you get a response and you're
[01:49:40] going to display it wherever you want.
[01:49:42] So
[01:49:47] as you are building your AI systems,
[01:49:48] you're going to notice that you cannot
[01:49:50] rely completely on OpenAI because
[01:49:53] actually they are closed source. That
[01:49:55] means you have no control over the
[01:49:57] model. You cannot see how exactly it
[01:49:59] works. You're going to pay a lot for
[01:50:01] using tokens every time and if you are
[01:50:04] not using Azure your company data going
[01:50:06] to leave your environment and this is a
[01:50:08] huge problem for many businesses. So
[01:50:10] this is where the hacking phase comes
[01:50:12] into the picture. It is the biggest
[01:50:15] community library for AI models. There
[01:50:18] are already more than 2 million models
[01:50:20] over there and the best thing most of
[01:50:22] them are for free. That means you can go
[01:50:24] and find a model for almost any problem
[01:50:27] that you might encounter. And my friend,
[01:50:30] this is exactly why I keep repeating the
[01:50:32] same thing. As a data science in
[01:50:34] industry, we don't have anymore to train
[01:50:36] anything from the scratch. All what you
[01:50:38] have to do is just to find the right
[01:50:40] model for your business case and just
[01:50:43] fine-tune it. And the big advantage with
[01:50:45] the hugging face is that you can
[01:50:47] download the models locally at your
[01:50:49] machine and you can deploy it anywhere
[01:50:52] you want. With that you stay in control.
[01:50:54] You can use sensitive data for the model
[01:50:57] because nothing is leaving your
[01:50:59] environment and as well you reduce the
[01:51:01] [music] costs.
[01:51:06] So now let's say that you can talk to AI
[01:51:08] models, you can write really good
[01:51:09] prompts, you can build really cool
[01:51:11] demos, but this is not enough to build
[01:51:14] an AI system because you need to connect
[01:51:17] everything together. And this is where
[01:51:19] Langshain comes in. So you can use
[01:51:22] langin in order to orchestrate the whole
[01:51:24] process to connect all the models that
[01:51:26] you need the tools the memory and as
[01:51:29] well to build your business logic. So
[01:51:31] the AI can take multiple steps in order
[01:51:34] to complete a full task. And this is
[01:51:36] exactly what AI system does. It's not
[01:51:38] just like one prompt and [music] one
[01:51:39] answer.
[01:51:44] [music] We going to learn about rag
[01:51:46] retrieval augmented generation. The
[01:51:48] issue is that all the AI models are
[01:51:50] actually pre-trained using public data.
[01:51:53] [music] But of course, the company's
[01:51:54] data are protected and not available
[01:51:57] publicly, which means AI models has no
[01:52:01] idea about your company's data. So we
[01:52:03] have somehow to connect the company's
[01:52:05] data to the AI model. And here comes the
[01:52:07] concept of rag. So here how it works.
[01:52:10] First you take all the company's data,
[01:52:12] PDFs and files and you want to store
[01:52:15] them into something called Victor
[01:52:17] database. So all what you have to do is
[01:52:19] to turn a text into something fancy
[01:52:21] called embedding. It is just
[01:52:23] representing your text with numbers and
[01:52:26] then load and store all those embeddings
[01:52:28] inside the Victor database. So once a
[01:52:31] user ask a question it going to turn
[01:52:33] into embedding and the system going to
[01:52:35] start comparing and searching for the
[01:52:37] closest match using semantic search. So
[01:52:40] once it finds the right information the
[01:52:43] LLM model going to turn it into
[01:52:45] response. So it's all about adding a
[01:52:47] memory to the LLM model to use your real
[01:52:51] data instead of relying on what they
[01:52:53] were trained on.
[01:52:58] We all used AI [music] chatbots like the
[01:53:00] Shajbet. You write a prompt and it gives
[01:53:03] you back a text. But of course, this is
[01:53:05] not enough. Companies want more than a
[01:53:08] nice answer on the screen. They want an
[01:53:11] AI that actually gets the works done.
[01:53:14] And this is exactly why we have AI
[01:53:16] agents. An agent does first the thinking
[01:53:19] and then it going to take a real action.
[01:53:21] like for example maybe talking to a
[01:53:23] database, updating records, calling an
[01:53:26] API and maybe triggering a workflow and
[01:53:30] as well it is great in order to automate
[01:53:32] a lot of boring task that we do normally
[01:53:34] at the work like reading the incoming
[01:53:36] emails and responding to it creating a
[01:53:39] summary of a meeting and as well
[01:53:42] creating those boring Jira and Service
[01:53:44] Now tickets. So it is way more than just
[01:53:46] a chat with AI. It is an AI that is
[01:53:49] actually [music] doing a work.
[01:53:54] The MCP model context protocol. Now AI
[01:53:58] agents can only take real actions like
[01:54:00] checking emails, querying a database or
[01:54:03] calling an API only if they are
[01:54:05] connected to the external sources. And
[01:54:07] here my friend, there are like two big
[01:54:09] problems. First, if you connect your AI
[01:54:12] agents directly to the productive
[01:54:14] database, this is really risky. And the
[01:54:16] second issue, we have a lot of external
[01:54:18] systems and you're going to end up
[01:54:20] writing and building connectors for each
[01:54:23] tool. And this takes a lot of time and
[01:54:25] efforts in order to create a new
[01:54:27] connectors each time you are connecting
[01:54:30] a new system to the AI agents. [music]
[01:54:32] And this is exactly why we have MCB. It
[01:54:35] fixed those issues. We're going to add a
[01:54:37] safe and standard layer between your AI
[01:54:40] agents and your sources. And this has my
[01:54:43] friend a lot of benefits. First of all,
[01:54:45] you can plug and play any system to your
[01:54:48] AI agents without creating each time a
[01:54:51] new connectors using this layer going to
[01:54:53] give you a full control on how the AI
[01:54:56] going to interact with your sources
[01:54:58] where you can add a lot of policies in
[01:55:00] order to protect your external sources.
[01:55:03] So using MCB servers and protocols, it
[01:55:06] going to makes everything like faster.
[01:55:08] You can connect a lot of things and as
[01:55:10] well you're going to feel safe
[01:55:11] connecting AI agents to your sources. If
[01:55:14] you like this type of content where I'm
[01:55:16] sketching stuff and I'm showing you all
[01:55:18] things behind the scenes, then support
[01:55:19] my work by subscribing, liking, and
[01:55:21] commenting and sharing it with other
[01:55:23] people like you. And you can check my
[01:55:25] website for the other courses like the
[01:55:26] SQL and Tableau. You can follow my
[01:55:29] written content in my newsletter and
[01:55:31] LinkedIn. if you're still here. Thank
[01:55:32] you so much for watching and I will see
[01:55:34] you in [music] the next video. Bye-bye.
[01:55:40] >> [music]
