Data Analysis requires a massive amount of information to work on and to collect insights from. This massive amount of data is collected from various structured or unstructured sources. After the collection of data, the challenge is to store it at a place easily accessible and manageable by the organization. For this purpose, we use Data Lakes or Data Warehouses.
Both these storage mediums allow users to store large and complex data (Big Data), yet these terms arenâ€™t synonymous. There are considerable differences between them in terms of how the data is stored, and collected, the type of data stored, and the purpose of storage.
What is a Data Warehouse
Similar to a shop warehouse where the owner keeps their item for storage and safekeeping in order to use the items at a later time, data warehouses store data for organizations. Wikipedia defines as DWs as central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place
The Data warehouse follows the policy of defining the goal and then collecting related data. The purpose of data collection, the type of data required, and the formats eligible for storage are discussed and decided prior to the data collection process. Data warehouses store only necessary and structured data that will be useful for the purpose in hand, reducing consumption of storage spaces and hence the expenses.
The stored data is primarily of structured format, i.e. processed and not in its raw form. This makes it easier to use without constant filtration and cleansing. It also makes it convenient for data users to understand the data better and therefore allows them to make informed use of given data.
The only drawback pertaining to Data Warehouses is the difficulty in data manipulation because of the increased complexity and expenses of making changes in structured data.
What is a Data lake
Data Lakes are storehouses of unstructured, structured, and semi-structured data whose purpose is identified at a later time. In contrast with Data Warehouses, Data Lakes have a policy of collecting data first and defining its purpose later. Wikipedia defines Data lake as a system or repository of data stored in its natural/raw format.
Data is collected without a specific purpose in mind. The use and purpose of the data is defined as and when requirements arise in the organization. This usually gives the organization more flexibility in using the data. However, it requires a larger amount of storage space than what is required while storing data in Data Warehouses.
Since the data is raw, unfiltered, and unprocessed, it makes it easier for the data analysis process as it is better used by machine learning algorithms. In spite of that, raw data pose a risk of creation of Data Swamps where unnecessary data is collected without appropriate Data quality and Data governance policies in practice.
|DATA WAREHOUSE||DATA LAKE|
|Stores structured data||Stores unstructured, raw data.|
|The purpose of data collection is defined prior to collection.||No prior knowledge about the purpose of the collected data is provided.|
|Compact storage space is required to store only as much data as needed for the defined purpose.||No defined limit of storage space as data is collected from sources without a specific purpose.|
|Only required data is collected from accurate sources since the purpose is clearly defined.||Unnecessary data may be stored which might not be used at all in the future.|
|Data is processed and filtered and therefore more understandable by users.||Data is raw and unfiltered therefore is at risk of creation of Data Swamps.|
|Since the data is structured it is difficult and costlier to manipulate and change data.||Data manipulation is easy.|
|Can be understood and used by most of the business professionals of the organization.||Used by Data Scientists having a clear understanding of unstructured data and analysis tools.|
DATA LAKES OR DATA WAREHOUSES? WHAT TO USE?
Now that we have a clear picture of what Data Lakes and Data Warehouses are, what they are used for, and their advantages and disadvantages, it leads us to the question of when to use them? Which one to use?
We know that neither one of them can be a replacement for the other. Both of these storage mediums are valuable for different scenarios.
When we know we require data concerning a specific purpose, we want that data to be ready to use, filtered, structured and accurate so that the analysis can be quick and reliable therefore we will use data warehouses.
For example, data related to a department like marketing/sales can be stored collectively in a data warehouse so that accessing marketing-related data can be easier. It also aids in performance analysis, by allowing the convenient and informed creation of dashboards and reports. It also allows employees of the same department to efficiently access related data without the need of integrating it again and again.
On the other hand, Data Lakes are utilized when we are aware that required data might not always be in a structured format. Data Lakes are used to collect data in masses which makes analysis and prediction capabilities more flexible and ideal. For example, in the education sector or medical field, where data is usually stored in varying formats, i.e. written documents, databases, files, etc., we require Data Lakes to collect and organize the data.
Therefore to make better use of the data in hand and to make it accessible, useful, and beneficial for the organization we need to make sure that we choose the right storage option. It becomes crucial to make the correct choice as it affects the expenses, time consumed, efforts applied, and overall analysis and decision-making for the organization.
Now that you know about data lakes and data warehouses, explore the basics of Data Governance.