A data warehouse is often the data source for business intelligence and machine learning systems. Here’s what you need to know about it.
Data warehouses can be categorized as relational (SQL), NoSQL, transactional (OLTP), analytical (OLAP) or hybrid (HTAP). Specialized databases, e.g., for different departments, were initially seen as an advance but were derided as “islands”.
An attempt to create a single database of all corporate data is called a data lake if the data remains in its original form. If the data is converted into a standard format, it is called a data warehouse. Subsets of a data warehouse are called data marts.
A data warehouse is an analytical database created from two or more data sources. Data warehouses typically store historical data up to petabytes in size. Data warehouses often have large amounts of computational and storage resources to perform complex queries and produce reports. They are often used as business intelligence and machine learning systems data sources.
One of the main reasons for using data warehouses is that the number and type of indexes created in transactional (OLTP) databases are limited, reducing the analytical capacity. Once the data has been copied into the data warehouse, you can index anything you want in the data warehouse to achieve good performance for analytical queries without affecting the performance of the OLTP database records.
Another reason for creating an enterprise data warehouse is the ability to combine data from different sources for analytical purposes. For example, your OLTP sales program does not need weather information at the point of sale, but sales forecasts can use this data. If you add weather data to the data warehouse, you can easily integrate it into historical sales data models.
Data lakes, which store files in their original format, are essentially schemas on read. Any application that reads data from a dataset must apply its formulas and relationships to the data. On the other hand, data warehouses are “schema on write”: data types, indexes and relationships are applied to the data as it is stored.
A “schema on read” is suitable for data that can be used in various contexts with little risk of data loss. However, there is a risk that the data will not be used at all. Schema on write is suitable for data that serves a specific purpose and needs to be linked to data from other sources. However, adequately formatted data may only be accepted on import if converted to the desired data type.
Data warehouses contain data for the entire company, while data marts focus on data related to a specific business activity. Data marts can be independent of the data warehouse, dependent on the data warehouse (i.e., they can come from an enterprise database or an external source), or a combination of the two.
Data marts are used in parts because they are smaller, provide faster results and reduce operational costs. A data mart often contains aggregated and selected data rather than detailed data in or attached to the data warehouse.
A data warehouse usually has a layered architecture:
Each Layer Has a Different Purpose:
Source data are usually operational databases for sales and marketing, as well as other parts of the business. It can include social media and external data such as surveys and demographics.
The storage layer stores data extracted from data sources. The schema is created here if the source is unstructured, such as text from a social network. This layer also performs quality checks to remove poor-quality data and correct common errors.
ETL tools extract the data, perform the required mappings and transformations, and load the data into the data warehouse layer. ELT tools first save the data and then transform it. If you use ELT tools, you can use a data lake and skip the traditional bootstrap layer.
The data storage layer of a data warehouse contains cleaned, transformed and ready-to-analyze data. It is often a row-oriented relational layer but can also be column-oriented or reverse-indexed with list indexes for full-text search. Data warehouses often contain many more indexes than operational data stores to speed up analytical queries.
Data representation in a data warehouse is usually done by executing SQL queries, which can be created using a graphical interface tool. The result of SQL queries is used to produce tables, charts, graphs, dashboards, reports and forecasts, often using business intelligence tools.
Recently, some data warehouse solutions have started to support machine learning to improve the quality of models and forecasts. For example, Google BigQuery has added SQL operators that support linear regression models for forecasting and binary logistic regression models for classification. Some data warehouses can also be linked to deep learning libraries and machine learning tools.
The design of a data warehouse is based on two main approaches. The difference between them lies in the coordination of the data flow between the data warehouse and the data marts:
In the top-down design, the data warehouse is the central data repository for the entire enterprise. Data marts are retrieved from the data warehouse.
In the bottom-up design, data marts are centralized and integrated into the data warehouse.
Data warehouse solutions often use a top-down approach in the manufacturing and insurance sectors. Marketing, for example, tends to prefer the bottom-up approach.
A data warehouse can be implemented on-premises, in the cloud, or as a hybrid. In the past, data warehouses have always been on-premises.
The cost and lack of scalability of local servers in data centers were often a problem. As vendors launched the first data storage devices, the number of applications grew. Today, the trend is to move all or part of data storage to the cloud. Scalability and integration with other cloud services play an essential role.
The downside of moving large amounts of data to the cloud is the operational cost in terms of data storage and the cost of computing and storage resources in the cloud data warehouse. On the other hand, the time required to move data to the cloud is negligible.
Ultimately, any data warehouse solution depends on your objectives, resources and budget. First, you must ask yourself whether you need a data warehouse. If so, the next step is to identify your data sources, their size and growth rate, and analyze how you currently use them. You can then start experimenting with data warehouses, data lakes and data marts to find out what works best for your business.
We recommend setting up a concept test with a small subset of your data, either on-premises or within a small cloud instance. Once your design is validated and business benefits demonstrated, you can scale up to an entire data warehouse installation with full management support.
Your email address will not be published. Required fields are marked *
Save my name, email, and website in this browser for the next time I comment.
Lost your password?