Data Warehousing Basics

Data warehousing is a computational method which provides the tools to analyse data and reporting specific metrics. As data is becoming a more influent and relevant asset, data warehouses are gaining strenght due to it’s habilities to store historical data and merging data from various data sources.

In order to address this increase of popularity of this method we created this massive resource of combined articles on data warehouses.

This resource describes the basic concepts of a data warehouse in a simple way, in order to share a first and introductionary approach to data warehousing basics.

What is data warehousing?

A data warehouse is a relational database which stores historical data about some defined subject, designed specially from analysis and query of the information. The main difference between data warehousing and operational databases is the transaction nature of there, versus de analysing and querying of the data warehouses.

So, data warehouses are mainly OLAP systems, privileging data analysis, instead of OLTP where the trasactions are de main concern. We define and compare those two approaches in a fantastic article.

Since the DW stores data from several sources, it needs a process to extract the data, process it, normalize some information and, finally, store all the useful information in the database system. For this actions are used ETL processes, merging all the data in one well structured repository.

Basic elements

Data mart is the simple element of a data warehouse, focusing the information of a unique subject. This subject might be customers and usually are merged from a small list of data sources. A data warehouse is composed of multiple data marts, with specific relationships between them.

A fact is a table that stores concrete measures about something. This information is basic, mostly numerical and are the main values and data from a data mart.

In the other hand, a dimension is a describing table of the data presented of the facts and its fields. It has complementary information about the main values. Usually the relationship between fact tables and dimension ones is one to many.

Other elements of data warehousing are the basic database types like: tablespaces, indexes, tables, partitioning, views and constraints. Some of these resources can be explored in this article.

Element relationship

Since a data warehouse is a database, it has a specific structure for it’s data. There are many kinds of structures, based on the elements fact and dimension tables, described before.

The star schema is a organization pattern that each dimension is represented only by one dimension table.  All the dimensions are connected to the central fact table.

In opposition, the snowflake schema has its dimension tables normalized, and linked to one another. There are layers of dimension tables, supported by one fact table.

Those two logical schema structures are fully described in a simple, yet great resource.

Fact constellation schema is a multiple fact table schema. The dimension can be linked to more than one fact table, but are not linked to each other.

Data Warehouse Architecture

The aim of a data warehouse is to gather a lot of data of specific subjects, store and anylise it as efficiently as possible. To accomplish that, we need a model architecture that supports all these tasks, separating each one from the other.

A data warehouse is composed of essentially 3 tiers:  the data level, the application level and the front end.

There must be defined the process that transform the data sources into information that fits in the database. Sometimes it’s used a staging area for these transformations, before inserting the data into the database.

In the application area is where the analysis and reporting is done, using some OLAP.

The front end exposed the information to the users. It can be achived by sending all the data to the end user, or just supplying a specific data mart with the right view to the end user.

For all this layers to work properly, there are managers that supervise the operations. The Load Manager, Warehouse Manager and Query Manager are examples of management entities that ensure the quality and efficiency of the data warehouse.

Other Important Ideas

Besides the obvious database dependencies that are important in a data warehouse, since it’s foundations are databases, there are some advanced subjects that can be discussed further.

The extraction, transformation and loading of data are crucial activities that should be explored in a more detailed way. Another subject of particular importance is data warehouse maintenance: how it is done, best practices and which tools to use.