One of the most pressing issues regarding data modelling and management is its organization and structure. The way we put together and relate our data has great impact on the system’s performance, functionality and scalability.
So the logical schema structure is a key aspect in the development of any system. There are many data structures that can store data in an efficient and reliable way.
In this resource we’ll talk about two very important logical structures in data warehousing: star schema and snowflake schema. These two distinct ways of putting together large quantities of data are the most commonly used in data warehouse structures.
Therefore we will describe them to the detail, providing a lot of useful and fundamental information.
So, enjoy this chunk of fantastic information!
Regarding to the logical organization of your data, the star schema is the easier and most straightforward way to structure all the information.
The architecture of this kind of schemas is quite simple and provides several advantages, as well as some disadvantages, which we describe below.
In the star schema we have one or more centralized fact tables and one or multiple dimensions linked to it. These dimensions are only related to the fact table, so the only structural link they have is to that specific table.
The fact table relates to the dimensions having their primary keys as foreign keys, and other extra attributes relevant to the data warehouse. Therefore, each composition of primary keys identify a unique fact or piece of information. In most cases, the fact tables are in the 3rd normal form and the dimension ones are not normalized.
This way, the core data of the database is separated from its attributes, which characterize that information.
The implementation of this structure provides your data warehouse some benefits. Below we describe the most important ones:
- Simple structure of the data – Easy to understand how elements are connected. Simplifies the reporting of the information.
- Most common – Easy to integrate with another tools.
- Queries more effective – The queries in these systems are usually simpler since the data doesn’t follow some strict rules of normalization. Another reason for this is the lesser number of tables to join.
- Performance enhancements – The performance has substantial gains due to the de-normalized form of the data.
- Optimized for large data sets – Due to the best performance of the system and it’s queries, the star schema is efficient on data warehouses or data marts with huge data sets.
- Rapid aggregational actions – Tasks like sum, average, count, and others are performed quickly on this systems.
- Good for OLAP
The de-normalization of star schema structural data has also some disadvantages. In this chapter we focus the most important ones:
- Poor Data Integrity – Due to the non normalized structure of these tables, information can be replicated, creating several anomalies in the data.
- Long time loading dimension table – When the data integrity is low and replication values high, loading time of the tables increases.
- More disk space
- Additional processing – Usually some controlling processes are added, to avoid the data integrity issue.
- Harder Complex Queries – Since the data schema is built specifically to analyse a set of data, its de-normalized organization makes it harder to develop new complex queries.
- No Many-to-Many – This schema has no many-to-many relationships
As you can see, this structure resembles a star, thus the name of this logical schema.
The central fact table is linked to many dimension that characterize it, and these dimensions only relate to the central table.
This is only a very simple example of what a star schema might be. It can be quite complex with more facts and dimensions.
The snowflake schema is a structure variation of the previous described one, the star schema. Also based on facts and dimensions, this logical schema interpretation enables a different relationship between tables.
On this chapter we’ll describe what is the snowflake schema, its benefits and disadvantages.
The snowflake schema architecture is based on one or more central fact tables. These tables relate themselves to one or multiple dimensions. For its part the dimension tables can be connected to other dimension tables.
This structure is normalized along the multiple dimensions which are related to each other, therefore providing a more normalized logical structure. These connections can appear with multiple levels of depth depending on the level of complexity of the structure.
Each related dimension table can have multiple parent dimension tables resulting in a more complex, and normalized, system.
This schema is usually used to support data warehouse or data marts specific needs, improving their’s capability.
There are obvious differences between the two structures referred on this article. Here we describe the benefits of this snowflaking approach:
- Better Data Quality – The information stored on the dimensions has usually far less anomalies than in star schema.
- Less Storage Used – Due to the optimization on the dimension tables, a lot of the storage space is spared with the significant decrease of dat replication.
- Better Specific Query Performance – Specific views are optimized by this structure since it is built to support them because such queries are optimized.
- Optimized Tools – There are several tools built to work with this kind of data organization.
- More Structured Data – Information is obviously much more organized than in non-normalized structures.
Besides all of the benefits stated above, there are also negative points of the usage of this structure:
- Not Fully Normalized – This structure imposes some normalization to the dimension tables. However this isn’t fully achieved and, to ensure data quality, some extra processing tasks need to be developed.
- High Join Complexity – The structure of the data is far more complex to analyse and to work with than in non-normalized schemas.
- Poor General Performance – Despite the storage and organization benefits, this schema usually performs worse on general queries due to the high number of join operations needed to retrieve non-specific data.
In order to consolidate your knwoledge on the snowflake schema we provide this simple graphical example.
The example shows some relationships between dimension tables, normalizing some of the information stored on the database.
This is a very simple example. A Sales fact should be much more complex.