Data Warehouse Interview Questions and Answers

data warehouse interview questions and answersThe growing interest and utilization of these data storage methods brings more people and organizations to work in the area.

On one side we’ll have more job offers for qualified people but, in the other hand, the number of specialists trying to work with these ideas will increase a lot, too.

Regarding that, this resource was created to help everyone on job interviews, answering to data warehousing interview questions.

Remember to follow the general guidelines when attending a job interview. Some tips on how to behave are specified on this article.

In this article are covered the most important questions, with some simple answers. The whole resource is available in pdf, and you can download it here.

Questions and Answers

The questions specified are concrete about data warehouses and related subjects. All of them were gathered from job interviews and compiled to this list of the most important and frequent ones. The answers we provide are simple and to the point.

So here we go!

What is a data warehouse, and why is it used?

A data warehouse is a repository of data. The pieces of information stored are relevant to each other and support the decision making tree of some corporation or entity. It can incorporate multiple data sources to store all the data connected to the subject. Tipically it is composed by archive or historical data that can be analised. A data warehouse is supported on a database system.

What are the basic stages of a data warehouse?

The first stage to build a data warehouse is the initial data introduction, tipically this can be achived by copying some operational database. This is called and offline operational database. Then, we have to feed new sets of data to the newest created data warehouse. Therefore, this database is updated with large sets of data in a regular time basis (week, month). With this step, we’ve successfully built a offline data warehouse.

To achived a Real-time data warehouse you have to insert the operational data in real time. When this is integrated with the application, reporting on the data, it’s called a Integrated data warehouse.

What is OLAP and OLTP, and which are their main differences?

OLAP performs the analysis on the data, reporting the information. The focus on these kind of systems is the reading of data, thus using the SELECT database statement. OLTP manages the transaction system that collects the data. Actions like INSERT, UPDATE or DELETE are the focus here.

This topic is covered in a much more detailed way in this OLTP and OLAP article, provided by us.

What is a fact table?

The fact table is a concrete measure that is tipically stored as numeric values, they have the core business information.

In detail, the fact table contains two different kinds of information. The foreign keys to the related dimension tables, providing joining relationships, and the measure columns which represent the added data.

And a dimension table?

Dimension tables describe the quantified data on the fact tables, giving context on its fields. They contain descriptive attributes which provide more information related to the fact table.

Fact tables have foreign keys to the dimension ones and the relation is one to many.

Describe the star schema.

In the star schema we have a centralized fact table and multiple dimensions linked to it. These dimensions are only related to the fact table, so the only link they have is to that specific table.

The fact table relates to the dimensions having their primary keys as foreign keys, and other extra attributes relevant to the data warehouse.

Therefore this kind of schema is denormalized and  better for simple querys, which are usually faster.

The next diagram represents a simple star schema implementation.

Star Schema

Describe the snowflake schema.

The Snoflake schema have links and relationships between dimensions, becoming a normalized organization of tables: fact and dimensions. This type of schema is usually more complex because each dimension can be composed of many other dimensions.

This kind of organization is explain in the next schema.

 

Snowflake Schema

What is a OLAP cube?

A OLAP data cube is a representation of data in multiple dimensions, using facts and dimensions. It is characterized by the combination of information according to it’s relationship.

It can consist in a colection of 0 to many dimensions, representing specific data. There are five basic operation to perform on these kind of data cubes:

  • Slicing
  • Dicing
  • Roll-Up
  • Drill-Up and Drill-Down
  • Pivoting

Explain the slicing operation.

The slicing operation on a OLAP Cube establishes a single value for one of the dimensions of the cube, selecting all the data that corresponds to the selected value.

So, by executing a slice on the cube we get all the selected dimension and fact information for the specific value assigned.

Explain the dicing operation

Dicing on OLAP Cubes consists on choosing an interval of values for some of the dimensions representing in the cube, and selecting the data that corresponds to those intervals.

This operation creates a subset of the cube which contains the data between the intervals.

Explain the roll up operation.

The roll-up operation performs some computing rules on the data of a OLAP cube specific dimension, returning the computed information to the end user.

These applied rules can be defined and summarize the information on that specific dimension.

Explain the drill-up/drill-down operation

These operations allow the exploration of information between the levels of data presented on dimensions and facts on the data warehouse.

It can select summarized information or the details that compose that data aggregation.

Explain the pivoting operation

Pivoting allows the rotation of the cube on its dimensions providing the user a different point of view of the explored data.

The cube can be rotated on every face.

Explain the concept of data mart.

Data mart is a specific group of data linked to a subject, which is part of a specific data warehouse. Therefore, a data warehouse have multiple data marts.

Basically a data mart is a small data warehouse with condensed information about a specific subject and it’s relationships. Usually each data mart is related to a department, business unit or something that can function individually within a data warehouse.

Which are the reasons to create a Data Mart?

There are various reasons that lead to a the creation of a data mart. The most important ones are:

  • Create a data specific environment, providing easy access to it
  • Easy to create
  • Data is more relevant to users having only the essential information
  • Lower cost than creating a whole data warehouse

What does Normalization mean?

Normalization is the process in which tables and fields are organized in a database in order to reduce the redundancy of stored data. Therefore many relationships between tables are defined, providing a better organized database system.

The key benefits of normalization are:

  • Low database data redundancy
  • Searching and indexing is faster
  • Fewer null values since data is well distributed
  • Cleaner and easier to mantain

What is a ETL process?

A ETL process consists on getting data from different sources and converting it to enter in a specific data warehouse.

Theses processes transform and normalize the data, providing a common base for all sources to integrate with a data warehouse.

What is aggregation?

Aggregation is the representation of a set of data, joined by some aggregation function.

This functions may be simple or complex depending of the purpose of the selected aggregation data. A simple function is the sum of every value.

Explain what is partitioning.

Partitioning is the process of dividing all data warehouse elements into smaller and distinct sets of data, keeping the relationships between the elements.

The benefits of partitioning are:

  • Easy management
  • Better performance
  • Availability
  • Easier backup and recovery

What types of dimensions do you know?

There are four common kinds of dimensions in a data warehouse:

  • Conformed Dimension
  • Degenerated Dimension
  • Role-Playing Dimension
  • Junk Dimension

Describe a conformed dimension.

A conformed dimension is shared between various subjects in the data warehouse. Therefore it is widely used in different contexts, meaning the same thing in each one of them.

Explain what is a degenerated dimension.

The degenerated dimension is derived from a fact table and doesn’t have it’s own dimension table.

What is a role-playing dimension?

A role-playing dimension has multiple applications within the same Data Warehouse and it is reused for different purposes. One example is an ID. In a data warehouse we can have several kinds of ID’s: client id, product id, etc.

What are junk dimensions?

Junk dimensions are composed by some attributes that don’t fit in another tables and are usually used with rapidly changing dimensions.

What is the difference between metadata and data dictionary?

A data dictionary has all the definitions of a database, the tables and fields, rows, number of rows, and that kind of information.

Metadata describes some kind of information with addicional and important data which is complementary.


Final Considerations

There are a lot of questions that can be asked about this topic, since it is a huge one.  From time to time we’ll be adding more information to this resource.

Data Warehouse interview questions and answers PDF file

This resource (you can download it in the beggining of the article), is a compilation of all the materials on the page. If, in your opinion, this is a useful resource, please subscribe our mailing list in order to received more documents on related subjects.