ETL Data Warehousing and Business Intelligence
Please CLICK HERE to visit our movies section to watch all Telugu, Hindi & Tamil movies online, Daily updated with Lots & Lots of Movies.Thanks for visiting!
Data Warehousing and Business Intelligence
This data warehousing and Business Intelligence resource site aims to help people get a good high-level understanding of what it takes to implement a successful data warehouse project. A lot of the information is from my personal experience as a data warehousing professional, both as a client and as a vendor.
This site is divided into four main areas.
Tools: The selection of business intelligence (BI) tools and the selection of the data warehousing team. The tools covered are :
Database, Hardware.
ETL (Extraction, Transformation, and Loading).
OLAP.
Reporting .
Metadata .
Processes: This selection contains the typical milestones for a data warehousing project, from requirement gathering to production rollout and beyond.
Concepts: This section discusses several concepts particular to the data warehousing field.
Topics include:
Dimensional Data Model
Slowly Changing Dimension
Conceptual, Logical, and Physical Data Model
MOLAP, ROLAP, and HOLAP
Bill Inmon vs. Ralph Kimball
Glossary: A glossary of common data warehousing terms.
Feel free to contact me about additional information regarding data warehousing that I should include.
This site is updated frequently to reflect the latest technology, information, and reader feedback. Please bookmark this site now by pressing Control-D!Data Warehousing Tools
As the old Chinese saying goes, “To accomplish a goal, make sure the proper tools are selected.” This is especially true when the goal is to build a data warehouse. Given the complexity of the data warehousing system and the cross-departmental implications of the project, it is easy to see why the proper selection of tools and personnel is very important. This section will talk about the such selections. They are grouped into the following:General Considerations
Database/Hardware Selection
ETL Tool Selection
OLAP Tool Selection
Reporting Tool Selection
Metadata Tool Selection
Data Warehouse Team Personnel Selection
Data Warehousing - Database/Hardware Selection
Buy vs. Build
The only choices here are what type of hardware and database to purchase, as there is basically no way that one can build hardware/database systems from scratch.
Database/Hardware Selections
In making selection for the database/hardware platform, there are several items that need to be carefully considered:
· Scalability: How can the system grow as your data storage needs grow? Which RDBMS and hardware platform can handle large sets of data most efficiently? To get an idea of this, one needs to determine the approximate amount of data that is to be kept in the data warehouse system once it’s mature, and base any testing numbers from there.
· Parallel Processing Support: The days of multi-million dollar supercomputers with one single CPU are gone, and nowadays the most powerful computers all use multiple CPUs, where each processor can perform a part of the task, all at the same time. When I first started working with massively parallel computers in 1993, I had thought that it would be the best way for any large computations to be done within 5 years. Indeed, parallel computing is gaining popularity now, although a little slower than I had originally thought.· RDBMS/Hardware Combination: Because the RDBMS physically sits on the hardware platform, there are going to be certain parts of the code that is hardware platform-dependent. As a result, bugs and bug fixes are often hardware dependent.
True Case: One of the projects I have worked on was with a major RDBMS provider paired with a hardware platform that was not so popular (at least not in the data warehousing world). The DBA constantly complained about the bug not being fixed because the support level for the particular type of hardware that client had chosen was Level 3, which basically meant that no one in the RDBMS support organization will fix any bug particular to that hardware platform.
Popular Relational Databases :
Oracle
Microsoft SQL Server
IBM DB2
Teradata
Sybase
MySQL
Popular OS Platforms:
Data Warehousing - ETL Selection
Buy vs. Build
When it comes to ETL tool selection, it is not always necessary to purchase a third-party tool. This determination largely depends on three things:
Complexity of the data transformation: The more complex the data transformation is, the more suitable it is to purchase an ETL tool.
Data cleansing needs: Does the data need to go through a thorough cleansing exercise before it is suitable to be stored in the data warehouse? If so, it is best to purchase a tool with strong data cleansing functionalities. Otherwise, it may be sufficient to simply build the ETL routine from scratch.
Data volume: Available commercial tools typically have features that can speed up data movement. Therefore, buying a commercial product is a better approach if the volume of data transferred is large.
ETL Tool Functionalities
While the selection of a database and a hardware platform is a must, the selection of an ETL tool is highly recommended, but it’s not a must. When you evaluate ETL tools, it pays to look for the following characteristics:
· Functional capability: This includes both the ‘transformation’ piece and the ‘cleansing’ piece. In general, the typical ETL tools are either geared towards having strong transformation capabilities or having strong cleansing capabilities, but they are seldom very strong in both. As a result, if you know your data is going to be dirty coming in, make sure your ETL tool has strong cleansing capabilities. If you know there are going to be a lot of different data transformations, it then makes sense to pick a tool that is strong in transformation.
· Ability to read directly from your data source: For each organization, there is a different set of data sources. Make sure the ETL tool you select can connect directly to your source data.· Metadata support: The ETL tool plays a key role in your metadata because it maps the source data to the destination, which is an important piece of the metadata. In fact, some organizations have come to rely on the documentation of their ETL tool as their metadata source. As a result, it is very important to select a metadata tool that works with your overall metadata strategy.
Popular Tools
. Data Junction
. Ascential DataStage
. Ab Initio
. Informatica
Data Warehousing - OLAP Tool Selection
OLAP tools are geared towards slicing and dicing of the data. As such, they require a strong metadata layer, as well as front-end flexibility. Those are typically difficult features for any home-built systems to achieve. Therefore, my recommendation is that if OLAP analysis is part of your charter for building a data warehouse, it is best to purchase an existing OLAP tool rather than creating one from scratch.
OLAP Tool Functionalities
Before we speak about OLAP tool selection criterion, we must first distinguish between the two types of OLAP tools, MOLAP (Multidimensional OLAP) and ROLAP (Relational OLAP).
· MOLAP: In this type of OLAP, a cube is aggregated from the relational data source (data warehouse). When user generates a report request, the MOLAP tool can generate the create quickly because all data is already pre-aggregated within the cube.
· ROLAP: In this type of OLAP, instead of pre-aggregating everything into a cube, the ROLAP engine essentially acts as a smart SQL generator. The ROLAP tool typically comes with a ‘Designer’ piece, where the data warehouse administrator can specify the relationship between the relational tables, as well as how dimensions, attributes, and hierarchies map to the underlying database tables.
Right now, there is a convergence between the traditional ROLAP and MOLAP vendors. ROLAP vendor recognize that users want their reports fast, so they are implementing MOLAP functionalities in their tools; MOLAP vendors recognize that many times it is necessary to drill down to the most detail level information, levels where the traditional cubes do not get to for performance and size reasons.
So what are the criteria for evaluating OLAP vendors? Here they are:
· Ability to leverage parallelism supplied by RDBMS and hardware: This would greatly increase the tool’s performance, and help loading the data into the cubes as quickly as possible.
· Performance: In addition to leveraging parallelism, the tool itself should be quick both in terms of loading the data into the cube and reading the data from the cube.
· Metadata support: Because OLAP tools aggregates the data into the cube and sometimes serves as the front-end tool, it is essential that it works with the metadata strategy/tool you have selected.
Popular Tools
Business Objects
Cognos
Hyperion
Microsoft Analysis Services
Micro Strategy
Data Warehousing - Reporting Tool Selection
There is a wide variety of reporting requirements, and whether to buy or build a reporting tool for your business intelligence needs is also heavily dependent on the type of requirements. Typically, the determination is based on the following:
Number of reports: The higher the number of reports, the more likely that buying a reporting tool is a good idea. This is not only because reporting tools typically make creating new reports easier (by offering re-usable components), but they also already have report management systems to make maintenance and support functions easier.
Desired Report Distribution Mode: If the reports will only be distributed in a single mode (for example, email only, or over the browser only), we should then strongly consider the possibility of building the reporting tool from scratch. However, if users will access the reports through a variety of different channels, it would make sense to invest in a third-party reporting tool that already comes packaged with these distribution modes.
Ad Hoc Report Creation: Will the users be able to create their own ad hoc reports? If so, it is a good idea to purchase a reporting tool. These tool vendors have accumulated extensive experience and know the features that are important to users who are creating ad hoc reports. A second reason is that the ability to allow for ad hoc report creation necessarily relies on a strong metadata layer, and it is simply difficult to come up with a metadata model when building a reporting tool from scratch.
Reporting Tool Functionalities :
Data is useless if all it does is sit in the data warehouse. As a result, the presentation layer is of very high importance.
Most of the OLAP vendors already have a front-end presentation layer that allows users to call up pre-defined reports or create ad hoc reports. There are also several report tool vendors. Either way, pay attention to the following points when evaluating reporting tools:
· Data source connection capabilities
In general there are two types of data sources, one the relationship database, the other is the OLAP multidimensional data source. Nowadays, chances are good that you might want to have both. Many tool vendors will tell you that they offer both options, but upon closer inspection, it is possible that the tool vendor is especially good for one type, but to connect to the other type of data source, it becomes a difficult exercise in programming.
· Scheduling and distribution capabilities
In a realistic data warehousing usage scenario by senior executives, all they have time for is to come in on Monday morning, look at the most important weekly numbers from the previous week (say the sales numbers), and that’s how they satisfy their business intelligence needs. All the fancy ad hoc and drilling capabilities will not interest them, because they do not touch these features.
Based on the above scenario, the reporting tool must have scheduling and distribution capabilities. Weekly reports are scheduled to run on Monday morning, and the resulting reports are distributed to the senior executives either by email or web publishing. There are claims by various vendors that they can distribute reports through various interfaces, but based on my experience, the only ones that really matter are delivery via email and publishing over the intranet.
· Customization
Every one of us has had the frustration over spending an inordinate amount of time tinkering with some office productivity tool only to make the report/presentation look good. This is definitely a waste of time, but unfortunately it is a necessary evil. In fact, a lot of times, analysts will wish to take a report directly out of the reporting tool and place it in their presentations or reports to their bosses. If the reporting tool offers them an easy way to pre-set the reports to look exactly the way that adheres to the corporate standard, it makes the analysts jobs much easier, and the time savings are tremendous.
Popular Tools
Business Objects (Crystal Reports)
Cognos
ActuateData Warehousing - Metadata Tool Selection
Only in the rarest of cases does it make sense to build a metadata tool from scratch. This is because doing so requires resources that are intimately familiar with the operational, technical, and business aspects of the data warehouse system, and such resources are difficult to come by. Even when such resources are available, there are often other tasks that can provide more value to the organization than to build a metadata tool from scratch.
In fact, the question is often whether any type of metadata tool is needed at all. Although metadata plays an extremely important role in a successful data warehousing implementation, this does not always mean that a tool is needed to keep all the “data about data.” It is possible to, say, keey such information in the repository of other tools used, in a text documentation, or even in a presentation or a spreadsheet.
Having said the above, though, it is author’s believe that having a solid metadata foundation is one of the keys to the success of a data warehousing project. Therefore, even if a metadata tool is not selected at the beginning of the project, it is essential to have a metadata strategy; that is, how metadata in the data warehousing system will be stored.
Metadata Tool Functionalities
This is the most difficult tool to choose, because there is clearly no standard. In fact, it might be better to call this a selection of the metadata strategy. Traditionally, people have put the data modeling information into a tool such as ERWin and Oracle Designer, but it is difficult to extract information out of such data modeling tools. For example, one of the goals for your metadata selection is to provide information to the end users. Clearly this is a difficult task with a data modeling tool.
So typically what is likely to happen is that additional efforts are spent to create a layer of metadata that is aimed at the end users. While this allows the end users to gain the required insight into what the data and reports they are looking at means, it is clearly inefficient because all that information already resides somewhere in the data warehouse system, whether it be the ETL tool, the data modeling tool, the OLAP tool, or the reporting tool.
There are efforts among data warehousing tool vendors to unify on a metadata model. In June of 2000, the OMG released a metadata standard called CWM (Common Warehouse Metamodel), and some of the vendors such as Oracle have claimed to have implemented it. This standard incorporates the latest technology such as XML, UML, and SOAP, and, if accepted widely, is truly the best thing that can happen to the data warehousing industry. As of right now, though, the author has not really seen that many tools leveraging this standard, so clearly it has not quite caught on yet.So what does this mean about your metadata efforts? In the absence of everything else, I would recommend that whatever tool you choose for your metadata support supports XML, and that whatever other tool that needs to leverage the metadata also supports XML. Then it is a matter of defining your DTD across your data warehousing system. At the same time, there is no need to worry about criteria that typically is important for the other tools such as performance and support for parallelism because the size of the metadata is typically small relative to the size of the data warehouse.
Data Warehousing Processes
After the tools and team personnel selections are made, the data warehouse project can begin. The following are the typical processes involved in the datawarehousing project cycle.
Requirement Gathering
Physical Environment Setup
Data Modeling
ETL
OLAP Cube Design
Front End Development
Performance Tuning
Quality Assurance
Rolling out to Production
Production Maintenance
Incremental Enhancements
Each page listed below represents a typical data warehouse phase, and has several sections:
Task Description: This section describes what typically needs to be accomplished during this particular data warehouse phase.
Time Requirement: A rough estimate of the amount of time this particular data warehouse task takes.
Deliverables: Typically at the end of each data warehouse task, one or more documents are produced that fully describe the steps and results of that particular task. This is especially important for consultants to communicate their results to the clients.
Possible Pitfalls: Things to watch out for. Some of them obvious, some of them not so obvious. However, all of them are real.
The Additional Observations section contains my own observations on data warehouse processes not included in any of the step.
Data Warehousing Project - Requirement Gathering
Task Description
The first thing that the project team should engage in is gathering requirements from end users. Because end users are typically not familiar with the data warehousing process or concept, the help of the business sponsor is essential. Requirement gathering can happen as one-to-one meetings or as Joint Application Development (JAD) sessions, where multiple people are talking about the project scope in the same meeting.
The primary goal of this phase is to identify what constitutes as a success for this particular phase of the data warehouse project. In particular, end user reporting / analysis requirements are identified, and the project team will spend the remaining period of time trying to satisfy these requirements.
Associated with the identification of user requirements is a more concrete definition of other details such as hardware sizing information, training requirements, data source identification, and most importantly, a concrete project plan indicating the finishing date of the data warehousing project.
Based on the information gathered above, a disaster recovery plan needs to be developed so that the data warehousing system can recover from accidents that disable the system. Without an effective backup and restore strategy, the system will only last until the first major disaster, and, as many data warehousing DBA’s will attest, this can happen very quickly after the project goes live.
Time Requirement
2 - 8 weeks.
Deliverables
· A list of reports / cubes to be delivered to the end users by the end of this current phase.
· An updated project plan that clearly identifies resource loads and milestone delivery dates.
Possible Pitfalls
This phase often turns out to be the most tricky phase of the data warehousing implementation. The reason is that because data warehousing by definition includes data from multiple sources spanning many different departments within the enterprise, there are often political battles that center on the willingness of information sharing. Even though a successful data warehouse benefits the enterprise, there are occasions where departments may not feel the same way. As a result of unwillingness of certain groups to release data or to participate in the data warehousing requirement definition, the data warehouse effort either never gets off the ground, or could not start in the direction originally defined.
When this happens, it would be ideal to have a strong business sponsor. If the sponsor is at the CXO level, she can often exert enough influence to make sure everyone cooperates.
Data Warehousing Project - Environment Setup
Task Description
Once the requirements are somewhat clear, it is necessary to set up the physical servers and databases. At a minimum, it is necessary to set up a development environment and a production environment. There are also many data warehousing projects where there are three environments: Development, Testing, and Production.
It is not enough to simply have different physical environments set up. The different processes (such as ETL, OLAP Cube, and reporting) also need to be set up properly for each environment.
It is best for the different environments to use distinct application and database servers. In other words, the development environment will have its own application server and database servers, and the production environment will have its own set of application and database servers.
Having different environments is very important for the following reasons:
All changes can be tested and QA’d first without affecting the production environment.
Development and QA can occur during the time users are accessing the data warehouse.
When there is any question about the data, having separate environment(s) will allow the data warehousing team to examine the data without impacting the production environment.
Time Requirement
Getting the servers and databases ready should take less than 1 week.
Deliverables
Hardware / Software setup document for all of the environments, including hardware specifications, and scripts / settings for the software.
Possible Pitfalls
To save on capital, often data warehousing teams will decide to use only a single database and a single server for the different environments. Environment separation is achieved by either a directory structure or setting up distinct instances of the database. This is problematic for the following reasons:
1. Sometimes it is possible that the server needs to be rebooted for the development environment. Having a separate development environment will prevent the production environment from being impacted by this.
2. There may be interference when having different database environments on a single box. For example, having multiple long queries running on the development database could affect the performance on the production database.
Data Warehousing Project - Data Modeling
Task Description
This is a very important step in the data warehousing project. Indeed, it is fair to say that the foundation of the data warehousing system is the data model. A good data model will allow the data warehousing system to grow easily, as well as allowing for good performance.
In data warehousing project, the logical data model is built based on user requirements, and then it is translated into the physical data model. The detailed steps can be found in the Conceptual, Logical, and Physical Data Modeling section.
Part of the data modeling exercise is often the identification of data sources. Sometimes this step is deferred until the ETL step. However, my feeling is that it is better to find out where the data exists, or, better yet, whether they even exist anywhere in the enterprise at all. Should the data not be available, this is a good time to raise the alarm. If this was delayed until the ETL phase, rectifying it will becoming a much tougher and more complex process.
Time Requirement
2 - 6 weeks.
Deliverables
Identification of data sources.
Logical data model.
Physical data model.
Possible Pitfalls
It is essential to have a subject-matter expert as part of the data modeling team. This person can be an outside consultant or can be someone in-house who has extensive experience in the industry. Without this person, it becomes difficult to get a definitive answer on many of the questions, and the entire project gets dragged out.
Recommended Books
The Data Modeling Handbook : A Best-Practice Approach to Building Quality Data Models
Object-Oriented Data Warehouse Design: A Star Schema
Data Modeling with ERWin
Data Warehousing Project - ETL
Task Description
The ETL (Extraction, Transformation, Loading) process typically takes the longest to develop, and this can easily take up to 50% of the data warehouse implementation cycle or longer. The reason for this is that it takes time to get the source data, understand the necessary columns, understand the business rules, and understand the logical and physical data models.
Time Requirement
1 - 6 weeks.
Deliverables
Data Mapping Document
ETL Script / ETL Package in the ETL tool
Possible Pitfalls
There is a tendency to give this particular phase too little development time. This can prove suicidal to the project because end users will usually tolerate less formatting, longer time to run reports, less functionality (slicing and dicing), or fewer delivered reports; one thing that they will not tolerate is wrong information.
A second common problem is that some people make the ETL process more complicated than necessary. In ETL design, the primary goal should be to optimize load speed without sacrificing on quality. This is, however, sometimes not followed. There are cases where the design goal is to cover all possible future uses, whether they are practical or just a figment of someone’s imagination. When this happens, ETL performance suffers, and often so does the performance of the entire data warehousing system.
Recommended Books
Professional SQL Server 2000 DTS (Data Transformation Service)
Data Warehousing Project - OLAP Cube/Report Design
Task Description
Usually the design of the olap cube can be derived from the Requirement Gathering phase. More often than not, however, users have some idea on what they want, but it is difficult for them to specify the exact report / analysis they want to see. When this is the case, it is usually a good idea to include enough information so that they feel like they have gained something through the data warehouse, but not so much that it stretches the data warehouse scope by a mile. Remember that data warehousing is an iterative process - no one can ever meet all the requirements all at once.
Time Requirement
1 - 2 weeks.
Deliverables
Documentation specifying the OLAP cube dimensions and measures.
Actual OLAP cube / report.
Possible Pitfalls
Make sure your OLAP cube-bilding process is optimized. It is common for the data warehouse to be on the bottom of the nightly batch load, and after the loading of the data warehouse, there usually isn’t much time remaining for the olap cube to be refreshed. As a result, it is worthwhile to experiment with the OLAP cube generation paths to ensure optimal performance.
Recommended Books
OLAP Solutions: Building Multidimensional Information Systems
Microsoft Olap Solutions
MDX Solutions: With Microsoft SQL Server Analysis Services
Data Warehousing Project - Front End Development
Task Description
Regardless of the strength of the OLAP engine and the integrity of the data, if the users cannot visualize the reports, the data warehouse brings zero value to them. Hence front end development is an important part of a data warehousing initiative.
So what are the things to look out for in selecting a front-end deployment methodology? The most important thing is that the reports should need to be delivered over the web, so the only thing that the user needs is the standard browser. These days it is no longer desirable nor feasible to have the IT department doing program installations on end users desktops just so that they can view reports. So, whatever strategy one pursues, make sure the ability to deliver over the web is a must.
The front-end options ranges from an internal front-end development using scripting languages such as ASP, PHP, or Perl, to off-the-shelf products such as Seagate Crystal Reports, to the more higher-level products such as Actuate. In addition, many OLAP vendors offer a front-end on their own. When choosing vendor tools, make sure it can be easily customized to suit the enterprise, especially the possible changes to the reporting requirements of the enterprise. Possible changes include not just the difference in report layout and report content, but also include possible changes in the back-end structure. For example, if the enterprise decides to change from Solaris/Oracle to Microsoft 2000/SQL Server, will the front-end tool be flexible enough to adjust to the changes without much modification?
Another area to be concerned with is the complexity of the reporting tool. For example, do the reports need to be published on a regular interval? Are there very specific formatting requirements? Is there a need for a GUI interface so that each user can customize her reports?
Time Requirement
1 - 4 weeks.
Deliverables
Front End Deployment Documentation
Possible Pitfalls
Just remember that the end users do not care how complex or how technologically advanced your front end infrastructure is. All they care is that they receives their information in a timely manner and in the way they specified.
Data Warehousing Project - Performance Tuning
Task Description
There are three major areas where a data warehousing system can use a little performance tuning:
ETL - Given that the data load is usually a very time-consuming process (and hence they are typically relegated to a nightly load job) and that data warehousing-related batch jobs are typically of lower priority, that means that the window for data loading is not very long. A data warehousing system that has its ETL process finishing right on-time is going to have a lot of problems simply because often the jobs do not get started on-time due to factors that is beyond the control of the data warehousing team. As a result, it is always an excellent idea for the data warehousing group to tune the ETL process as much as possible.
Query Processing - Sometimes, especially in a ROLAP environment or in a system where the reports are run directly against the relationship database, query performance can be an issue. A study has shown that users typically lose interest after 30 seconds of waiting for a report to return. My experience has been that ROLAP reports or reports that run directly against the RDBMS often exceed this time limit, and it is hence ideal for the data warehousing team to invest some time to tune the query, especially the most popularly ones.
Report Delivery - It is also possible that end users are experiencing significant delays in receiving their reports due to factors other than the query performance. For example, network traffic, server setup, and even the way that the front-end was built sometimes play significant roles. It is important for the data warehouse team to look into these areas for performance tuning.
Time Requirement
3 - 5 days.
Deliverables
Performance tuning document - Goal and Result
Possible Pitfalls
Make sure the development environment mimics the production environment as much as possible - Performance enhancements seen on less powerful machines sometimes do not materialize on the larger, production-level machines.
Recommended Books
Oracle High-Performance SQL Tuning
Microsoft SQL Server 2000 Performance Tuning Technical Reference
DB2 High Performance Design and Tuning
Data Warehousing - Quality Assurance
Task Description
Once the development team declares that everything is ready for further testing, the QA team takes over. The QA team is always from the client. Usually the QA team members will know little about data warehousing, and some of them may even resent the need to have to learn another tool or tools. This makes the QA process a tricky one.
Sometimes the QA process is overlooked. On my very first data warehousing project, the project team worked very hard to get everything ready for Phase 1, and everyone thought that we had met the deadline. There was one mistake, though, the project managers failed to recognize that it is necessary to go through the client QA process before the project can go into production. As a result, it took five extra months to bring the project to production (the original development time had been only 2 1/2 months).
Time Requirement
1 - 4 weeks.
Deliverables
QA Test Plan
QA verification that the data warehousing system is ready to go to production
Possible Pitfalls
As mentioned above, usually the QA team members know little about data warehousing, and some of them may even resent the need to have to learn another tool or tools. Make sure the QA team members get enough education so that they can complete the testing themselves.
Data Warehousing Project - Rollout To Production
Task Description
Once the QA team gives thumbs up, it is time for the data warehouse system to go live. Some may think this is as easy as flipping on a switch, but usually it is not true. Depending on the number of end users, it sometimes take up to a full week to bring everyone online! Fortunately, nowadays most end users access the data warehouse over the web, making going production sometimes as easy as sending out an URL via email.
Time Requirement
1 - 3 days.
Deliverables
Delivery of the data warehousing system to the end users.
Possible Pitfalls
Take care to address the user education needs. There is nothing more frustrating to spend several months to develop and QA the data warehousing system, only to have little usage because the users are not properly trained. Regardless of how intuitive or easy the interface may be, it is always a good idea to send the users to at least a one-day course to let them understand what they can achieve by properly using the data warehouse.
Data Warehousing Project - Production Maintenance
Task Description
Once the data warehouse goes production, it needs to be maintained. Tasks as such regular backup and crisis management becomes important and should be planned out. In addition, it is very important to consistently monitor end user usage. This serves two purposes: 1. To capture any runaway requests so that they can be fixed before slowing the entire system down, and 2. To understand how much users are utilizing the data warehouse for return-on-investment calculations and future enhancement considerations.
Time Requirement
Ongoing.
Deliverables
Consistent availability of the data warehousing system to the end users.
Possible Pitfalls
Usually by this time most, if not all, of the developers will have left the project, so it is essential that proper documentation is left for those who are handling production maintenance. There is nothing more frustrating than staring at something another person did, yet unable to figure it out due to the lack of proper documentation.
Another pitfall is that the maintenance phase is usually boring. So, if there is another phase of the data warehouse planned, start on that as soon as possible.
Data Warehousing Project - Incremental Enhancements
Task Description
Once the data warehousing system goes live, there are often needs for incremental enhancements. I am not talking about a new data warehousing phases, but simply small changes that follow the business itself. For example, the original geographical designations may be different, the company may originally have 4 sales regions, but now because sales are going so well, now they have 10 sales regions.
Deliverables
Change management documentation
Actual change to the data warehousing system
Possible Pitfalls
Because a lot of times the changes are simple to make, it is very tempting to just go ahead and make the change in production. This is a definite no-no. Many unexpected problems will pop up if this is done. I would very strongly recommend that the typical cycle of development –> QA –> Production be followed, regardless of how simple the change may seem.
Additional Observations
These are some trends that I have observed from my experience:
Quick Implementation Time
If you add up the total time required to complete the tasks from Requirement Gathering to Rollout to Production, you’ll find it takes about 9 - 29 weeks to complete each phase of the data warehousing efforts. The 9 weeks may sound too quick, but I have been personally involved in a turnkey data warehousing implementation that took 40 business days, so that is entirely possible. Furthermore, some of the tasks may proceed in parallel, so as a rule of thumb it is reasonable to say that it generally takes 2 - 6 months for each phase of the data warehousing implementation.
Why is this important? The main reason is that in today’s business world, the business environment changes quickly, which means that what is important now may not be important 6 months from now. For example, even the traditionally static financial industry is coming up with new products and new ways to generate revenue in a rapid pace. Therefore, a time-consuming data warehousing effort will very likely become obsolete by the time it is in production. It is best to finish a project quickly. The focus on quick delivery time does mean, however, that the scope for each phase of the data warehousing project will necessarily be limited. In this case, the 80-20 rule applies, and our goal is to do the 20% of the work that will satisfy 80% of the user needs. The rest can come later.
Lack of Collaboration with Data Mining Efforts
Usually data mining is viewed as the final manifestation of the data warehouse. The ideal is that now information from all over the enterprise is conformed and stored in a central location, data mining techniques can be applied to find relationships that are otherwise not possible to find.
Unfortunately, this has not quite happened due to the following reasons:
1. Few enterprises have an enterprise data warehouse infrastructure. In fact, currently they are more likely to have isolated data marts. At the data mart level, it is difficult to come up with relationships that cannot be answered by a good OLAP tool.
2. The ROI for data mining companies is inherently lower because by definition, data mining will only be performed by a few users (generally no more than 5) in the entire enterprise. As a result, it is hard to charge a lot of money due to the low number of users. In addition, developing data mining algorithms is an inherently complex process and requires a lot of up front investment. Finally, it is difficult for the vendor to put a value proposition in front of the client because quantifying the returns on a data mining project is next to impossible.
This is not to say, however, that data mining is not being utilized by enterprises. In fact, many enterprises have made excellent discoveries using data mining techniques. What I am saying, though, is that data mining is typically not associated with a data warehousing initiative. It seems like successful data mining projects are usually stand-alone projects.
Industry Consolidation
In the last 2 years, we have seen rapid industry consolidation, as the weaker competitors are gobbled up by stronger players. The list is below (note that the dollar amount quoted is the value of the deal when initially announced):
Business Objects (OLAP) purchased Crystal Decisions (Reporting) for $820 million in 2003.
Hyperion (OLAP) purchased Brio (OLAP) for $142 million in 2003.
GEAC (ERP) purchased Comshare (OLAP) for $52 million in 2003.
Cognos (OLAP) purchased Adaytum (OLAP) for $60 million in 2002.
Business Objects (OLAP) purchase Acta (ETL) for $65 million in 2002.
For the majority of the deals, the purchase represents an effort by the buyer to expand into other areas of data warehousing (Hyperion’s purchase of Brio also falls into this category because, even though both are OLAP vendors, their product lines do not overlap). This clearly shows vendors’ strong push to be the one-stop shop, from reporting, OLAP, to ETL.
There are two levels of one-stop shop. The first level is at the corporate level. In this case, the vendor is essentially still selling two entirely separate products. But instead of dealing with two sets of sales and technology support groups, the customers only interact with one such group. The second level is at the product level. In this case, different products are integrated. In data warehousing, this essentially means that they share the same metadata layer. This is actually a rather difficult task, and therefore not commonly accomplished. When there is metadata integration, the customers not only get the benefit of only having to deal with one vendor instead of two (or more), but the customer will be using a single product, rather than multiple products. This is where the real value of industry consolidation is shown.
How to Measure Success
Given the significant amount of resources usually invested in a data warehousing project, a very important question is how success can be measured. This is a question that many project managers do not think about, and for good reason: Many project managers are brought in to build the data warehousing system, and then turn it over to in-house staff for ongoing maintenance. The job of the project manager is to build the system, not to justify its existence.
Just because this is often not done does not mean this is not important. Just like a data warehousing system aims to measure the pulse of the company, the success of the data warehousing system itself needs to be measured. Without some type of measure on the return on investment (ROI), how does the company know whether it made the right choice? Whether it should continue with the data warehousing investment?
There are a number of papers out there that provide formula on how to calculate the return on a data warehousing investment. Some of the calculations become quite cumbersome, with a number of assumptions and even more variables. Although they are all valid methods, I believe the success of the data warehousing system can simply be measured by looking at one criteria:
How often the system is being used.
If the system is satisfying user needs, users will naturally use the system. If not, users will abandon the system, and a data warehousing system with no users is actually a detriment to the company (since resources that can be deployed elsewhere are required to maintain the system). Therefore, it is very important to have a tracking mechanism to figure out how much are the users accessing the data warehouse. This should not be a problem if third-party reporting/OLAP tools are used, since they all contain this component. If the reporting tool is built from scratch, this feature needs to be included in the tool. Once the system goes into production, the data warehousing team needs to periodically check to make sure users are using the system. If usage starts to dip, find out why and address the reason as soon as possible. Is the data quality lacking? Are the reports not satisfying current needs? Is the response time slow? Whatever the reason, take steps to address it as soon as possible, so that the data warehousing system is serving its purpose successfully.Data Warehousing Concepts
Several concepts are of particular importance to data warehousing. They are discussed in detail in this section.
Dimensional Data Model: Dimensional data model is commonly used in data warehousing systems. This section describes this modeling technique.
Slowly Changing Dimension: This is a common issue facing data warehousing practioners. This section explains the problem, and describes the three ways of handling this problem with examples.
Conceptual, Logical, and Physical Data Model: Different levels of abstraction for a data model. This section explains their differences and lists the steps for constructing each.
MOLAP, ROLAP, and HOLAP: What are these different types of OLAP technology? This section discusses how they are different from the other, and the advantages and disadvantages of each.
Bill Inmon vs. Ralph Kimball: These two data warehousing heavyweights have a different view of the role between data warehouse and data mart.
Dimensional Data Model
Dimensional data model is most often used in data warehousing systems. This is different from the 3rd normal form, commonly used for transactional (OLTP) type systems. As you can imagine, the same data would then be stored differently in a dimensional model than in a 3rd normal form model.
To understand dimensional data modeling, let’s define some of the terms commonly used in this type of modeling:
Dimension: A category of information. For example, the time dimension.
Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.
Hierarchy: The specification of levels that represents relationship between different attributes within a hierarchy. For example, one possible hierarchy in the Time dimension is Year –> Quarter –> Month –> Day.
Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount would be such a measure. This measure is stored in the fact table with the appropriate granularity. For example, it can be sales amount by store by day. In this case, the fact table would contain three columns: A date column, a store column, and a sales amount column.
Lookup Table: The lookup table provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies the quarter, and one or more additional fields that specifies how that particular quarter is represented on a report (for example, first quarter of 2001 may be represented as “Q1 2001″ or “2001 Q1″).
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.
In designing data models for data warehouses / data marts, the most commonly used schema types are Star Schema and Snowflake Schema.
Star Schema: In the star schema design, a single object (the fact table) sits in the middle and is radially connected to other surrounding objects (dimension lookup tables) like a star. A star schema can be simple or complex. A simple star consists of one fact table; a complex star can have more than one fact table.
Snowflake Schema: The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase number of lookup tables.
Whether one uses a star or a snowflake largely depends on personal preference and business needs. Personally, I am partial to snowflakes, when there is a business case to analyze the information at that particular level.
Fact Table Granularity
Granularity
The first step in designing a fact table is to determine the granularity of the fact table. By granularity, we mean the lowest level of information that will be stored in the fact table. This constitutes two steps:
Determine which dimensions will be included.
Determine where along the hierarchy of each dimension the information will be kept.
The determining factors usually goes back to the requirements.
Which Dimensions To Include
Determining which dimensions to include is usually a straightforward process, because business processes will often dictate clearly what are the relevant dimensions.
For example, in an off-line retail world, the dimensions for a sales fact table are usually time, geography, and product. This list, however, is by no means a complete list for all off-line retailers. A supermarket with a Rewards Card program, where customers provide some personal information in exchange for a rewards card, and the supermarket would offer lower prices for certain items for customers who present a rewards card at checkout, will also have the ability to track the customer dimension. Whether the data warehousing system includes the customer dimension will then be a decision that needs to be made.
What Level Within Each Dimensions To Include
Determining which part of hierarchy the information is stored along each dimension is a bit more tricky. This is where user requirement (both stated and possibly future) plays a major role.
In the above example, will the supermarket wanting to do analysis along at the hourly level? (i.e., looking at how certain products may sell by different hours of the day.) If so, it makes sense to use ‘hour’ as the lowest level of granularity in the time dimension. If daily analysis is sufficient, then ‘day’ can be used as the lowest level of granularity. Since the lower the level of detail, the larger the data amount in the fact table, the granularity exercise is in essence figuring out the sweet spot in the tradeoff between detailed level of analysis and data storage.
Note that sometimes the users will not specify certain requirements, but based on the industry knowledge, the data warehousing team may foresee that certain requirements will be forthcoming that may result in the need of additional details. In such cases, it is prudent for the data warehousing team to design the fact table such that lower-level information is included. This will avoid possibly needing to re-design the fact table in the future. On the other hand, trying to anticipate all future requirements is an impossible and hence futile exercise, and the data warehousing team needs to fight the urge of the “dumping the lowest level of detail into the data warehouse” symptom, and only includes what is practically needed. Sometimes this can be more of an art than science, and prior experience will become invaluable here.
Fact and Fact Table Types
Types of Facts
There are three types of facts:
Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table.
Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.
Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.
Let us use examples to illustrate each of the three types of facts. The first example assumes that we are a retailer, and we have a fact table with the following columns:
Date
Store
Product
Sales_Amount
The purpose of this table is to record the sales amount for each product in each store on a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can sum up this fact along any of the three dimensions present in the fact table — date, store, and product. For example, the sum of Sales_Amount for all 7 days in a week represent the total sales amount for that week.
Say we are a bank with the following fact table:
Date
Account
Current_Balance
Profit_Margin
The purpose of this table is to record the current balance for each account at the end of each day, as well as the profit margin for each account for each day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what’s the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information). Profit_Margin is a non-additive fact, for it does not make sense to add them up for the account level or the day level.
Types of Fact Tables
Based on the above classifications, there are two types of fact tables:
Cumulative: This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table.
Snapshot: This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.Slowly Changing Dimensions
The “Slowly Changing Dimension” problem is a common one particular to data warehousing. In a nutshell, this applies to cases where the attribute for a record varies over time. We give an example below:
Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the original entry in the customer lookup table has the following record:
Customer Key,Name,State
1001,Christina,Illinois
At a later date, she moved to Los Angeles, California on January, 2003. How should ABC Inc. now modify its customer table to reflect this change? This is the “Slowly Changing Dimension” problem.
There are in general three ways to solve this type of problem, and they are categorized as follows:
Type 1: The new record replaces the original record. No trace of the old record exists.
Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two people.
Type 3: The original record is modified to reflect the change.
We next take a look at each of the scenarios and how the data model and the data looks like for each of them. Finally, we compare and contrast among the three alternatives.
Type 1 Slowly Changing Dimension
In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other words, no history is kept.
In our example, recall we originally have the following table:
Customer Key,Name,State
1001,Christina,Illinois
After Christina moved from Illinois to California, the new information replaces the new record, and we have the following table:
Customer Key,Name,State
1001,Christina,California
Advantages:
- This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the old information.
Disadvantages:
- All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in this case, the company would not be able to know that Christina lived in Illinois before.
Usage:
About 50% of the time.
When to use Type 1:
Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse to keep track of historical changes.
Type 2 Slowly Changing Dimension
In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new information. Therefore, both the original and the new record will be present. The newe record gets its own primary key.
In our example, recall we originally have the following table:
Customer Key,Name,State
1001,Christina,Illinois
After Christina moved from Illinois to California, we add the new information as a new row into the table:
Customer Key,Name,State
1001,Christina,Illinois
1005,Christina,California
Advantages:
- This allows us to accurately keep all historical information.
Disadvantages:
- This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern.
- This necessarily complicates the ETL process.
Usage:
About 50% of the time.
When to use Type 2:
Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes.
Type 3 Slowly Changing Dimension
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of interest, one indicating the original value, and one indicating the current value. There will also be a column that indicates when the current value becomes active.
In our example, recall we originally have the following table:
Customer Key,Name,State
1001,Christina,Illinois
To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns:
Customer Key
Name
Original State
Current State
Effective Date
After Christina moved from Illinois to California, the original information gets updated, and we have the following table (assuming the effective date of change is January 15, 2003):Customer ,Key Name, Original State, Current State ,Effective Date
1001,Christina,Illinois,California ,15-JAN-2003
Advantages:
- This does not increase the size of the table, since new information is updated.
- This allows us to keep some part of history.
Disadvantages:
- Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if Christina later moves to Texas on December 15, 2003, the California information will be lost.
Usage:
Type 3 is rarely used in actual practice.
When to use Type 3:
Type III slowly changing dimension should only be used when it is necessary for the data warehouse to track historical changes, and when such changes will only occur for a finite number of time.
Conceptual, Logical, and Physical Data Models
There are three levels of data modeling. They are conceptual, logical, and physical. This section will explain the difference among the three, the order with which each one is created, and how to go from one level to the other.
Conceptual Data Model
Features of conceptual data model include:
Includes the important entities and the relationships among them.
No attribute is specified.
No primary key is specified.
At this level, the data modeler attempts to identify the highest-level relationships among the different entities.
Logical Data Model
Features of logical data model include:
Includes all entities and relationships among them.
All attributes for each entity are specified.
The primary key for each entity specified.
Foreign keys (keys identifying the relationship between different entities) are specified.
Normalization occurs at this level.
At this level, the data modeler attempts to describe the data in as much detail as possible, without regard to how they will be physically implemented in the database.
In data warehousing, it is common for the conceptual data model and the logical data model to be combined into a single step (deliverable).
The steps for designing the logical data model are as follows:
Identify all entities.
Specify primary keys for all entities.
Find the relationships between different entities.
Find all attributes for each entity.
Resolve many-to-many relationships.
Normalization.
Physical Data Model
Features of physical data model include:
Specification all tables and columns.
Foreign keys are used to identify relationships between tables.
Denormalization may occur based on user requirements.
Physical considerations may cause the physical data model to be quite different from the logical data model.
At this level, the data modeler will specify how the logical data model will be realized in the database schema.
The steps for physical data model design are as follows:
Convert entities into tables.
Convert relationships into foreign keys.
Convert attributes into columns.
Modify the physical data model based on physical constraints / requirements.
MOLAP, ROLAP, and HOLAP
In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP.
MOLAP
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats.
Advantages:
Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicing and dicing operations.
Can perform complex calculations: All calculations have been pre-generated when the cube is created. Hence, complex calculations are not only doable, but they return quickly.
Disadvantages:
Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself.
Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.
ROLAP
This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP’s slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a “WHERE” clause in the SQL statement.
Advantages:
Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount.
Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational database, can therefore leverage these functionalities.
Disadvantages:
Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large.
Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions.
HOLAP
HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can “drill through” from the cube into the underlying relational data.
Bill Inmon vs. Ralph Kimball
In the data warehousing field, we often hear about discussions on where a person / organization’s philosophy falls into Bill Inmon’s camp or into Ralph Kimball’s camp. We describe below the difference between the two.
Bill Inmon’s paradigm: Data warehouse is one part of the overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form.
Ralph Kimball’s paradigm: Data warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in the dimensional model.
There is no right or wrong between these two ideas, as they represent different data warehousing philosophies. In reality, the data warehouse in most enterprises are closer to Ralph Kimball’s idea. This is because most data warehouses started out as a departmental effort, and hence they originated as a data mart. Only when more data marts are built later do they evolve into a data warehouse.
Glossary
[A-D] [E-Z]
Aggregation: One way of speeding up query performance. Facts are summed up for selected dimensions from the original fact table. The resulting aggregate table will have fewer rows, thus making queries that can use them go faster.
Attribute: Attributes represent a single type of information in a dimension. For example, year is an attribute in the Time dimension.
Conformed Dimension: A dimension that has exactly the same meaning and content when being referred from different fact tables.
Data Mart: Data marts have the same definition as the data warehouse (see below), but data marts have a more limited audience and/or data content.
Data Warehouse: A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process (as defined by Bill Inmon).
Data Warehousing: The process of designing, building, and maintaining a data warehouse system.
Dimension: The same category of information. For example, year, month, day, and week are all part of the Time Dimension.
Dimensional Model: A type of data modeling suited for data warehousing. In a dimensional model, there are two types of tables: dimensional tables and fact tables. Dimensional table records information on each dimension, and fact table records all the “fact”, or measures.
Dimensional Table: Dimension tables store records related to this particular dimension. No facts are stored in a dimensional table.
Drill Across: Data analysis across dimensions.
Drill Down: Data analysis to a child attribute.
Drill Through: Data analysis that goes from an OLAP cube into the relational database.
Drill Up: Data analysis to a parent attribute.
ETL: Stands for Extraction, Transformation, and Loading. The movement of data from one area to another.
Fact Table: A type of table in the dimensional model. A fact table typically includes two types of columns: fact columns and foreign keys to the dimensions.
Hierarchy: A hierarchy defines the navigating path for drilling up and drilling down. All attributes in a hierarchy belong to the same dimension.
Metadata: Data about data. For example, the number of tables in the database is a type of metadata.
Metric: A measured value. For example, total sales is a metric.
MOLAP: Multidimensional OLAP. MOLAP systems store data in the multidimensional cubes.
OLAP: On-Line Analytical Processing. OLAP should be designed to provide end users a quick way of slicing and dicing the data.
ROLAP: Relational OLAP. ROLAP systems store data in the relational database.
Snowflake Schema: A common form of dimensional model. In a snowflake schema, different hierarchies in a dimension can be extended into their own dimensional tables. Therefore, a dimension can have more than a single dimension table.
Star Schema: A common form of dimensional model. In a star schema, each dimension is represented by a single dimension table.
Popularity: 2% [?]



No comments yet.