Data Staging Cubes Are a Bad Idea

A cube is designed for optimised reading of data, not writing. It is the right tool for the job of reporting. Data staging cube is the name given to a cube that is being used for a purpose other than what is optimally designed to do; where it is the wrong tool for the job. A data staging cube will appear in the staging & transformation layer of the data model. It usually appears when the 16 key field limit of a DataStore is less than the full unique key of the DataSet.

“a data staging cube ensures the integrity of the DataSet granularity”

At some point in our data modeling adventures we will get the opportunity to implement a budget or forecast solution where the full unique key of the DataSet has more than 16 fields. The DataSet will have different granularities of records that contains records at different stages of its life cycle. It does not matter if the budget data is fundamentally top-down, bottom-up or both in its design. It does not matter if it is a rough budget, distributed by historical trends, manually modified or adjusted by a journal entry.

Cube (Inactive)

No matter where the budget data is at in its life cycle, the DataSet will most likely have many records at different granularities. For a data modeler, this forces a requirement upon the storage of the DataSet; to be able to handle all possible granularities and guarantee that the records will never overwrite each other causing data to just dis-appear. The simplest conclusion is that every characteristic, time and unit InfoObject must be considered to be part of the DataSets full unique key.

Only the key figure InfoObjects can be considered as data fields. A DataStore with only 16 key fields is great until the DataSet has more than 16 key fields. At this point in a solution discussion the obvious idea to use a cube instead of a DataStore becomes very attractive. When under pressure to get a solution in place by yesterday, a cube will be used and all other data modeling techniques will be ignored.

This is a technically viable solution that will work; as long as you accept the technical debt that goes along with this decision:

  • A cube will force characteristic SIDs to be generated;
  • A cube is slower to load than a DataStore;
  • A cube will never optimise the downstream delta record generation;
  • Databases have been known to have parallel processing issues with row level bit map index locking;
  • You can never implement a ‘read ahead’ transformation to the DataTarget;
  • Using the same InfoObjects in the reporting layer can invalidate query cache;
  • The InfoObjects will flag the need for an attribute change run.

Some SAP ECC extractors force you to use a cube because the extractor developer did not make it delta capable. This forces you to do a full load, every time. Other SAP ECC delta extractors will occasionally fail to extract all true delta changes for fields that were added to the ‘extended extract structure’.┬áThe standard fields available for delta extraction will usually be ok but any additional customised fields that were added to the extract structure would not have been involved in the extract programs process to determine which records have changed.

Once a data staging cube and a nightly load window with a full extract of the budget data has been implemented into the data model, there is an additional processing overhead in all the down stream DataTargets:

  • Risk of duplicate records in reporting if the process chain fails and does not get cleaned up properly;
  • A lot of records going into the database re-do log. Even when deleting data from the cube;
  • Full loads in the staging layer, transformation layer, warehouse layer;
  • Full aggregate drop and rebuild;
  • Full query cache invalidation. Unable to leverage the OLAP engine delta cache update feature.

Hopefully the data model with the staging cube and full load strategy has been optimised to only load ‘this year and future years’. This should have been implemented using dynamic selection filters in the InfoPackage and DTPs allowing for ‘selective deletions’ or the ‘delete overlapping request’ techniques to be implemented. The selective loads per year help to minimise the volume of data flow but also assume that prior years will never change.

If you happen to be un-lucky enough to be forced to extract all the data, all the time then please keep an eye out for the inherent risks associated with this approach. What would happen to the reporting data in BW when the source system archives its historical data? That’s right, it’s gone. Do you really want to be the one who has to go to the project manager the day after their go-live and ask them to put the historically archived data back into the production source system?

This situation will occur when the BW process chains were built with the assumption that all historical data is in the source system, is never archived and then proceed to delete the entire prior full load. This is a BW Administrators nightmare waiting to happen should you be unlucky enough not to have a data model that is using a staging cube and was not built with dynamic selection filters upon extraction and flow through BW. This is quite likely to exist when the historical DataSet is small enough for the nightly load window and the project team was unable to allocate the time to build dynamic selection filters into the solution.

“the data staging cubes technical debt is way more than the effort of a DataStore and concatenated characteristics”

There is only one situation where you will be forced to use a data staging cube; with the SAP ECC business content extractors that have been defined as ‘cube only’. The BW system will honour this definition and physically stop you from creating a transformation from the DataSource to a DataStore DataTarget. The best you can do with these extractors is to ensure the transformation layer is using a DataStore after this staging layer cube; or you could use a different extractor.

All of the above starts to make it really clear that the choice to use a data staging cube comes with a very high technical debt. This debt manifests itself mostly in the nightly load window and query re-caching activities with the risk of historical data just disappearing. Save time, processing and re-do log space by replacing the data staging cube with a DataStore using concatenated characteristics and rogue record detection techniques.

Use the right tool for the right job and enjoy 100% data integrity with true delta record generation in the data flow as early as possible.