|
Building A Data Warehouse
The building of a data warehouse is known as extract,
transformation, load (ETL for short).
First we extract the data from its various data sources.
These will usually be database tables from various systems
within an organisation, but are often supplemented by
other files such as spreadsheets and text files.
In the transformation process we have the opportunity
to massage the data into what we want for the data warehouse.
This is where most of the action takes place. Apart
from adding calculations and data from various tables
where necessary, this is where we check that the data
is consistent. For example dates on different systems
in the organisation are often stored in different ways.
These differences must be resolved in the transformation
process.
In transformation codes must be translated. This is
known as data cleansing.
The load process creates the data warehouse. In a dimensional
star schema data warehouse this means loading the dimension
and fact tables. Often the dimensions do not change
much – for example a price list might only change
annually – so not all dimensions need to be loaded
every day. Even then, the dimensions can often be updated
with changes and additions rather than being completely
rebuilt each time.
Fact tables – for example your daily sales –
need to be updated more often, usually on a daily basis.
This involves checking that all records in the fact
tables link correctly to the dimension tables, as this
is vital for the star schema to join together properly
and have any meaning.
Monitoring the build. In most organisations the data
warehouse build will need to be monitored to ensure
that the data has arrived at its destination and is
correct. This monitoring can be done in a number of
ways (home-made scripts for example) but should involve
some form of automation. A good ETL package will provide
tools for monitoring the load such as log files, a shcheduler
and reject files for rejected input records. A sophisticated
ETL system will be able to send out email alerts when
things go wrong, and in some cases correct the problem
and reinput the data.
|