Data WareHouse:
Data Warehouse is a huge database, stores historical data, the volume is huge, this data is used for business analysis. You can analyze data of certain time period or certain condition or certain region or certain department etc. This data analysis is done to monitor the business and to take business decisions.
You can use reporting tools or other way to analyze the data called Business Intelligence (BI) testing. The BI analysis helps making decision.
The data might come from any databases , the source database could be different than the DW-database, or data may come from flat files, XML files, MainFrame etc. The source could be anything and the data is stored in the DW, which is a traget database. We use ETL to load data to the DW. ETL is explained below as you continue.
There are two types of transactions :
OLTP --- online transactional processing
and
OLAP --- online application processing.
OLTP maintains current data , like credit card transactions. It does not support business analysis. This data is not for Data Warehousing. This helps in monitoring the regular operational data. This is a detailed data. It has more joins, few indexes.
OLAP --- This is for DW purpose, it has historical data, the data is summarized. It has few joins, more indexes.
Some Terminologies:
Data Acquisition: It is data extraction from source databases , apply business or transformation rules and load into DW.
ETL : It is Data Extraction, Transformation and Loading. In this process, extract data from source tables from different databases or Flat files or XML files or MainFrame(source of data could be anything), apply business logic or transformation logic and then load into the target table or target database which is DW. The popular tools for ETL are : Informatica, Ab Initio, Data Stage etc.
Data Merging : It is a Transformation process of ETL. The data are transformed using joining between different source tables or different source databases and stored in DW.
Data Cleansing : It is a Transformation process of ETL. Some example of data cleansing are , combining first name and last name , replacing null with something like zero, data rounding, changing date format, applying UPPER , INITCAP, LOWER in the names etc.
Data Aggregation : It is a Transformation process of ETL. In this the aggregate functions are used such as sum, avg, count, max, min etc.
Data Loading : It is a Loading process of ETL. In this the transformed data is loaded to the target DW. The loading could be Initial(also called Full Loading or historical loading) , Incremental(also called Delta Loading) .
Full Loading is fast and it is done initially, first time loading.
Incremental Loading is loading of new data or transactions to the already loaded database(DW).
ETL Plan : This shows how the source data will be transformed and transferred to the target DW.
It could be a document or graphical representation of data flow.
MetaData : It is data about data. For example : The column headers on a table are meta data. The column headers have name and datatypes.
DataMart : It is a database having subject oriented data from dataware house. Such as : region specific, department specific, product specific etc.
Data Mining: It is a process of getting the required data from different sources, usually large volume of data sources, for business analysis/intellegence porpose. Data mining uses softwares, graphs, statistical methods, algorithms.
The data can be used in business analysis, decision making, risk management, future predictions etc.
Big Data: Every moment huge volume of data transaction are happening in the systems such as Youtube, Facebook, Google, Amazon, social media, shopping app, transportation app etc. The big data is characterized by 5 Vs: Volume, Velocity, Variety, Veracity and Value.
Volume: amount of data
Velocity: the data is generated quickly.
Variety: there are variety of data which could be Structured data, Semi Structured data and UnStructured data
Veracity: Accuracy and Trustworthiness of the data
Value: benefits of the data such as increase in efficiency, reduced costs, better risk management
DW-Schemas:
There are two types of schemas in DW:
Star Schema
Snowflake Schema
Star Schema :In this , there is a fact table at the center , the fact table is surrounded by dimension tables.
Snowflake Schema : In this, there is a fact table at the center, the fact table is surrounded by dimension tables. The denormalized dimensional tables are further split into one or more tables(sub-dimension tables) to normalize the dimension.
Fact Table : It has foreign keys from the dimension tables. It has numerical data called facts. Fact tables have less number of columns but large number of records compared to dimensional table.
Dimension Table : It has primary keys for the fact tables. These are loaded first while loading. It has context of the facts. It has data such as what, who, where which describe the fact table.
Slowly changing dimension(SCD):
To track the changes that is happening over the period of time.
SCD1: It stores only the current data.
SCD2: It stores full history related to the data. For example: a person's address data will have his full history of relocations from the beginning to now.
SCD3: It stores partial history related to the data. For example : a person's current address plus his most recent previous address.
Business Intelligence(BI):
It is a process of extracting the data from DW to analyze, evaluate the business. It could be data of certain period , data of a particular department, data of particular region etc. In IT project, reporting team works in BI. The popular tools in BI are : Business Object, MicroStrategy, Cognos etc.
Data Modelling : This is visual representation of the data. It shows what kind of data is used, what kind of business rule is applied and where the rule has been applied, data dependencies, Primary/Foreign keys, naming conventions, connection between the data objects(tables), any missing data or redundant data etc. Example is E-R modeling in relational database , E-R stands for Entity Relationship.
Data models are of three types:
1. Consecptual
2. Logical
3. Physical
Data Model/modelling: it is a visual representation of data flow, relationship with other data, dependencies, data integrity(Primary/Foreign keys).
Dimensional Modeling:
It is a way of organizing a data in data warehouse. It is a technique or concept of designing data warehouse. The main elements in dimensional modeling are facts, dimensions and attributes. There are different ways of organizing the facts, dimensions and the attributes such as Star, Snowflack, Multidimensional, and MultiFact Star. These are called schemas.
No comments:
Post a Comment