I want to give the user a playful logical learning experience of Business Objects, and it will help every beginner to learn and get complete information.
SAP Business Objects BI or SAP BO, formerly known as BOBJ, is an enterprise software solution that specializes in reporting and analytical business intelligence (BI). It comprises various reporting applications that help the users to find data, conduct analytics, and generate reports.
SAP Business Objects also offers other services such as consulting and education. It aims to simplify the process of reporting and analytics allowing SAP customers to deploy business intelligence projects and perform predictive analytics without any support from data analysts.
Furthermore, the SAP BO BI apps make use of the drag-and-drop functionality so that the users can discover and analyse data from various sources. SAP BO is a front-end-based BI platform which integrates the data from different back-end sources instead of storing it at the application level.
So Business Objects is basically a reporting tool, used for performing predictive analysis on the report based on the Enterprise data available with the Company and make future decisions based on it.
So before understanding Business objects you need to understand the dataware house concepts, databased and then Business Objects.
Lets begin with the Data warehouse concept. As every IT business runs on data, So inorder to manage and store the data, we need the data storage, which is stored based on the concept of Data warehousing.
- Data warehouse
- Business Intelligence
- Data Mining
- Data Mart
- OLTP
- OLAP
- Normalization
- Denormalization
- Composite Key, Candidate Key, Unique Key, Surrogate key
- Dimensions, Facts , aggregated facts
- Dimensional Modelling
- Conformed dimensions
- Forms of Dimensional modelling
- Normal Forms
- Data Points
- Slowly changing Dimensions
1. Data warehouse:
Formal Definition:
A data warehouse is a subject oriented,
integrated, time variant and non-volatile collection of data in support of
management decision making process.
It means:
Subject
oriented: Stored data targets specific subject
E.g. it may store data like Sales, Number
of customers etc. and not general data on everyday operations.
Integrated: Data may be distributed across heterogeneous sources which have to
be integrated.
E.g. Sales data may be on RDBMS, customer
data may be on flat files
Time
variant: Stored Data may not be current but varies
with time and data have an element of time.
E.g. Data of Sales in last 5 years
Non
Volatile: It is separate from the enterprise
operational database and may not be subject to frequent modifications. It
generally has two operations performed on it: Loading of data and Access of
data
My
understanding:
Data warehouse integrates data from
different or multiple data sources for reporting and analysis. It stores huge
amount of data, more historical than current data. It is a subject oriented,
integrated, time variant, non-volatile collection of data to support decision
making process for a business. It does
not require data to be highly accurate. The goal is to execute statistical
queries and provide results which can influence decision making process for an
enterprise.
These systems are thus called Online
Analytical processing (OLAP) Systems
2. Business Intelligence:
Business intelligence (BI) is an umbrella term that includes the applications, infrastructure
and tools, and best practices that enable access to and analysis of information
to improve and optimize decisions and performance.
BI technologies provide historical, current
and predictive views of business operations. Common functions of business
intelligence technologies are reporting, online analytical processing,
analytics, data mining, process mining, complex event processing, business
performance management, benchmarking, text mining, predictive analytics and
prescriptive analytics.
My
Understanding:
Business intelligence constitutes a broad category
of applications and technologies for gathering, storing, analyzing, and
providing access to data to help enterprise management or users to make better
business decisions. BI applications include the activities of decision
support systems, query
and reporting, online analytical processing (OLAP),
statistical analysis, forecasting, and data mining.
E.g. you can follow the trend of growth for
your business for the last ten years and based on this data, management can
forecast the business growth and take future decision to help growing the
business or an enterprise.
3. Data Mining:
Data mining is the process that attempts to discover patterns in
large data sets. It is done for predictive analysis. It utilizes methods at the intersection of
artificial intelligence, machine learning, statistics, and database systems. The
overall goal of the data mining process is to extract information from a data
set and transform it into an understandable structure for further use.
Exploration
& analysis, by automatic or semi-automatic means, of large quantities of
data in order to discover meaningful patterns
My
Understanding:
Data mining deals with analyzing large data
set to extract some specific information and discover a meaningful pattern to
help grow business or enterprise. It is done for predictive analysis. The
analysis is based on historical data plus current data. There are tools used
for data mining, most commonly used tools are:
- Google Chart API: Google's Web-based tool
allows users to create charts and graphs that can be easily embedded into a
website.
- SAS Analytics: SAS's software includes
tools for data mining and statistical analysis as well as forecasting trends
E.g.
A shopping mall offers seasons end sale with some predefined
combinations of products to increase the sales and ultimately to get a
profitable growth from business. The predefined combinations are decided based
on the analysis of large data set using data mining tools.
4. Data Mart:
A data mart is
the access layer of the data warehouse environment that is used to get data out
to the users. The data mart is a subset of the data warehouse that is usually
oriented to a specific business line or team. In some deployments, each
department or business unit is considered the owner of its data mart including
all the hardware, software and data. This enables each department to use,
manipulate and develop their data any way they see fit; without altering
information inside other data marts or the data warehouse. In other deployments
where conformed dimensions are used, this business unit ownership will not hold
true for shared dimensions like customer, product etc.
My
Understanding:
Data mart is a concept of organization of
data into smaller units. A data mart is a subset of data warehouse specific to
a particular business stream.
E.g. A shopping mall has outlets of
different brands that are doing business under a common shelter. Each outlet is
interested to view its own data for business decisions and not any other
outlet’s data. To organize the data in an efficient way, data marts are created
that contains data relevant to a specific brand outlet for their analysis.
5. OLTP: Online transaction processing
Online
transaction processing,
or OLTP, refers to a class of systems that facilitate and manage
transaction-oriented applications, typically for data entry and retrieval transaction
processing.
Transaction
processing is a type of computer processing in which the computer responds
immediately to user requests. Each request is considered to be a transaction.
Automatic teller machines for banks are an example of transaction processing.
OLTP (online
transaction processing) is a class of software programs capable of supporting
transaction-oriented applications on the Internet. Typically, OLTPsystems are used for order
entry, financial transactions, customer relationship management (CRM) and
retail sales.
My Understanding:
OLTP is basically
used for transaction oriented analysis. It is basically used for data entry
based on each transaction and retrieval of the same in a transactional manner. OLTP
is based on operational data. E.g. ATM’s are one of the common examples of
OLTP.
6. OLAP: Online Analytical Processing
On-Line Analytical Processing (OLAP) is a
category of software technology that enables analysts, managers and executives
to gain insight into data through fast, consistent, interactive access to a
wide variety of possible views of information that has been transformed from
raw data to reflect the real dimensionality of the enterprise as understood by
the user.
OLAP tools enable
users to interactively analyze multidimensional data from multiple
perspectives. OLAP consists of three basic analytical operations:
Consolidation
(roll-up): Consolidation
involves the aggregation of data that can be accumulated and computed in one or
more dimensions. For example, all sales offices are rolled up to the sales
department or sales division to anticipate sales trends.
Drilldown: In contrast, the drill-down is a
technique that allows users to navigate through the details. For instance,
users can view the sales by individual products that make up a region’s sales.
Slicing and dicing: It is a feature
whereby users can take out (slicing) a specific set of data of the OLAP cube
and view (dicing) the slices from different viewpoints.
Databases
configured for OLAP use a multidimensional data model, allowing for complex
analytical and ad-hoc queries with a rapid execution time. They borrow aspects
of navigational databases, hierarchical databases and relational databases.
My Understanding:
OLAP is commonly
based on analyzing historical data. It helps management of an enterprise or
business to develop insights in an effective, fast and interactive way. OLAP
tools are used to analyze multidimensional data from multiple views or
perspective.
7.
OLTP vs OLAP:
8. Normalization:
Normalization is the process of efficiently organizing data in a
database. There are two goals of the normalization process: eliminating
redundant data (for example, storing the same data in more than one table) and
ensuring data dependencies make sense (only storing related data in a table). Normalization usually involves dividing
large tables into smaller (and less redundant) tables and defining
relationships between them. The objective is to isolate data so that additions,
deletions, and modifications of a field can be made in just one table and then
propagated through the rest of the database via the defined relationships.
My Understanding:
The main goal of
normalization process is organizing data in an effective manner. Normalization
involves redundancy of data to avoid duplicate data stored in different tables
in a database and ensuring that only relevant data is stored in a table.
9. Denormalization:
Denormalization is the process of attempting to optimize
the read performance of a database by adding redundant data or by grouping
data. In some cases,
denormalisation helps cover up the inefficiencies inherent in relational database
software. A valid reason for denormalizing a relational
design is to enhance performance.
A true relational DBMS would allow for a
fully normalized database at the logical level, while providing physical
storage of data that is tuned for high performance. De-normalization is a
technique to move from higher to lower normal forms of database modeling in
order to speed up database access.
My Understanding:
Denormalization is the concept of adding
redundant data or by grouping data in order to optimize the readability of a
database or performance aspect of reading data from a database
10. Composite Key:
A Primary Key
uniquely identifies each row in a table, it is not always a single-column key, it
could be
- a
single-column key
- or a
composite key
A primary key can
consist of one or more columns of a table. When two or more columns are used as
a primary key, they are called a composite key. Each single column's data can
be duplicated but the combination values of these columns cannot be duplicated.
For example, if
you have a Student table and a Course table, and one student can select many
courses and one course can be selected by many students, so this is a
many-to-many relationship. So you need to create the third table to define the
relationship, say it's called StudentCourse. It is important to note that you
only need the StudentID and CourseID in this table as a composite key. You do
not need an extra identity ID column in this table to uniquely identifies each
row because only having an ID column to uniquely identifies each row is not
sufficient. It cannot prevent the same student selecting the same course from
being inserted into this table.
Candidate
Key:
Candidate key is
a combination of attributes that can be uniquely used to identify a database
record without any extraneous data. Each table may have one or more candidate
keys. One of these candidate keys is selected as the table primary key.
Unique
Key:
The UNIQUE constraint uniquely identifies
each record in a database table. The UNIQUE and PRIMARY KEY constraints both
provide a guarantee for uniqueness for a column or set of columns. A PRIMARY
KEY constraint automatically has a UNIQUE constraint defined on it.
Note that you can have many UNIQUE
constraints per table, but only one PRIMARY KEY constraint per table.
e.g. The following SQL creates a UNIQUE
constraint on the "P_Id" column when the "Persons" table is
created:
CREATE TABLE Persons
(
P_Id int NOT NULL,
LastName varchar(255) NOT NULL,
FirstName varchar(255),
Address varchar(255),
City varchar(255),
UNIQUE (P_Id)
)
Surrogate
key:
A surrogate key is a substitution for the
natural primary key.
It is just a unique identifier or number for each row that can be used for the
primary key to the table. The only requirement for a surrogate primary key is
that it is unique for each row in the table.
Data warehouses typically use a surrogate, (also known as artificial or
identity key), key for the dimension tables primary keys. They can use Infa
sequence generator, or Oracle sequence, or SQL Server Identity values for the
surrogate key.
It is useful because the natural primary key (i.e. Customer Number in Customer
table) can change and this makes updates more difficult.
Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as
the primary keys (according to the business users) but ,not only can these
change, indexing on a numerical value is probably better and you could consider
creating a surrogate key called, say, AIRPORT_ID. This would be internal to the
system and as far as the client is concerned you may display only the
AIRPORT_NAME.
11. Fact Table :
In data warehousing, a fact table consists
of the measurements, metrics or facts of a business process. It is located at
the center of a star schema or a snowflake schema surrounded by dimension
tables.
Fact tables provide the (usually) additive
values that act as independent variables by which dimensional attributes are
analyzed. Fact tables are often defined by their grain. The grain of a fact
table represents the most atomic level by which the facts may be defined. The
grain of a SALES fact table might be stated as "Sales volume by Day by
Product by Store". Each record in this fact table is therefore uniquely
defined by a day, product and store.
A fact table is used in dimensional model in data
warehouse design. A fact table
is found at the center of a star
schema or snowflake schema surrounded by dimension tables.
A fact table consists of
facts of a particular business process e.g., sales revenue by month by product.
Facts are also known as measurements or metrics. A fact table record captures a
measurement or a metric.
My Understanding:
Fact table constitutes of measurable, sum
able, calculative metrics of a business process. Fact tables provide values
that act as variables based on which dimensional variables are analyzed. Fact
tables contains fact values and keys to access dimension tables.
Dimension
table:
In data warehousing, a dimension table is
one of the set of companion tables to a fact table.
The fact table contains business facts or
measures and foreign keys which refer to candidate keys (normally primary keys)
in the dimension tables.
Contrary to fact tables, the dimension
tables contain descriptive attributes (or fields) which are typically textual
fields or discrete numbers that behave like text. These attributes are designed
to serve two critical purposes: query constraining/filtering and query result
set labeling.
Dimension table
rows are uniquely identified by a single key field. It is recommended that the
key field is a simple integer for the reason that key value is meaningless and
is only used to be join fields between the fact and dimension tables.
My Understanding:
Dimension table
consists of different entities that are non-measurable and non-aggregatable.
Dimension table is one of the set of companion tables to a fact table. Dimension
tables contain fields which are typically textual
fields or discrete numbers that behave like text.
A dimension table is a table in a star
schema of a data warehouse. A dimension table stores attributes, or dimensions,
that describe the objects in a fact table.
Dimension tables
are referenced by fact tables using keys. When creating a dimension table in a
data warehouse, a system-generated key is used to uniquely identify a row in
the dimension. This key is also known as a surrogate key. The surrogate key is
used as the primary key in the dimension table. The surrogate key is placed in
the fact table and a foreign key is defined between the two tables. When the
data is joined, it does so just as any other join within the database.
12. Database modeling:
A database model
is a theory or specification describing how a database is
structured and used. Several such models have been suggested.
Common models
include:
A data model is not just a way of structuring data: it also
defines a set of operations that can be performed on the data. The relational
model, for example, defines operations such as select (project) and join. Although these operations may not be explicit in a particular query language, they provide the foundation on which a
query language is built.
My Understanding:
Database modeling deals with the
structuring of data in a database. It is a theory describing how a database is
structured and used. Some of the examples are Relational model, Entity-relationship, Object-relational
model.
13. Dimensional modeling:
Dimensional
modeling (DM) is the name of a set of techniques and concepts used in data warehouse design. It is considered to be different from entity-relationship modeling (ER).
Dimensional
Modeling does not necessarily involve a relational database. The same modeling
approach, at the logical level, can be used for any physical form, such as
multidimensional database or even flat files. DM is a design technique for
databases intended to support end-user queries in a data warehouse. It is
oriented around understandability and performance
Dimensional
modeling always uses the concepts of facts (measures), and dimensions
(context). Facts are typically (but not always) numeric values that can be
aggregated, and dimensions are groups of hierarchies and descriptors that
define the facts. For example, sales amount is a fact; timestamp, product,
register#, store#, etc. are elements of dimensions. Dimensional models are
built by business process area, e.g. store sales, inventory, claims, etc.
Because the different business process areas share some but not all dimensions,
efficiency in design, operation, and consistency, is achieved using conformed dimensions.
My Understanding:
Dimensional modeling
deals with the concept of facts and dimensions. It is a design proposal for
databases intended to support end-user queries in a data warehouse. This
modeling is used in data warehouse design.
For example,
sales amount is a fact; timestamp, product, register#, store#, etc. are
elements of dimensions. Dimensional models are built by business process area,
e.g. store sales, inventory, claims, etc. Because the different business
process areas share some but not all dimensions, efficiency in design, operation,
and consistency, is achieved using conformed
dimensions.
14. Conformed dimension:
A conformed
dimension is a set of data attributes that have been physically implemented in
multiple database tables using the same structure, attributes, domain values,
definitions and concepts in each implementation. A conformed dimension cuts
across many facts.
Conformed
dimensions are either identical or strict mathematical subsets of the most
granular, detailed dimension. Dimension tables are not conformed if the
attributes are labeled differently or contain different values. Conformed
dimensions come in several different flavors. At the most basic level, conformed
dimensions mean exactly the same thing with every possible fact table to which
they are joined. The date dimension table connected to the sales facts is
identical to the date dimension connected to the inventory facts.
My Understanding:
Dimensions are
conformed when they are either exactly the same (including keys) or one is a
perfect subset of the other. Most important, the row headers produced in the
answer sets from two different conformed dimensions must be able to match
perfectly.
15.
Forms of Dimensional Model:
Star schema:
Star schema (also
called star-join schema, data cube, or multi-dimensional schema) is the
simplest style of data
warehouse schema. The star schema consists of one or more fact tables referencing any number of dimension tables. The star schema is an important special case of
the snowflake
schema, and is
more effective for handling simpler queries.
A star schema
classifies the attributes of an event into facts (measured numeric/time data),
and descriptive dimension attributes (product id, customer name, sale date)
that give the facts a context. A fact record is the nexus between the specific
dimension values and the recorded facts. The facts are stored at a uniform
level of detail (the grain) in the fact table. Dimension attributes are
organized into affinity groups and stored in a minimal number of dimension
tables.
Star schemas are
designed to optimize user ease-of-use and retrieval performance by minimizing
the number of tables to join to materialize a transaction.
A star schema is
called such as it a constellation of stars, generally several bright stars
(facts) surrounded by dimmer ones (dimensions).
The fact table
holds the metric values recorded for a specific event. Because of the desire to
hold atomic level data, there generally are a very large number of records
(billions). Special care is taken to minimize the number and size of attributes
in order to constrain the overall table size and maintain performance. Fact
tables generally come in 3 flavors - transaction (facts about a specific event
e.g. Sale), snapshot (facts recorded at a point in time e.g. Account details at
month end), and accumulating snapshot tables (e.g. month-to-date sales for a
product).
Dimension tables
usually have few records compared to fact tables, but may have a very large
number of attributes that describe the fact data.
Snowflake schema:
Snowflake schema
is a logical
arrangement of
tables in a multidimensional database such that the entity relationship diagram resembles a snowflake in shape. The snowflake schema is
represented by centralized fact tables which are connected to multiple dimensions.
The snowflake
schema is similar to the star schema. However, in the snowflake schema,
dimensions are normalized into multiple related tables, whereas the star
schema's dimensions are normalized with each dimension represented by a single
table.
A complex
snowflake shape emerges when the dimensions of a snowflake schema are
elaborate, having multiple levels of relationships, and the child tables have
multiple parent tables ("forks in the road"). The
"snowflaking" effect only affects the dimension tables and NOT the
fact tables.
Fact
Constellation Schema:
This
Schema is used mainly for the aggregate fact tables, OR where we want to split
a fact table for better comprehension. The split of fact table is done only
when we want to focus on aggregation over few facts & dimensions. This is
also called as multiple star schema.
16. The Normal Forms:
The
database community has developed a series of guidelines for ensuring that
databases are normalized. These are referred to as normal forms and are
numbered from one (the lowest form of normalization, referred to as first normal
form or 1NF) through five
(fifth normal form or 5NF). In practical applications, you'll often see 1NF, 2NF, and 3NF along with the occasional 4NF
First Normal
Form (1NF)
First normal form (1NF) sets the very basic rules for an organized database:
- Eliminate
duplicative columns from the same table.
- Create
separate tables for each group of related data and identify each row with
a unique column or set of columns (the primary key).
Second Normal Form (2NF)
Second normal form (2NF) further addresses the concept of removing
duplicative data:
- Meet all
the requirements of the first normal form.
- Remove
subsets of data that apply to multiple rows of a table and place them in
separate tables.
- Create
relationships between these new tables and their predecessors through the
use of foreign keys.
Third Normal Form (3NF)
Third normal form (3NF) goes one large step further:
- Meet all
the requirements of the second normal form.
- Remove
columns that are not dependent upon the primary
key.
Boyce-Codd Normal Form (BCNF or 3.5NF)
The Boyce-Codd Normal Form, also referred to as the "third and half
(3.5) normal form", adds one more requirement:
- Meet all
the requirements of the third normal form.
- Every
determinant must be a candidate key.
Fourth Normal Form (4NF)
Finally, fourth normal form (4NF) has one additional requirement:
- Meet all
the requirements of the third normal form.
- A relation
is in 4NF if it has no multi-valued dependencies.
Remember, these normalization guidelines are cumulative. For a database
to be in 2NF, it must first fulfill all the criteria of a 1NF database.
17. Data Points:
A data point is a set of measurements on a single
member of a statistical population, or a subset of
those measurements for a given individual. In addition, in statistical graphics, a "data
point" may be an individual item with a statistical display; such points
may relate to either a single member of a population or to a summary statistic calculated for
a given subpopulation.
18. Slowly changing
dimension:
Dimension
is the logical groupings of data such as geographical location,
customer or product information. With Slowly Changing Dimensions (SCDs) data
changes slowly, rather than changing on a time-based, regular schedule.
For
example, you may have a dimension in your database
that tracks the sales records of your company's salespeople. Creating sales
reports seems simple enough, until a salesperson is transferred from one
regional office to another. How do you record such a change in your sales
dimension?
You
could calculate the sum or average of each salespersons sales, but if you use
that to compare the performance of salesmen, that might give misleading
information. If the salesperson was transferred and used to work in a hot
market where sales were easy, and now works in a market where sales are
infrequent, his/her totals will look much stronger than the other salespeople
in their new region. Or you could create a second salesperson record and treat
the transferred person as a new sales person, but that creates problems.
Dealing
with these issues involves SCD management methodologies referred to as Type 0
through 6. Type 6 SCDs are also sometimes called Hybrid SCDs. The most common
types are I, II, and III.
Type 1:
This
methodology overwrites old with new data, and therefore does not track
historical data. Its common uses are for misspelled names. (Assuming you won't
need to know how it was misspelled in the past.)
Example
of a supplier table:
In
the above example, Supplier_Code is the natural key
and Supplier_Key is a surrogate
key.
Technically, the surrogate key is not necessary, since the table will be unique
by the natural key (Supplier_Code). However, to optimize performance on joins
use integer than character.
If
the supplier relocates the headquarters to Illinois. The updated table would
overwrite this record:
The disadvantage
of this is that there is no history in the data warehouse. An advantage of this
type is it’s easy to maintain.
If you have
calculated an aggregate table summarizing facts by state, it will need to be
recalculated when the Supplier_State is changed
Type II:
This method
tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers.
Unlimited history is preserved for each insert.
For example, if
the supplier relocates to Illinois the version numbers will be incremented
sequentially:
Type III:
This method
tracks changes using separate columns and preserves limited history. The Type
II preserves unlimited history as it's limited to the number of columns
designated for storing historical data. The original table structure in Type I
and Type II is the same but Type III adds additional columns. In the following
example, an additional column has been added to record the supplier's original
state - only the previous history is stored.
This record
contains a column for the original state and current state—cannot track the
changes if the supplier relocates a second time.
One variation of
this is to create the field Previous_Supplier_State instead of
Original_Supplier_State which would track only the most recent historical
change.
Type 4:
The Type 4 method
is usually referred to as using "history tables", where one table
keeps the current data, and an additional table is used to keep a record of
some or all changes.
For the above
example the original table name is Supplier and the history table is Supplier_History.
My
Understanding:
SCD are the dimensions that changes over
the time e.g. designation, address, Contact number. To handle SCD, three types
of SCD are most widely used. Those are:
Type1:
Existing record is updated or overwritten
and there is no track of the historical information.
Type2:
A new record is inserted or multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers are
added to differentiate the record uniquely.
Type3:
It inserts a new
column. This method tracks changes using separate columns and preserves limited
history. The Type II preserves unlimited history as it's limited to the number
of columns designated for storing historical data.