Basics and getting started with the `rolap` package
Jose Samos (jsamos@ugr.es)
2023-07-28
Source:vignettes/rolap.Rmd
rolap.Rmd
Introduction
The multidimensional data model was defined in the 1990s with the aim of supporting data analysis. Data in multidimensional systems is obtained from operational systems and is transformed to adapt it to the new structure.
Transformations can be carried out using professional ETL (Extract, Transform and Load) tools. Recently, tools aimed at end users have emerged, which are also aimed at performing transformation operations. All these tools are very useful to carry out the transformation process, they provide a development environment to define the transformation operations in a general way.
Frequently, the operations to be performed aim to transform tables (with data that comes from operational systems) into a ROLAP (Relational On-Line Analytical Processing) star database, made up of fact and dimension tables, which implements a multidimensional system. With the tools mentioned above, this transformation can be carried out, but it requires a lot of work.
The goal of rolap
is to define transformations that
allow you to easily obtain ROLAP star databases, composed by fact and
dimension tables, from operational tables and to be able to export them
in various formats to be used by OLAP query tools.
The rest of this document is structured as follows: First, basic concepts of dimensional modelling and star databases are summarized. The following is an illustrative example of how the package works, to show usage of features it offers. Finally, the document ends with conclusions and bibliography.
Dimensional modelling and star databases
The content of this section is a summary based mainly on Adamson (2010) and Kimball and Ross (2013). It is intended to
present the fundamental concepts of the area that are relevant to
understand the functionality, naming and use of the rolap
package.
Dimensional modelling
Dimensional modelling aims to obtain simple data models. Simplicity is sought for two reasons: so that decision-makers can easily understand the data, and also so that they can be easily queried.
In dimensional modelling, the analysis of a business process is performed modelling how it is measured. The measures are called facts, and the descriptors of the context of the facts are dimensions. Facts are numerical data, contain measures, and decision makers want to see them at various levels of detail, defined by dimensions.
Not all numerical data is a measure in facts (some tools consider it that way). In dimensional modelling the designer has to differentiate between facts and dimensions. Some criteria are considered to distinguish between them, for example:
- If it can be defined at different levels of detail then it is a fact.
- If it is quantitative and takes continuous values, then it is a fact.
- If it provides context then it is a dimension.
Sometimes there are no measures associated with the business process, it is simply recorded that the combination of dimensions has occurred. This situation is often called factless facts, Jensen, Pedersen, and Thomsen (2010) prefer to call it measureless facts. In any case, including when no other measures are available, a measure can be considered that represents the number of times the combination of dimension values occurs.
Dimensions and dimension attributes
Attributes considered by the designer as dimensions can be grouped taking into account the natural affinities between them. In particular, they can be grouped as they describe the “who, what, where, when, how and why” associated with the modelled business process. Two attributes share a natural affinity when they are only related in one context. When their relationships are determined by transactions or activities, they can occur in multiple contexts, if this occurs, they must be located in different dimensions.
In this way, a dimension is made up of a set of naturally related dimension attributes that describe the context of facts. Dimensions are used for two purposes: fact selection and fact grouping with the desired level of detail.
Additionally, in the dimensions hierarchies with levels and descriptors can be defined. More details can be found at Jensen, Pedersen, and Thomsen (2010). These concepts are not used in the current version of the package.
Facts and measures
Facts have a granularity, which is determined by the attributes of the dimensions that are considered at each moment. Thus, a measure in a fact has two components, the numerical property and an formula, frequently the SUM aggregation function, that allows combining several values of this measure to obtain a new value of the same measure with a coarser granularity (Jensen, Pedersen, and Thomsen 2010).
According to their behaviour to obtain a coarser granularity, three types of measures are distinguished: additive, semi-additive and non-additive. For additive measures, SUM is always a valid formula that maintains the meaning of the measure when the granularity changes. For semi-additive measures, there is no point in using SUM when changing the level of detail in any of the dimensions because the meaning of the measure changes, this frequently occurs in dimensions that represents time and measures representing inventory level. For non-additive measures, values cannot be combined across any dimension using SUM because the result obtained has a different meaning from the original measure (generally occurs with ratios, percentages or unit amounts such as unit cost or unit price).
The most useful measures are additive. If we have non-additive measures, they can generally be redefined from other additive measures.
Star databases
Dimensional models implemented in RDBMS (Relational Database Management Systems) are known as ROLAP (Relational On-Line Analytical Processing), if they use a table for each dimension are called ROLAP star databases because of their resemblance to a star-like structure: A fact table in the centre and dimension tables around it. Thus, dimension attributes are columns of the respective dimension tables, and measures are columns of the fact table.
Other possible implementations on RDBMS normalize dimensions and are known as ROLAP snowflake databases. More details can be found at Jensen, Pedersen, and Thomsen (2010). This is not considered in this package.
Dimension tables
Dimension tables contain the context associated with business process measures. Although they can contain any type of data, numerical data is generally not used for dimension attributes because some end user query tools consider any numeric data as a measure.
Dimension attributes with NULL value are a source of problems when querying since DBMS and query tools sometimes handle them inconsistently, the result depends on the product. It is recommended to avoid the use of NULL and replace them with a descriptive text. In the case of dates, it is recommended to replace the NULL values with an arbitrary date in the very far future.
Surrogate keys
A dimension table contains dimension attributes and also a
surrogate key column. This column is a unique identifier that
has no intrinsic meaning: It is generally an integer and is the primary
key for the dimension table. In Adamson
(2010) surrogate keys are easily identifiable by the suffix
“_key” in the column name (and this criterion has also been applied in
rolap
package).
Dimension tables also contain natural key columns that uniquely identify associated entities in an operational system. The separation of surrogate keys and natural keys allows the star schema to store changes in dimensions. Therefore, the use of surrogate keys in dimensions is a solution to the SCD (Slowly Changing Dimensions) problem.
Special dimensions
In some cases, for the sake of simplicity, it is helpful to create a table that contains dimension attributes that have no natural affinities to each other, generally these are low-cardinality flags and indicators. The result is what is known as a junk dimension. They do not require any special support, only the designer’s will to define them.
Sometimes some dimension attributes are left in the fact table, usually transaction identifiers. It is considered as the primary key of a dimension that does not have an associated table, for this reason it is known as a degenerate dimension. In this package, each dimension is represented by a table, for a degenerate dimension this table will have a single attribute (for example, the transaction identifier). If the dimension table is subsequently deleted, we will have a degenerate dimension.
Associated with multiple star databases we have the conformed dimensions that are presented in section Conformed dimensions.
Fact table
At the centre of a star database is the fact table. In addition to containing measures, the fact table includes foreign keys that refer to each of the surrogate keys in the dimension tables.
Primary key
A subset of foreign keys (along with possibly keys of degenerate dimensions) is considered to form the primary key of the fact table.
Grain
The subset of dimensions that forms the primary key defines the level of detail stored in the fact table, which is known as the fact table’s grain. In the design process, it is very important for the designer to clearly define the grain of the fact table (it is usually defined by listing the dimensions whose surrogate keys form its primary key): It is a way to ensure that all the facts are stored at the same level of detail.
At the finest grain, a row in the fact table corresponds to the measures of an event and vice versa, it is not influenced by the possible reports that may be obtained. When two facts have different grains, they should be set on different fact tables.
Constellation
It is frequent the need to have several fact tables for various reasons:
We find measures with different grain.
There are measures that do not occur simultaneously, for example, when one occurs, we have no value for others and vice versa.
In reality it is about different business processes, each one has to have its own fact table but they have dimensions in common. This is known as a constellation which corresponds to the Kimball enterprise data warehouse bus architecture.
Conformed dimensions
When star databases share a set of common dimensions in a constellation, these dimensions are called conformed dimensions.
There are several possibilities to have conformed dimensions, the
most obvious form is that the dimension tables share structure
and content, that is, they are identical dimensions. This is
the one considered in this version of the rolap
package.
An illustrative example
In this section an illustrative example is developed to show how the package works.
Data sets
To illustrate how the package works we will use a small part of the Deaths in 122 U.S. cities - 1962-2016. 122 Cities Mortality Reporting System dataset.
Year | WEEK | Week Ending Date | REGION | State | City | Pneumonia and Influenza Deaths | All Deaths | <1 year (all cause deaths) | 1-24 years (all cause deaths) | 25-44 years | 45-64 years (all cause deaths) | 65+ years (all cause deaths) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1962 | 2 | 01/13/1962 | 1 | MA | Boston | 11 | 270 | 14 | 8 | 11 | 70 | 167 |
1962 | 4 | 01/27/1962 | 1 | MA | Boston | 12 | 285 | 22 | 7 | 8 | 73 | 175 |
1963 | 4 | 01/26/1963 | 1 | MA | Boston | 10 | 276 | 11 | 14 | 17 | 67 | 167 |
1964 | 3 | 01/18/1964 | 1 | MA | Boston | 13 | 325 | 17 | 7 | 24 | 90 | 187 |
1964 | 6 | 02/08/1964 | 1 | MA | Boston | 9 | 244 | 13 | 9 | 14 | 61 | 147 |
1962 | 3 | 01/20/1962 | 1 | CT | Bridgeport | 2 | 40 | 5 | 1 | 3 | 10 | 21 |
1962 | 5 | 02/03/1962 | 1 | CT | Bridgeport | 5 | 46 | 6 | 0 | 3 | 15 | 22 |
1962 | 8 | 02/24/1962 | 1 | CT | Bridgeport | 2 | 45 | 2 | 0 | 2 | 16 | 25 |
1963 | 4 | 01/26/1963 | 1 | CT | Bridgeport | 2 | 46 | 4 | 0 | 3 | 10 | 29 |
1964 | 5 | 02/01/1964 | 1 | CT | Bridgeport | 8 | 45 | 3 | 1 | 2 | 11 | 28 |
1962 | 9 | 03/03/1962 | 1 | MA | Cambridge | 4 | 39 | 1 | 0 | 2 | 7 | 29 |
1964 | 2 | 01/11/1964 | 1 | MA | Cambridge | 7 | 31 | 1 | 0 | 2 | 9 | 19 |
1964 | 5 | 02/01/1964 | 1 | MA | Cambridge | 6 | 27 | 2 | 0 | 0 | 8 | 17 |
1964 | 9 | 02/29/1964 | 1 | MA | Cambridge | 0 | 26 | 0 | 0 | 2 | 8 | 16 |
1962 | 4 | 01/27/1962 | 1 | CT | Hartford | 1 | 47 | 7 | 1 | 0 | 14 | 25 |
1962 | 7 | 02/17/1962 | 1 | CT | Hartford | 4 | 57 | 3 | 1 | 3 | 21 | 29 |
1963 | 3 | 01/19/1963 | 1 | CT | Hartford | 2 | 66 | 7 | 2 | 3 | 18 | 36 |
1963 | 7 | 02/16/1963 | 1 | CT | Hartford | 4 | 77 | 6 | 1 | 7 | 19 | 44 |
1963 | 8 | 02/23/1963 | 1 | CT | Hartford | 6 | 49 | 3 | 2 | 3 | 14 | 27 |
1964 | 2 | 01/11/1964 | 1 | CT | Hartford | 3 | 53 | 7 | 0 | 2 | 16 | 28 |
In the table above, all the rows of the ft
dataset are
shown. For each week and city, mortality figures by age
group and cause, considered separately, are included
(i.e., the combination of age group and cause is not included). In the
cause, only a distinction is made between pneumonia or influenza and
others. Regarding the data types, all the columns are of type character,
they have been imported directly as a tibble
, without
indicating a specific type.
From this initial dataset, two additional datasets,
ft_num
and ft_age
, have been defined with the
transformations shown below.
library(rolap)
ft_num <- ft |>
dplyr::mutate(`Pneumonia and Influenza Deaths` = as.integer(`Pneumonia and Influenza Deaths`)) |>
dplyr::mutate(`All Deaths` = as.integer(`All Deaths`))
ft_age <- ft |>
dplyr::select(-`Pneumonia and Influenza Deaths`,-`All Deaths`) |>
tidyr::gather("Age", "All Deaths", 7:11) |>
dplyr::mutate(`All Deaths` = as.integer(`All Deaths`)) |>
dplyr::mutate(Age = stringr::str_replace(Age, " \\(all cause deaths\\)", ""))
ft_num
has the same structure and data as
ft
, only the data types of the two cause measures change.
For ft_age
, the columns with cause measures are removed and
it is transformed into a tidy table (through
tidyr::gather
), with the only measure of numeric type and
removing redundant descriptions. Below are the first 15 rows of
ft_age
.
Year | WEEK | Week Ending Date | REGION | State | City | Age | All Deaths |
---|---|---|---|---|---|---|---|
1962 | 2 | 01/13/1962 | 1 | MA | Boston | <1 year | 14 |
1962 | 4 | 01/27/1962 | 1 | MA | Boston | <1 year | 22 |
1963 | 4 | 01/26/1963 | 1 | MA | Boston | <1 year | 11 |
1964 | 3 | 01/18/1964 | 1 | MA | Boston | <1 year | 17 |
1964 | 6 | 02/08/1964 | 1 | MA | Boston | <1 year | 13 |
1962 | 3 | 01/20/1962 | 1 | CT | Bridgeport | <1 year | 5 |
1962 | 5 | 02/03/1962 | 1 | CT | Bridgeport | <1 year | 6 |
1962 | 8 | 02/24/1962 | 1 | CT | Bridgeport | <1 year | 2 |
1963 | 4 | 01/26/1963 | 1 | CT | Bridgeport | <1 year | 4 |
1964 | 5 | 02/01/1964 | 1 | CT | Bridgeport | <1 year | 3 |
1962 | 9 | 03/03/1962 | 1 | MA | Cambridge | <1 year | 1 |
1964 | 2 | 01/11/1964 | 1 | MA | Cambridge | <1 year | 1 |
1964 | 5 | 02/01/1964 | 1 | MA | Cambridge | <1 year | 2 |
1964 | 9 | 02/29/1964 | 1 | MA | Cambridge | <1 year | 0 |
1962 | 4 | 01/27/1962 | 1 | CT | Hartford | <1 year | 7 |
ft_num
and ft_age
are the tables that will
be considered as a starting point to obtain star databases from them in
this example.
Dimensional modelling
For each table, the goal is to define the attributes that correspond to measures in facts and those that are in dimensions. For facts, measures and their aggregation functions have to be defined. For dimensions, attributes with natural affinity must be grouped. Each attribute should only appear once in the definition.
To avoid having to write the name of the attributes of the table, with the following function we can have them in the form of a string. Thus, we can copy and paste each name as needed.
dput(colnames(ft_num))
#> c("Year", "WEEK", "Week Ending Date", "REGION", "State", "City",
#> "Pneumonia and Influenza Deaths", "All Deaths", "<1 year (all cause deaths)",
#> "1-24 years (all cause deaths)", "25-44 years", "45-64 years (all cause deaths)",
#> "65+ years (all cause deaths)")
dput(colnames(ft_age))
#> c("Year", "WEEK", "Week Ending Date", "REGION", "State", "City",
#> "Age", "All Deaths")
Star Schema
We create an star_schema
object using the
star_schema()
function, as shown below.
s <- star_schema()
We already have an empty schema, we can add facts and dimensions to it, as shown in the sections that follow.
Dimensions
We can define the dimensions through variables, using the
dimension_schema()
function and then the
define_dimension()
function in a star_schema
object, or directly in a star_schema
object , using the
define_dimension()
function. In the example below, the
result in both cases is the same. The advantage is that by using
variables we save defining them several times if the same dimension
appears in several star schemes.
# definition 1
when <- dimension_schema(name = "When",
attributes = c("Year",
"WEEK"))
s_age <- star_schema() |>
define_dimension(when)
# definition 2
s_age <- star_schema() |>
define_dimension(name = "When",
attributes = c("Year",
"WEEK"))
Facts
For completeness, facts can also be defined as variables or directly associated with the star_schema object. As for the dimensions, in the example below, the result in both cases is the same.
# definition 1
mrs_cause <- fact_schema(name = "MRS Cause",
measures = c("Pneumonia and Influenza Deaths",
"All Deaths"))
s_cause <- star_schema() |>
define_facts(mrs_cause)
# definition 2
s_cause <- star_schema() |>
define_facts(name = "MRS Cause",
measures = c("Pneumonia and Influenza Deaths",
"All Deaths"))
Each measure has an aggregation function associated with it, by
default it is SUM. Alternatively, the MAX or
MIN functions can be used. AVG cannot be used because
the average of averages is not the total average. Instead, a field
called nrow_agg
(an alternative name can be specified) is
always added where the number of records added in the result row is
stored. With that value and the result of SUM, the value of the
AVG function can be obtained.
The aggregation functions are needed to get the data at the granularity defined in the star schema, which can be equal to or coarser than the granularity defined in the table.
s_cause <- star_schema() |>
define_facts(name = "MRS Cause",
measures = c("Pneumonia and Influenza Deaths",
"All Deaths"),
agg_functions = c("MAX", "SUM"),
nrow_agg = "Num"
)
In the above example, aggregation functions have been explicitly defined for the measures and the default name of the measure corresponding to the number of records added has been changed.
Star databases
From a star_schema
object and a table
(tibble
) that contains all the fields that appear in it,
using the star_database()
function, we obtain a
star_database
object. As an example we will consider data
on causes.
The definition of the star schema for the data considered for the cause case is shown below.
s_cause <- star_schema() |>
define_facts(name = "MRS Cause",
measures = c("Pneumonia and Influenza Deaths",
"All Deaths"),
nrow_agg = "Num") |>
define_dimension(name = "When",
attributes = c("Year",
"WEEK",
"Week Ending Date")) |>
define_dimension(name = "Where",
attributes = c("REGION",
"State",
"City"))
Next we get the star database.
db_cause <- star_database(s_cause, ft_num)
For now the objective of this database is to export its tables to be
used in OLAP query tools. Two functions are offered for this purpose,
firstly snake_case()
to transform the column names to snake
case and secondly as_tibble_list()
to generate a
tibble
list with the database tables, as shown in the
following example.
l_cause <- db_cause |>
snake_case() |>
as_tibble_list()
The tables of dimensions and facts of the obtained star database are shown below.
when_key | year | week | week_ending_date |
---|---|---|---|
1 | 1962 | 2 | 01/13/1962 |
2 | 1962 | 3 | 01/20/1962 |
3 | 1962 | 4 | 01/27/1962 |
4 | 1962 | 5 | 02/03/1962 |
5 | 1962 | 7 | 02/17/1962 |
6 | 1962 | 8 | 02/24/1962 |
7 | 1962 | 9 | 03/03/1962 |
8 | 1963 | 3 | 01/19/1963 |
9 | 1963 | 4 | 01/26/1963 |
10 | 1963 | 7 | 02/16/1963 |
11 | 1963 | 8 | 02/23/1963 |
12 | 1964 | 2 | 01/11/1964 |
13 | 1964 | 3 | 01/18/1964 |
14 | 1964 | 5 | 02/01/1964 |
15 | 1964 | 6 | 02/08/1964 |
16 | 1964 | 9 | 02/29/1964 |
where_key | region | state | city |
---|---|---|---|
1 | 1 | CT | Bridgeport |
2 | 1 | CT | Hartford |
3 | 1 | MA | Boston |
4 | 1 | MA | Cambridge |
when_key | where_key | pneumonia_and_influenza_deaths | all_deaths | num |
---|---|---|---|---|
1 | 3 | 11 | 270 | 1 |
2 | 1 | 2 | 40 | 1 |
3 | 2 | 1 | 47 | 1 |
3 | 3 | 12 | 285 | 1 |
4 | 1 | 5 | 46 | 1 |
5 | 2 | 4 | 57 | 1 |
6 | 1 | 2 | 45 | 1 |
7 | 4 | 4 | 39 | 1 |
8 | 2 | 2 | 66 | 1 |
9 | 1 | 2 | 46 | 1 |
9 | 3 | 10 | 276 | 1 |
10 | 2 | 4 | 77 | 1 |
11 | 2 | 6 | 49 | 1 |
12 | 2 | 3 | 53 | 1 |
12 | 4 | 7 | 31 | 1 |
13 | 3 | 13 | 325 | 1 |
14 | 1 | 8 | 45 | 1 |
14 | 4 | 6 | 27 | 1 |
15 | 3 | 9 | 244 | 1 |
16 | 4 | 0 | 26 | 1 |
It can be seen that the data in the fact table has the same granularity as the data in the original table (they have the same number of records).
The same can be seen in the num
field in the fact table
which shows the number of records added to get each record, in this case
it is always 1. Dimensions contain only the distinct value combinations
for the attribute sets they contain.
Rename fact measures and dimension attributes
In this case it has not been considered necessary, but if desired, it is possible to rename both the dimension attributes and the fact measures.
To do this, the functions get_attribute_names()
and
get_measure_names()
are available to consult the current
names, and set_attribute_names()
and
set_measure_names()
to change them if necessary.
Constellations
From two or more star databases we can define a constellation.
Definition of star databases
To more effectively demonstrate the functionality, we are going to define the star databases with a coarser granularity than the original tables; additionally we are going to select only some of the rows from the tables.
For the cause case is shown below.
when <- dimension_schema(name = "When",
attributes = c("Year"))
where <- dimension_schema(name = "Where",
attributes = c("State",
"City"))
s_cause <- star_schema() |>
define_facts(name = "MRS Cause",
measures = c("Pneumonia and Influenza Deaths",
"All Deaths")) |>
define_dimension(when) |>
define_dimension(where)
ft_num2 <- ft_num |>
dplyr::filter(Year > "1962") |>
dplyr::filter(City == "Boston" | City == "Bridgeport")
db_cause <- star_database(s_cause, ft_num2) |>
snake_case()
l_cause <- db_cause |>
as_tibble_list()
The tables obtained as a result are shown below.
when_key | year |
---|---|
1 | 1963 |
2 | 1964 |
where_key | state | city |
---|---|---|
1 | CT | Bridgeport |
2 | MA | Boston |
when_key | where_key | pneumonia_and_influenza_deaths | all_deaths | nrow_agg |
---|---|---|---|---|
1 | 1 | 2 | 46 | 1 |
1 | 2 | 10 | 276 | 1 |
2 | 1 | 8 | 45 | 1 |
2 | 2 | 22 | 569 | 2 |
Next, we define the star database and the rest of the operations in the case of age groups.
s_age <- star_schema() |>
define_facts(name = "MRS Age",
measures = c("All Deaths")) |>
define_dimension(when) |>
define_dimension(where) |>
define_dimension(name = "Who",
attributes = c("Age"))
ft_age2 <- ft_age |>
dplyr::filter(Year < "1964") |>
dplyr::filter(City != "Boston" & City != "Bridgeport")
db_age <- star_database(s_age, ft_age2) |>
snake_case()
l_age <- db_age |>
as_tibble_list()
The tables obtained as a result are also shown below.
when_key | year |
---|---|
1 | 1962 |
2 | 1963 |
where_key | state | city |
---|---|---|
1 | CT | Hartford |
2 | MA | Cambridge |
who_key | age |
---|---|
1 | 1-24 years |
2 | 25-44 years |
3 | 45-64 years |
4 | 65+ years |
5 | <1 year |
when_key | where_key | who_key | all_deaths | nrow_agg |
---|---|---|---|---|
1 | 1 | 1 | 2 | 2 |
1 | 1 | 2 | 3 | 2 |
1 | 1 | 3 | 35 | 2 |
1 | 1 | 4 | 54 | 2 |
1 | 1 | 5 | 10 | 2 |
1 | 2 | 1 | 0 | 1 |
1 | 2 | 2 | 2 | 1 |
1 | 2 | 3 | 7 | 1 |
1 | 2 | 4 | 29 | 1 |
1 | 2 | 5 | 1 | 1 |
2 | 1 | 1 | 5 | 3 |
2 | 1 | 2 | 13 | 3 |
2 | 1 | 3 | 51 | 3 |
2 | 1 | 4 | 107 | 3 |
2 | 1 | 5 | 16 | 3 |
As we have filtered the data, it can be seen that the dimension tables only contain the necessary data for each star database but, for common dimensions, they share the same structure.
Constellation definition
Next we define a constellation formed by the two star databases.
ct <- constellation("MRS", db_cause, db_age)
Constellation tables can also be exported as a tibble list.
lc <- ct |>
as_tibble_list()
Below are the tables of the constellation’s star databases. The instances of the dimensions have been integrated so that the tables are common to both databases.
when_key | year |
---|---|
1 | 1962 |
2 | 1963 |
3 | 1964 |
where_key | state | city |
---|---|---|
1 | CT | Bridgeport |
2 | CT | Hartford |
3 | MA | Boston |
4 | MA | Cambridge |
who_key | age |
---|---|
1 | 1-24 years |
2 | 25-44 years |
3 | 45-64 years |
4 | 65+ years |
5 | <1 year |
when_key | where_key | pneumonia_and_influenza_deaths | all_deaths | nrow_agg |
---|---|---|---|---|
2 | 1 | 2 | 46 | 1 |
2 | 3 | 10 | 276 | 1 |
3 | 1 | 8 | 45 | 1 |
3 | 3 | 22 | 569 | 2 |
when_key | where_key | who_key | all_deaths | nrow_agg |
---|---|---|---|---|
1 | 2 | 1 | 2 | 2 |
1 | 2 | 2 | 3 | 2 |
1 | 2 | 3 | 35 | 2 |
1 | 2 | 4 | 54 | 2 |
1 | 2 | 5 | 10 | 2 |
1 | 4 | 1 | 0 | 1 |
1 | 4 | 2 | 2 | 1 |
1 | 4 | 3 | 7 | 1 |
1 | 4 | 4 | 29 | 1 |
1 | 4 | 5 | 1 | 1 |
2 | 2 | 1 | 5 | 3 |
2 | 2 | 2 | 13 | 3 |
2 | 2 | 3 | 51 | 3 |
2 | 2 | 4 | 107 | 3 |
2 | 2 | 5 | 16 | 3 |
These tables can be directly exported in the format required by the OLAP query tool that we are going to use.
Constellations can be defined from star databases and also from other constellations. For example, let’s define a new star database based on the previous examples.
s_age2 <- star_schema() |>
define_facts(name = "MRS Age 2",
measures = c("All Deaths")) |>
define_dimension(when) |>
define_dimension(where) |>
define_dimension(name = "Who",
attributes = c("Age"))
db_age2 <- star_database(s_age2, ft_age) |>
snake_case()
Next we define a constellation, it is made up of the previously defined constellation and the new star database.
ct2 <- constellation("MRS2", ct, db_age2)
Export results
To continue working with the result of the implementation of the
multidimensional database (star database or constellation) it can be
exported as a tibble
list, suitable for example to pass the
data to a spreadsheet, or as an object of the class dm
,
which allows passing the data to a RDBMS.
In some cases, we may be interested in having the data in the form of
a table again, especially if we want to carry out some statistical
analysis with the integrated data. We can export them as a single
tibble
list.
Functions are defined, applicable to objects of both star databases and constellations.
tibble
list
The following shows the results that are obtained for a star database.
db_cause |>
as_tibble_list()
#> $when
#> # A tibble: 2 × 2
#> when_key year
#> <int> <chr>
#> 1 1 1963
#> 2 2 1964
#>
#> $where
#> # A tibble: 2 × 3
#> where_key state city
#> <int> <chr> <chr>
#> 1 1 CT Bridgeport
#> 2 2 MA Boston
#>
#> $mrs_cause
#> # A tibble: 4 × 5
#> when_key where_key pneumonia_and_influenza_deaths all_deaths nrow_agg
#> <int> <int> <int> <int> <int>
#> 1 1 1 2 46 1
#> 2 1 2 10 276 1
#> 3 2 1 8 45 1
#> 4 2 2 22 569 2
Also for a constellation.
ct |>
as_tibble_list()
#> $when
#> # A tibble: 3 × 2
#> when_key year
#> <int> <chr>
#> 1 1 1962
#> 2 2 1963
#> 3 3 1964
#>
#> $where
#> # A tibble: 4 × 3
#> where_key state city
#> <int> <chr> <chr>
#> 1 1 CT Bridgeport
#> 2 2 CT Hartford
#> 3 3 MA Boston
#> 4 4 MA Cambridge
#>
#> $who
#> # A tibble: 5 × 2
#> who_key age
#> <int> <chr>
#> 1 1 1-24 years
#> 2 2 25-44 years
#> 3 3 45-64 years
#> 4 4 65+ years
#> 5 5 <1 year
#>
#> $mrs_cause
#> # A tibble: 4 × 5
#> when_key where_key pneumonia_and_influenza_deaths all_deaths nrow_agg
#> <int> <int> <int> <int> <int>
#> 1 2 1 2 46 1
#> 2 2 3 10 276 1
#> 3 3 1 8 45 1
#> 4 3 3 22 569 2
#>
#> $mrs_age
#> # A tibble: 15 × 5
#> when_key where_key who_key all_deaths nrow_agg
#> <int> <int> <int> <int> <int>
#> 1 1 2 1 2 2
#> 2 1 2 2 3 2
#> 3 1 2 3 35 2
#> 4 1 2 4 54 2
#> 5 1 2 5 10 2
#> 6 1 4 1 0 1
#> 7 1 4 2 2 1
#> 8 1 4 3 7 1
#> 9 1 4 4 29 1
#> 10 1 4 5 1 1
#> 11 2 2 1 5 3
#> 12 2 2 2 13 3
#> 13 2 2 3 51 3
#> 14 2 2 4 107 3
#> 15 2 2 5 16 3
dm
object
The dm
package
allows you to link tables implemented using the tibble
class in R with tables in RDBMS.
The following example shows the transformation of a star database to
an object of class dm
.
# star database
db_cause_dm <- db_cause |>
as_dm_class()
class(db_cause_dm)
#> [1] "dm"
db_cause_dm
#> ── Metadata ────────────────────────────────────────────────────────────────────
#> Tables: `when`, `where`, `mrs_cause`
#> Columns: 10
#> Primary keys: 3
#> Foreign keys: 2
Also for a constellation.
# constellation
ct_dm <- ct |>
as_dm_class(pk_facts = TRUE)
class(ct_dm)
#> [1] "dm"
ct_dm
#> ── Metadata ────────────────────────────────────────────────────────────────────
#> Tables: `when`, `where`, `who`, `mrs_cause`, `mrs_age`
#> Columns: 17
#> Primary keys: 5
#> Foreign keys: 5
For example, the dm
class object can be used to
represent the constellation tables, as it’s shown in the following (it
can also be used to store them in a DBMS).
ct_dm |>
dm::dm_draw(rankdir = "LR", view_type = "all")
In the example below, a local SQLite database is used to
demonstrate how to store the model in a database. dm
works
with any RDBMS supported by DBI
.
db <- DBI::dbConnect(RSQLite::SQLite())
ct_dm_db <- dm::copy_dm_to(db, ct_dm)
ct_dm_db
#> ── Table source ────────────────────────────────────────────────────────────────
#> src: sqlite 3.44.2 []
#> ── Metadata ────────────────────────────────────────────────────────────────────
#> Tables: `when`, `where`, `who`, `mrs_cause`, `mrs_age`
#> Columns: 17
#> Primary keys: 5
#> Foreign keys: 5
DBI::dbDisconnect(db)
Single tibble
list
The following shows the results that are obtained for a star database
seen as a single tibble
list. Each single
tibble
is the result of performing an inner join
between the fact table and the dimension tables.
db_cause |>
as_single_tibble_list()
#> $mrs_cause
#> # A tibble: 4 × 6
#> year state city pneumonia_and_influenza_deaths all_deaths nrow_agg
#> <chr> <chr> <chr> <int> <int> <int>
#> 1 1963 CT Bridgeport 2 46 1
#> 2 1963 MA Boston 10 276 1
#> 3 1964 CT Bridgeport 8 45 1
#> 4 1964 MA Boston 22 569 2
Also for a constellation.
ct |>
as_single_tibble_list()
#> $mrs_cause
#> # A tibble: 4 × 6
#> year state city pneumonia_and_influenza_deaths all_deaths nrow_agg
#> <chr> <chr> <chr> <int> <int> <int>
#> 1 1963 CT Bridgeport 2 46 1
#> 2 1963 MA Boston 10 276 1
#> 3 1964 CT Bridgeport 8 45 1
#> 4 1964 MA Boston 22 569 2
#>
#> $mrs_age
#> # A tibble: 15 × 6
#> year state city age all_deaths nrow_agg
#> <chr> <chr> <chr> <chr> <int> <int>
#> 1 1962 CT Hartford 1-24 years 2 2
#> 2 1962 CT Hartford 25-44 years 3 2
#> 3 1962 CT Hartford 45-64 years 35 2
#> 4 1962 CT Hartford 65+ years 54 2
#> 5 1962 CT Hartford <1 year 10 2
#> 6 1962 MA Cambridge 1-24 years 0 1
#> 7 1962 MA Cambridge 25-44 years 2 1
#> 8 1962 MA Cambridge 45-64 years 7 1
#> 9 1962 MA Cambridge 65+ years 29 1
#> 10 1962 MA Cambridge <1 year 1 1
#> 11 1963 CT Hartford 1-24 years 5 3
#> 12 1963 CT Hartford 25-44 years 13 3
#> 13 1963 CT Hartford 45-64 years 51 3
#> 14 1963 CT Hartford 65+ years 107 3
#> 15 1963 CT Hartford <1 year 16 3
Conclusions
rolap
package offers a set of operations that allow us
to transform tables into star databases. From several star databases you
can define a constellation with conformed dimensions.
Operations have been designed to be intuitive and easy to use. The result greatly facilitates the data transformation process for the exposed situation.
The implementation of the multidimensional database obtained can be exported to work with multidimensional analysis tools on spreadsheets or RDBMS.