starschemar Transformation Operations
Jose Samos (jsamos@ugr.es)
2020-09-24
Source:vignettes/v10_transformation_op.Rmd
v10_transformation_op.Rmd
Introduction
Package starschemar
offers operations to transform flat
tables into star schemas and also to export or exploit them through
queries.
From a flat table, we define a dimensional model classifying its attributes as facts or dimensions (dimensional modelling).
From a flat table and a dimensional model we obtain a star schema that we can transform; from various star schemas we can define a constellation (star schema and constellation definition).
Dimensions contain rows without duplicates, we can apply operations to perform data cleaning and to conform them (cleaning and conforming data).
When new data is obtained, it is necessary to refresh the existing data with them by means of incremental refresh operations (incremental refresh).
The results obtained can be exported to be consulted with other tools (exporting results).
Finally, basic queries can be performed from R (query functions), especially to select the data to export.
In this document, these operations available in the package are briefly presented.
Dimensional modelling
Starting from a flat table, a dimensional model is defined specifying
the attributes that make up each of the dimensions and the measurements
in the facts. The result is a dimensional_model
object. It
is carried out through the following functions:
-
dimensional_model()
: An emptydimensional_model
object is created in which definition of facts and dimensions can be added. Example:
dm <- dimensional_model()
-
define_dimension()
: To define a dimension in adimensional_model
object, we have to define its name and the set of attributes that make it up. Example:
dm <- dimensional_model() |>
define_dimension(name = "When",
attributes = c("Week Ending Date",
"WEEK",
"Year"))
-
define_fact()
: To define facts in adimensional_model
object, the essential data is a name and a set of measurements that can be empty (does not have explicit measurements). Associated with each measurement, an aggregation function is required, which by default is SUM. Examples:
dm <- dimensional_model() |>
define_fact(
name = "mrs_age",
measures = c("Deaths"),
agg_functions = c("SUM"),
nrow_agg = "nrow_agg"
)
dm <- dimensional_model() |>
define_fact(name = "Factless fact")
Star schema and constellation definition
A dimensional model is implemented using a star schema. We can have several related star schemas through common dimensions that together form a fact constellation.
Star schema definition
A star schema is defined from a flat table and a dimensional model definition. Once defined, a star schema can be transformed by defining role playing dimensions, changing the writing style of element names or the type of dimension attributes. These operations are carried out through the following functions:
-
star_schema()
: Creates astar_schema
object from a flat table (implemented by atibble
) and adimensional_model
object. Example:
st <- star_schema(mrs_age, dm_mrs_age)
-
role_playing_dimension()
: Given a list ofstar_schema
dimension names, all with the same structure, a role playing dimension with the indicated name and attributes is generated. The original dimensions become role dimensions defined from the new role playing dimension. Example:
st <- star_schema(mrs_age, dm_mrs_age) |>
role_playing_dimension(
dim_names = c("when", "when_available"),
name = "When Common",
attributes = c("Date", "Week", "Year")
)
-
snake_case()
: Transform fact, dimension, measurement, and attribute names according to the snake case style. Example:
st <- star_schema(mrs_age, dm_mrs_age) |>
snake_case()
-
character_dimensions()
: Transforms numeric type attributes of dimensions into character type. In astar_schema
numerical data are measurements that are situated in the facts. Numerical data in dimensions are usually codes, day, week, month or year numbers. There are tools that consider any numerical data to be a measurement, for this reason it is appropriate to transform the numerical data of dimensions into character data. It also allows indicating the literal to be used in case the numerical value is not defined. Example:
st <- star_schema(mrs_age, dm_mrs_age) |>
character_dimensions()
Star schema rename
Once a star schema is defined, we can rename its elements. It is necessary to be able to rename attributes of dimensions and measures of facts because the definition operations only allowed us to select columns of a flat table. For completeness also dimensions and facts can be renamed. To carry out these operations, the following functions are available:
-
rename_dimension()
: Set new name for a dimension. Example:
st <- st_mrs_age |>
rename_dimension(name = "when", new_name = "when_happened")
-
get_dimension_attribute_names()
: Get the name of attributes in a dimension, so that it is easier to modify them if necessary. Example:
attribute_names <-
st_mrs_age |> get_dimension_attribute_names("when")
-
rename_dimension_attributes()
: Set new names of some attributes in a dimension. Example:
st <-
st_mrs_age |> rename_dimension_attributes(
name = "when",
attributes = c("when_happened_week", "when_happened_year"),
new_names = c("week", "year")
)
-
rename_fact()
: Set new name for facts. Example:
st <- st_mrs_age |> rename_fact("age")
-
get_measure_names()
: Get the name of the measures in fact, so that it is easier to modify them if necessary. Example:
measure_names <-
st_mrs_age |> get_measure_names()
-
rename_measures()
: Set new names of some measures in facts. Example:
st <-
st_mrs_age |> rename_measures(measures = c("n_deaths"),
new_names = c("num_deaths"))
Constellation definition
Based on various star schemas, a constellation can be defined in which star schemas share common dimensions. Dimensions with the same name must be shared. It is defined by the following function:
-
constellation()
: Creates aconstellation
object from a list ofstar_schema
objects. All dimensions with the same name in the star schemas have to be conformable. Example:
ct <- constellation(list(st_mrs_age, st_mrs_cause), name = "mrs")
Cleaning and conforming data
Once star schemas and fact constellations are defined, data cleaning operations can be carried out on dimensions. There are three groups of functions:
One to obtain components of star schemas and constellations.
Another to define data cleaning operations over dimensions.
One more to apply operations to star schemas or constellations.
Obtaining components
We can obtain dimensions from a star schema or conformed dimensions from a fact constellation. Available functions in both cases are similar.
Star schema
-
get_dimension_names()
: Get the names of the dimensions of a star schema. Role playing dimensions are not considered. Example:
dn <- st_mrs_age |>
get_dimension_names()
-
get_dimension()
: Get a dimension of a star schema given its name. Role dimensions can be obtained but not role playing dimensions. Example:
where <- st_mrs_age |>
get_dimension("where")
Constellation
-
get_conformed_dimension_names()
: Get the names of the conformed dimensions of a constellation. Example:
dn <- ct_mrs |>
get_conformed_dimension_names()
-
get_conformed_dimension()
: Get a conformed dimension of a constellation given its name. Example:
when <- ct_mrs |>
get_conformed_dimension("when")
-
get_star_schema_names()
: Get the names of the star schemas of a constellation. Example:
stn <- ct_mrs |>
get_star_schema_names()
-
get_star_schema()
: Get a star schema of a constellation given its name. Example:
age <- ct_mrs |>
get_star_schema("mrs_age")
Definition of updates
Modifications are defined on dimension rows in various ways based exclusively on the values of the dimension fields. Although the surrogate key intervenes in the definition, the result, internally, does not depend on it so that it can be applied more generally in other star schemas.
-
record_update_set()
: Arecord_update_set
object is created. Stores updates on dimension records. Each update is made up of a dimension name, an old value set, and a new value set. Example:
updates <- record_update_set()
-
match_records()
: For a dimension, given the primary key of two records, it adds an update to the set of updates that modifies the combination of values of the rest of attributes of the first record so that they become the same as those of the second. Example:
updates <- record_update_set() |>
match_records(dimension = where,
old = 1,
new = 2)
-
update_record()
: For a dimension, given the primary key of one record, it adds an update to the set of updates that modifies the combination of values of the rest of attributes of the selected record so that they become those given. Example:
updates <- record_update_set() |>
update_record(
dimension = who,
old = 1,
values = c("1: <1 year")
)
-
update_selection()
: For a dimension, given a vector of column names, a vector of old values and a vector of new values, it adds an update to the set of updates that modifies all the records that have the combination of old values in the columns with the new values in those same columns. Example:
updates <- record_update_set() |>
update_selection(
dimension = where,
columns = c("city"),
old_values = c("Bridgepor"),
new_values = c("Bridgeport")
)
-
update_selection_general()
: For a dimension, given a vector of column names, a vector of old values for those columns, another vector column names, and a vector of new values for those columns, it adds an update to the set of updates that modifies all the records that have the combination of old values in the first column vector with the new values in the second column vector. Example:
updates <- record_update_set() |>
update_selection_general(
dimension = where,
columns_old = c("state", "city"),
old_values = c("CT", "Bridgepor"),
columns_new = c("city"),
new_values = c("Bridgeport")
)
Updates application
Defined updates can be applied on a star schema or on the conformed dimension of a fact constellation.
Star schema
-
modify_dimension_records()
: Given a list of dimension record update operations, they are applied on the dimensions of thestar_schema
object. Update operations must be defined with the set of functions available for that purpose. Example:
st <- st_mrs_age |>
modify_dimension_records(updates_st_mrs_age)
Constellation
-
modify_conformed_dimension_records()
: Given a list of dimension record update operations, they are applied on the conformed dimensions of theconstellation
object. Update operations must be defined with the set of functions available for that purpose. Example:
ct <- ct_mrs |>
modify_conformed_dimension_records(updates_st_mrs_age)
Dimension enrichment
To enrich a dimension with new attributes related to others already included in it, first, we export the attributes on which the new ones depend, then we define the new attributes, and import the table with all the attributes to be added to the dimension.
-
enrich_dimension_export()
: Export the selected attributes of a dimension, without repeated combinations, to enrich the dimension. If it is a role dimension they cannot be exported, you have to work with the associated role playing dimension. Example:
tb <-
enrich_dimension_export(st_mrs_age,
name = "when_common",
attributes = c("week", "year"))
-
enrich_dimension_import()
: For a dimension of a star schema atibble
is attached. This contains dimension attributes and new attributes. If values associated with all rows in the dimension are included in thetibble
, the dimension is enriched with the new attributes. Role dimensions cannot be directly enriched. If a role playing dimension is enriched, the new attributes are also added to the associated role dimensions. Example:
tb <-
enrich_dimension_export(st_mrs_age,
name = "when_common",
attributes = c("week", "year"))
# Add new columns with meaningful data (these are not), possibly exporting
# data to a file, populating it and importing it.
tb <- tibble::add_column(tb, x = "x", y = "y", z = "z")
st <- enrich_dimension_import(st_mrs_age, name = "when_common", tb)
-
enrich_dimension_import_test()
: Previous toenrich_dimension_import
, it checks if thetibble
has values for all dimension instances. Returns the dimension instances that do not match the imported data. Example:
tb <-
enrich_dimension_export(st_mrs_age,
name = "when_common",
attributes = c("week", "year"))
# Add new columns with meaningful data (these are not), possibly exporting
# data to a file, populating it and importing it.
tb <- tibble::add_column(tb, x = "x", y = "y", z = "z")[-1, ]
tb2 <- enrich_dimension_import_test(st_mrs_age, name = "when_common", tb)
Incremental refresh
When new data is obtained, an incremental refresh of the data can be carried out, both of the dimensions and of the facts. Incremental refresh can be applied to both star schema and fact constellation, using the following functions.
Star schema
-
incremental_refresh_star_schema()
: Incrementally refresh a star schema with the content of a new one that is integrated into the first. Once the dimensions are integrated, if there are records in the fact table whose keys match the new ones, new ones can be ignored, they can be replaced by new ones, all of them can be grouped using the aggregation functions, or they can be deleted. Therefore, the possible values of theexisting
parameter are: “ignore”, “replace”, “group” or “delete”. Example:
st <- st_mrs_age |>
incremental_refresh_star_schema(st_mrs_age_w10, existing = "replace")
Sometimes the data refresh consists of eliminating data that is no longer necessary, generally because it corresponds to a period that has stopped being analysed but it can also be for other reasons. This data can be selected using the following function:
-
filter_fact_rows()
: Filter fact rows based on dimension conditions in a star schema. Dimensions remain unchanged. Filtered rows can be deleted using theincremental_refresh_star_schema
function. Example:
st <- st_mrs_age |>
filter_fact_rows(name = "when", when_happened_week <= "03") |>
filter_fact_rows(name = "where", city == "Bridgeport")
st2 <- st_mrs_age |>
incremental_refresh_star_schema(st, existing = "delete")
Once the fact data is removed (using the other incremental refresh functions), we can remove the data for the dimensions that are no longer needed using the following function:
-
purge_dimensions_star_schema()
: Delete instances of dimensions not related to facts in a star schema. Example:
st3 <- st2 |>
purge_dimensions_star_schema()
Constellation
-
incremental_refresh_constellation()
: Incrementally refresh a star schema in a constellation with the content of a new star schema that is integrated into the first. Example:
ct <- ct_mrs |>
incremental_refresh_constellation(st_mrs_age_w10, existing = "replace")
-
purge_dimensions_constellation()
: Delete instances of dimensions not related to facts in a star schema. It performs the operation for each of the componentstar_schemas
and also for the conformed dimensions. Example:
ct <- ct_mrs |>
purge_dimensions_constellation()
Exporting results
Once the data has been properly structured and transformed, it can be exported to be consulted with other tools or with R. Various export formats have been defined, both for star schemas and for constellations, using the following functions.
Star schema
-
star_schema_as_flat_table()
: We can again obtain a flat table, implemented using atibble
, from a star schema. Example:
ft <- st_mrs_age |>
star_schema_as_flat_table()
-
star_schema_as_multistar()
: We can obtain amultistar
. Amultistar
only distinguishes between general and conformed dimensions, each dimension has its own data. It can contain multiple fact tables. Example:
ms <- st_mrs_age |>
star_schema_as_multistar()
-
star_schema_as_tibble_list()
: We can obtain atibble
list with them. Role playing dimensions can be optionally included. Example:
tl <- st_mrs_age |>
star_schema_as_tibble_list(include_role_playing = TRUE)
Constellation
-
constellation_as_multistar()
: We can obtain amultistar
. Amultistar
only distinguishes between general and conformed dimensions, each dimension has its own data. It can contain multiple fact tables. Example:
ms <- ct_mrs |>
constellation_as_multistar()
-
constellation_as_tibble_list()
: We can obtain atibble
list with them. Role playing dimensions can be optionally included. Example:
tl <- ct_mrs |>
constellation_as_tibble_list(include_role_playing = TRUE)
multistar
-
multistar_as_flat_table()
: We can obtain a flat table, implemented using atibble
, from amultistar
(which can be the result of a query). If it only has one fact table, it is not necessary to provide its name. Example:
ft <- ms_mrs |>
multistar_as_flat_table(fact = "mrs_age")
Query functions
There are many multidimensional query tools available. The exported
data, once stored in files, can be used directly from them. You can also
perform basic queries from R on data in the multistar
format, mainly for selecting the data to export, using the following
functions:
-
dimensional_query()
: An emptydimensional_query
object is created where you can select fact measures, dimension attributes and filter dimension rows. Example:
ms_mrs <- ct_mrs |>
constellation_as_multistar()
dq <- dimensional_query(ms_mrs)
-
select_fact()
: To define the fact to be consulted, its name is indicated, optionally, a vector of names of selected measures and another of aggregation functions are also indicated. If the name of any measure is not indicated, only the one corresponding to the number of aggregated rows is included, which is always included. If no aggregation function is included, those defined for the measures are considered. Examples:
dq <- dimensional_query(ms_mrs) |>
select_fact(
name = "mrs_age",
measures = c("n_deaths"),
agg_functions = c("MAX")
)
dq <- dimensional_query(ms_mrs) |>
select_fact(name = "mrs_age",
measures = c("n_deaths"))
dq <- dimensional_query(ms_mrs) |>
select_fact(name = "mrs_age")
-
select_dimension()
: To add a dimension in adimensional_query
object, we have to define its name and a subset of the dimension attributes. If only the name of the dimension is indicated, it is considered that all its attributes should be added. Example:
dq <- dimensional_query(ms_mrs) |>
select_dimension(name = "where",
attributes = c("city", "state")) |>
select_dimension(name = "when")
-
filter_dimension()
: Allows you to define selection conditions for dimension rows. Conditions can be defined on any attribute of the dimension (not only on attributes selected in the query for the dimension). The selection is made based on the functiondplyr::filter
. Conditions are defined in exactly the same way as in that function. Example:
dq <- dimensional_query(ms_mrs) |>
filter_dimension(name = "when", when_happened_week <= "03") |>
filter_dimension(name = "where", city == "Boston")
-
run_query()
: Once we have selected the facts, dimensions and defined the conditions on the instances, we can execute the query to obtain the result. As an option, we can indicate if we do not want to unify the facts in the case of having the same grain. Example:
ms <- dimensional_query(ms_mrs) |>
select_dimension(name = "where",
attributes = c("city", "state")) |>
select_dimension(name = "when",
attributes = c("when_happened_year")) |>
select_fact(
name = "mrs_age",
measures = c("n_deaths")
) |>
select_fact(
name = "mrs_cause",
measures = c("pneumonia_and_influenza_deaths", "other_deaths")
) |>
filter_dimension(name = "when", when_happened_week <= "03") |>
filter_dimension(name = "where", city == "Boston") |>
run_query()
Conclusions
starschemar
package offers a set of operations that
allow us to transform flat tables into star schemas. Star schemas
support the definition of role playing and role dimensions. Additional
transformation operations can be applied to each star schema to adapt
the format of the data. From several star schemas you can define fact
constellation with conformed dimensions.
Operations have been designed to be intuitive and easy to use. The result greatly facilitates the data transformation process for the exposed situation. In this document the available operations have been briefly presented.