Transforming a pivot table into a flat table with the flattabler package
Jose Samos (jsamos@ugr.es)
2023-08-08
Source:vignettes/flattabler.Rmd
flattabler.Rmd
Introduction
Pivot tables are generally used to present raw and summary data. They
are generated from spreadsheets and, more recently, also from R (pivottabler
).
If we generate pivot tables from our own data,
flattabler
package is not necessary. But, if we get data in
pivot table format and need to represent or analyse it using another
tool, this package can be very helpful: It can save us several hours of
programming or manual transformation.
flattabler
package offers a set of operations that allow
us to transform one or more pivot tables into a flat table.
The rest of this document is structured as follows: First, an
illustrative example of transforming a pivot table into a flat table is
presented. Then, the operations available in flattabler
package, classified according to their purpose, are presented. Finally,
the document ends with the conclusions section.
An illustrative example
In this example, given a pivot table and the flat table obtained from
it, the transformations performed are presented. Next, a function is
defined that groups these transformations. This function is applied to a
list of pivot tables to obtain a single flat table. Finally, it is shown
how the flat table can be modified using functions from tidyverse
package components.
Pivot table
A pivot table allows to represent information in a structured way, mainly to be analysed by a person or to make a graphical representation of it. In addition to a header and/or footer, it contains label rows and columns, and a matrix of values, usually numeric data.
c1 | c2 | c3 | c4 | c5 | c6 | c7 | |
---|---|---|---|---|---|---|---|
r1 | M4 | E | D | ||||
r2 | e2 | Total e2 | Total general | ||||
r3 | B | A | d3 | d4 | d5 | ||
r4 | b1 | a05 | 70,40 | 1.089,00 | 1.159,40 | 1.159,40 | |
r5 | a09 | 674,31 | 674,31 | 674,31 | |||
r6 | a13 | 421,08 | 1.055,12 | 64,68 | 1.540,88 | 1.540,88 | |
r7 | a17 | 96,00 | 1.347,84 | 545,28 | 1.989,12 | 1.989,12 | |
r8 | Total b1 | 1.261,79 | 3.491,96 | 609,96 | 5.363,71 | 5.363,71 | |
r9 | b2 | a02 | 924,80 | 1.867,02 | 73,50 | 2.865,32 | 2.865,32 |
r10 | a06 | 1.058,40 | 494,19 | 139,65 | 1.692,24 | 1.692,24 | |
r11 | a10 | 791,04 | 121,03 | 912,07 | 912,07 | ||
r12 | a14 | 4.698,00 | 40,96 | 4.738,96 | 4.738,96 | ||
r13 | a18 | 150,00 | 443,52 | 593,52 | 593,52 | ||
r14 | Total b2 | 7.622,24 | 2.966,72 | 213,15 | 10.802,11 | 10.802,11 | |
r15 | b3 | a03 | 658,56 | 203,52 | 148,48 | 1.010,56 | 1.010,56 |
r16 | a07 | 92,00 | 1.466,08 | 1.558,08 | 1.558,08 | ||
r17 | a15 | 2.043,00 | 184,96 | 544,18 | 2.772,14 | 2.772,14 | |
r18 | a19 | 393,96 | 1.056,25 | 1.450,21 | 1.450,21 | ||
r19 | Total b3 | 3.187,52 | 2.910,81 | 692,66 | 6.790,99 | 6.790,99 | |
r20 | b4 | a04 | 263,13 | 204,80 | 489,00 | 956,93 | 956,93 |
r21 | a08 | 69,66 | 1.261,17 | 101,50 | 1.432,33 | 1.432,33 | |
r22 | a12 | 346,00 | 1.008,61 | 124,74 | 1.479,35 | 1.479,35 | |
r23 | a16 | 1.399,68 | 142,08 | 43,36 | 1.585,12 | 1.585,12 | |
r24 | a20 | 34,88 | 261,95 | 83,00 | 379,83 | 379,83 | |
r25 | Total b4 | 2.113,35 | 2.878,61 | 841,60 | 5.833,56 | 5.833,56 | |
r26 | Total general | 14.184,90 | 12.248,10 | 2.357,37 | 28.790,37 | 28.790,37 |
The table above, contained in the df_ex
variable of the
package, has the following parts:
The header is made up of row r1 and the intersection between rows and columns of labels (cells of c1 and c2 with r2 and r3). Let us suppose that the content of the cell (r1, c1) is especially relevant, it is part of the table header and identifies the content of this pivot table with respect to others: It is the identifier of the page.
Columns c1 and c2, also rows r2 and r3, contain labels, except those of the intersection that are part of the header. It is common to try not to repeat the values of the outer labels, it being understood that, if there is no value, the value of that position is the last one shown in the same row or column (this is the case in column c1 and in row r2). In the innermost labels, if there are no values, it is because the corresponding position in the outer row or column corresponds to an aggregate (this is the case in column c2 and in row r3).
The matrix of values is made up of rows and columns after those containing labels (rows from r4 on, and columns from c3 on). Each value of this matrix is characterized by the combination of labels of the corresponding row and column. It is common to find null values in that matrix because the data is not produced or recorded for the combination of labels that define it, that is, the data is usually scattered. In addition to base data, aggregated data can be included in the matrix. Since it was intended for a person, the thousands separator had been used and, both this and the decimal separator, had been used with the Spanish style.
Obtaining a flat table
A flat database or flat-file database is a database that only
contains a single table. A flat table is a generally
denormalized table that is not related to other tables. It is not
necessarily tidy data (in the sense of the tidyverse
package) but in each column all the data must be of the same type, so
that it can be easily stored in a RDBMS (Relational Database
Management System).
A pivot table is not a flat table, but from a pivot table we can
obtain a flat table, that is what we are going to do with the help of
the flattabler
package. Below are the transformations
performed using its functions.
library(flattabler)
ft <- pivot_table(df_ex) |>
set_page(1, 1) |>
remove_top(1) |>
define_labels(n_col = 2, n_row = 2) |>
fill_labels() |>
remove_agg() |>
fill_values() |>
remove_k() |>
replace_dec() |>
unpivot()
Starting from the pivot table in the variable df_ex
:
We get an object using the
pivot_table()
function, the constructor of thepivot_table
class.We define that the value that identifies the pivot table (its page) is in cell (r1, c1), by means of
set_page(1, 1)
.We should leave only the labels and the matrix of values, therefore rows or columns with other information have to be removed. The cells between the rows and columns of labels are ignored (cells of c1 and c2 with r2 and r3). We delete the first row using
remove_top(1)
because it does not contain labels.Then, we define the number of rows and columns containing labels using
define_labels(n_col = 2, n_row = 2)
. There are two columns and two rows of labels.Since there are more than one row or column with labels, the values of the labels of the first row and column have not been repeated. They are filled using
fill_labels()
.The pivot table contains aggregated data. It is removed by
remove_agg()
. It is recognized exclusively because there are no values in the row or column of the labels next to the array of values.The array of values has gaps that, instead of having a numeric value, have an empty string. In R it is more appropriate to have
NA
if the data is not available. This operation is performed throughfill_values()
.The example is a Spanish report that uses thousands and decimal separators in the value matrix. We need to adapt them to the R syntax for numbers. This operation is carried out using
remove_k()
to remove the thousands separator andreplace_dec()
to replace the decimal separator.Finally, it is transformed into a flat table by
unpivot()
: each row corresponds to a value with its combination of labels. By default,NA
values are not considered.
The result obtained can be seen in the following table. An additional label has been added with the value that identifies the pivot table, the pivot table page.
page | col1 | col2 | row1 | row2 | value |
---|---|---|---|---|---|
M4 | b1 | a05 | e2 | d3 | 70.40 |
M4 | b1 | a05 | e2 | d4 | 1089.00 |
M4 | b1 | a09 | e2 | d3 | 674.31 |
M4 | b1 | a13 | e2 | d3 | 421.08 |
M4 | b1 | a13 | e2 | d4 | 1055.12 |
M4 | b1 | a13 | e2 | d5 | 64.68 |
M4 | b1 | a17 | e2 | d3 | 96.00 |
M4 | b1 | a17 | e2 | d4 | 1347.84 |
M4 | b1 | a17 | e2 | d5 | 545.28 |
M4 | b2 | a02 | e2 | d3 | 924.80 |
M4 | b2 | a02 | e2 | d4 | 1867.02 |
M4 | b2 | a02 | e2 | d5 | 73.50 |
M4 | b2 | a06 | e2 | d3 | 1058.40 |
M4 | b2 | a06 | e2 | d4 | 494.19 |
M4 | b2 | a06 | e2 | d5 | 139.65 |
M4 | b2 | a10 | e2 | d3 | 791.04 |
M4 | b2 | a10 | e2 | d4 | 121.03 |
M4 | b2 | a14 | e2 | d3 | 4698.00 |
M4 | b2 | a14 | e2 | d4 | 40.96 |
M4 | b2 | a18 | e2 | d3 | 150.00 |
M4 | b2 | a18 | e2 | d4 | 443.52 |
M4 | b3 | a03 | e2 | d3 | 658.56 |
M4 | b3 | a03 | e2 | d4 | 203.52 |
M4 | b3 | a03 | e2 | d5 | 148.48 |
M4 | b3 | a07 | e2 | d3 | 92.00 |
M4 | b3 | a07 | e2 | d4 | 1466.08 |
M4 | b3 | a15 | e2 | d3 | 2043.00 |
M4 | b3 | a15 | e2 | d4 | 184.96 |
M4 | b3 | a15 | e2 | d5 | 544.18 |
M4 | b3 | a19 | e2 | d3 | 393.96 |
M4 | b3 | a19 | e2 | d4 | 1056.25 |
M4 | b4 | a04 | e2 | d3 | 263.13 |
M4 | b4 | a04 | e2 | d4 | 204.80 |
M4 | b4 | a04 | e2 | d5 | 489.00 |
M4 | b4 | a08 | e2 | d3 | 69.66 |
M4 | b4 | a08 | e2 | d4 | 1261.17 |
M4 | b4 | a08 | e2 | d5 | 101.50 |
M4 | b4 | a12 | e2 | d3 | 346.00 |
M4 | b4 | a12 | e2 | d4 | 1008.61 |
M4 | b4 | a12 | e2 | d5 | 124.74 |
M4 | b4 | a16 | e2 | d3 | 1399.68 |
M4 | b4 | a16 | e2 | d4 | 142.08 |
M4 | b4 | a16 | e2 | d5 | 43.36 |
M4 | b4 | a20 | e2 | d3 | 34.88 |
M4 | b4 | a20 | e2 | d4 | 261.95 |
M4 | b4 | a20 | e2 | d5 | 83.00 |
Since this table is not intended to be analysed directly by a person, aggregated data has been removed as well as data that was not available for tag combinations (of course this is optional). The numerical data has been transformed so that it can be easily processed in R.
The result of the transformations is a tibble
that can
be further transformed using the functions of the tidyverse
package.
Transforming a set of pivot tables
Once we have defined the necessary transformations for a pivot table, we can apply them to any other with the same structure. Candidate tables can have different number of rows or columns, depending on the number of labels, but they must have the same number of rows and columns of labels, and the same number of header or footer rows, so that the transformations are the same for each table.
To easily perform this operation, we define a function f
from the transformations, as shown below.
f <- function(pt) {
pt |>
set_page(1, 1) |>
remove_top(1) |>
define_labels(n_col = 2, n_row = 2) |>
fill_labels() |>
remove_agg() |>
fill_values() |>
remove_k() |>
replace_dec() |>
unpivot()
}
The only difference from the original transformation is that we don’t
need to build a pivot_table
object because the input
functions provided by the package build it automatically.
The package has functions that allow data to be read in either text
format or Excel format, from a single file or from a folder with
multiple files. For example, the following code reads files in CSV
format contained in a package data folder. The result is a list of
pivot_table
objects that can be directly transformed.
folder <- system.file("extdata", "csvfolder", package = "flattabler")
lpt <- read_text_folder(folder)
class(lpt[[1]])
#> [1] "pivot_table"
Given a list of pivot tables, lpt
,
flatten_table_list()
applies the transformation defined by
function f
to each of them, and merges the results into a
flat table.
ftl <- flatten_table_list(lpt, f)
In this case, the full result is not shown in this document because it takes up too much space, but a sample is shown below.
page | col1 | col2 | row1 | row2 | value |
---|---|---|---|---|---|
M1 | b1 | a05 | e2 | d4 | 2.25 |
M1 | b1 | a09 | e1 | d1 | 2.55 |
M1 | b1 | a13 | e1 | d2 | 1.02 |
M1 | b1 | a13 | e2 | d3 | 3.48 |
M1 | b1 | a17 | e2 | d5 | 2.13 |
M1 | b4 | a04 | e2 | d5 | 4.89 |
M1 | b4 | a08 | e1 | d2 | 1.01 |
M1 | b4 | a16 | e1 | d1 | 1.43 |
M1 | b4 | a16 | e2 | d3 | 2.43 |
M2 | b1 | a05 | e2 | d3 | 8 |
M2 | b1 | a05 | e2 | d4 | 22 |
M2 | b1 | a09 | e2 | d3 | 13 |
M2 | b1 | a17 | e2 | d3 | 5 |
M2 | b2 | a02 | e2 | d5 | 5 |
M2 | b2 | a06 | e1 | d1 | 24 |
M2 | b2 | a06 | e2 | d3 | 21 |
M2 | b2 | a10 | e2 | d3 | 16 |
M2 | b2 | a18 | e1 | d2 | 10 |
M2 | b2 | a18 | e2 | d4 | 12 |
M2 | b3 | a15 | e2 | d5 | 13 |
M2 | b4 | a04 | e2 | d3 | 7 |
M2 | b4 | a12 | e2 | d3 | 10 |
M2 | b4 | a16 | e2 | d4 | 8 |
M3 | b1 | a17 | e1 | d2 | 44.62 |
M3 | b2 | a02 | e1 | d1 | 25.60 |
M3 | b2 | a10 | e1 | d2 | 25.11 |
M3 | b4 | a08 | e1 | d1 | 30.42 |
M3 | b4 | a08 | e1 | d2 | 15.15 |
M4 | b1 | a13 | e2 | d3 | 421.08 |
M4 | b1 | a13 | e2 | d4 | 1055.12 |
M4 | b1 | a17 | e2 | d3 | 96.00 |
M4 | b1 | a17 | e2 | d4 | 1347.84 |
M4 | b2 | a06 | e2 | d4 | 494.19 |
M4 | b2 | a10 | e2 | d3 | 791.04 |
M4 | b2 | a14 | e2 | d3 | 4698.00 |
M4 | b2 | a18 | e2 | d3 | 150.00 |
M4 | b3 | a07 | e2 | d3 | 92.00 |
M4 | b3 | a15 | e2 | d3 | 2043.00 |
M4 | b3 | a19 | e2 | d4 | 1056.25 |
M4 | b4 | a12 | e2 | d4 | 1008.61 |
Once we have a flat table, implemented using tibble
, we
can use tidyverse
package components to transform it, as shown below. In this case all
results are displayed.
t <- ftl |>
tidyr::pivot_wider(names_from = page, values_from = value) |>
dplyr::rename(B = col1, A = col2, E = row1, D = row2) |>
dplyr::select(A, B, D, E, M1, M2, M3, M4) |>
dplyr::arrange(A, B, D, E)
A | B | D | E | M1 | M2 | M3 | M4 |
---|---|---|---|---|---|---|---|
a01 | b1 | d4 | e2 | 1.88 | 9 | NA | NA |
a02 | b2 | d1 | e1 | NA | 10 | 25.60 | NA |
a02 | b2 | d2 | e1 | NA | 10 | 45.10 | NA |
a02 | b2 | d3 | e2 | NA | 34 | NA | 924.80 |
a02 | b2 | d4 | e2 | NA | 29 | NA | 1867.02 |
a02 | b2 | d5 | e2 | NA | 5 | NA | 73.50 |
a03 | b3 | d1 | e1 | NA | 4 | 12.96 | NA |
a03 | b3 | d2 | e1 | NA | 10 | 26.70 | NA |
a03 | b3 | d3 | e2 | NA | 14 | NA | 658.56 |
a03 | b3 | d4 | e2 | NA | 8 | NA | 203.52 |
a03 | b3 | d5 | e2 | NA | 8 | NA | 148.48 |
a04 | b4 | d1 | e1 | 1.84 | 9 | 16.56 | NA |
a04 | b4 | d2 | e1 | 1.51 | 16 | 24.16 | NA |
a04 | b4 | d3 | e2 | 5.37 | 7 | NA | 263.13 |
a04 | b4 | d4 | e2 | 3.20 | 8 | NA | 204.80 |
a04 | b4 | d5 | e2 | 4.89 | 10 | NA | 489.00 |
a05 | b1 | d1 | e1 | 1.91 | 41 | 78.31 | NA |
a05 | b1 | d3 | e2 | 1.10 | 8 | NA | 70.40 |
a05 | b1 | d4 | e2 | 2.25 | 22 | NA | 1089.00 |
a06 | b2 | d1 | e1 | NA | 24 | 63.60 | NA |
a06 | b2 | d2 | e1 | NA | 15 | 30.30 | NA |
a06 | b2 | d3 | e2 | NA | 21 | NA | 1058.40 |
a06 | b2 | d4 | e2 | NA | 17 | NA | 494.19 |
a06 | b2 | d5 | e2 | NA | 7 | NA | 139.65 |
a07 | b3 | d1 | e1 | NA | 6 | 13.98 | NA |
a07 | b3 | d3 | e2 | NA | 5 | NA | 92.00 |
a07 | b3 | d4 | e2 | NA | 28 | NA | 1466.08 |
a08 | b4 | d1 | e1 | 1.69 | 18 | 30.42 | NA |
a08 | b4 | d2 | e1 | 1.01 | 15 | 15.15 | NA |
a08 | b4 | d3 | e2 | 0.86 | 9 | NA | 69.66 |
a08 | b4 | d4 | e2 | 1.73 | 27 | NA | 1261.17 |
a08 | b4 | d5 | e2 | 4.06 | 5 | NA | 101.50 |
a09 | b1 | d1 | e1 | 2.55 | 16 | 40.80 | NA |
a09 | b1 | d2 | e1 | 2.74 | 12 | 32.88 | NA |
a09 | b1 | d3 | e2 | 3.99 | 13 | NA | 674.31 |
a10 | b2 | d1 | e1 | NA | 6 | 9.90 | NA |
a10 | b2 | d2 | e1 | NA | 27 | 25.11 | NA |
a10 | b2 | d3 | e2 | NA | 16 | NA | 791.04 |
a10 | b2 | d4 | e2 | NA | 7 | NA | 121.03 |
a11 | b3 | d1 | e1 | NA | 7 | 23.17 | NA |
a11 | b3 | d2 | e1 | NA | 5 | 17.75 | NA |
a11 | b3 | d3 | e2 | NA | 17 | NA | NA |
a11 | b3 | d5 | e2 | NA | 13 | NA | NA |
a12 | b4 | d2 | e1 | 1.47 | 32 | 47.04 | NA |
a12 | b4 | d3 | e2 | 3.46 | 10 | NA | 346.00 |
a12 | b4 | d4 | e2 | 3.49 | 17 | NA | 1008.61 |
a12 | b4 | d5 | e2 | 1.54 | 9 | NA | 124.74 |
a13 | b1 | d1 | e1 | 2.99 | 12 | 35.88 | NA |
a13 | b1 | d2 | e1 | 1.02 | 12 | 12.24 | NA |
a13 | b1 | d3 | e2 | 3.48 | 11 | NA | 421.08 |
a13 | b1 | d4 | e2 | 2.18 | 22 | NA | 1055.12 |
a13 | b1 | d5 | e2 | 1.32 | 7 | NA | 64.68 |
a14 | b2 | d1 | e1 | NA | 32 | 40.32 | NA |
a14 | b2 | d2 | e1 | NA | 8 | 25.68 | NA |
a14 | b2 | d3 | e2 | NA | 45 | NA | 4698.00 |
a14 | b2 | d4 | e2 | NA | 4 | NA | 40.96 |
a15 | b3 | d1 | e1 | NA | 14 | 42.98 | NA |
a15 | b3 | d3 | e2 | NA | 30 | NA | 2043.00 |
a15 | b3 | d4 | e2 | NA | 17 | NA | 184.96 |
a15 | b3 | d5 | e2 | NA | 13 | NA | 544.18 |
a16 | b4 | d1 | e1 | 1.43 | 10 | 14.30 | NA |
a16 | b4 | d2 | e1 | 1.97 | 18 | 35.46 | NA |
a16 | b4 | d3 | e2 | 2.43 | 24 | NA | 1399.68 |
a16 | b4 | d4 | e2 | 2.22 | 8 | NA | 142.08 |
a16 | b4 | d5 | e2 | 2.71 | 4 | NA | 43.36 |
a17 | b1 | d1 | e1 | 3.71 | 9 | 33.39 | NA |
a17 | b1 | d2 | e1 | 1.94 | 23 | 44.62 | NA |
a17 | b1 | d3 | e2 | 3.84 | 5 | NA | 96.00 |
a17 | b1 | d4 | e2 | 2.34 | 24 | NA | 1347.84 |
a17 | b1 | d5 | e2 | 2.13 | 16 | NA | 545.28 |
a18 | b2 | d1 | e1 | NA | 20 | 44.80 | NA |
a18 | b2 | d2 | e1 | NA | 10 | 27.20 | NA |
a18 | b2 | d3 | e2 | NA | 10 | NA | 150.00 |
a18 | b2 | d4 | e2 | NA | 12 | NA | 443.52 |
a19 | b3 | d1 | e1 | NA | 23 | 47.15 | NA |
a19 | b3 | d2 | e1 | NA | 19 | 53.77 | NA |
a19 | b3 | d3 | e2 | NA | 14 | NA | 393.96 |
a19 | b3 | d4 | e2 | NA | 25 | NA | 1056.25 |
a20 | b4 | d1 | e1 | 2.70 | 16 | 43.20 | NA |
a20 | b4 | d3 | e2 | 2.18 | 4 | NA | 34.88 |
a20 | b4 | d4 | e2 | 1.55 | 13 | NA | 261.95 |
a20 | b4 | d5 | e2 | 3.32 | 5 | NA | 83.00 |
Pivot table operations
To transform one or more pivot tables into a flat table, the workflow is as follows:
Pivot table import: Import pivot tables into an object or list of objects. We start from data in text or Excel files, or previously imported data in a data frame, and generate pivot table objects from them.
-
Pivot table definition: Study the structure of the data and define the pivot table. If there are several homogeneous pivot tables, we will focus on one, the definition should be applicable to all of them.
- Define the characteristics of the pivot table: Number of rows and columns with labels, and page value (identifies the pivot table).
- Remove the rows and columns that are not part of the pivot table: It should only contain the rows and columns of labels and an array of values.
Pivot table transformation: Optionally, complete or transform the components (labels and values) of the pivot table.
Flat table generation: Generate the flat table from the definition of the pivot table and the available data. If there are multiple pivot tables, apply the defined operations to all of them and merge the result.
In this section, the operations available in flattabler
package, classified according to this workflow, are presented.
Pivot table import
The objective of import operations is to obtain external data that contains one or more pivot tables to transform them.
Three formats have been considered: text file, Excel file, and data frame.
In the case of working with files, the situation of jointly treating all the files in a folder has also been considered. In the case of Excel, alternatively, all the sheets in a file can be treated together.
The S3 pivot_table
class has been defined in the
package. Transform operations are defined for objects of this class.
Import operations can be classified into two groups: those that return a
pivot_table
object, and those that return a list of
pivot_table
objects. Objects in a list can be transformed
together.
Operations that return an object
-
pivot_table()
: Creates apivot_table
object from a data frame. The data frame is expected to contain one or more pivot tables. Additional information associated with the pivot table can be indicated. The data frame data is converted to character type. Example:
pt <- pivot_table(df_ex)
pt <- pivot_table(df_ex, page = "M4")
-
read_text_file()
: Reads a text file and creates apivot_table
object. The file is expected to contain one or more pivot tables. Each line in the file corresponds to a row in a table; within each row, columns are defined by a separator character. The file name can be included as part of the object attributes. Example:
file <- system.file("extdata", "csv/set_v_ie.csv", package = "flattabler")
pt <- read_text_file(file, define_page = TRUE)
-
read_excel_sheet()
: Reads an Excel file sheet and creates apivot_table
object. The sheet is expected to contain one or more pivot tables. Each line in the sheet corresponds to a row in a table. The file and sheet names can be included as part of the object attributes. Example:
file <- system.file("extdata", "excel/set_v.xlsx", package = "flattabler")
pt <- read_excel_sheet(file, define_page = 3)
Operations that return a list of objects
-
divide()
: Divides a table into tables separated by some empty row or column. Sometimes multiple pivot tables are placed in a text document or Excel sheet, imported as one text table. This operation recursively divides the initial table into tables separated by some empty row or column. Once a division has been made, it tries to divide each part of the result. An object is generated for each indivisible pivot table. Returns a list ofpivot_table
objects. Example:
pt <- pivot_table(df_set_h_v)
lpt <- pt |> divide()
-
read_text_folder()
: Reads all text files in a folder and creates a list ofpivot_table
objects, one from each file. Each file is expected to contain a pivot table. Each line in a file corresponds to a row in a table; within each row, columns are defined by a separator character. File name can be included as part of each object attributes. Example:
folder <- system.file("extdata", "csvfolder", package = "flattabler")
lpt <- read_text_folder(folder)
-
read_excel_folder()
: Reads one sheet from each of the Excel files in a folder and creates a list ofpivot_table
objects, one from each sheet or, which is the same in this case, one from each file. Each sheet is expected to contain a pivot table. Each line in a file corresponds to a row in a table. File and sheet names can be included as part of each object attributes. Example:
folder <- system.file("extdata", "excelfolder", package = "flattabler")
lpt <- read_excel_folder(folder)
-
read_excel_file()
: Reads sheets from an Excel file and creates apivot_table
object list, one from each sheet. Each sheet is expected to contain a pivot table. Each line in a sheet corresponds to a row in a table. The file and sheet names are included as part of each object attributes. Example:
file <- system.file("extdata", "excel/set_sheets.xlsx", package = "flattabler")
lpt <- read_excel_file(file)
Pivot table definition
Once we have a pivot_table
object or list of objects,
pivot tables have to be defined. Each object generated by import
operations contains a text table, it is expected to contain a pivot
table, but may also have more information, generally in the form of a
table header or footer. Through this set of operations we transform the
text table in the object into a pivot table and define its
characteristics.
A pivot_table
object should only contain label
rows and columns, and an matrix of values,
usually numeric data. Additional information can be used to identify the
pivot table relative to other similar tables: can be used to define the
pivot table page.
Page: We consider the page of the pivot table as the literal that identifies it with respect to other homogeneous tables; generally it is the value of an attribute (i.e., 2023, 2022,…) or the name of a variable (i.e., amount, profit,…). When multiple pivot tables are integrated into a flat table, the page is essential to distinguish the origin of the data. It is considered as an additional label.
The workflow is generally as follows:
Explore the table to determine its distribution and characteristics. If we start from a list of
pivot_table
objects, we will explore each one of the tables. In order to transform them together, they should have homogeneous structure. We will use member reference (instead of list slicing) to access the objects in the list. Example:pt <- lpt[[1]]
.In case the text table contains multiple pivot tables, they can be obtained using
divide()
, which returns a list ofpivot_table
objects; therefore, we return to the first step.Define the characteristics of the pivot table: Number of rows and columns with labels, and page value.
Remove the rows and columns that are not part of the labels or matrix of values: It should only contain the rows and columns of labels and a matrix of values.
Therefore, we still need to review the functions for these last two steps.: Functions to define the pivot table characteristics, and to remove the rows and columns that are not part of it.
Define pivot table characteristics
-
get_page()
andset_page()
: The page value is defined when importing data, sometimes it is included in the file or spreadsheet name. Using these functions, you can get the defined values and redefine them. The content of a table cell or string can be defined as a page value. Example:
pt <- pt |> set_page(1, 1)
-
define_labels()
: This function defines the quantity of rows and columns that contain labels. Example:
pt <- pt |> define_labels(n_col = 2, n_row = 2)
Remove rows and columns
remove_empty()
remove_rows()
remove_cols()
remove_top()
remove_bottom()
remove_left()
remove_right()
Remove rows and columns that are not part of the pivot table. The most frequent situation will be having to eliminate the header or footer of the table (top and bottom rows), the rest of the functions are defined to try to contemplate all possible cases. Example:
pt <- pt |> remove_top(1)
Pivot table transformation
Once a pivot_table
object only contains pivot table
data, and its attributes have been defined, it could be transformed into
a flat table. However, we can take advantage of the table structure to
modify and complete it. Therefore, optionally, we can complete and
transform the components of the pivot table: Labels and values.
Transform labels
-
fill_labels()
: Fills missing values in row and column labels for a pivot table. When there is more than one row or column with labels, the first ones usually do not repeat the values. In the illustrative example, this occurs in column c1 and row r2. By default, in columns they are filled down, in rows to the right. Example:
pt <- pt |> fill_labels()
-
remove_agg()
: Removes pivot table rows and columns that contain aggregated data. Aggregated data is recognized exclusively because the label of the row or column closest to the matrix of values is empty. Example:
pt <- pt |> remove_agg()
-
extract_labels()
: Sometimes a table column includes values of multiple label fields, this is generally known as compact table format. Given a column number and a set of labels, it generates a new column with the labels located at the positions they occupied in the original column and removes them from it. Example:
pt <- pivot_table(df_ex_compact) |>
extract_labels(col = 1,
labels = c("b1", "b2", "b3", "b4", "Total general"))
-
get_col_values()
: To facilitate the study of the labels included in the same column of several tables, this function gets the values of the indicated column in a list of tables. It may be useful to use it beforeextract_labels()
. Example:
file <- system.file("extdata", "csv/set_v_compact.csv", package = "flattabler")
pt <- read_text_file(file)
lpt <- pt |> divide()
df <- get_col_values(lpt, start_row = 4)
labels <- sort(unique(df$label))
Transform values
-
fill_values()
: The array of values has gaps that, instead of having a numeric value, have an empty string. This operation fills withNA
missing values in a pivot table value array. Example:
pt <- pt |> fill_values()
-
remove_k()
: Values sometimes include a thousands separator that can be removed using this function. Example:
pt <- pt |> remove_k()
-
replace_dec()
: Values sometimes include a decimal separator different from the one needed; it can be replaced using this function. Example:
pt <- pt |> replace_dec()
Flat table generation
In order to generate a flat table from a pivot_table
object, it is an essential requirement to have properly defined its
attributes and that it only contains the pivot table label rows
and columns, and the matrix of values. Optionally, if the table
has a usual structure, we could have transformed the values and labels,
if necessary.
We can generate a flat table from a pivot table (a
pivot_table
object) or from a list of pivot tables (a list
of pivot_table
objects).
Generation from a pivot table
-
unpivot()
: Transforms a pivot table into a flat table (implemented by atibble
). An additional column with page information can be included.NA
values can be excluded from the array of values. Example:
ft <- pivot_table(df_ex) |>
set_page(1, 1) |>
remove_top(1) |>
define_labels(n_col = 2, n_row = 2) |>
fill_labels() |>
remove_agg() |>
fill_values() |>
remove_k() |>
replace_dec() |>
unpivot()
Generation from a list of pivot tables
-
flatten_table_list()
: Given a list ofpivot_table
objects and a transformation function that flattens apivot_table
object, transforms each table using the function and merges the results into a flat table. Example:
f <- function(pt) {
pt |>
set_page(1, 1) |>
define_labels(n_col = 2, n_row = 2) |>
remove_top(1) |>
fill_labels() |>
remove_agg() |>
fill_values() |>
remove_k() |>
replace_dec() |>
unpivot()
}
ft <- flatten_table_list(lpt, f)
Conclusions
flattabler
package offers a set of operations that allow
us to transform one or more pivot tables into a flat table.
Transformation operations have been designed to be intuitive and easy to
use. With them, it has been possible to properly transform all the pivot
tables found so far by the author.
If an unforeseen situation arises, the proposed operations are also
useful and can be supplemented by operations available in the components
of tidyverse
package.