Introduction
XML files are commonly used to configure various tools and applications. While there are dedicated tools for generating XML files through structured input forms, in many cases, it is beneficial to organize data in tables linked by primary and foreign keys, especially when dealing with large datasets or multiple instances of the same entity.
The tab2xml
package provides a solution by enabling the
generation of XML files based on predefined templates
and structured tabular data stored in Excel (.xlsx) or
OpenDocument Spreadsheet (.ods) files. By leveraging relational
tables, users can efficiently handle and manage large-scale data entry,
ensuring consistency and ease of modification.
This approach simplifies XML creation, making it more accessible and scalable when working with complex data structures.
Content
In addition to this introduction, the document includes:
An example based on the definition of an XML schema for Mondrian, following the Mondrian Documentation.
A breakdown of each schema element, detailing the corresponding tables, their relationships, and the XML templates used for data transformation.
The process of transforming elements to generate the schema.
Example: Mondrian Schema
The starting point for this example is the Mondrian schema, based on a schema example from the Mondrian Documentation, shown below.
<Schema>
<Cube name="Sales">
<Table name="sales_fact_1997" />
<Dimension name="Gender" foreignKey="customer_id">
<Hierarchy name="Gender" hasAll="true" allMemberName="All Genders" primaryKey="customer_id">
<Table name="customer" />
<Level name="Gender" column="gender" type="String" uniqueMembers="true" />
</Hierarchy>
</Dimension>
<Dimension name="Time" foreignKey="time_id">
<Hierarchy name="Time" hasAll="false" primaryKey="time_id">
<Table name="time_by_day" />
<Level name="Year" column="the_year" type="Numeric" uniqueMembers="true" />
<Level name="Quarter" column="quarter" type="Numeric" uniqueMembers="false" />
<Level name="Month" column="month_of_year" type="Numeric" uniqueMembers="false" />
</Hierarchy>
</Dimension>
<Measure name="Unit Sales" column="unit_sales" aggregator="sum" formatString="#,###" />
<Measure name="Store Sales" column="store_sales" aggregator="sum" formatString="#,###.##" />
<Measure name="Store Cost" column="store_cost" aggregator="sum" formatString="#,###.00" />
<CalculatedMember name="Profit" dimension="Measures" formula="[Measures].[Store Sales] - [Measures].[Store Cost]">
<CalculatedMemberProperty name="FORMAT_STRING" value="$#,##0.00" />
</CalculatedMember>
</Cube>
</Schema>
To define this schema, we can either edit the XML file directly or use the Pentaho Schema Workbench tool. Another approach is to break it down into tables and templates, work with the tables, and generate the required versions directly, as shown below.
Scheme decomposition
Next, we will decompose the content of this schema into tables, defined as sheets in the spreadsheets files (.xlsx or .ods) included in the package, and templates.
Root of the schema template: schema_template.xml
Below is the root of the schema template.
The root can contain only one token (in this case,
the {Cube}
token). If multiple tokens or none are found, an
error is raised.
Tokens are defined using a name enclosed in
{
and}
.Token names are case-insensitive.
Tokens reference either sheets in the spreadsheet or columns in the sheet associated with each element.
In the example, a schema consists of one or more cubes.
Starting from the root, each element must have an associated sheet and a template file, where the sheet’s columns or other components are included.
Cube
Only one cube is defined in the schema; therefore, the table contains only one row.
If we need to reference this element from another, we add a primary key column. The name of this column must consist of the sheet name followed by the
_pk
suffix.To reference elements from another table, we add a foreign key column, named after the referenced sheet with the
_fk
suffix.
In this case, a cube has an associated table. However, since other elements can also have associated tables, we add the foreign key to the cube (as well as to each element with an associated table).
cube_pk | name | table_fk |
---|---|---|
1 | Sales | 1 |
Regarding the template, we can include two types of tokens:
- Tokens referencing table columns: these will be
replaced with a value.
- Tokens referencing other elements (referenced by the sheet name): these will be replaced with another template.
When including tokens that reference other sheets, each token must be placed on a separate line, as they will be replaced with templates.
Table
Since the tables are referenced by other elements, we define a primary key with values to allow referencing (as we did from the cube).
table_pk | name |
---|---|
1 | sales_fact_1997 |
2 | customer |
3 | time_by_day |
The template only needs to include the table name using the specified syntax.
Dimension
Each dimension is included in a cube and can also contain other elements that reference it.
dimension_pk | name | foreignKey | cube_fk |
---|---|---|---|
1 | Gender | customer_id | 1 |
2 | Time | time_id | 1 |
In the template, the sheet’s columns and the containing element (which includes one or more hierarchies) are referenced.
Hierarchy
Hierarchies reference both the dimensions that contain them and their associated tables.
hierarchy_pk | name | hasAll | allMemberName | primaryKey | dimension_fk | table_fk |
---|---|---|---|---|---|---|
1 | Gender | true | allMemberName=“All Genders” | customer_id | 1 | 2 |
2 | Time | false | NA | time_id | 2 | 3 |
In the template, we can see that it includes the table and the levels that make up the hierarchy.
Level
Levels do not need to be referenced, so they do not include a primary key. Instead, they reference the hierarchies they belong to.
name | column | type | uniqueMembers | hierarchy_fk |
---|---|---|---|---|
Gender | gender | String | true | 1 |
Year | the_year | Numeric | true | 2 |
Quarter | quarter | Numeric | false | 2 |
Month | month_of_year | Numeric | false | 2 |
The template references the columns of the associated sheet.
Measure
Measures are included in cubes and must reference them. Since they do not need to be referenced, they do not require a primary key.
name | column | aggregator | formatString | cube_fk |
---|---|---|---|---|
Unit Sales | unit_sales | sum | #,### | 1 |
Store Sales | store_sales | sum | #,###.## | 1 |
Store Cost | store_cost | sum | #,###.00 | 1 |
The template includes the columns from the sheet.
CalculatedMember
Calculated members may have components that must be referenceable (they have a primary key) and are included in cubes.
calculatedmember_pk | name | dimension | formula | cube_fk |
---|---|---|---|---|
1 | Profit | Measures | [Measures].[Store Sales] - [Measures].[Store Cost] | 1 |
The template shows the elements they contain.
Generate the Schema
To generate the schema, tokens in each template are replaced with corresponding values from a data source.
It handles both direct replacements from the associated sheet and foreign key relationships.
library(tab2xml)
# Define file paths
source_xml <- system.file("extdata", "schema_template.xml", package = "tab2xml")
source_xlsx <- system.file("extdata", "schema.xlsx", package = "tab2xml")
temp_file <- tempfile(fileext = ".xml")
# Convert spreadsheet to XML
file <- sheet2xml(source_xlsx, source_xml, temp_file)
As shown in the example, we only need to specify:
The spreadsheet file.
The root template. The rest of the templates must be located in the same folder as the root template and are determined from the content of the templates being processed.
The name of the output XML file (if none is specified, a file with the same name as the spreadsheet is created in the same location).
The resulting output is shown below.
library(xml2)
xml_content <- readLines(file, warn = FALSE)
cat("```xml\n", paste(xml_content, collapse = "\n"), "\n```", sep = "")
<Schema>
<Cube name="Sales">
<Table name="sales_fact_1997" />
<Dimension name="Gender" foreignKey="customer_id">
<Hierarchy name="Gender" hasAll="true" allMemberName="All Genders" primaryKey="customer_id">
<Table name="customer" />
<Level name="Gender" column="gender" type="String" uniqueMembers="true" />
</Hierarchy>
</Dimension>
<Dimension name="Time" foreignKey="time_id">
<Hierarchy name="Time" hasAll="false" primaryKey="time_id">
<Table name="time_by_day" />
<Level name="Year" column="the_year" type="Numeric" uniqueMembers="true" />
<Level name="Quarter" column="quarter" type="Numeric" uniqueMembers="false" />
<Level name="Month" column="month_of_year" type="Numeric" uniqueMembers="false" />
</Hierarchy>
</Dimension>
<Measure name="Unit Sales" column="unit_sales" aggregator="sum" formatString="#,###" />
<Measure name="Store Sales" column="store_sales" aggregator="sum" formatString="#,###.##" />
<Measure name="Store Cost" column="store_cost" aggregator="sum" formatString="#,###.00" />
<CalculatedMember name="Profit" dimension="Measures" formula="[Measures].[Store Sales] - [Measures].[Store Cost]">
<CalculatedMemberProperty name="FORMAT_STRING" value="$#,##0.00" />
</CalculatedMember>
</Cube>
</Schema>
Conclusions
The tab2xml
package offers an efficient and scalable
method for generating XML files from tabular data. By using structured
templates and relational tables, the package facilitates the management
of complex datasets while maintaining consistency and accuracy.
This approach is particularly advantageous for scenarios requiring the generation of large XML files or handling multiple entities with defined relationships. It streamlines the transformation process, reduces manual errors, and enhances maintainability by allowing users to modify the underlying data without altering the XML structure.
The provided example, based on the Mondrian schema, demonstrates how to apply this methodology to real-world use cases. This framework can be extended to other domains requiring XML generation from relational data, offering a flexible and robust solution for data-driven XML creation.