Columnar nosql star schema benchmark model and data engineering

Implementation of multidimensional databases in column-oriented NoSQL systems

Open Archive TOULOUSE Archive Ouverte (OATAO) OATAO is an open access repository that collects the work of Toulouse researchers and makes it freely available over the web where possible.

administrator: staff-oatao@listes-diff.inp-toulouse.fr

Implementation of multidimensional databases in column-oriented NoSQL systems

Keywords: data warehouse design, multidimensional modelling, NoSQL data-
bases, model transformation rules, column-oriented NoSQL model.

1 Introduction

─the computation of the OLAP cube using NoSQL technologies.

2 State of the art

In our approach we consider the conceptual model as well as logical models that allow distributing multidimensional data vertically using a column-oriented model. Finally we take into account hierarchies in our transformation rules by providing transformation rules to manage the pre-computed aggregates structured using an ag-gregate lattice.

A multidimensional schema, namely E, is defined by (FE, DE, StarE) where:

─FE={F1,…, Fn} is a finite set of facts,
─DE={D1,…, Dm} is a finite set of dimensions,
─StarE: FE→2�� is a function that associates facts of FE to sets of dimensions along which it can be analyzed (2��is the power set of DE).

─WeakHi: ParamHi→2�� is a function associating with each parameter possibly one or more weak attributes.	Code de champ modifié

functions are used to summarize the measure data according to these groups. Formal-ly, an OLAP cuboid O is derived from E, O = (FO,DO) such that:

─FO is a fact derived from F (F∈FE) with a subset of measures, MO⊆MF.

The fact FLineOrder is defined by (LineOrder, {SUM(Quantity), SUM(Discount), SUM(Revenue), SUM(Tax)}) and it is analyzed according to four dimensions, each consisting of several hierarchical levels (called detail levels or parameters):

─The Customer dimension (DCustomer) with parameters Customer (along with the weak attribute Name), City, Region and Nation,
─The Part dimension (DPart) with parameters Partkey (with weak attributes Size and Prod_Name), Category, Brand and Type; organized using two hierarchies HBrand and HCateg,
─The Date dimension (DDate) with parameters Date, Month (with a weak attribute, MonthName) and Year,
─The Supplier dimension (DSupplier) with parameters Supplier (with weak attributes Name), City, Region and Nation.

HCust

Fact	CUSTOMER	Customer	City	Region	Nation	All

HTime

Name

	SUPPLIER	Supplier City	Region

– FSSB={FLineOrder}

– DSSB={DCustomer, DPart, DDate, DSupplier},

Column-Oriented NoSQL models provide tables with a flexible schema (untyped columns) where the number of columns may vary between each record (called rows). Each row has a row key and a set of column families. Physical storage is organized according to these column families, hence a “vertical partitioning” of the data. A col-umn family consists of a set of columns, each associated with a qualifier (name) and an atomic value. Every value can be “versioned” using a timestamp. The flexibility of a column-oriented NoSQL database enables managing the absence of some columns between the different table rows. However, in the context of multidimensional data storage, this rarely happens as data is usually highly structured. This implies that the structure of a column family (i.e. the set of columns of the column family) will be the same for all table rows.

The following notations are used for describing a NoSQL model with respect to the definition of conceptual models. In addition to attribute names and values that are also present in the conceptual model, we focus here on the structure of rows.

4.2 Column-oriented models for data warehousing

In column-oriented stores, the data model is determined not only by its attributes and values, but also by the column families that group attributes (i.e. columns). In rela-tional database models, mapping from conceptual to logical structures is more straightforward. In column-oriented stores, there are several candidate approaches, which can differ on the tables and structures used. So far, no logical model has been proven better than another one and no mapping rules are widely accepted.

Table

The formalism that we have defined earlier enables us to define a mapping from the conceptual multidimensional model to each of our three logical models. Let O = (FO, DO) be a cuboid for a multidimensional model E built from the fact F with dimensions in DE.

Table 1 shows how we can map any measure m of FO and any dimension D of DO into all 3 models MLC0, MLC1 and MLC2. Let T be a generic table, TD a table for the dimension D, TF a table for a fact F and cf a generic column family.

─For all dimensions D∈DO, each attribute d∈AD of the dimension D is converted into an attribute (a column) of cf, i.e. RT.cf:d.

Conceptual to MLC1. To instantiate this model from the conceptual model, five rules are applied:

─Given a cuboid O, the fact FO is translated into a table TF with one column family cf and each dimension D∈DO is translated into a table TD with one column family cfD per table.

─Each measure m∈FO is translated into an attribute of the column family cf in the table TF, i.e. ��.cf:m.

5.1

Data: Data is generated using an extended version of SSB to generate raw data spe-cific to our models in normalized and denormalized formats. This is very convenient for our experimental purposes.

The benchmark models a simple product retail example and corresponds to a typi-cal decision support star-schema. It contains one fact table “LineOrder” and 4 dimen-sions “Customer”, “Supplier”, “Part” and “Date” (see Fig. 2 for an excerpt). The dimensions are composed of hierarchies; e.g. Date is organized according to one hier-archy of attributes (d_date, d_month, d_year).

5.2 Experimental results

InTable 2 we summarize data loading times by model and scale factor. We can ob-serve at scale factor SF1, we have 107 lines on each line order table for a 997 MB disk memory usage for MLC2 (3.9GB for both MLC0 and MLC1). At scale factor SF10 and SF100 we have respectively 108 lines and 109 lines and 9.97GB (39GB MLC0 and MLC1) and 97.7GB (390GB MLC0 and MLC1) for of disk memory usage. We observe that memory usage is lower in the MLC2 model. This is explained by the absence of redundancy in the dimensions. For all scale factors, the “dimension” tables “Customers”, “Supplier”, “Part” and “Date” have respectively 50000, 3333, 3333333 and 2556 records.

	MLC0	MLC1	MLC2
			264s / 0.997GB
SF10 (sf=10 108 lines)
	39075s / 390GB	39716s / 390GB

In Fig. 5, we sumarize experimental results concerning the computation of the OLAP cuboids at different levels of the OLAP lattice for SF1 using data from the model MLC0. We report the time needed to compute the cuboid and the number of records it contains.

We observe as expected that the number of records decreases from one level to the lower level. The same is true for computation time. We need between 550 and 642 seconds to compute the cuboids at the first level (using 3 dimensions). We need be-tween 78 seconds and 480 seconds at the second layer (using 2 dimensions). And we only need between 2 and 23 seconds to compute the cuboids at the third and fourth level (using 1 and 0 dimensions).

Dimensions used: C = Customer, S = Supplier, D = Date, P = Part (i.e. Product)

10,000,000 rows

CSP			CSPD
			CSD				CPD	9,790,169 rows			SPD
					556s			573s
62,500 rows	CP	937,475 rows			556s			573s
62,500 rows		937,475 rows	CD	21,250 rows		SP			SD			PD
78s		417s	CD			SP			SD			PD
C	250 rows 17s		S		250 rows	P		3,750 rows		D			85 rows
			S			P				D
			All					Record count
								Computation time (seconds)
								Computation time (seconds)

In this paper, we studied instantiating multidimensional data warehouses using NoSQL column-oriented systems. We proposed three approaches at the column-oriented logical model. Using a simple formalism that separate structures from values, we described mappings from the conceptual level (described using a multidimension-

3 “Reasonable time” for a Big Data environment running on commodity hardware (without an optical fiber network between nodes, i.e. the recommended 10,000 GB/s).

These studies are supported by the ANRT funding under CIFRE-Capgemini part-nership.

8 References

199-212 (2014)
12.Kimball, R., Ross, M., The Data Warehouse Toolkit: The Definitive Guide to Dimen-

sional Modeling, 3rd ed., John Wiley & Sons, Inc., 2013.