Datasets

Description

The tool allows for examining the schemas inferred from five real-world datasets which are described below.

The following table provides basic statistics about each dataset.
Dataset GitHub Twitter NYTimes VK Core
Datasets description
Size 13.7 GB 21 GB 21.3 GB 5.2 GB 508.7 GB
# objects 1,000,001 9,901,087 1,184,943 3,036,654 123,986,577
average textual size 14.7 KB 2.2 KB 19.3 KB 1.4 KB 4.4 KB
average AST* size 495.46 138.67 1,165.17 51.38 54
(*)AST: Abstract Syntax Tree

References

[1] DISCALA, Michael et ABADI, Daniel J. Automatic generation of normalized relational schemas from nested key-value data. In : Proceedings of the 2016 International Conference on Management of Data. ACM, 2016. p. 295-310.

Tool

Introduction

This main goal of this tool is to showcase the interactive schema inference of JSON collections. The GUI is divided into two panes: a "dataset" pane allowing to input and manage a JSON collection, and a "schema visualization" pane used for exploring the schema. The user has the possibility of either selecting her/his own data or exploring a schema from a sample of schemas that were inferred from real-world datasets.

Dataset input

A dtaset is input by either providing a JSONLines file through the "Load a File" (step 1) button or by typing JSON objects manually in the dataset input area (steps 1a and 1b). After a minimal data validation is performed, the "Infer schema" button is activated (step 2). Clicking on this button infers the schema and displays it in the schema area.

Schema exploration

Initially, the k-schema is displayed. To expand the content of a record into its l-equivlent type, click on the left vertical bar then click again on that bar to go back to the k-equivalent type. A colour change indicates the parameter used for inferring the type. As outlined, it is possible to explore a sample of schmas by clicking on the corresponding button (step 3). A basic exploration scenario is depicted here and here.

Screenshots

Playground

GUI

Exploring a k-schema

GUI

Exploring an l-schema

GUI