The tool allows for examining the schemas inferred from five real-world datasets which are described below.
The following table provides basic statistics about each dataset.
- The GitHub dataset was obtained from . The objects in this dataset correspond to pull requests metadata generated upon each call to the GitHub web service.
This dataset does not use arrays, but it represents a good testbed for measuring the effect of the presence of different sets of fields in different records.
- The Twitter dataset, also obtained from , contains tweet records describing metadata about tweets, as well as a few delete records, which correspond to metadata generated upon delete requests. This dataset is interesting because it mixes two kinds of objects, but also because it uses arrays quite intensively.
- The NYTimes dataset, crawled using the NYTimes articlesearch API call, contains metadata about the NYTimes articles crawled from the NYTimes API.
An interesting feature of this dataset is the variability of the structure attached to the same field in different instances, which distinguishes it from the previous datasets.
- The VK dataset, which was obtained from Kaggle is related to the 2018 Russian election.
Its records describe interactions of users in the VK social networking site and, like the NYTimes dataset, features some structural variations that are not fully described in the
API documentation .
- The Core dataset, which is the largest one (508 GB), was obtained from the Core website. It cotains information about research articles aggregated
from many open repositories worldwide. The current dataset corresponds to the dump of March 1st, 2018, which describes more than 123 million articles.
The records in this dataset follow a fixed schema that is documented in the website quite precisely
except in the case of arrays containing records which are approximated to arrays of strings; this shows that even in very regular and well-documented datasets
one can still discover that the actual structure of data is different from what is documented.
(*)AST: Abstract Syntax Tree
|average textual size
|average AST* size
 DISCALA, Michael et ABADI, Daniel J. Automatic generation of normalized relational schemas from nested key-value data. In : Proceedings of the 2016 International Conference on Management of Data. ACM, 2016. p. 295-310.
This main goal of this tool is to showcase the interactive schema inference of JSON collections.
The GUI is divided into two panes: a "dataset" pane allowing to input and manage a JSON collection, and a "schema visualization" pane used for exploring the schema.
The user has the possibility of either selecting her/his own data or exploring a schema from a sample of schemas that were inferred from real-world datasets.
A dtaset is input by either providing a JSONLines file through the "Load a File" (step 1) button or by typing JSON objects manually in the dataset input area (steps 1a and 1b).
After a minimal data validation is performed, the "Infer schema" button is activated (step 2). Clicking on this button infers the schema and displays it in the schema area.
Initially, the k-schema is displayed.
To expand the content of a record into its l-equivlent type, click on the left vertical bar then click again on that bar to go back to the k-equivalent type.
A colour change indicates the parameter used for inferring the type.
As outlined, it is possible to explore a sample of schmas by clicking on the corresponding button (step 3). A basic exploration scenario is depicted here and here.
Exploring a k-schema
Exploring an l-schema