Parametric Schema Inference for Massive JSON Datasets

The inferred Schemas

Dataset Kind-based Label-based Spark Dataframe
GitHub schema schema schema
Twitter'16 schema schema schema
NYTimes schema schema schema
Wikidata schema NA schema
VK schema schema schema
Twitter'18 schema schema schema
Core schema schema schema

How to read the Schemas?

Our schemas are represented in JSON following a simple rule: every type T is mapped to a record R which contains: To illustrate this representation, consider the following type expressed in our fornmal syntax:
[ { id : Int , abstract : Null + Str , year : Str ? } ]
and its JSON representation.
            
            {
            "__Kind": "Array",
            "__Content": {
                "__Kind": "Record",
                "__Content": {
                    "_id": {
                        "__Kind": "NumType"
                        },
                    "abstract": {
                        "__Kind": "Union",
                        "__Content":
                        [
                            {
                            "__Kind": "NullType"
                            },
                            {
                            "__Kind": "StrType"
                            }
                        ]
                        },
                    "__Optional" : {
                        "year" : {
                            "__Kind": "StrType"
                                   }
                                }
                    }
                }
            }