Read/Write Parquet with Struct column type

I am trying to write a Dataframe like this to Parquet:

| foo | bar               |
|  1  | {"a": 1, "b": 10} |
|  2  | {"a": 2, "b": 20} |
|  3  | {"a": 3, "b": 30} |

I am doing it with Pandas and Fastparquet:

df = pd.DataFrame({
    "foo": [1, 2, 3],
    "bar": [{"a": 1, "b": 10}, {"a": 2, "b": 20}, {"a": 3, "b": 30}]

import fastparquet
fastparquet.write('/my/parquet/location/toy-fastparquet.parq', df)

I would like to load Parquet in (py)Spark, and query the data with Spark SQL, like:

df ="/my/parquet/location/")
result = spark.sql("SELECT * FROM my_toy_table WHERE bar.b > 15")

My issue is that, even though fastparquet can read its Parquet file correctly (the bar field is correctly deserialized as a Struct), in Spark, bar is read as a column of type String, that just contains a JSON representation of the original structure:

In [2]: df.head()                                                                                                                                                                                           
Out[2]: Row(foo=1, bar='{"a": 1, "b": 10}')

I tried writing Parquet from PyArrow, but no luck there: ArrowNotImplementedError: Level generation for Struct not supported yet. I have also tried passing file_scheme='hive' to Fastparquet, but I got the same results. Changing Fastparquet serialization to BSON (object_encoding='bson') produced an unreadable binary field.

Is there a way to produce Parquet files that contain nested structures, and that are also readable in Spark? Or is it possible to de-serialize JSON fiels when reading in Spark? Or should I avoid nested structures altogether?

Leave an answer

Your email address will not be published. Required fields are marked *