Categories
Mastering Development

spark.sql.files.maxPartitionBytes not limiting max size of written partitions

I’m trying to copy parquet data from another s3 bucket to my s3 bucket. I want to limit the size of each partition to a max of 128 MB. I thought by default spark.sql.files.maxPartitionBytes would have been set to 128 MB, but when I look at the partition files in s3 after my copy I see individual partition files around 226 MB instead. I was looking at this post which suggested that I set this spark config key in order to limit the max size of my partitions: Limiting maximum size of dataframe partition but it doesn’t seem to work?

This is the definition of that config key:

The maximum number of bytes to pack into a single partition when
reading files. This configuration is effective only when using
file-based sources such as Parquet, JSON and ORC.

I’m also a bit confused how this relates to size of the written parquet files.

For reference, I am running a glue script on glue version 1.0, spark 2.4 and the script is this:

val conf: SparkConf = new SparkConf()
conf.set("spark.sql.catalogImplementation", "hive")
    .set("spark.hadoop.hive.metastore.glue.catalogid", catalogId)
val spark: SparkContext = new SparkContext(sparkConf)

val glueContext: GlueContext = new GlueContext(spark)
val sparkSession = glueContext.getSparkSession

val sqlDF = sparkSession.sql("SELECT * FROM db.table where id='item1'")
sqlDF.write.mode(SaveMode.Overwrite).parquet("s3://my-s3-location/")

Leave a Reply

Your email address will not be published. Required fields are marked *