I’m trying to copy parquet data from another s3 bucket to my s3 bucket. I want to limit the size of each partition to a max of 128 MB. I thought by default spark.sql.files.maxPartitionBytes would have been set to 128 MB, but when I look at the partition files in s3 after my copy I see individual partition files around 226 MB instead. I was looking at this post which suggested that I set this spark config key in order to limit the max size of my partitions: Limiting maximum size of dataframe partition but it doesn’t seem to work?
This is the definition of that config key:
The maximum number of bytes to pack into a single partition when
reading files. This configuration is effective only when using
file-based sources such as Parquet, JSON and ORC.
I’m also a bit confused how this relates to size of the written parquet files.
For reference, I am running a glue script on glue version 1.0, spark 2.4 and the script is this:
val conf: SparkConf = new SparkConf() conf.set("spark.sql.catalogImplementation", "hive") .set("spark.hadoop.hive.metastore.glue.catalogid", catalogId) val spark: SparkContext = new SparkContext(sparkConf) val glueContext: GlueContext = new GlueContext(spark) val sparkSession = glueContext.getSparkSession val sqlDF = sparkSession.sql("SELECT * FROM db.table where id='item1'") sqlDF.write.mode(SaveMode.Overwrite).parquet("s3://my-s3-location/")