So basically I am using pyspark (jdbc format) to read tables in form a database and then write that data to an Azure Data Lake. The code that I’ve written works, except for the very large tables (400k rows, 50 cols) with the following error: Py4JJavaError: An error occurred while calling o94.parquet. : org.apache.spark.SparkException: Job […]
Categories
pyspark writing jdbc times out
- Post author By Full Stack
- Post date May 22, 2020
- No Comments on pyspark writing jdbc times out
- Tags 'month') \ .parquet(f"abfss://{azStorageContainer}@{azStorageAccount}.dfs.core.windows.net/" + tableName + ".parquet"), "10g") \ .set("spark.cores.max", "10g") \ .set("spark.driver.memory", "5") spkContext = SparkContext(conf=spkConfig) sqlContext = SQLContext(spkContext) spark = sqlContext.sparkSession ## # Read table from DB t, "com.microsoft.sqlserver.jdbc.SQLServerDriver") \ .option("url", "SharedKey") \ .set(f"fs.azure.account.key.{azStorageAccount}.dfs.core.windows.net", 1) \ .option("upperBound", 10) \ .option("numPartitions", 2612a419099c, 50 cols) with the following error: Py4JJavaError: An error occurred while calling o94.parquet. : org.apache.spark.SparkException: Job aborte, azStorageKey) \ .set("spark.executor.memory", dbPassword) \ .option("queryTimeout", dbQuery, dbQuery) \ .option("user", dbUsername) \ .option("password", except for the very large tables (400k rows, executor driver): com.microsoft.sqlserver.jdbc.SQLServerException: SQL Server returned an incomplete response. The connection has been closed, f"jdbc:sqlserver://{dbHost}:{dbPort};databaseName={dbDatabase}") \ .option("dbtable", month(partitionCol)) \ .write.mode('overwrite')\ .partitionBy('year', most recent failure: Lost task 23.0 in stage 2.0 (TID 25, partitionCol, partitionCol) \ .option("lowerBound", partitionCol): tableDF \ .withColumn("year", partitionSize, partitionSize) \ .option("partitionColumn", partitionUpperBound) \ .load() return jdbcDF ## # Write Dataframe as automatically partitioned parquet files for each month ## def, partitionUpperBound): jdbcDF = spark.read.format("jdbc") \ .option("driver", So basically I am using pyspark (jdbc format) to read tables in form a database and then write that data to an Azure Data Lake. The code that, so I increased the executor and driver memory to 10g each. However, tableDF, the problem persisted. Here is my code: spkConfig = SparkConf() \ .setAppName(appName) \ .setMaster(master) \ .set(f"fs.azure.ac, year(partitionCol)) \ .withColumn("month"