In the world of big data processing, Apache Spark and Apache Hive are two powerful tools that are widely used for data analytics and processing. Spark provides fast and distributed data processing capabilities, while Hive offers a data warehousing solution built on top of Hadoop. In this blog post, we will explore how to write a table to Hive from Spark without using the warehouse connector in HDP 3.1.
Understanding the Problem
When attempting to write to a Hive table directly from Spark using the warehouse connector, you may encounter a specific issue. The following code snippet demonstrates this problem:
code
spark-shell --driver-memory 16g --master local[3]
--conf spark.hadoop.metastore.catalog.default=hive
val df = Seq(1,2,3,4).toDF
spark.sql("create database foo")
df.write.saveAsTable("foo.my_table_01")
Executing this code results in the following error:
code
Table foo.my_table_01 failed strict managed table checks due to the following reason: Table is marked as a managed table but is not transactional.
This error occurs because the table is marked as a managed table but lacks the necessary transactional properties. However, we can overcome this issue by modifying our approach.
An Alternative Approach
To write to a Hive table without using the warehouse connector and still be able to read the data later with Hive, we can take a slightly different approach. Let’s consider the following code snippet:
code
val df = Seq(1,2,3,4).toDF.withColumn("part", col("value"))
df.write.partitionBy("part").option("compression", "zlib").mode(SaveMode.Overwrite).format("orc").saveAsTable("foo.my_table_02")
In this code, we first add a new column called “part” to our DataFrame and then write the data to the Hive table “foo.my_table_02”. We specify the partitioning column as “part” and set the compression option to “zlib”. Additionally, we use the ORC format for efficient storage. This approach allows us to write to the table successfully using Spark.
Querying the Table
Now, let’s move on to querying the table we just created. We can use the following code snippet to query the data from Hive:
code
spark.sql("select * from foo.my_table_02").show
Executing this code will display the contents of the table “foo.my_table_02” in the Spark console.
Dealing with Errors
While working with this approach, you may encounter the following error when trying to query the table using Hive or Beeline:
code
Error: java.io.IOException: java.lang.IllegalArgumentException: bucketId out of range: -1
This error is caused by a bucketId value that is out of range. Unfortunately, there is no straightforward solution to this problem. However, there are some workarounds you can try.
Workarounds and Related Issues
One possible workaround is to create an external table in Hive instead of a managed table. External tables are not managed, not ACID, and not transactional, which makes them a suitable alternative in certain scenarios. To implement this workaround, you can use the following code snippet:
code
val externalPath = "/warehouse/tablespace/external/hive/foo.db/my_table"
df.write.partitionBy("part_col").option("compression", "zlib").mode(SaveMode.Overwrite).orc(externalPath)
val columns = df.drop("part_col").schema.fields.map(field => s"${field.name} ${field.dataType.simpleString}").mkString(", ")
val ddl =
s"""
|CREATE EXTERNAL TABLE foo.my_table ($columns)
|PARTITIONED BY (part_col string)
|STORED AS ORC
|Location '$externalPath'
""".stripMargin
spark.sql(ddl)
By creating an external table and specifying the appropriate column metadata, you can work around the limitations imposed by the managed table.
It’s worth mentioning that there are some related issues and bugs that have been reported by the community. These issues include problems with large data frames, the warehouse connector, and LLAP dependencies. While some proposed solutions and workarounds exist, they may not be suitable for all scenarios. Therefore, it is essential to consider these factors when implementing your solution.
Conclusion
In this blog post, we explored how to write a table to Hive from Spark without using the warehouse connector in HDP 3.1. We discussed the issue of managed tables lacking transactional properties and provided an alternative approach to writing data to Hive using Spark. Additionally, we covered some workarounds and related issues that you may encounter during your implementation.
By understanding the intricacies of Spark and Hive integration and leveraging the appropriate techniques, you can effectively write data to Hive and achieve seamless interoperability between these powerful data processing tools.