Categories
Mastering Development

While Creating/Selecting spark table/view getting error: The root scratch dir: /tmp/hive on HDFS should be writable

While Creating/Selecting spark table/view i getting below error: The root scratch dir: /tmp/hive on HDFS should be writable Even selecting list of tables from catalog, getting same error: spark.catalog.listTables() Traceback (most recent call last): File “”, line 1, in File “/usr/lib/python2.7/site-packages/pyspark/sql/catalog.py”, line 83, in listTables iter = self._jcatalog.listTables(dbName).toLocalIterator() File “/usr/lib/python2.7/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py”, line 1257, in call File […]

Categories
Development

How to transform dataframes to rdds in structured streaming?

I get data from kafka using pyspark streaming, and the result is a dataframe, when I transform dataframe to rdd, it went wrong: Traceback (most recent call last): File “/home/docs/dp_model/dp_algo_platform/dp_algo_core/test/test.py”, line 36, in <module> df = df.rdd.map(lambda x: x.value.split(” “)).toDF() File “/home/softs/spark-2.4.3-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/sql/dataframe.py”, line 91, in rdd File “/home/softs/spark-2.4.3-bin-hadoop2.6/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py”, line 1257, in __call__ File “/home/softs/spark-2.4.3-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/sql/utils.py”, line […]

Categories
Development Java

VectorAssembler fails with java.util.NoSuchElementException: Param handleInvalid does not exist

When transforming an ML Pipeline which uses VectorAssembler, it is hitting with a “Param handleInvalid does not exist” error. Why does this happen? Am I missing something? I am new to PySpark. I am using this as per code for combining a given list of columns into a single vector column: for categoricalCol in categoricalColumns: […]

Categories
Cache Development Java

Saving spark dataframe from azure databricks’ notebook job to azure blob storage causes java.lang.NoSuchMethodError

I have created a simple job using notebook in azure databricks. I am trying to save a spark dataframe from notebook to azure blob storage. Attaching the sample code import traceback from pyspark.sql import SparkSession from pyspark.sql.types import StringType # Attached the spark submit command used # spark-submit –master local[1] –packages org.apache.hadoop:hadoop-azure:2.7.2, # com.microsoft.azure:azure-storage:3.1.0 ./write_to_blob_from_spark.py […]

Categories
Development

Azure databricks: KafkaUtils createDirectStream causes Py4JNetworkError(“Answer from Java side is empty”) error

In Azure databricks, I tried to create a kafka stream in notebook and used it to create a spark job. Databricks throw error at the line KafkaUtils.createDirectStream(). Attached the correponding code below. from kazoo.client import KazooClient from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils, TopicAndPartition sc = spark.sparkContext ssc = StreamingContext(sc, 30) print(‘SSC created:: {}’.format(ssc)) […]

Categories
Development

Getting pyspark.sql.utils.ParseException while running sql query in pyspark

i am running SQL query on pyspark and getting below error. Can you please help me? query = “select DENSE_RANK() OVER(ORDER BY PROD_NM, CNTRY) AS SYSTEM_ID, id AS SOURCE_ID,source_name,prod_nm,CNTRY,source_entity,entity_name from(SELECT distinct id, ‘AMPIL’ as SOURCE_NAME,prod_nm, ‘PROD2′ AS Source_Entity,’PRODUCT’ AS ENTITY_NAME,CASE WHEN OPRTNG_CMPNYS = ‘Janssen Canada’ THEN ‘Canada’ WHEN OPRTNG_CMPNYS LIKE ‘Janssen US%’ THEN ‘United States’ […]

Categories
Development

PyArrow error running PySpark in combination with Panda UDFS in PyCharm

Following is my code: from pyspark.sql import SparkSession import pandas as pd from pyspark.sql import functions as sf from pyspark.sql import types as st spark = SparkSession.builder.getOrCreate() data = spark.createDataFrame([[x] for x in “first second third”.split()], [‘text’]) def foo(text: pd.Series) -> pd.Series: return text.transform(lambda x: x[::-1]) foo_udf = sf.pandas_udf(foo, functionType=sf.PandasUDFType.SCALAR, returnType=st.StringType()) ms = pd.Series([“firs”, “second”, […]

Categories
Development

Hive cannot be queried in Spark

I want to query hive in pyspark, but I have some problems. My spark and hive work fine, but when I try to use them together, there will always be a driver problem, or I can’t find the main problem. I tried to put the sql driver package in lib, but nothing changed. I tried […]

Categories
Development

fitting training data from decision tree regressor causes crash

Trying to implement the decision tree regressor algorithm on some training data but when I call fit() I get an error. (trainingData, testData) = data.randomSplit([0.7, 0.3]) vecAssembler = VectorAssembler(inputCols=[“_1”, “_2”, “_3”, “_4”, “_5”, “_6”, “_7”, “_8”, “_9”, “_10″], outputCol=”features”) dt = DecisionTreeRegressor(featuresCol=”features”, labelCol=”_11″) dt_model = dt.fit(trainingData) Generates the error File “spark.py”, line 100, in <module> […]

Categories
Ask

PySpark dataframe operation causes OutOfMemoryError

I’m just starting to experiment with pyspark/spark and run into the issue that my code is not working. I cannot find the issue and the error output of spark is not very helpful. I do find sort of the same questions on stackoverflow but none with a clear answer or solution (at least not for […]