Categories
Development Java

VectorAssembler fails with java.util.NoSuchElementException: Param handleInvalid does not exist

When transforming an ML Pipeline which uses VectorAssembler, it is hitting with a “Param handleInvalid does not exist” error. Why does this happen? Am I missing something? I am new to PySpark. I am using this as per code for combining a given list of columns into a single vector column: for categoricalCol in categoricalColumns: […]

Categories
Cache Development Java

Saving spark dataframe from azure databricks’ notebook job to azure blob storage causes java.lang.NoSuchMethodError

I have created a simple job using notebook in azure databricks. I am trying to save a spark dataframe from notebook to azure blob storage. Attaching the sample code import traceback from pyspark.sql import SparkSession from pyspark.sql.types import StringType # Attached the spark submit command used # spark-submit –master local[1] –packages org.apache.hadoop:hadoop-azure:2.7.2, # com.microsoft.azure:azure-storage:3.1.0 ./write_to_blob_from_spark.py […]

Categories
Development

Pyspark 2.4.3, Read Avro format message from Kafka – Pyspark Structured streaming

I am trying to read Avro messages from Kafka, using PySpark 2.4.3. Based on the below stack over flow link , Am able to covert into Avro format (to_avro) and code is working as expected. but from_avro is not working and getting below issue.Are there any other modules that support reading avro messages streamed from […]

Categories
Development

Azure databricks: KafkaUtils createDirectStream causes Py4JNetworkError(“Answer from Java side is empty”) error

In Azure databricks, I tried to create a kafka stream in notebook and used it to create a spark job. Databricks throw error at the line KafkaUtils.createDirectStream(). Attached the correponding code below. from kazoo.client import KazooClient from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils, TopicAndPartition sc = spark.sparkContext ssc = StreamingContext(sc, 30) print(‘SSC created:: {}’.format(ssc)) […]

Categories
Development

pyspark local – AWSCredentialsProviderChain using `~/.aws/credentials`

Seems like spark does not look at ~/.aws/credentials, I wonder if there is a way to make it work without these environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, also because I am using IAM roles on AWS. Traceback (most recent call last): File “/usr/lib/python3.7/runpy.py”, line 193, in _run_module_as_main “__main__”, mod_spec) File “/usr/lib/python3.7/runpy.py”, line 85, in _run_code exec(code, […]

Categories
Development

PyArrow error running PySpark in combination with Panda UDFS in PyCharm

Following is my code: from pyspark.sql import SparkSession import pandas as pd from pyspark.sql import functions as sf from pyspark.sql import types as st spark = SparkSession.builder.getOrCreate() data = spark.createDataFrame([[x] for x in “first second third”.split()], [‘text’]) def foo(text: pd.Series) -> pd.Series: return text.transform(lambda x: x[::-1]) foo_udf = sf.pandas_udf(foo, functionType=sf.PandasUDFType.SCALAR, returnType=st.StringType()) ms = pd.Series([“firs”, “second”, […]

Categories
Development

pyspark truncate table without overwrite

I need to truncate a table before inserting new data. I have the following code to insert: df.write.jdbc(dbUrl, self._loadDb, “append”, self._props[‘dbProps’]) Which works great, except.. i want an empty database. I know about setting the mode to overwrite and adding .option(‘truncate’, True) but .. this is not what i want. In another way; how can […]

Categories
Development

Hive cannot be queried in Spark

I want to query hive in pyspark, but I have some problems. My spark and hive work fine, but when I try to use them together, there will always be a driver problem, or I can’t find the main problem. I tried to put the sql driver package in lib, but nothing changed. I tried […]

Categories
Ask

PySpark dataframe operation causes OutOfMemoryError

I’m just starting to experiment with pyspark/spark and run into the issue that my code is not working. I cannot find the issue and the error output of spark is not very helpful. I do find sort of the same questions on stackoverflow but none with a clear answer or solution (at least not for […]