The ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM.
Get registeredAsk the Fabric Databases & App Development teams anything! Live on Reddit on August 26th. Learn more.
Hi,
I'm trying to use PyDeequ and I am following the steps here - https://pydeequ.readthedocs.io/en/latest/README.html#installation
1.
pip install pydeequ
2.
import os
# Set the SPARK_VERSION environment variable
os.environ['SPARK_VERSION'] = '3.3'
3.
from pyspark.sql import SparkSession, Row
import pydeequ
spark = (SparkSession
.builder
.config("spark.jars.packages", pydeequ.deequ_maven_coord)
.config("spark.jars.excludes", pydeequ.f2j_maven_coord)
.getOrCreate())
df = spark.sparkContext.parallelize([
Row(a="foo", b=1, c=5),
Row(a="bar", b=2, c=6),
Row(a="baz", b=3, c=None)]).toDF()
4.
from pydeequ.analyzers import *
analysisResult = AnalysisRunner(spark) \
.onData(df) \
.addAnalyzer(Size()) \
.addAnalyzer(Completeness("b")) \
.run()
analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
analysisResult_df.show()
I am getting the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[53], line 4
1 from pydeequ.analyzers import *
3 analysisResult = AnalysisRunner(spark) \
----> 4 .onData(df) \
5 .addAnalyzer(Size()) \
6 .addAnalyzer(Completeness("b")) \
7 .run()
9 analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
10 analysisResult_df.show()
File ~/cluster-env/trident_env/lib/python3.10/site-packages/pydeequ/analyzers.py:52, in AnalysisRunner.onData(self, df)
46 """
47 Starting point to construct an AnalysisRun.
48 :param dataFrame df: tabular data on which the checks should be verified
49 :return: new AnalysisRunBuilder object
50 """
51 df = ensure_pyspark_df(self._spark_session, df)
---> 52 return AnalysisRunBuilder(self._spark_session, df)
File ~/cluster-env/trident_env/lib/python3.10/site-packages/pydeequ/analyzers.py:124, in AnalysisRunBuilder.__init__(self, spark_session, df)
122 self._jspark_session = spark_session._jsparkSession
123 self._df = df
--> 124 self._AnalysisRunBuilder = self._jvm.com.amazon.deequ.analyzers.runners.AnalysisRunBuilder(df._jdf)
TypeError: 'JavaPackage' object is not callable
Did I miss any installation or setup or anything?
Solved! Go to Solution.
Hi @Anonymous ,
We had the latest version of pydeequ and we managed to solve it by adding a spark property to the environment. See screenshot below.
Hi @russelp ,
The "'JavaPackage' object is not callable" error message usually means that the used Java/Scala package was not found. This may mean that the Deequ library was not loaded correctly into the Spark session.
There are several things you can check for the problem:
Make sure you are using compatible versions of Spark and Deequ.
Make sure PyDeequ is correctly installed and up to date. This can be checked with the following command:
pip show pydeequ
PyDeequ can be reinstalled with the following command:
pip install --upgrade pydeequ
Best Regards,
Yang
Community Support Team
If there is any post helps, then please consider Accept it as the solution to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!
Hi @Anonymous ,
We had the latest version of pydeequ and we managed to solve it by adding a spark property to the environment. See screenshot below.