Join us at FabCon Atlanta from March 16 - 20, 2026, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.
Register now!Calling all Data Engineers! Fabric Data Engineer (Exam DP-700) live sessions are back! Starting October 16th. Sign up.
Hi,
I'm trying to use PyDeequ and I am following the steps here - https://pydeequ.readthedocs.io/en/latest/README.html#installation
1.
pip install pydeequ
2.
import os
# Set the SPARK_VERSION environment variable
os.environ['SPARK_VERSION'] = '3.3'
3.
from pyspark.sql import SparkSession, Row
import pydeequ
spark = (SparkSession
.builder
.config("spark.jars.packages", pydeequ.deequ_maven_coord)
.config("spark.jars.excludes", pydeequ.f2j_maven_coord)
.getOrCreate())
df = spark.sparkContext.parallelize([
Row(a="foo", b=1, c=5),
Row(a="bar", b=2, c=6),
Row(a="baz", b=3, c=None)]).toDF()
4.
from pydeequ.analyzers import *
analysisResult = AnalysisRunner(spark) \
.onData(df) \
.addAnalyzer(Size()) \
.addAnalyzer(Completeness("b")) \
.run()
analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
analysisResult_df.show()
I am getting the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[53], line 4
1 from pydeequ.analyzers import *
3 analysisResult = AnalysisRunner(spark) \
----> 4 .onData(df) \
5 .addAnalyzer(Size()) \
6 .addAnalyzer(Completeness("b")) \
7 .run()
9 analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
10 analysisResult_df.show()
File ~/cluster-env/trident_env/lib/python3.10/site-packages/pydeequ/analyzers.py:52, in AnalysisRunner.onData(self, df)
46 """
47 Starting point to construct an AnalysisRun.
48 :param dataFrame df: tabular data on which the checks should be verified
49 :return: new AnalysisRunBuilder object
50 """
51 df = ensure_pyspark_df(self._spark_session, df)
---> 52 return AnalysisRunBuilder(self._spark_session, df)
File ~/cluster-env/trident_env/lib/python3.10/site-packages/pydeequ/analyzers.py:124, in AnalysisRunBuilder.__init__(self, spark_session, df)
122 self._jspark_session = spark_session._jsparkSession
123 self._df = df
--> 124 self._AnalysisRunBuilder = self._jvm.com.amazon.deequ.analyzers.runners.AnalysisRunBuilder(df._jdf)
TypeError: 'JavaPackage' object is not callable
Did I miss any installation or setup or anything?
Solved! Go to Solution.
Hi @Anonymous ,
We had the latest version of pydeequ and we managed to solve it by adding a spark property to the environment. See screenshot below.
I have installed pypi library to my envinornment , the latest version for pydeeque available is 1.5.0 and added the spark property
com.amazon.deequ:deequ:1.5.0-spark-3.5, after adding this I am not able to connect to spark session.
Hi @russelp ,
The "'JavaPackage' object is not callable" error message usually means that the used Java/Scala package was not found. This may mean that the Deequ library was not loaded correctly into the Spark session.
There are several things you can check for the problem:
Make sure you are using compatible versions of Spark and Deequ.
Make sure PyDeequ is correctly installed and up to date. This can be checked with the following command:
pip show pydeequ
PyDeequ can be reinstalled with the following command:
pip install --upgrade pydeequ
Best Regards,
Yang
Community Support Team
If there is any post helps, then please consider Accept it as the solution to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!
Hi @Anonymous ,
We had the latest version of pydeequ and we managed to solve it by adding a spark property to the environment. See screenshot below.
Join the Fabric FabCon Global Hackathon—running virtually through Nov 3. Open to all skill levels. $10,000 in prizes!
Check out the September 2025 Fabric update to learn about new features.
User | Count |
---|---|
13 | |
5 | |
4 | |
3 | |
2 |