Hi Amit,
There are a couple of reasons why this might be happening:
-
Bucketing and Sorting Columns:
In your write.bucketBy
method, you specified ‘state’, ‘age’, and ‘salary’ as the bucketing columns. However, in the sortBy
method, you specified ‘bonus’ as the sorting column. It’s important to note that the sorting column in sortBy
should be one of the bucketing columns. If you want to sort by ‘bonus’, it should be one of the bucketing columns.
Here’s an example assuming ‘bonus’ is one of your bucketing columns:
df1.write.bucketBy(3, 'state', 'age', 'salary').sortBy('state', 'age', 'salary', 'bonus').mode("overwrite").saveAsTable('df_EmployeeTable_bucketBy')
-
Bucketing and Sorting Order:
By default, Spark uses ascending order for bucketing and sorting. If you want to specify the sorting order (ascending or descending), you can use the asc
or desc
functions from the pyspark.sql.functions
module.
Example:
from pyspark.sql.functions import asc
df1.write.bucketBy(3, 'state', 'age', 'salary').sortBy(asc('state'), asc('age'), asc('salary'), asc('bonus')).mode("overwrite").saveAsTable('df_EmployeeTable_bucketBy')
Make sure that ‘bonus’ is included in the list of columns used for bucketing, and specify the sorting order for the bucketing columns. Additionally, when querying the table, you might need to explicitly use an ORDER BY
clause to see the results in the desired order:
ss1.sql("SELECT * FROM df_EmployeeTable_bucketBy ORDER BY state, age, salary, bonus").collect()
This should help you achieve the desired sorting order when querying the table.