Where to find table saved with 'saveAsTable' in pyspark

Amit_Kumar_Tyagi · December 15, 2023, 4:54am

Hi Team,

I ran below code in Jupyter for pyspark successfully -

df1.write.bucketBy(4,‘state’).saveAsTable(‘df_EmployeeTable_bucketBy’)

Now where can I see this saved ‘df_EmployeeTable_bucketBy’ ?

In putty I ran ‘hadoop fs -ls’ and it was not listed there. In hive terminal I ran ‘show tables;’ and it was not there also.

Shubh_Tripathi · December 19, 2023, 7:03am

Hi Amit,

You can see the saved directory at: “/apps/hive/warehouse/df_employeetable_bucketby”. You can run the below command to see all the parquet files created by bucketBy:

hadoop fs -ls /apps/hive/warehouse/df_employeetable_bucketby

Amit_Kumar_Tyagi · December 19, 2023, 3:28pm

Thanks Shubh !!!

One more doubt.

When I delete ‘df_EmployeeTable_bucketBy’ after running “hadoop fs -rm -R /apps/hive/warehouse/df_employeetable_bucketby” in putty and then again run “df1.write.bucketBy(4,‘state’).saveAsTable(‘df_EmployeeTable_bucketBy’)” in Jupyter I get error ‘df_EmployeeTable_bucketBy’ already exists .

Why is it so when I have already deleted ‘df_EmployeeTable_bucketBy’ ?

Shubh_Tripathi · December 20, 2023, 5:25am

This geenrally does not happens. Make sure, the ‘df_employeetable_bucketby’ was deleted successfully. Also restart the jupyter kernel after deleting the table.

Amit_Kumar_Tyagi · December 25, 2023, 3:56pm

I ran below-

df1.write.bucketBy(3,‘state’,‘age’,‘salary’).sortBy(‘bonus’).mode(“overwrite”).saveAsTable(‘df_EmployeeTable_bucketBy’)

And then to see the data ran below-

ss1.sql(“select * from df_employeetable_bucketby”).collect()

but the output was not in sorted order (on ‘bonus’) . Why is it so?

Shubh_Tripathi · December 26, 2023, 6:48am

Hi Amit,

There are a couple of reasons why this might be happening:

Bucketing and Sorting Columns:
In your write.bucketBy method, you specified ‘state’, ‘age’, and ‘salary’ as the bucketing columns. However, in the sortBy method, you specified ‘bonus’ as the sorting column. It’s important to note that the sorting column in sortBy should be one of the bucketing columns. If you want to sort by ‘bonus’, it should be one of the bucketing columns.

Here’s an example assuming ‘bonus’ is one of your bucketing columns:
```
df1.write.bucketBy(3, 'state', 'age', 'salary').sortBy('state', 'age', 'salary', 'bonus').mode("overwrite").saveAsTable('df_EmployeeTable_bucketBy')
```
Bucketing and Sorting Order:
By default, Spark uses ascending order for bucketing and sorting. If you want to specify the sorting order (ascending or descending), you can use the asc or desc functions from the pyspark.sql.functions module.

Example:
```
from pyspark.sql.functions import asc

df1.write.bucketBy(3, 'state', 'age', 'salary').sortBy(asc('state'), asc('age'), asc('salary'), asc('bonus')).mode("overwrite").saveAsTable('df_EmployeeTable_bucketBy')
```

Make sure that ‘bonus’ is included in the list of columns used for bucketing, and specify the sorting order for the bucketing columns. Additionally, when querying the table, you might need to explicitly use an ORDER BY clause to see the results in the desired order:

ss1.sql("SELECT * FROM df_EmployeeTable_bucketBy ORDER BY state, age, salary, bonus").collect()

This should help you achieve the desired sorting order when querying the table.