Where to find table saved with 'saveAsTable' in pyspark

Hi Team,

I ran below code in Jupyter for pyspark successfully -

df1.write.bucketBy(4,‘state’).saveAsTable(‘df_EmployeeTable_bucketBy’)

Now where can I see this saved ‘df_EmployeeTable_bucketBy’ ?

In putty I ran ‘hadoop fs -ls’ and it was not listed there. In hive terminal I ran ‘show tables;’ and it was not there also.

Hi Amit,

You can see the saved directory at: “/apps/hive/warehouse/df_employeetable_bucketby”. You can run the below command to see all the parquet files created by bucketBy:

hadoop fs -ls /apps/hive/warehouse/df_employeetable_bucketby

Thanks Shubh !!!

One more doubt.

When I delete ‘df_EmployeeTable_bucketBy’ after running “hadoop fs -rm -R /apps/hive/warehouse/df_employeetable_bucketby” in putty and then again run “df1.write.bucketBy(4,‘state’).saveAsTable(‘df_EmployeeTable_bucketBy’)” in Jupyter I get error ‘df_EmployeeTable_bucketBy’ already exists .

Why is it so when I have already deleted ‘df_EmployeeTable_bucketBy’ ?

This geenrally does not happens. Make sure, the ‘df_employeetable_bucketby’ was deleted successfully. Also restart the jupyter kernel after deleting the table.

I ran below-

df1.write.bucketBy(3,‘state’,‘age’,‘salary’).sortBy(‘bonus’).mode(“overwrite”).saveAsTable(‘df_EmployeeTable_bucketBy’)

And then to see the data ran below-

ss1.sql(“select * from df_employeetable_bucketby”).collect()

but the output was not in sorted order (on ‘bonus’) . Why is it so?

Hi Amit,

There are a couple of reasons why this might be happening:

  1. Bucketing and Sorting Columns:
    In your write.bucketBy method, you specified ‘state’, ‘age’, and ‘salary’ as the bucketing columns. However, in the sortBy method, you specified ‘bonus’ as the sorting column. It’s important to note that the sorting column in sortBy should be one of the bucketing columns. If you want to sort by ‘bonus’, it should be one of the bucketing columns.

    Here’s an example assuming ‘bonus’ is one of your bucketing columns:

    df1.write.bucketBy(3, 'state', 'age', 'salary').sortBy('state', 'age', 'salary', 'bonus').mode("overwrite").saveAsTable('df_EmployeeTable_bucketBy')
    
  2. Bucketing and Sorting Order:
    By default, Spark uses ascending order for bucketing and sorting. If you want to specify the sorting order (ascending or descending), you can use the asc or desc functions from the pyspark.sql.functions module.

    Example:

    from pyspark.sql.functions import asc
    
    df1.write.bucketBy(3, 'state', 'age', 'salary').sortBy(asc('state'), asc('age'), asc('salary'), asc('bonus')).mode("overwrite").saveAsTable('df_EmployeeTable_bucketBy')
    

Make sure that ‘bonus’ is included in the list of columns used for bucketing, and specify the sorting order for the bucketing columns. Additionally, when querying the table, you might need to explicitly use an ORDER BY clause to see the results in the desired order:

ss1.sql("SELECT * FROM df_EmployeeTable_bucketBy ORDER BY state, age, salary, bonus").collect()

This should help you achieve the desired sorting order when querying the table.