Scenario: How to copy dataframe column value to a variable in PySpark?

Dileep_P · October 8, 2020, 11:55pm

Hi,

I am struggling to figure out a way to solve below requirement in PySpark. Any help would be really appreciated.

*Requirement:
Read a date column value from Hive table and pass that dynamic value as date extension in file name , while writing into a csv file.

Ex:
Step1: Below is the sample sql from Hive. Imagine this will always return 1 value/cell.
results = spark.sql(Select ETL_FORM_DT From ETL_BATCH Where Btch_status=‘S’)

Step2: Then I want to assign above output to a variable, like
v_etl_frm_dt = results.select(“ETL_FORM_DT”)
** Here, v_etl_frm_dt is getting created as type “dataframe”.

Step3:
out_data.coalesce(1).write.mode(‘overwrite’).option(“quote”, “”).option(“emptyValue”, None).csv(“file:///home/pdc19883858/out_test_” + v_etl_frm_dt + “.csv”)

** If I use as above, I am getting an error “Can not concatenate 'str” and ‘dataframe’

So i guess if I can read above query output as a date/int type, rather as a dataframe, problem
solved… but still figuring out how!

How can I achieve this? Please let me know if my question is not clear.

Thanks in advance.

Dileep_P · October 9, 2020, 2:42am

I somehow figured it by doing as below:

results = spark.sql(“Select ETL_FORM_DT From ETL_BATCH Where Btch_status=‘S’”)

date_param = results.select(f.max('ETL_FORM_DT ')).first()[0]

print(type(date_param))
–> <type ‘datetime.date’>
print(date_param)
–> 2020-10-03

and I was able to pass that value to my csv file name as below. Worked as required.
.csv(“file:///home/pdc19883858/out_test_” + date_final + “.csv”)

I would like to know if this is the right approach. Please let me know if you have any better solutions.

Thank you!

sgiri · October 16, 2020, 3:05pm

Yes. Based on my guess of your requirement it seems to be the right approach.

The dataframe or RDD of spark are lazy. The dataframe does not have values instead it has references. To fetch the data, you need call an action on dataframe or RDD such as take(), collect() or first().

I hope it clears your doubt.

Dileep_P · October 16, 2020, 4:08pm

Yes, it is clear now. Thank you!

Also, can you please look into the Pandas issue in PySpark, below is the link. I had to use VENV to get it work. But want to avoid activate, deactivating every time I login. If you can install this on your side, it is useful for everyone.
Note: I am able to import pandas for python program, but getting error while trying to import in PySpark program. Screenshots available in below link.

Thank you,
Dileep

sgiri · October 16, 2020, 4:13pm

It is usually hard to install a library because it impacts the users. Also, I am afraid that the pandas might start breaking the normal spark dataframe.

We will evaluate and do the needful. If you are able to use venv, I think you are good to go.