Hive tblproperties not working from Pyspark


#1

Hi,

I have a hive table define with tblproperties(“skip.header.line.count”=“1”) .
When I query (select *) from hive itself, its showing proper result but when I execute the select from pyspark, the header is not skipped.

Could you please help.

df=sqlContext.sql(“select * from sde_db.cust_data_master”)
df.show(2)

±------±---------±---------±--------±------------±------±----±----±-----±-----------±-----------±-------------------±------+
|cust_id| biz_dt|first_name|last_name| address| city|state| post|phone1| phone2| email| web|country|
±------±---------±---------±--------±------------±------±----±----±-----±-----------±-----------±-------------------±------+
| null| null|first_name|last_name| address|country| city|state| post| phone1| phone2| email| au|
| 1|2018-01-09| Rebbecca| Didio|171 E 24th St| AU|Leith| TA| 7315|03-8174-9123|0458-665-290|rebbecca.didio@di…| au|
±------±---------±---------±--------±------------±------±----±----±-----±-----------±-----------±-------------------±------+


#2

Hi Sayandeep,

Very good observation.

Looks like if you turn off vectorization, the problem seems to not occur. This is due that if you disable vectorized execution the reader is not vectorized. You can do that by calling:

spark.sql("set hive.vectorized.execution.enabled=false;")

You can take a look at this Hive Known issue: https://issues.apache.org/jira/browse/HIVE-19943


#3

Thanks Sandeep. But still its the same.

spark.sql(“SET hive.vectorized.execution.enabled=false;”)
19/03/06 10:39:32 WARN SetCommand: ‘SET hive.vectorized.execution.enabled=false;’ might not work, since Spark doesn’t support changing the Hive config dy
namically. Please pass the Hive-specific config by adding the prefix spark.hadoop (e.g. spark.hadoop.hive.vectorized.execution.enabled) when starting a S
park application. For details, see the link: https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties.
DataFrame[key: string, value: string]

spark.sql(“SET spark.hadoop.hive.vectorized.execution.enabled=false;”)
DataFrame[key: string, value: string]

spark.sql(“select * from sde_db.cust_data_master limit 5”).show()
19/03/06 10:39:58 WARN LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
±------±---------±---------±--------±-------------------±------±-------±----±-----±-----------±-----------±-------------------±------+
|cust_id| biz_dt|first_name|last_name| address| city| state| post|phone1| phone2| email| web|country|
±------±---------±---------±--------±-------------------±------±-------±----±-----±-----------±-----------±-------------------±------+
| null| null|first_name|last_name| address|country| city|state| post| phone1| phone2| email| au|
| 1|2018-01-09| Rebbecca| Didio| 171 E 24th St| AU| Leith| TA| 7315|03-8174-9123|0458-665-290|rebbecca.didio@di…| au|
| 2|2018-01-09| Stevie| Hallo| 22222 Acoma St| AU| Proston| QL| 4613|07-9997-3366|0497-622-620|stevie.hallo@hotm…| au|
| 3|2018-01-09| Mariko| Stayer|534 Schoenborn St…| AU| Hamel| WA| 6215|08-5558-9019|0427-885-282|mariko_stayer@hot…| au|
| 4|2018-01-09| Gerardo| Woodka| 69206 Jackson Ave| AU|Talmalmo| NS| 2640|02-6044-4682|0443-795-912|gerardo_woodka@ho…| au|
±------±---------±---------±--------±-------------------±------±-------±----±-----±-----------±-----------±-------------------±------+