Cannot resolve column name in given input columns in pyspark dataframe error

pratikkawalgikar · February 21, 2020, 2:12pm

Hello,

I have created following dataframe:

df = spark.read.csv(“file:///home/pratik58892973/olist_sellers_dataset.csv”, header=“True”, sep="|")

While I’m able to fetch data:
df.show(5)

and display the schema:
df.printSchema()

root
|-- seller_id,seller_zip_code_prefix,seller_city,seller_state: string (nullable = true)

It is showing an error while fetching a single column value for the given dataframe

df.select(“seller_city”).show()

Error:
AnalysisException: “cannot resolve ‘seller_city’ given input columns: [seller_id,seller_zip_code_prefix,seller_city,seller_state];;\n’Project ['seller_city]\n± Relation[seller_id,seller_zip_code_prefix,seller_city,seller_state#10] csv\n”

Can anyone suggest the solution?

jiaxuan_liu · August 19, 2020, 1:03am

my problem is:cannot resolve 'csu_5g_base_user_mon.c1249' given input columns,and the reason is that the function selectcan not deal with character’.’，so i have to remove or replace it with other characters.Hope you can get something userful to resolve your problem.

Yx_Yuan · June 27, 2021, 2:49pm

when you read in the csv file, make sure to use the right separator, it maybe “;”, “,” so that when you check the schema using df.printSchema(), you should get something like:
root
|–seller_id: string (nullable = true)
|–seller_zip_code_prefix: string (nullable = true)
|–seller_city: string (nullable = true)
|–seller_state: string (nullable = true)

Jose_Ambrosio_Hernan · March 17, 2022, 7:11pm

Hello,
To solve that problem you just need to read your csv file correctly, the way you do it you also need to add the delimiter so that spark can identify the headers.
This line:
df = spark.read.csv(“file:///home/pratik58892973/olist_sellers_dataset.csv”, header=“True”, sep="|")

For this:

df=spark.read.option(“delimiter”, “;”).option(“header”, True).csv("/home/pratik58892973/olist_sellers_dataset.csv")

To check the schema if spark correctly identified the fields:

df.printSchema()

You will get something like this:

Therefore, it would no longer show the error for the seller_city column.

df.select(“seller_city”).show()