Cannot resolve column name in given input columns in pyspark dataframe error

Hello,

I have created following dataframe:

df = spark.read.csv(“file:///home/pratik58892973/olist_sellers_dataset.csv”, header=“True”, sep="|")

While I’m able to fetch data:
df.show(5)

and display the schema:
df.printSchema()

root
|-- seller_id,seller_zip_code_prefix,seller_city,seller_state: string (nullable = true)

It is showing an error while fetching a single column value for the given dataframe

df.select(“seller_city”).show()

Error:
AnalysisException: “cannot resolve ‘seller_city’ given input columns: [seller_id,seller_zip_code_prefix,seller_city,seller_state];;\n’Project ['seller_city]\n± Relation[seller_id,seller_zip_code_prefix,seller_city,seller_state#10] csv\n”

Can anyone suggest the solution?

my problem is:cannot resolve 'csu_5g_base_user_mon.c1249' given input columns,and the reason is that the function selectcan not deal with character’.’,so i have to remove or replace it with other characters.Hope you can get something userful to resolve your problem.

when you read in the csv file, make sure to use the right separator, it maybe “;”, “,” so that when you check the schema using df.printSchema(), you should get something like:
root
|–seller_id: string (nullable = true)
|–seller_zip_code_prefix: string (nullable = true)
|–seller_city: string (nullable = true)
|–seller_state: string (nullable = true)

1 Like

Hello,
To solve that problem you just need to read your csv file correctly, the way you do it you also need to add the delimiter so that spark can identify the headers.
This line:
df = spark.read.csv(“file:///home/pratik58892973/olist_sellers_dataset.csv”, header=“True”, sep="|")

For this:

df=spark.read.option(“delimiter”, “;”).option(“header”, True).csv("/home/pratik58892973/olist_sellers_dataset.csv")

To check the schema if spark correctly identified the fields:

df.printSchema()

You will get something like this:

root |-- id: string (nullable = true) |-- name: string (nullable = true) |-- firstname: string (nullable = true) |-- zip: string (nullable = true) |-- city: string (nullable = true) |-- birthdate: string (nullable = true) |-- street: string (nullable = true) |-- housenr: string (nullable = true) |-- stateCode: string (nullable = true) |-- state: string (nullable = true)

Therefore, it would no longer show the error for the seller_city column.

df.select(“seller_city”).show()

1 Like