Change of DF column type of string to date

krishnakanth_Boyina · September 20, 2017, 4:32pm

StructType schema = new StructType().add("name", StringType)add("starttstamp", StringType).add("endtstamp", StringType);
Dataset<Row> df = sparkSession.read().schema(schema).option("delimiter", "|").option("inferSchema", "true").csv("sample.csv");

How do i convert the **starttstamp** column to date of format of yyyyMMdd

Boyina|201709050000|201709060000 > csv

sgiri · September 21, 2017, 7:49pm

Hi Krishnakanth,

Interesting question.

Since we want to use the spark-csv package, the easiest way to do is by using the spark-csv package’s schema option :

import org.apache.spark.sql.types._
import org.apache.spark.sql._

var schema = StructType(Array(StructField("name", StringType, nullable = true), StructField("starttstamp", DateType, nullable = true), StructField("endtstamp", DateType, nullable = true)))

var df = spark.read.format("com.databricks.spark.csv").option("delimiter", "|").option("inferSchema", "false").option("dateFormat", "yyyyMMddhhmm").schema(schema).load("/data/spark/sample.csv")
df.show()

You can see the result using .show() function:

scala> df.show()
+-------+-----------+----------+
|   name|starttstamp| endtstamp|
+-------+-----------+----------+
|Boyina1| 2016-09-05|2015-09-06|
|Boyina2| 2016-08-05|2015-08-06|
+-------+-----------+----------+

Alternatively, we could also use the map function on underlying rdd and in the map function we could call:

def parseDate(d:String):Date = {
    val format = new java.text.SimpleDateFormat("yyyyMMddhhmm")
    format.parse(d)
}

krishnakanth_Boyina · September 22, 2017, 4:53am

The date is in the pattern of 2016-09-05 but how can i change it to yyyyMMdd format

sgiri · September 22, 2017, 7:44am

The column is a full Date object which you can print it whatever way you want. Here, it is just being print like that by show() method of dataframe.