Hi Krishnakanth,
Interesting question.
Since we want to use the spark-csv package, the easiest way to do is by using the spark-csv package’s schema option :
import org.apache.spark.sql.types._
import org.apache.spark.sql._
var schema = StructType(Array(StructField("name", StringType, nullable = true), StructField("starttstamp", DateType, nullable = true), StructField("endtstamp", DateType, nullable = true)))
var df = spark.read.format("com.databricks.spark.csv").option("delimiter", "|").option("inferSchema", "false").option("dateFormat", "yyyyMMddhhmm").schema(schema).load("/data/spark/sample.csv")
df.show()
You can see the result using .show() function:
scala> df.show()
+-------+-----------+----------+
| name|starttstamp| endtstamp|
+-------+-----------+----------+
|Boyina1| 2016-09-05|2015-09-06|
|Boyina2| 2016-08-05|2015-08-06|
+-------+-----------+----------+
Alternatively, we could also use the map function on underlying rdd and in the map function we could call:
def parseDate(d:String):Date = {
val format = new java.text.SimpleDateFormat("yyyyMMddhhmm")
format.parse(d)
}