Apache Spark - Interview Question - Find the longest line

You have a text file of say 100 TB in HDFS. You need to print the longest line in this text file. Can you do it efficiently using Apache Spark?

1 Like

There are multiple ways to do. Most of the people try to do it using Spark Dataframes in some form like:

max_len = spark.sql("select max(len(line)) from textfile").take()
select line from textfiledf where len(line) = max_len limit 1

What’s wrong with it? It is having two actions and therefore it is going to be slow. Can you thing of an approach in which there is only once action. Here is my approach using RDD:

rdd = sc.textFile("/data/mr/wordcount/input")
rdd1 = rdd.map(lambda line: (len(line), line))
result = rdd1.reduce(lambda x,y: x if x[0]> y[0] else y)
print(result[1])
1 Like