Spark code to find users having same DNA

abhinav · September 25, 2017, 1:35pm

Problem

Write a Spark code to find users having same DNA in the file stored in HDFS.

Dataset

The file is located at

/data/mr/dna/dna.txt

Sample Output

Output file will have the users having same DNA

ACG ['User5', 'User3']
ACGT    ['User4', 'User1']

ArUn_M · July 16, 2018, 6:17pm

Please correct me if it needs some improvement
var Dnardd = sc.textFile("/data/mr/dna/dna.txt")
def clean(line:String) = {
var arr = line.split(" ")
(arr(3).trim,arr(0))
}
var pairs = Dnardd.map(clean)
var userdna = pairs.groupByKey()

#output
Array((TGCA,CompactBuffer(User2)), (ACG,CompactBuffer(User3, User5)), (ACGT,CompactBuffer(User1, User4)), (AGCT,CompactBuffer(User6)))