Spark code to find anagrams in a text file


#1

Problem

Write a Spark code to find anagrams in a text file stored in HDFS. An anagram is basically a different arrangement of letters in a word. Anagram does not need to be meaningful

Dataset

The file is located at

/data/mr/wordcount/big.txt

Sample Output

Output file will contain the anagrams in the text file

3   ['bowel,', 'elbow,', 'below,']
3   ['bore', 'boer', 'robe']
3   ['bears', 'baser', 'saber']

#2

could please upload the code to solve the above problem


#3

Try something on these lines.

import re
def towords(x):
    x = re.sub(r"[^0-9A-Za-z]+", " ", x)
    x = re.sub(r"[ ]+", " ", x)
    return x.lower().split()

lines = sc.textFile("/data/mr/wordcount/big.txt")
words = lines.flatMap(towords)

def sortchars(w):
    return ("".join(sorted(w)), [w])

wordskv = words.map(sortchars)

def mergearrays(x, y):
    print("Merging: %s and %s" % (x, y))
    if x is not None and y is not None:
        return x.extend(y)
    if x is not None:
        return x;
    if y is not None:
        return y
    return []
    

duplicatewords = wordskv.reduceByKey(mergearrays)

def uniqwords(x):
    u = list(set(x[0]))
    return (len(u), u)
    
result = duplicatewords.map(uniqwords)

#5

package com.alok.projects

import org.apache.spark.sql.SparkSession

object entry {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName(“String Analysis”)
.config(“spark.master”, “local”)
.getOrCreate()

val novel = spark.sparkContext.textFile(“src/resources/big.txt”)
val stopWords = Set(“a”, “about”, “above”, “after”, “again”, “against”, “all”, “am”, “an”, “and”, “any”, “are”, “as”,
“at”, “be”, “because”, “been”, “before”, “being”, “below”, “between”, “both”, “but”, “by”, “could”, “did”, “do”,
“does”, “doing”, “down”, “during”, “each”, “few”, “for”, “from”, “further”, “had”, “has”, “have”, “having”, “he”,
“he’d”, “he’ll”, “he’s”, “her”, “here”, “here’s”, “hers”, “herself”, “him”, “himself”, “his”, “how”, “how’s”, “i”,
“i’d”, “i’ll”, “i’m”, “i’ve”, “if”, “in”, “into”, “is”, “it”, “it’s”, “its”, “itself”, “let’s”, “me”, “more”,
“most”, “my”, “myself”, “nor”, “of”, “on”, “once”, “only”, “or”, “other”, “ought”, “our”, “ours”, “ourselves”,
“out”, “over”, “own”, “same”, “she”, “she’d”, “she’ll”, “she’s”, “should”, “so”, “some”, “such”, “than”, “that”,
“that’s”, “the”, “their”, “theirs”, “them”, “themselves”, “then”, “there”, “there’s”, “these”, “they”, “they’d”,
“they’ll”, “they’re”, “they’ve”, “this”, “those”, “through”, “to”, “too”, “under”, “until”, “up”, “very”, “was”,
“we”, “we’d”, “we’ll”, “we’re”, “we’ve”, “were”, “what”, “what’s”, “when”, “when’s”, “where”, “where’s”, “which”,
“while”, “who”, “who’s”, “whom”, “why”, “why’s”, “with”, “would”, “you”, “you’d”, “you’ll”, “you’re”, “you’ve”,
“your”, “yours”, “yourself”, “yourselves”, “may”, “no”, “not”,“now”,“will”,“must”,“can”)

val novel_words_cleaned_tuple = novel.flatMap(x => x.split(" “))
.map(c => c.replaceAll(”[^a-zA-Z0-9]+", “”))
.map(_.toLowerCase)
.filter(x => !stopWords.contains(x) && x != “”).distinct()
//.map(x => (x,1))

novel_words_cleaned_tuple.take(10).foreach(println)
novel_words_cleaned_tuple.map(x => (x.split("").sorted.toList,List(x))).reduceByKey( ++ ).filter(x => x._2.length > 1).take(10).foreach(println)

The output:
(List(a, i, l, r, t),List(trial, trail))
(List(a, e, l, r, r, y),List(larrey, rarely))
(List(e, g, n, o, o, r, r, s, v),List(grosvenor, governors))
(List(d, e, e, f, m, o, p, r, r),List(performed, preformed))
(List(0, 2, 7),List(270, 207))
(List(d, e, i, k, l, n),List(linked, kindle))
(List(a, b, g, r),List(brag, garb, grab))
(List(d, e, e, e, p, s, t),List(deepest, deepset))
(List(a, c, e, h, s, t),List(sacthe, chaste))
(List(a, e, e, g, m, r),List(meagre, meager))


#6

This is good. I am just wondering why do you need to remove the stopwords?


#7

Thanks Sandeep! Just wanted to include meaningful words. I agree that answer is not inline with the question. Can remove the stopwords filter.


#8

Hi Sandeep,
I have tried changing it.

package com.alok.projects

import org.apache.spark.sql.SparkSession

object entry {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName(“String Analysis”)
.config(“spark.master”, “local”)
.getOrCreate()

val novel = spark.sparkContext.textFile("src/resources/big.txt")
val novel_words_cleaned_tuple = novel.flatMap(x => x.split(" "))
  .map(c => c.replaceAll("[^a-zA-Z0-9]+", ""))
  .map(_.toLowerCase)
  .distinct()
  
novel_words_cleaned_tuple.map(x => (x.split("").sorted.toList,List(x))).reduceByKey(_ ++ _)
  .filter(x => x._2.length > 1).sortBy(_._2.length,ascending = false).map(x => x._2).take(50).foreach(println)
  
  }

}

output#:
List(423, 243, 342, 432, 324, 234)
List(621, 261, 612, 216, 126, 162)
List(352, 532, 523, 253, 325, 235)
List(251, 125, 521, 512, 215, 152)
List(142, 124, 412, 214, 241, 421)
List(trace, caret, cater, carte, crate, react)
List(least, tales, stael, slate, steal, stale)
List(425, 254, 452, 542, 524, 245)
List(531, 513, 351, 315, 153, 135)
List(136, 631, 163, 613, 361, 316)
List(lustre, sutler, rustle, result, ulster, luster)
List(453, 534, 543, 435, 354, 345)
List(154, 145, 541, 451, 415, 514)
List(413, 431, 134, 314, 143, 341)
List(321, 213, 123, 132, 231, 312)
List(live, evil, veil, levi, vile)
List(stop, tops, post, spot, pots)
List(164, 146, 641, 416, 461)
List(645, 564, 465, 456, 546)
List(149, 194, 491, 914, 419)
List(seton, onset, tones, stone, notes)
List(ernest, resent, rentes, sterne, enters)
List(sacre, scare, acres, races, cares)
List(ranged, grande, danger, garden)
List(tis, sit, ist, its)
List(493, 439, 394, 349)
List(voter, votre, overt, trove)
List(158, 185, 581, 518)
List(263, 362, 326, 236)
List(359, 593, 395, 539)
List(tens, sent, nest, nets)
List(374, 347, 437, 473)
List(174, 417, 147, 471)
List(mister, merits, remits, termis)
List(385, 538, 583, 358)
List(482, 428, 284, 248)
List(849, 498, 489, 984)
List(265, 256, 526, 562)
List(457, 547, 574, 475)
List(102, 120, 210, 201)
List(grenade, angered, grandee, enraged)
List(392, 329, 239, 293)
List(relating, integral, triangle, altering)
List(1685, 1856, 1865, 1658)
List(350, 305, 503, 530)
List(singer, reigns, signer, resign)
List(137, 317, 173, 371)
List(secured, seducer, rescued, reduces)
List(leap, plea, pale, peal)
List(limes, slime, smile, miles)

We can filter numeric values.