How to find unique ngrams from dataframe email column?

ashwinipadhy899328 · December 2, 2018, 9:32am

I have requirement to create a custom feature transformer in spark scala.I am using mleap to have this feature transformer for serialization.For instance i have a scala dataframe

±-------------------+ .
| email_list| .
±-------------------+ .
|testmail1115@gmail.com| .
|mavenmaven@mlail.com| .
|dnd.7899334622@gmail.com| .
±-------------------+ .
If i use the transformer it converts the input array of strings into an array of n-grams.like below:

import org.apache.spark.ml.feature.NGram
val emailD1F=emailDF.withColumn(“email_split”, split(col(“email_list”), “@”).getItem(0)).withColumn(“email_split”, split(col(“email_split”), “”)) .
val ngram = new NGram().setN(2).setInputCol(“col1”).setOutputCol(“ngrams”)

val ngramDataFrame = ngram.transform(emailD1F)
ngramDataFrame.show()

So end result should be

unique ngram present and total ngram present

Can anyone help me with this?This is required to make a custom transformer for ml model to run in mleap environment.