How to find unique ngrams from dataframe email column?


#1

I have requirement to create a custom feature transformer in spark scala.I am using mleap to have this feature transformer for serialization.For instance i have a scala dataframe

±-------------------+ .
| email_list| .
±-------------------+ .
|testmail1115@gmail.com| .
|mavenmaven@mlail.com| .
|dnd.7899334622@gmail.com| .
±-------------------+ .
If i use the transformer it converts the input array of strings into an array of n-grams.like below:

±-------------------±-----------------------------+
| email_list | ngrams| .
±-------------------±------------------------------+
|testmail1115@gmail.com| [t e, e s, s t, t…|
|mavenmaven@mlail.com| [m a, a v, v e, e…| .
|dnd.7899334622@gmail.com| [d n, n d, d…| .
±-------------------±-------------------+ .
How to get the distinct ngram present rather the pattern or array in the below code:

import org.apache.spark.ml.feature.NGram
val emailD1F=emailDF.withColumn(“email_split”, split(col(“email_list”), “@”).getItem(0)).withColumn(“email_split”, split(col(“email_split”), “”)) .
val ngram = new NGram().setN(2).setInputCol(“col1”).setOutputCol(“ngrams”)

val ngramDataFrame = ngram.transform(emailD1F)
ngramDataFrame.show()

So end result should be

unique ngram present and total ngram present

Can anyone help me with this?This is required to make a custom transformer for ml model to run in mleap environment.