Solving near real time user segmentation

I need to provide a way to categorise users in different segments. These could be like
a. New Customers
b. Repeat customer
c. Dropped customer
or very specifics like those who purchased this month but total purchase was less than 1000
Users database is close to 50 millions.
What should be ideal technology stack?

In what form is your data?

Let me assume that we have a file having transactions. If we could move this file in HDFS, using Hive we could create queries on top of that file.

Ok. If data is in Mongo then what is the best approach? Thanks Sandeep for quick response.

Okay. Interesting!

We could create the aggregation pipelines or map-reduce code for mongodb collections and do the grouping of users into various buckets.

With aggregations of mongodb, you can achieve as much as you can achieve the SQL in case of databases. So, if you want to go beyond what can be achieved using aggregations pipeline of mongodb, then you might have to write map-reduce in mongodb.

Good news is there seems to be a very nice spark connector
(https://www.mongodb.com/products/spark-connector) which makes it possible to use spark with mongoddb. The spark logic that you write get converted to the aggregation pipeline.

So, you can probably use the spark mllib with Mongodb efficiently without having to write the logic.

Also, for the segmentation of customers, most of the job can be easily achieved using SQL but if your logic grows complex such as find outlier customers, you could try clustering algorithms of MLLIB.

If you could share a small sample of mongodb collection backup and exact example of segmentation, I could give it a shot.

1 Like