String Similarity

Binita_Sajit · June 29, 2022, 8:30am

I have an excel file with 11 columns. This is my source file.
I have many target excel files , columns in each vary and their names also vary.

I have to merge the target data with the source data on the basis of column data similarity

For eg, my source column name is Product Description, whereas one if the target column is Line Description. If we manually check we know both these columns mean the same.

How can we automate this process? Which model to use?

sandeepgiri · June 29, 2022, 3:10pm

Is it one time job?

How many records are there? How many files are there?

Can you manually fix the names of columns and load these as Dataframes and merge them?

Do you need to use machine learning?

Binita_Sajit · June 30, 2022, 6:28am

Source File has columns
Country code First line of description Second line of description UPC or EAN code VENDORNAM VENDOR_PART_NBR CURRENCYCODE

Each of the vendor has their own template where the columns don’t match. For eg in one of the vendor sheet , the contents under column Short Specs looks similar to the Source File’s ‘First line of description’

Kind of checking vector similarity. Check for similarity of each column of vendor file with each column of Source file. If found similar, add the content to the source file

Hope I was able to explain

Binita_Sajit · July 1, 2022, 9:20am

Have mailed both source and target file

When done manually

Ref Row #9 of target,

the content appearing under ‘Short Specs’ is split into two
‘Portable/ASUS P14/P1400CMNS-EK353RA/Gray/14.0" FHD’ - added under ‘First Line of Description’ and ‘Celeron N4020/4Go/128Go NVMe SSD/UMA/Win10 Pro Academic/USB-vers-LAN/2 ans ’ added under ’ Second line of description’ in the Source
Content of column ‘EAN’ is added to ‘UPC or EAN Code’ in the Source
Content of column ‘Succède à’ is added to ‘VENDOR_PART_NBR’

This needs to be automated using ML algo

Binita_Sajit · July 4, 2022, 5:45am

Source File :

Binita_Sajit · July 4, 2022, 5:45am

One of the target file :