c# - Compare/match two columns using approximate string matching (fuzzy string matching, levenshtein) -
c# - Compare/match two columns using approximate string matching (fuzzy string matching, levenshtein) -
first of allow me explain i'm trying achieve. application i'm making should have ability compare 2 columns of 2 different tables eachother. every cell of column first table should linked best matching cell column of sec table. this:
this can achieved using levenshtein's algorithm. wrote test programme in c# see if can recreate same results image showing us. made 2 array's, 1 containing first column of image , 1 containing sec column of image. every cell of first column compared every cell of sec column, means 4 iterations on every cell (16 in total). highest match (the 1 lowest levenshtein distance) of sec column linked cell of first column.
the problem: allow have 2 big columns 100k rows each, should serious performance issues. because every cell first column need matched every cell of sec column highest possible match, have iterate 100k * 100k = 10 billion times. have create avoid iterating 10 billion times.
i did research levenshtein used , came across this: http://www.slideshare.net/fullscreen/vasiletopac/fuzzy-hash-map/4. i'm wondering if able create guy did in link?
some things consider:
in such big columns there multiple matches on single cell(the user need chose right one). means can't exclude matched cells current search in order bring downwards iteration. in illustration matching/comparison done on 2 columns, in future compare single column table 1 columns table 2 (less work user). more time expensive can expect.note: i'm using c# 4 months right now, i'll hope can provide me starting point (i prefer not working answer, rather want research myself larn well). understanding. english language not native language, please sense free edit post.
try come assumption holds true matching can segment smaller chunks like:
the first capital alpha character in table 1 must match first capital alpha character in table 2
you may able find valid assumption allow pre-process values column:
firstalpha1 firstalpha2 =========== =========== p c s f c p f s
then simple sort , bring together (exact match) on value split solution smaller chunks.
c# hashtable levenshtein-distance
Comments
Post a Comment