I am not able to fully understand your desired outcome, but here is my best shot at the first two steps you mention. (I don't understand your comment about checking the final 10 chars or how it relates to special characters, so I am skipping that part.)
In [80]: g
Out[80]:
FULL_FILE FILENAME
0 c:\ythisFile.pdf thisFile.pdf
1 c:\ythatFile.exe thatFile.exe
2 c:\ytheOtherFile.zip theOtherFile.zip
In [81]: d
Out[81]:
FULL_FILE FILENAME
0 c:\zthis_File.pdf this_File.pdf
1 c:\zthatFile.exe thatFile.exe
2 c:\ztheOtherFile.zip ssss.zip
In [82]: temp1 = pd.merge(
g,
d,
on='FULL_FILE',
how='left',
suffixes=('_g', '_d')
)
In [83]: temp1
Out[83]:
FULL_FILE FILENAME_g FILENAME_d
0 c:\ythisFile.pdf thisFile.pdf NaN
1 c:\ythatFile.exe thatFile.exe NaN
2 c:\ytheOtherFile.zip theOtherFile.zip NaN
In [84]: step2 = d.FULL_FILE.map(
lambda x: temp1.loc[temp1.FILENAME_d.isnull()].FILENAME_g.map(
lambda y: y in x
).any()
)
In [85]: step2
Out[85]:
0 False
1 True
2 True
Name: FULL_FILE, dtype: bool
In [86]: temp2 = pandas.merge(
temp1,
d.loc[step2].drop('FULL_FILE', axis=1),
left_index=True,
right_index=True,
how='left'
)
In [87]: temp2
Out[87]:
FULL_FILE FILENAME_g FILENAME_d FILENAME
0 c:\ythisFile.pdf thisFile.pdf NaN NaN
1 c:\ythatFile.exe thatFile.exe NaN thatFile.exe
2 c:\ytheOtherFile.zip theOtherFile.zip NaN ssss.zip
In [88]: temp2['FILENAME_d'] = temp2['FILENAME_d'].fillna(temp2.FILENAME)
In [89]:temp2.drop('FILENAME', axis=1)
Out[89]:
FULL_FILE FILENAME_g FILENAME_d
0 c:\ythisFile.pdf thisFile.pdf NaN
1 c:\ythatFile.exe thatFile.exe thatFile.exe
2 c:\ytheOtherFile.zip theOtherFile.zip ssss.zip
Note that this also works for the case when there is actually a match in the first step. For example, if I add such an entry to your example data:
In [135]: def fuzzy_match(g, d):
...: temp1 = pd.merge(
...: g,
...: d,
...: on='FULL_FILE',
...: how='left',
...: suffixes=('_g', '_d')
...: )
...: step2 = d.FULL_FILE.map(
...: lambda x: temp1.loc[temp1.FILENAME_d.isnull()].FILENAME_g.map(
...: lambda y: y in x
...: ).any()
...: )
...: temp2 = pd.merge(
...: temp1,
...: d.loc[step2].drop('FULL_FILE', axis=1),
...: left_index=True,
...: right_index=True,
...: how='left'
...: )
...: temp2['FILENAME_d'] = temp2['FILENAME_d'].fillna(temp2.FILENAME)
...: return temp2.drop('FILENAME', axis=1)
...:
...:
In [136]: g = pd.DataFrame([['c:\\aFile.txt', 'aFile.txt'], ['c:\\ythisFile.pdf', 'thisFile.pdf'], ['c:\\ythatFile.exe', 'thatFile.exe'], ['c:\\ytheOtherFile.zip', '
...: theOtherFile.zip']], columns=['FULL_FILE', 'FILENAME']); d = pd.DataFrame([['c:\\aFile.txt', 'aFile.txt'], ['c:\\zthis_File.pdf', 'this_File.pdf'], ['c:\\z
...: thatFile.exe', 'thatFile.exe'], ['c:\\ztheOtherFile.zip', 'ssss.zip']], columns=['FULL_FILE', 'FILENAME'])
In [137]: g
Out[137]:
FULL_FILE FILENAME
0 c:\aFile.txt aFile.txt
1 c:\ythisFile.pdf thisFile.pdf
2 c:\ythatFile.exe thatFile.exe
3 c:\ytheOtherFile.zip theOtherFile.zip
In [138]: d
Out[138]:
FULL_FILE FILENAME
0 c:\aFile.txt aFile.txt
1 c:\zthis_File.pdf this_File.pdf
2 c:\zthatFile.exe thatFile.exe
3 c:\ztheOtherFile.zip ssss.zip
In [139]: fuzzy_match(g, d)
Out[139]:
FULL_FILE FILENAME_g FILENAME_d
0 c:\aFile.txt aFile.txt aFile.txt
1 c:\ythisFile.pdf thisFile.pdf NaN
2 c:\ythatFile.exe thatFile.exe thatFile.exe
3 c:\ytheOtherFile.zip theOtherFile.zip ssss.zip