I have some kind of a Regex problem I wanted to make it as general as possible although I have written my code in MATLAB.
INFO:
LipidData is a 68x2 table that contains a name column and the Short column, that are strings like LPC, PC, AC4PIM2, SHexCer, SQDG and many more. This LipidData matrix is not going to change, whereas foundpattern may vary depending on the real input data where it comes from.
foundpattern is an N×4 table, where in my example N is 7. The only relevant column here is the first one, called ISDs and which contains the strings to check(for reproducibility you may copy only the column as a cell array). Here you can see both MATLAB tables:
INPUT:
>> LipidData
LipidData =
68×2 table
Lipid subclass name Short
___________________________________________________ ___________
{'Diacylated phosphatidylinositol monomannoside' } {'Ac2PIM1' }
{'Diacylated phosphatidylinositol dimannoside' } {'Ac2PIM2' }
{'Triacylated phosphatidylinositol dinomannoside' } {'Ac3PIM2' }
{'Tetraaacylated phosphatidylinositol dimannoside' } {'AC4PIM2' }
{'Anacardic Acid' } {'ACar' }
{'Acetylglucose andrographolide' } {'AcylGlcADG' }
{'Bis[monoacylglycero]phosphates' } {'BMP' }
{'Cholesteryl esters' } {'CE' }
{'Ceramide' } {'Cer' }
{'Ceramide alpha-hydroxy fatty acid-dihydrosphingosine' } {'CerADS' }
{'Ceramide alpha-hydroxy fatty acid-phytospingosine' } {'CerAP' }
{'Ceramide beta-hydroxy fatty acid-sphingosine' } {'CerAS' }
{'Ceramide beta-hydroxy fatty acid-dihydrosphingosine' } {'CerBDS' }
{'Ceramide beta-hydroxy fatty acid-sphingosine' } {'CerBS' }
{'Ceramide Esterified omega-hydroxy fatty acid-dihydrosphingosine'} {'CerEODS' }
{'Ceramide Esterified omega-hydroxy fatty acid-sphingosine' } {'CerEOS' }
{'Ceramide non-hydroxyfatty acid-dihydrosphingosine' } {'CerNDS' }
{'Ceramide non-hydroxyfatty acid-phytospingosine' } {'CerNP' }
{'Ceramide non-hydroxyfatty acid-sphingosine' } {'Cer_NS' }
{'Ceramide phosphate' } {'CerP' }
{'Cholesterol' } {'Cholesterol'}
{'Cardiolipins' } {'CL' }
{'Diacyl/alkylglycerides' } {'DG' }
{'Digalactosyldiacylglycerols' } {'DGDG' }
{'1,2-diacylglyceryl-3-O-4'-(N,N,N-trimethyl)-homoserine' } {'DGTS' }
{'Ether Oxygenated Phosphatidylcholines' } {'EtherOxPC' }
{'Ether Oxygenated Phosphatidylethanolamines' } {'EtherOxPE' }
{'Ether-linked Phosphatidylcoline' } {'EtherPC' }
{'Ether-linked Phosphatidylethanolamine' } {'EtherPE' }
{'Fatty Acids' } {'FA' }
{'Fatty acid ester of hydroxyl fatty acid' } {'FAHFA' }
{'Glucuronosyldiacylglycerol' } {'GlcADG' }
{'GM3 Ganglioside' } {'GM3' }
{'Hidroxy Bis[monoacylglycero]phosphates' } {'HBMP' }
{'Hexosylceramide alpha-hydroxy fatty acid-phytospingosine' } {'HexCerAP' }
{'Hexosylceramide non-hydroxyfatty acid-dihydrosphingosine' } {'HexCerNDS' }
{'Hexosylceramide non-hydroxyfatty acid-sphingosine' } {'HexCer_NS' }
{'Lyso 1,2-diacylglyceryl-3-O-4'-(N,N,N-trimethyl)-homoserine' } {'DGTS' }
{'Lyso Phosphatidic acids' } {'LPA' }
{'Lyso Phosphatidylcholines' } {'LPC' }
{'Lyso Phosphatidylethanolamines' } {'LPE' }
{'Lyso Phosphatidylglycerols' } {'LPG' }
{'Lyso Phosphatidylinositols' } {'LPI' }
{'Lyso Phosphatidylserines' } {'LPS' }
{'Monoacyl/alkylglycerides' } {'MG' }
{'Monogalactosyldiacylglycerols' } {'MGDG' }
{'Oxygenated Cardiolipins' } {'OxCL' }
{'Oxygenated Fatty Acids' } {'OxFA' }
{'Oxygenated Phosphatidic acids' } {'OxPA' }
{'Oxygenated Phosphatidylcholines' } {'OxPC' }
{'Oxygenated Phosphatidylethanolamines' } {'OxPE' }
{'Oxygenated Phosphatidylglycerols' } {'OxPG' }
{'Oxygenated Phosphatidylinositols' } {'OxPI' }
{'Oxygenated Phosphatidylserines' } {'OxPS' }
{'Oxygenated Triacyl/alkylglycerides' } {'OxTG' }
{'Phosphatidic acids' } {'PA' }
{'Phosphatidylbutyl alcohol' } {'PBtOH' }
{'Phosphatidylcholines' } {'PC' }
{'Phosphatidylethanolamines' } {'PE' }
{'Phosphatidyletanol' } {'PEtOH' }
{'Phosphatidylglycerols' } {'PG' }
{'Phosphatidylinositols' } {'PI' }
{'Phosphatidylmethanol' } {'PMeOH' }
{'Phosphatidylserines' } {'PS' }
{'Sulfatides hexosyl ceramide' } {'SHexCer' }
{'Sphingomyelines' } {'SM' }
{'Sulfoquinovosyl diacylglycerols' } {'SQDG' }
{'Triacyl/alkylglycerides' } {'TG' }
>> foundpattern
foundpattern =
7×4 table
ISDs tR Standard desv RSD
__________________________ ______ _____________ _______
{'18:1 (d7) MG' } 1.34 0.020418 1.5238
{'18:1(d7) LPC' } 1.5868 0.0056024 0.35305
{'18:1 (d9) SM' } 6.8999 0.08336 1.2081
{'15:0-18:1(d7) PC' } 7.989 0.072533 0.90791
{'15:0-18:1(d7) DG' } 12.085 0.097445 0.80631
{'15:0-18:1 (d7)-15:0 TG'} 17.487 0.029701 0.16984
{'Cholesterol (d7)' } 18.247 0.032275 0.17687
The problem resides when comparing the regular expression of the LipidData PC with a foundpattern value of {'18:1(d7) LPC'} which would make a 'match' that I don't know how to avoid it. I only need to find the exact same Short values within the foundpattern.ISDs. Another example of the same problem would appear hypothetically if in found pattern there was a Cer_NS, which would match not only with its LipidData value Cer_NS but also with Cer.
I believe making the values a group (using regex with parentheses) as you would see in the code is a solution, but of course the groups are 'slightly modified' and thus the repetition. I know I miss something there but I don't know what.
Anyway to avoid match repetitions there? As you would see at the OUTPUT, the Codes cell array should only have 7 entries instead of 8.
CODE:
Codes={}
for j=1:size(ID,1)
expression=strcat("(",char(LipidData{j,2}),")");
for i=1:size(foundpattern,1)
if regexp(char(foundpattern{i,1}),expression) ~= 0
disp(foundpattern{i,1})
disp(LipidData{j,2})
Codes{end+1}=LipidData{j,2};
end
end
end
OUTPUT:
>> Codes
Codes =
1×8 cell array
Columns 1 through 6
{1×1 cell} {1×1 cell} {1×1 cell} {1×1 cell} {1×1 cell} {1×1 cell}
Columns 7 through 8
{1×1 cell} {1×1 cell}
>> for i=1:size(Codes,2)
Codes{i}
end
ans =
1×1 cell array
{'Cholesterol'}
ans =
1×1 cell array
{'DG'}
ans =
1×1 cell array
{'LPC'}
ans =
1×1 cell array
{'MG'}
ans =
1×1 cell array
{'PC'}
ans =
1×1 cell array
{'PC'}
ans =
1×1 cell array
{'SM'}
ans =
1×1 cell array
{'TG'}
>>
Cer_NSif you are looking forCer?Cermatches when findingCerandCer_NSwhenCer_NS. Same withPC,LPCand all the possible problems.Ceras a whole word inCer_NS, you can go to the original answer version.