3

I have some kind of a Regex problem I wanted to make it as general as possible although I have written my code in MATLAB.

INFO:

LipidData is a 68x2 table that contains a name column and the Short column, that are strings like LPC, PC, AC4PIM2, SHexCer, SQDG and many more. This LipidData matrix is not going to change, whereas foundpattern may vary depending on the real input data where it comes from.

foundpattern is an N×4 table, where in my example N is 7. The only relevant column here is the first one, called ISDs and which contains the strings to check(for reproducibility you may copy only the column as a cell array). Here you can see both MATLAB tables:

INPUT:

>> LipidData

LipidData =

 68×2 table

                Lipid subclass name                       Short   
___________________________________________________    ___________

{'Diacylated phosphatidylinositol monomannoside'                  }    {'Ac2PIM1'    }
{'Diacylated phosphatidylinositol dimannoside'                    }    {'Ac2PIM2'    }
{'Triacylated phosphatidylinositol dinomannoside'                 }    {'Ac3PIM2'    }
{'Tetraaacylated phosphatidylinositol dimannoside'                }    {'AC4PIM2'    }
{'Anacardic Acid'                                                 }    {'ACar'       }
{'Acetylglucose andrographolide'                                  }    {'AcylGlcADG' }
{'Bis[monoacylglycero]phosphates'                                 }    {'BMP'        }
{'Cholesteryl esters'                                             }    {'CE'         }
{'Ceramide'                                                       }    {'Cer'        }
{'Ceramide alpha-hydroxy fatty acid-dihydrosphingosine'           }    {'CerADS'     }
{'Ceramide alpha-hydroxy fatty acid-phytospingosine'              }    {'CerAP'      }
{'Ceramide beta-hydroxy fatty acid-sphingosine'                   }    {'CerAS'      }
{'Ceramide beta-hydroxy fatty acid-dihydrosphingosine'            }    {'CerBDS'     }
{'Ceramide beta-hydroxy fatty acid-sphingosine'                   }    {'CerBS'      }
{'Ceramide Esterified omega-hydroxy fatty acid-dihydrosphingosine'}    {'CerEODS'    }
{'Ceramide Esterified omega-hydroxy fatty acid-sphingosine'       }    {'CerEOS'     }
{'Ceramide non-hydroxyfatty acid-dihydrosphingosine'              }    {'CerNDS'     }
{'Ceramide non-hydroxyfatty acid-phytospingosine'                 }    {'CerNP'      }
{'Ceramide non-hydroxyfatty acid-sphingosine'                     }    {'Cer_NS'     }
{'Ceramide phosphate'                                             }    {'CerP'       }
{'Cholesterol'                                                    }    {'Cholesterol'}
{'Cardiolipins'                                                   }    {'CL'         }
{'Diacyl/alkylglycerides'                                         }    {'DG'         }
{'Digalactosyldiacylglycerols'                                    }    {'DGDG'       }
{'1,2-diacylglyceryl-3-O-4'-(N,N,N-trimethyl)-homoserine'         }    {'DGTS'       }
{'Ether Oxygenated Phosphatidylcholines'                          }    {'EtherOxPC'  }
{'Ether Oxygenated Phosphatidylethanolamines'                     }    {'EtherOxPE'  }
{'Ether-linked Phosphatidylcoline'                                }    {'EtherPC'    }
{'Ether-linked Phosphatidylethanolamine'                          }    {'EtherPE'    }
{'Fatty Acids'                                                    }    {'FA'         }
{'Fatty acid ester of hydroxyl fatty acid'                        }    {'FAHFA'      }
{'Glucuronosyldiacylglycerol'                                     }    {'GlcADG'     }
{'GM3 Ganglioside'                                                }    {'GM3'        }
{'Hidroxy Bis[monoacylglycero]phosphates'                         }    {'HBMP'       }
{'Hexosylceramide alpha-hydroxy fatty acid-phytospingosine'       }    {'HexCerAP'   }
{'Hexosylceramide non-hydroxyfatty acid-dihydrosphingosine'       }    {'HexCerNDS'  }
{'Hexosylceramide non-hydroxyfatty acid-sphingosine'              }    {'HexCer_NS'  }
{'Lyso 1,2-diacylglyceryl-3-O-4'-(N,N,N-trimethyl)-homoserine'    }    {'DGTS'       }
{'Lyso Phosphatidic acids'                                        }    {'LPA'        }
{'Lyso Phosphatidylcholines'                                      }    {'LPC'        }
{'Lyso Phosphatidylethanolamines'                                 }    {'LPE'        }
{'Lyso Phosphatidylglycerols'                                     }    {'LPG'        }
{'Lyso Phosphatidylinositols'                                     }    {'LPI'        }
{'Lyso Phosphatidylserines'                                       }    {'LPS'        }
{'Monoacyl/alkylglycerides'                                       }    {'MG'         }
{'Monogalactosyldiacylglycerols'                                  }    {'MGDG'       }
{'Oxygenated Cardiolipins'                                        }    {'OxCL'       }
{'Oxygenated Fatty Acids'                                         }    {'OxFA'       }
{'Oxygenated Phosphatidic acids'                                  }    {'OxPA'       }
{'Oxygenated Phosphatidylcholines'                                }    {'OxPC'       }
{'Oxygenated Phosphatidylethanolamines'                           }    {'OxPE'       }
{'Oxygenated Phosphatidylglycerols'                               }    {'OxPG'       }
{'Oxygenated Phosphatidylinositols'                               }    {'OxPI'       }
{'Oxygenated Phosphatidylserines'                                 }    {'OxPS'       }
{'Oxygenated Triacyl/alkylglycerides'                             }    {'OxTG'       }
{'Phosphatidic acids'                                             }    {'PA'         }
{'Phosphatidylbutyl alcohol'                                      }    {'PBtOH'      }
{'Phosphatidylcholines'                                           }    {'PC'         }
{'Phosphatidylethanolamines'                                      }    {'PE'         }
{'Phosphatidyletanol'                                             }    {'PEtOH'      }
{'Phosphatidylglycerols'                                          }    {'PG'         }
{'Phosphatidylinositols'                                          }    {'PI'         }
{'Phosphatidylmethanol'                                           }    {'PMeOH'      }
{'Phosphatidylserines'                                            }    {'PS'         }
{'Sulfatides hexosyl ceramide'                                    }    {'SHexCer'    }
{'Sphingomyelines'                                                }    {'SM'         }
{'Sulfoquinovosyl diacylglycerols'                                }    {'SQDG'       }
{'Triacyl/alkylglycerides'                                        }    {'TG'         }


>> foundpattern

foundpattern =

7×4 table

           ISDs                 tR      Standard desv      RSD  
__________________________    ______    _____________    _______

{'18:1 (d7) MG'          }      1.34       0.020418       1.5238
{'18:1(d7) LPC'          }    1.5868      0.0056024      0.35305
{'18:1 (d9) SM'          }    6.8999        0.08336       1.2081
{'15:0-18:1(d7) PC'      }     7.989       0.072533      0.90791
{'15:0-18:1(d7) DG'      }    12.085       0.097445      0.80631
{'15:0-18:1 (d7)-15:0 TG'}    17.487       0.029701      0.16984
{'Cholesterol (d7)'      }    18.247       0.032275      0.17687

The problem resides when comparing the regular expression of the LipidData PC with a foundpattern value of {'18:1(d7) LPC'} which would make a 'match' that I don't know how to avoid it. I only need to find the exact same Short values within the foundpattern.ISDs. Another example of the same problem would appear hypothetically if in found pattern there was a Cer_NS, which would match not only with its LipidData value Cer_NS but also with Cer.

I believe making the values a group (using regex with parentheses) as you would see in the code is a solution, but of course the groups are 'slightly modified' and thus the repetition. I know I miss something there but I don't know what.

Anyway to avoid match repetitions there? As you would see at the OUTPUT, the Codes cell array should only have 7 entries instead of 8.

CODE:

Codes={}
for j=1:size(ID,1)
  expression=strcat("(",char(LipidData{j,2}),")");
  for i=1:size(foundpattern,1)
    if regexp(char(foundpattern{i,1}),expression) ~= 0
      disp(foundpattern{i,1})
      disp(LipidData{j,2})
      Codes{end+1}=LipidData{j,2};
    end
  end
end

OUTPUT:

>> Codes

Codes =

1×8 cell array

Columns 1 through 6

{1×1 cell}    {1×1 cell}    {1×1 cell}    {1×1 cell}    {1×1 cell}    {1×1 cell}

Columns 7 through 8

{1×1 cell}    {1×1 cell}

>> for i=1:size(Codes,2)
Codes{i}
end

ans =

  1×1 cell array

  {'Cholesterol'}


ans =

  1×1 cell array

  {'DG'}


ans =

  1×1 cell array

  {'LPC'}


ans =

  1×1 cell array

  {'MG'}


ans =

  1×1 cell array

  {'PC'}


ans =

  1×1 cell array

  {'PC'}


ans =

  1×1 cell array

  {'SM'}


ans =

  1×1 cell array

  {'TG'}

>> 
3
  • Just a quick question: do you want to find Cer_NS if you are looking for Cer? Commented Sep 20, 2024 at 8:55
  • No, it would be different stances. I only want Cer matches when finding Cer and Cer_NS when Cer_NS. Same with PC, LPC and all the possible problems. Commented Sep 20, 2024 at 9:06
  • I see, I was a bit tricked by the wording first. Just in case you ever want to make it work in an opposite way, when you want to find Cer as a whole word in Cer_NS, you can go to the original answer version. Commented Sep 20, 2024 at 9:40

1 Answer 1

0

You need

expression=strcat('\<(', regexptranslate('escape', char(LipidData{j,2})),')\>')

The \< part matches the start of a word. The regexptranslate('escape', char(LipidData{j,2})) now escapes special regex metacharacters in the text used literally in the regex pattern. And \> matches the end of a word.

See this regex demo.

Sign up to request clarification or add additional context in comments.

3 Comments

So far so good. I hope when the data changes it works as well! Thank you very much!
how to do the same but in python?
@PARCB Even simpler: expression = fr'\b({re.escape(LipidData)})\b'). But you might need expression = fr'(?!\B\w)({re.escape(LipidData)})(?!\B\w)') in the end.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.