I have a malformed csv file which I need to fix:
- The file is supposed to have one record per line but because of this formatting issue, it has a MS-DOS newline character (
^M). - To make matters worse, the last field of the CSV file is a text field and it also contains this MS-DOS newline character so I can't simply replace
^Mcharacter. - But the good news is that the first field of the file is DATE field (
MM/DD/YY)
So I tried to replace (\r\nMM/DD/YY) pattern by (\rMM/DD/YY) but it didn't work. Here is my code snippet:
fixed_content = re.sub(r"""\r\n\d{2})/\d{2}/\d{2}""", r"""\r\1/\2/\3""", malformed_content)
My problems are:
- I don't know how to represent
^Mcharacter as a pattern. I used\r\n - I don't know how to refer to previous matches in the new replacing pattern. I used
\1for firstMMpattern,\2for nextDDpattern and\3for lastYYpattern.
(...)for backtracking.r"""\r\n(\d{2})/(\d{2})/(\d{2})"""but it's so much easier to just dore.sub(r"""\r\n(\d{2}/\d{2}/\d{2})""", r"""\r\1""", malformed_content)