How to replace \"" with \""" in R?

Question

I have a tsv that has some fields with some escaped double quotation marks (DQM). Those are fine, but the problem comes when the DQM delimiter comes after one of those and then when the tsv is opened it doesn't recognize the last DQM and so the following field gets included with the previous.

For example, the line:

"9812" "tt0167609" "tvSeries" "L'homme du \"Picardie\"" "L'homme du \"Picardie\"" 0 "1968" "\N" "13" "Drama"

when opened in a SS, places the first 3 fields fine. But the fourth field it includes all the way up to 1968 when what it should do is only put in the first "L'homme du \"Picardie\", the second of the same in the next field, and so on. And so the problem appears to be that it's not recognizing the " after the \". I tried clicking different options when opening it in SS programs, but nothing fixes it.

Now I've found out that I can fix this before opening it in a SS program by replacing \"" with \""" in a text editor, but I'd like to be able to do it in R when the file is generated.

I've tried several combinations of strings, but I just can't figure it out and I'm hoping someone can point me in the right direction. The following are just some of what I tried.

tv.Subset <- str_replace(tv.Subset, "\\\"\"", "\\\"\"\"") - one of my first attempts, simply escape each character in the string

tv.Subset <- str_replace(tv.Subset, '\\\"\"', '\\\"\"\"') - I wondered if single quotation marks for delimiters might be the trick

tv.Subset <- str_replace(tv.Subset, "\\\\\\"\\"", "\\\\\\"\\"\\"") - I read that you need to do double backslashes to respect both R and regex

Thanks.

Hard to say without input and output examples, but maybe try cat(str_replace("\"\"", r"("")", r"(""")"), "\n") — Andre Wildberg
– Andre Wildberg, Commented Sep 9, 2024 at 23:52
Thanks. I've read the documentation about cat as I saw others mention it, but I don't really understand it. If it's not obvious already, I'm an amateur. This is just for personal use. Doesn't the str_replace portion still need to have the subset ID in it somewhere though? I edited my post with an example. I noticed the forum got rid of a bunch of my backslashes some of which someone kindly edited back in, but the find and replace ones were still missing. I've added them back now that I know how. — kq76
– kq76, Commented Sep 10, 2024 at 11:24
I don't know what an SS editor is but simply getting this line into R do read.table("file.tsv") if there are really TABs between the fields. It escapes the backslash and reads all fields correctly. — Andre Wildberg
– Andre Wildberg, Commented Sep 10, 2024 at 13:37
How are you generating the file in R? I assume SS stands for "spreadsheet." What program exactly are you using to open the files? It's easier to help you if you include a simple reproducible example with sample input that can be used to test and verify possible solutions. — MrFlick
– MrFlick, Commented Sep 10, 2024 at 13:40
It would seem that read.delim(filename, ...) and playing with the argument quote= "\"" should fix your problem — Carl Witthoft
– Carl Witthoft, Commented Sep 10, 2024 at 20:16

TiredSquirrel · Accepted Answer · 2024-09-17 04:56:34Z

If I am understanding your question correctly, R is generating .tsv output files that you cannot open in spreadsheet programs due to the escaped double quotes in R's output.

If so, here is a reproducible example where the third row is your original data:

df <- data.frame(f1=c(1,27,9812,9999),
                 f2=c('tt12345','tt54321','tt0167609','tt98765'),
                 f3=rep("tvSeries",4),
                 f4=c("Some string","Another string","L'homme du \"Picardie\"","Yet another string"),
                 f5=c("Here is another string","Another string again","L'homme du \"Picardie\"","Also a string"),
                 f6=c(0,0,0,0),
                 f7=c(1966,1967,1968,1969),
                 f8=rep("\\N",4),
                 f9=c(11,12,13,14),
                 f10=rep('Drama',4))
write.table(df,'outfile.tsv')

Viewed in R, your data would look like this:

> print(df,row.names=F)
   f1        f2       f3                    f4                     f5 f6   f7  f8 f9   f10
    1   tt12345 tvSeries           Some string Here is another string  0 1966 \\N 11 Drama
   27   tt54321 tvSeries        Another string   Another string again  0 1967 \\N 12 Drama
 9812 tt0167609 tvSeries L'homme du "Picardie"  L'homme du "Picardie"  0 1968 \\N 13 Drama
 9999   tt98765 tvSeries    Yet another string          Also a string  0 1969 \\N 14 Drama

And the command write.table(df,'outfile.tsv') produces this output file, which some spreadsheet programs would fail to read:

"f1" "f2" "f3" "f4" "f5" "f6" "f7" "f8" "f9" "f10"
"1" 1 "tt12345" "tvSeries" "Some string" "Here is another string" 0 1966 "\N" 11 "Drama"
"2" 27 "tt54321" "tvSeries" "Another string" "Another string again" 0 1967 "\N" 12 "Drama"
"3" 9812 "tt0167609" "tvSeries" "L'homme du \"Picardie\"" "L'homme du \"Picardie\"" 0 1968 "\N" 13 "Drama"
"4" 9999 "tt98765" "tvSeries" "Yet another string" "Also a string" 0 1969 "\N" 14 "Drama"

The solution is not to change \" in R like you've been trying to, but instead to just not have R surround strings with quotes in your output. That way the only quotes in the output are the quotes that are part of a title, and spreadsheet programs will be able to treat them accordingly.

Here is the command you can use to save your data without the escaped quotes: write.table(df,'outfile.tsv',quote=F,sep='\t'). This will do two things:

quote=F will stop R from putting quotes around strings in your output. Since those quotes aren't there, it won't escape the quotes in titles.
sep='\t' will make R separate output columns with a tab character rather than a space character (space is the default, and your example looks like it used space). You need sep='\t' when quote=F so spreadsheet programs can tell spaces within strings are still part of the string and not separators.

Your new output file will look like this:

f1  f2  f3  f4  f5  f6  f7  f8  f9  f10
1   1   tt12345 tvSeries    Some string Here is another string  0   1966    \N  11  Drama
2   27  tt54321 tvSeries    Another string  Another string again    0   1967    \N  12  Drama
3   9812    tt0167609   tvSeries    L'homme du "Picardie"   L'homme du "Picardie"   0   1968    \N  13  Drama
4   9999    tt98765 tvSeries    Yet another string  Also a string   0   1969    \N  14  Drama

As long as the new output file uses tab characters, most spreadsheet programs will have no problem reading it like this.

That worked! Thank you very much. Here I was trying to fix it by adding something when what I should have been doing was getting rid of something. Thanks again. :)

Collectives™ on Stack Overflow

How to replace \"" with \""" in R?

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related