2

I have input files that are CSV format, created in another program. I have no control over them. I'm finding that with those files, line feed characters within doublequotes are incorectly treated as new rows.

If I recreate the file myself, it works correctly. I see no difference between the two files (the one I created and the one created by the other system) in Notepad. In all cases, the line feed characters are ascii 10 and the doublequote characters are ascii 34.

In notepad, the files both look like this:

Field1,"Field2a
Field2b",Field3

In the file I created, when I open it as a CSV in Excel, Cell B1 looks like this. This is correct; the line feed character within the doublequotes is treated as a literal, not as a new row.

enter image description here

But in the file from the other system, even though it looks exactly the same in Notepad and the line feed character is the same ascii 10 and the doublequote characters are the same ascii 34, when I open it as a CSV in Excel, Cell B1 looks like this. The line feed character, though it is within doublequote characters, is incorrectly treated as indicating a new row::

enter image description here

Unfortunately I'm not able to upload the files here as far as I understand.

Any suggestions?

I'm using Excel 2021.

UPDATE:

I was asked to view them in Notepad++>View>Show Symbol>Show All Characters. I downloaded Notepad++ and viewed them like that; they look the same there too:

enter image description here enter image description here

UPDATE 2:

I installed the Notepad++ Compare plugin. When I used it to compare, it said this:

enter image description here

I clicked Yes to compare anyway, and it said this:

enter image description here

So it seems there is some difference in the files but in their content?

UPDATE 3:

I was asked about the encoding shown in the lower-right corner of the Notepad++ window.

  • The one that works correctly is UTF-8.
  • The one with the problem is UTF-8-BOM.

Any suggestions on what I can do about this? Again I have no control over these input files.

13
  • 3
    If the two files really were absolutely, completely identical (same encoding, exactly the same contents), then Excel would treat them no differently. That would indicate that there is a difference of some kind between the two files. Perhaps try viewing the files in something like Notepad++ where you can View -> Show Symbol -> Show All Characters. Commented Jun 2 at 23:59
  • @Craig Thanks but they look the same in Notepad++ too. I'll update the question with pictures of that. Is there a diff tool that might help identify the difference? Commented Jun 3 at 0:21
  • 1
    I suggested that in Notepad++ you should do View -> Show Symbol -> Show All Characters. That will actually reveal the various "hidden" characters that are within the file contents. Update the question with a screenshot of that Commented Jun 3 at 0:32
  • @Craig Whoops, sorry I missed that. After viewing them under View -> Show Symbol -> Show All Characters, they still look the same as each other. Pictures corrected above. Commented Jun 3 at 0:38
  • @Craig I downloaded the Notepad++ Compare plugin and tried it. It says the files have different encodings but same content. Pictures above. Any suggestions where to go from here? Commented Jun 3 at 0:54

2 Answers 2

2

Use Power Query.

  1. From Excel, click Data, From text/CSV.

  2. Point at your csv file.

  3. For file origin, select 65001: Unicode (UTF-8)

  4. Now click close & load to and have it load to your worksheet.

This setup should work on both the UTF-8 and the UTF-8 BOM files. It did on my end.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks, but as I said in a comment above, "When I use Data > Get Data on the good file, it shows encoding 1252 Western European Windows. When I open the bad file with Data > Get Data and change it from 65001 "Unicode (UTF-8)" to 1252 Western European Windows, the same problem happens; it splits the field as before." That is, the problem files already have 65001: Unicode (UTF-8) encoding. I don't see that using 'From text/CSV' to take a file encoded as 65001: Unicode (UTF-8) and set it to have 65001: Unicode (UTF-8) encoding would change it. Can you elaborate on "It did [work] on my end"?
I created both text file types and with that PQ setting, both ended with the text with the chr 10 in a single cell as you would want
This could well be an issue that's impacted by both different versions of Excel as well as some other factors. Actually, in my case, the little bit of testing that I've done myself I can't actually get Excel to load the "middle" field into a single cell (with the two-line text) ..... so there's definitely more going on here
-1
import csv

with open("input.csv", "r", encoding="utf-8") as infile, \
     open("output.csv", "w", newline='', encoding="utf-8-sig") as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile, quoting=csv.QUOTE_ALL)
    for row in reader:
        writer.writerow(row)

Try this, this script will

Force all fields to be quoted (QUOTE_ALL)

Output CRLF line breaks

Use UTF-8 with BOM encoding

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.