0

I obtain multiple CSV files from API, in which I need to remove New Lines present in the CSV and join the record, consider the data provided below;

My Code to remove the New Line:

## Loading necessary libraries
import glob
import os
import shutil
import csv

## Assigning necessary path
source_path = "/home/Desktop/Space/"
dest_path = "/home/Desktop/Output/"
# Assigning file_read path to modify the copied CSV files
file_read_path = "/home/Desktop/Output/*.csv"

## Code to copy .csv files from one folder to another
for csv_file in glob.iglob(os.path.join(source_path, "*.csv"), recursive = True):
    shutil.copy(csv_file, dest_path)

## Code to delete the second row in all .CSV files
for filename in glob.glob(file_read_path):
    with open(filename, "r", encoding = 'ISO-8859-1') as file:
        reader = list(csv.reader(file , delimiter = ","))
        for i in range(0,len(reader)):
            reader[i] = [row_space.replace("\n", "") for row_space in reader[i]]
    with open(filename, "w") as output:
        writer = csv.writer(output, delimiter = ",", dialect = 'unix')
        for row in reader:
            writer.writerow(row)

I actually copy the CSV files into a new folder and then use the above code to remove any new line present in the file.

5
  • 1
    what is the problem actually? what is the current output you are getting? Commented Sep 23, 2019 at 9:38
  • The output is same as the input, there is hardly any impact on the CSV file. Commented Sep 23, 2019 at 9:40
  • try this newline='' to remove default newline: with open(filename, "w", newline='') as output: Commented Sep 23, 2019 at 9:42
  • Added, but, still there is no change. The new line spacing is not removed Commented Sep 23, 2019 at 10:10
  • It would appear that the newline is associated with the HTML content in your CSV so I would focus on a fields/column containing HTML rather than the whole file. Do bare in mind that newline is data too, by removing all newlines you are modifying how the input dataset is shaped. Commented Sep 23, 2019 at 10:26

2 Answers 2

1

You are fixing the csv File, because they have wrong \n the problem here is how to know if the line is a part of the previous line or not. if all lines starts with specifics words like in your example SV_a5d15EwfI8Zk1Zr or just SV_ You can do something like this:

import glob
# this is the FIX PART
# I have file ./data.csv(contains your example)  Fixed version is in data.csv.FIXED
file_read_path = "./*.csv"
for filename in glob.glob(file_read_path):
    with open(filename, "r", encoding='ISO-8859-1') as file, open(filename + '.FIXED', "w", encoding='ISO-8859-1') as target:
        previous_line = ''
        for line in file:
            # check if it's a new line or a part of the previous line
            if line.startswith('SV_'):
                if previous_line:
                    target.write( previous_line + '\n')
                previous_line = line[:-1]  # remove \n
            else:
                # concatenate the broken part with previous_line
                previous_line += line[:-1]  # remove \n
        # add last line
        target.write(previous_line + '\n')

Ouput:

SV_a5d15EwfI8Zk1Zr;QID4;"<span style=""font-size:16px;""><strong>HOUR</strong> Interview completed at:</span>";HOUR;TE;SL;;;true;ValidNumber;0;23.0;0.0;882;-873;0
SV_a5d15EwfI8Zk1Zr;QID6;"<span style=""font-size:16px;""><strong>MINUTE</strong> Interview completed:</span>";MIN;TE;SL;;;true;ValidNumber;0;59.0;0.0;882;-873;0
SV_a5d15EwfI8Zk1Zr;QID8;Number of Refusals - no language<br />For <strong>Zero Refusals - no language</strong> use 0;REFUSAL1;TE;SL;;;true;ValidNumber;0;99.0;0.0;882;-873;0
SV_a5d15EwfI8Zk1Zr;QID10;<strong>DAY OF WEEK:</strong>;WEEKDAY;MC;SACOL;TX;;true;;0;;;882;-873;0
SV_a5d15EwfI8Zk1Zr;QID45;"<span style=""font-size:16px;"">Using points from 0 to 10, how likely would you be recommend Gatwick Airport to a friend or colleague?</span><div> </div>";NPSCORE;MC;NPS;;;true;;0;;;882;-873;

EDITS:

Can Be Simpler using split too, this will fix the file it self:

import glob
# this is the FIX PART
# I have file //data.csv the fixed version in the same file
file_read_path = "./*.csv"
# assuming that all lines starts with SV_
STARTING_KEYWORD = 'SV_'
for filename in glob.glob(file_read_path):
    with open(filename, "r", encoding='ISO-8859-1') as file:
        lines = file.read().split(STARTING_KEYWORD)
    with open(filename, 'w', encoding='ISO-8859-1') as file:
        file.write('\n'.join(STARTING_KEYWORD + l.replace('\n', '') for l in lines if l))
Sign up to request clarification or add additional context in comments.

5 Comments

I have implemented the code using split method, but, the Starting Keyword is being appended in the Column too, any way to ignore the column, that is the first row?
But you should not have any column in the line list ? Is the key word used to split line exists only in the beginning or exist also in the midle of the line ?
The key word is present in all the rows, apart from the Column Name, the problem is that the Keyword is being appended in the first column name
Just remove the first keyword from the result of join, i'm using my phone sorry about this, but add slicing at the end of join call, write(.......for l in lines if l)[len(STARTING_KEYWORD) - 1:]) this will remove the first STARTING_KEYWORD that is added to the first line
remove - 1 the index should start from [len(STARTING_KEYWORD):] the goal is to remove the fisrt characters from this text passed to write
0

Well I'm not sure on the restrictions you have. But if you can use the pandas library , this is simple.

import pandas as pd

data_set = pd.read_csv(data_file,skip_blank_lines=True)
data_set.to_csv(target_file,index=False)

This will create a CSV File will all new lines removed. You can save a lot of time with available libraries.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.