2

I'm learning python, and then i have the following difficulties. The file i want to be cleaned is an .csv file. The file that contains the words that have to be removed from the .csv file is an .txt The .txt file is a list of domain names:

domain.com
domain2.com
domain3.com

The .csv file is a config file just like this:

domain.com;8;Started;C:\inetpub\wwwroot\d\domain.com;"http *:80:www.domain.com"

if the .txt file contains "domain.com" i want the complete line above to be removed. I would be realy gratefully if some python ninja could fix this.(or in bash?)

3
  • You don't need a Python ninja for this; grep would suffice. See man grep. Commented Feb 23, 2014 at 16:27
  • Where are you having trouble? What have you tried so far? Commented Feb 23, 2014 at 16:28
  • i have tried many python scripts, but they only allow the file directly in the cod, i want to do it from a file. Commented Feb 23, 2014 at 16:30

4 Answers 4

2

Well, since OP is learning python ...

$ python SCRIPT.py

TXT_file = 'TXT.txt'
CSV_file = 'CSV.csv'
OUT_file = 'OUTPUT.csv'

## From the TXT, create a list of domains you do not want to include in output
with open(TXT_file, 'r') as txt:
    domain_to_be_removed_list = []

    ## for each domain in the TXT
    ## remove the return character at the end of line
    ## and add the domain to list domains-to-be-removed list
    for domain in txt:
        domain = domain.rstrip()
        domain_to_be_removed_list.append(domain)


with open(OUT_file, 'w') as outfile:
    with open(CSV_file, 'r') as csv:

        ## for each line in csv
        ## extract the csv domain
        for line in csv:
            csv_domain = line.split(';')[0]

            ## if csv domain is not in domains-to-be-removed list,
            ## then write that to outfile
            if (not csv_domain in domain_to_be_removed_list):
                outfile.write(line)
Sign up to request clarification or add additional context in comments.

Comments

2

Will this suffice ?

import sys

def main():
    with open(sys.argv[1]) as fh:
        fhDomains = fh.read().split(";")
    with open(sys.argv[2]) as fh:
        fhExcludes = fh.read().split("\n")

    for i, dom in enumerate(fhDomains):
        if dom in fhExcludes:
            del fhDomains[i]

    fh = open(sys.argv[1], "w")
    fh.write(";".join(fhDomains))





if __name__ == "__main__":
    main()

execute with:

script.py Domains.txt excludes.txt

Comments

2

try:

grep -vf <(sed 's/.*/^&;/' domains.txt) file.csv

@glenn jackman's suggetion - shorter.

grep -wFvf domains.txt file.csv

but, the foo.com in domains, will stll will match both lines (one unwanted), like:

foo.com;.....
other.foo.com;.....

soo...

my domains.txt

dom1.com
dom3.com

my file.csv (only the 1st column needed)

dom1.com;wedwedwe
dom2.com;wedwedwe 2222
dom3.com;wedwedwe 333
dom4.com;wedwedwe 444444

result:

dom2.com;wedwedwe 2222
dom4.com;wedwedwe 444444

if you have windows file - the lines ends with \r\n not only with \n, use:

grep -vf <(<domains.txt tr -d '\r' |sed -e 's/.*/^&;/') file.csv

7 Comments

Does not work, outputs only the content of the csvfile
@HwT Tt works. See the edit. Maybe, your domains.txt contains some other characters... In this case, you should modify the question ;)
This will be faster and safer: grep -Fvf domains.txt file.csv
@glennjackman Not really. :) The fgrep (fixed string) solution will not works if the domains.txt contains foo.com and the file.csv will contains somefoo.com - will removed too. So, it is NOT safer... You need match the beginning of line and the semicolon too...
Well, you just need to add -w to the mix of options to only match whole words. I said safer because it's just doing fixed string matching, so any unintended regex chars in the domain list won't inadvertently match
|
0

This awk one-liner should do the trick:

awk -F';' 'NR == FNR {a[$1]++; next} !($1 in a)' txtfile csvfile

2 Comments

Does not work, outputs only the content of the csvfile
@HwT Your file seems to have windows formatting. Run it by dos2unix command first

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.