2

I have to csv files. The first looks like this:

enter image description here

The second contains a list of IP:

139.15.250.196
139.15.5.176

I'd like to check if any given IP in from the first file is in the second file. This seams to work (please correct or provide hints if my code is broken) but the issue is that the first file contains many duplicate values e.g. 10.0.0.1 may appear x times and I was not able to find a way to remove duplicates. Could you please assist me or guide ?

import csv

filename = 'ip2.csv'
with open(filename) as f:
    reader = csv.reader(f)
    ip = []
    for row in reader:
        ip.append(row[0])


filename = 'bonk_https.csv'
with open(filename) as f:
    reader = csv.reader(f)
    ip_ext = []
    for row in reader:
        ip_ext.append(row[0])
        for a in ip:
            if a in ip_ext:
                print(a)
6
  • Have you looked at the Pandas library? You could import the CSVs into Panda using the read_csv command. Likely deduplicate the list in Pandas. Then execute an inner join in Pandas with the merge command to get the list of matching items. Commented Dec 10, 2018 at 20:45
  • delete duplicates in Pandas: chrisalbon.com/python/data_wrangling/pandas_delete_duplicates Commented Dec 10, 2018 at 20:47
  • merge/join in Pandas: shanelynn.ie/merge-join-dataframes-python-pandas-index-1 Commented Dec 10, 2018 at 20:49
  • 2
    Why don't you create a set of IPs instead of a list? Commented Dec 10, 2018 at 20:50
  • 3
    Your code clearly isn't what you're running; it'll die immediately with a NameError (because reader isn't defined). Can you post a minimal reproducible example that can actually run? Commented Dec 10, 2018 at 21:15

2 Answers 2

3

You can cast any list into a set with set(list). A set only holds one of each items and can be compared with member in set like a list. So just cast your ip list to a set.

with open(filename) as f:
    ip_ext = []
    for row in reader:
        ip_ext.append(row[0])
        for a in set(ip):
            if a in set(ip_ext): #well, you don't need a set her unless you also have duplicates in ip_ext
                print(a)

Alternatively just break/continue if you found your entry. This might help you with that

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you but with your code I'm still getting duplicates :(
please give us some example data and your code. I currently can't see how you can get duplicates if you compare each member of a set (which no longer has duplicates) exactly once with the ip_ext list that you made. Unless ip_ext itself has also duplicates.
To be sure I updated my code. Please try it again. And please tell us more about your data.
Thank you. In fact it works :) The second file contains duplicates but that's ok. Please see the EDIT section of my question. I hope you can help me with it !
1

I suggest that you normalize all the IPs,

with open(...) as f
   # a set comprehension of _normalized_ ips, this strips excess trailing zeros
   my_ips = {'.'.join('%d'%int(n) for n in t) 
                for t in [x.split(',')[0].split('.') for x in f]}

Next, you check each normalized IP from rthe second file against the IP s contained in the normalized set (note that, different from other answers, here you have a single loop, and that checking if an item is a member of a set, x in my_xs, is a highly optimized operation)

with open(...) as f:
    for line in f:
        ip = '.'.join('%d'%int(n) for n in line.split('.'))
        if ip in my_ips:
            ...
        else:
            ...

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.