1

I am trying to append a row in sitemap_bp.csv in the adjacent column, if a line contains a string from mobilesitemap-browse.csv. I'm not able to iterate through the lines in mobilesitemap-browse.csv, it gets stuck on the first line. How do I go about solving this?

import csv

with open('sitemap_bp.csv','r') as csvinput:
    with open('mobilesitemap-browse.csv','r') as csvinput2:
        with open('output.csv', 'w') as csvoutput:
            writer = csv.writer(csvoutput, lineterminator='\n')
            sitemap = csv.reader(csvinput)
            mobilesitemap = csv.reader(csvinput2)

            all = []
            row = next(sitemap)
            row.append('mobile')
            all.append(row)

            for mobilerow in mobilesitemap:
                for row in sitemap:
                    #print row[0]
                    if mobilerow[1] in row[0]:
                        #print row, mobilerow[1]
                        all.append((row[0], mobilerow[1]))
                    else:
                        all.append(row)

            writer.writerows(all)
7
  • 1
    This is an aside, but don't use the nested with expressions. You can chain them with commas, e.g. with open('file1.txt') as file1, open('file2.txt') as file2, ... Commented Mar 11, 2015 at 23:54
  • Thank you for the data. Could you show us the output you're ACTUALLY getting? I think the expected output is clear enough Commented Mar 11, 2015 at 23:57
  • Updated snippet of sitemap_bp.csv, Am currently using \d{4,}_\d{4,}_\d{4,}_\d{4,}_\d{4,}|\d{4,}_\d{4,}_\d{4,}_\d{4,}|\d{4,}_\d{4,}_\d{4,}|\d{4,} to capture new types. Commented Mar 12, 2015 at 18:38
  • 1
    That's a silly regex. Just do r"\d{4,}(?:_\d{4,})*" Commented Mar 12, 2015 at 18:40
  • 1
    Works for me! regex101.com/r/dH2nM7/1 Commented Mar 12, 2015 at 18:45

1 Answer 1

1

Personally I'd parse the data from sitemap_bp.csv first, then use that dictionary to populate the new file.

import re

with open('sitemap_bp.csv','r') as csvinput, \
        open('mobilesitemap-browse.csv','r') as csvinput2, \
        open('output.csv', 'w') as csvoutput:
    writer = csv.writer(csvoutput, lineterminator='\n')
    sitemap = csvinput # no reason to pipe this through csv.reader
    mobilesitemap = csv.reader(csvinput2)

    item_number = re.compile(r"\d{5}_\d{7}_{7}")

    item_number_mapping = {item_number.search(line).group(): line.strip()
                           for line in sitemap if item_number.search(line)}
    # makes a dictionary {item_number: full_url, ...} for each item in sitemap
    # alternate to the above, consider:
    # # item_number_mapping = {}
    # # for line in sitemap:
    # #     line = line.strip()
    # #     match = item_number.search(line)
    # #     if match:
    # #         item_number_mapping[match.group()] = match.string

    all = [row + [item_number_mapping[row[1]] for row in mobilesitemap]

    writer.writerows(all)

My guess is that after the first time through your outer for loop, it tries to iterate through sitemap again but can't since the file is already exhausted. The minimal change for that would be:

        for mobilerow in mobilesitemap:
            csvinput.seek(0) # seek to the start of the file object
            next(sitemap) # skip the header row
            for row in sitemap:
                #print row[0]
                if mobilerow[1] in row[0]:
                    #print row, mobilerow[1]
                    all.append((row[0], mobilerow[1]))
                else:
                    all.append(row)

But the obvious reason not to do this is that it iterates through your sitemap_bp.csv file once per row in mobilesitemap-browse.csv, rather than just once like my code.

EDIT per question in comments

If you need to get a list of those URLs in sitemap_bp.csv that don't correspond with mobilesitemap-browse.csv, you're probably best-served by making a set for all the items you see as you see them, then using set operations to get the unseen items. This takes a little tinkering, but...

# instead of all = [row + [item number ...

seen = set()
all = []

for row in mobilesitemap:
    item_no = row[1]
    if item_no in item_number_mapping:
        all.append(row + [item_number_mapping[item_no]])
        seen.add(item_no)
# after this for loop, `all` is identical to the list comp version
unmatched_items = [item_number_mapping[item_num] for item_num in
                   set(item_number_mapping.keys()) - seen]
Sign up to request clarification or add additional context in comments.

3 Comments

TY, one thing i need in 'all = []' is the URLs(from sitemap) that do not have a corresponding match, what is the best way to do this? Do I need to iterate through sitemap?
@EliquidVape you mean all the URLs in sitemap_bp.csv or the URLs in mobilesitemap-browse.csv?
All the URLs in sitemap_bp.csv

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.