0

I have a db.sql file that includes lots of urls like as follows.

....<td class=\"column-1\"><a href=\"http://geni.us/4Lk5\" rel=nofollow\"><img src=\"http://www.toprateten.com/wp-content/uploads/2016/08/25460A-Panini-Press-Gourmet-Sandwich-Maker.jpg \" alt=\"25460A Panini Press Gourmet Sandwich Maker\" height=\"100\" width=\"100\"></a></td><td class=\"column-2\"><a href=\"http://geni.us/4Lk5\" rel=\"nofollow\">25460A Panini Press Gourmet Sandwich Maker</a></td><td class....

As you can see, there is http://geni.us/4Lk5\ in the file.

I have another product.csv files that contains ID (like 4LK5 above) and Amazon product URL like as follows.

4Lk5    8738    8/16/2016 0:20  https://www.amazon.com/gp/product/B00IWOJRSM/ref=as_li_qf_sp_asin_il_tl?ie=UTF8
Jx9Aj2  8738    8/22/2016 20:16 https://www.amazon.com/gp/product/B007EUSL5U/ref=as_li_qf_sp_asin_il_tl?ie=UTF8
9sl2    8738    8/22/2016 20:18 https://www.amazon.com/gp/product/B00C3GQGVG/ref=as_li_qf_sp_asin_il_tl?ie=UTF8

As you can see, there is 4LK5 which matches with Amazon product URL.

I have already read the csv file and pick only ID and Amazon product url with python.

def openFile(filename, mode):
    index = 0
    result = []
    with open(filename, mode) as csvfile:
        spamreader = csv.reader(csvfile, delimiter = ',', quotechar = '\n')
        for row in spamreader:
            result.append({
                "genu_id": row[0],
                "amazon_url": row[3]
            });
    return result

I have to add some code to search appropriate URL with genu_id in the db.sql and replace with amazon_url described on the code above.

Please help me.

2
  • Why would you want to use a regex for this, rather than parsing the cell contents with lxml.html or similar? Commented Jun 6, 2017 at 16:15
  • I'm new to python, so I don't know well. I think that I have to use regex in order to select 'http://' + 'geni.us/4Lk5' in ...**-1\"><a href=\"geni.us/4Lk5\" rel=nofol...** Commented Jun 6, 2017 at 16:18

1 Answer 1

1

There is no need for regex if you have such a predefined structure - if all links are in the form of http://geni.us/<geni_id> you can do it with simple str.replace() by reading each row of your CSV and replacing the matches in your SQL file. Something like:

import csv

with open("product.csv", "rb") as source, open("db.sql", "r+") as target:  # open the files
    sql_contents = target.read()  # read the SQL file contents
    reader = csv.reader(source, delimiter="\t")  # build a CSV reader, tab as a delimiter
    for row in reader:  # read the CSV line by line
        # replace any match of http://geni.us/<first_column> with third column's value
        sql_contents = sql_contents.replace("http://geni.us/{}".format(row[0]), row[3])
    target.seek(0)  # seek back to the start of your SQL file
    target.truncate()  # truncate the rest
    target.write(sql_contents)  # write back the changed content
    # ...
    # Profit? :D

Of course, if your original CSV file is comma-delimited, replace the delimiter in the csv.reader() call - the one you presented here seems tab-delimited.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.