0

I hope my question isn't a dublicate of another, but I have searched for three days and I aven't found the answer.

Okay, so I have a CSV file containing two headers. The file contains information about hotels (their name), how much they cost (price), their rating and where they are located (Area 1, 2 or 3):

The CSV file imported

As you can see the first row describes the area, while the second row are the Hotelname, price and rating. What I want is to rearrange the file and save it to a new CSV file, where the format looks like this:

The hopeful output

So the information about the area for the hotels have been given its own column. The names in the seond row are all identical. Is there a way to create this? I am a bit new to these tree-like datastructures when they have to be imported. Could it be done with if the tree had more nodes (e.g. if we started by country, moved down to area and then down to hotel name, price and rating)? Can it be done with Pandas?

1
  • Btw I am a total noob in here, so if I have done something wrong, let me know :) Commented Oct 4, 2017 at 16:14

2 Answers 2

0

First, could you share the csv files as text files? That is really helpful to try out my own solution. It seems not productive to write down the data from the picture.

Second, have you tried out to achieve this by scripting yourself? Or have you tried to use some library? You added the tag pandas but in the text you do not mention that. Any specific reason it should be pandas?

A solution which works for that one case seems simple to do just by using slicing. I guess the format you have is rather specific and not standard so the libraries might not help much. Pandas e.g. allows multiple rows as a header, but it is interpreted in a different way, see pandas dataframe with 2-rows header and export to csv

A solution idea:

table = []
with open(my_csv_file) as f:
    for line in f:
        a1, p1, r1, a2, p2, r2, a3, p3, r3 = line[:-1].split(",")
        table.append([a1, p1, r1, "area1"])
        table.append([a2, p2, r2, "area2"])
        table.append([a3, p3, r3, "area3"])
# ... convert table into dataframe etc.
Sign up to request clarification or add additional context in comments.

7 Comments

Sure. Is there a specific way to share files on the forum? Otherwise I have created a link: dropbox.com/s/5dpjt5p6p54799o/Hotels.csv?dl=0 The reason why it has to Pandas, is because I have made other scripts to make some other types of data manipulation later on (not important to the question, but just to clarify).
No specific way to my knowledge but the best is a place where you can just see and copy the text without downloading files which might have viruses, e.g. a github gist or pastebin or something like that.
Now I checked my solution with your data and it worked. Just replace the open(my_csv_file) with with io.StringIO("""Area1;;;Area 2;;;Area 3;; Hotel;Price ;Rating;Hotel;Price ;Rating;Hotel;Price ;Rating A;200;1;D;350;3;G;500;1 B;500;4;E;400;2;H;200;3 C;300;5;F;500;5;I;700;5 """) as f: and there you go. Of course this can be generalized. If you do so, maybe use a github gist and share it here, too :)
Can the solution be generalized a bit more? For example, what if there were N different areas, and 20 different values describing the hotel (other than the name)? Could your solution be more dynamic? Thanks for the help. What if the file had an extra header describing countries (for example Lets say Germany, which then had Area 1 and Area 2 (and a subsequent list of Hotel names with corresponding prices and ratings), and France, which had Area 3, Area 4 and Area 5). Could the solution be generalized to this?
Well, I think it is not my task to solve your problem. For me stackoverflow is about giving other hints how they can solve their problems themselves. Or pointing out one line which needs to be changed to make it finally work. So please come up with an attempt of generalization yourself and if you struggle with a certain part, I will help you about that part. But I will not do your homework.
|
0

Okay so I created a possible solution to the problem:

infile = csv.reader(infile, delimiter=';')
    out = []
    counter = 0
    i = 0
    k = 0
    names = []
    temp1 = 0
    for line in infile:
        temp = list(set(line))
        if counter == 0:
            names = line
            counter +=1
        elif counter == 1:
            k = len(list(set(line)))
            while i < len(line):
                line.insert(i+k, name)
                i += (k + 1)
            counter += 1
            out.append(line)
        else:
            i = 0
            ind = 0
            while i < len(line):
                line.insert(i+k, names[ind*k])
                i += (k + 1)
                ind +=1
            out.append(line)
    headers = out.pop(0)
    n = len(set(headers))
    table = pd.DataFrame(out, columns=headers)
    for i in range(0, len(table.columns)):
        if i ==0:
            temp1 = table.ix[:,n*i:n*(i+1)]
        else:
            temp1 = pd.concat([temp1, table.ix[:,n*i:n*(i+1)]], ignore_index=True)

I would very much like some input and suggestions to make the solution more elegant or to add extra levels of headers to the file.

2 Comments

The whole code does not separate well. Have you thought to split it into several functions, one for reading the header, one for reading the body for example? Many lines like temp = list(set(line)) look hacky and comments would be good. The variable names you use do seldomly explain what they are used for. Could you think of better names? And I suggest to use the library unittest to check whether the code runs under different conditions.
Thanks for the input Mkastner. I will separate it more so it is easier to analyse. I will also rename the variables for easier understadning :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.