2

I have a dataframes pulled from a file. The variable with all these dataframe names is: Data_Tables.

These dataframes all have the same columns, and I want to concatenate the dataframes based on the entry for one of the columns, I'll say "fruit" for this example.

For an individual dataframe, the fruit will be the same, so if there are 10 dataframes, 3 might be apple, 2 are banana, 2 are lemon, and 1 of each for grape, blueberry, and lime.

So far, I have my script to pull out where we see duplicated fruits.

In other words, I don't need to worry about grape, blueberry, and lime since there's only one of each, But for apple, banana, and lemon, I want to combine them into three separate tables. So I'll start with 10 tables and end with as many tables as there are fruit (i.e. 6).

In addition, I also have a variable to show how many dataframes show that specific fruit if multiples exist, so something like this:

Multiple_Fruit --> [Apple, Banana, Lemon]

Num_Dups_per_Fruit ---> [3, 2, 2]

My problem lies in trying to combine only those fruits. My approach was to do this iteratively, and this is what I have:

for i in range(0,3)     # b/c 3 fruits w/ multiple dataframes
    for j in range(0,10)    # b/c 10 total dataframes/files/tables
        if Data_Tables[j].iloc[0,4] == Multiple_Fruit[i]    # [0,4] is specific location where fruit is in large dataframe

After this I tried to concatenate the tables together as they are created, which didn't work: Concat_Dup_Fruit[i] = pd.concat([Concat_Dup_Fruit[i],Data_Tables[j]])

Ultimately, the number of files/dataframes and fruit will be different.

3
  • maybe better show minimal working code with small example data, and result which you expect. Commented Nov 4 at 17:03
  • what means "didn't work"? Did you get error message? If yes then show FULL error message in question (not in comments). We can't run your code, we can't see your computer, and we can't read in your mind - you have to show all details. Commented Nov 4 at 17:08
  • I don't know what problem you have but maybe it would be simpler to group all DataTables[j] to list (ie. group) and after for j- loop concat all of them pd.concat(group) and append() to Concat_Dup_Fruit. And this can work even with names which have only one Data_Table Commented Nov 4 at 17:31

1 Answer 1

1

I don't know what problem you have because you didn't show any error message, and you didn't create minimal working code which we could test.

I created my own minimal working code with small example data and all works for me.

But in inner for-loop I group all tables with the same fruit (on list group), and after for-loop I concate it pd.concat(group) and I append to empty list with results results.append(pd.concat(group)) instead of using index to replace items.

And this allows me to make all more readable without range() and without indexes.


import pandas as pd

multiple_fruit = ["apple", "banana", "lemon", "grape", "blueberry", "lime"]

data_tables = [
    pd.DataFrame({"A": [1], "B": [1], "C": [1], "D": [1], "Name": ["apple"]}),
    pd.DataFrame({"A": [2], "B": [2], "C": [2], "D": [2], "Name": ["banana"]}),
    pd.DataFrame({"A": [3], "B": [3], "C": [3], "D": [3], "Name": ["lemon"]}),
    pd.DataFrame({"A": [4], "B": [4], "C": [4], "D": [4], "Name": ["grape"]}),
    pd.DataFrame({"A": [5], "B": [5], "C": [5], "D": [5], "Name": ["blueberry"]}),
    pd.DataFrame({"A": [6], "B": [6], "C": [6], "D": [6], "Name": ["lime"]}),
    pd.DataFrame({"A": [7], "B": [7], "C": [7], "D": [7], "Name": ["apple"]}),
    pd.DataFrame({"A": [8], "B": [8], "C": [8], "D": [8], "Name": ["banana"]}),
    pd.DataFrame({"A": [9], "B": [9], "C": [9], "D": [9], "Name": ["lemon"]}),
    pd.DataFrame({"A": [0], "B": [0], "C": [0], "D": [0], "Name": ["apple"]}),
]

# --- grouping and concatenating ---

results = []

for name in multiple_fruit:
    group = []
    print(f"--- grouping: {name} ---")
    for table in data_tables:
        if table.loc[0, "Name"] == name:
            print(table.loc[0, "Name"])
            group.append(table)
    print("number of tables in group:", len(group))
    # results.append(pd.concat(group).reset_index(drop=True))
    results.append(pd.concat(group, ignore_index=True))

# --- display results ---

for item in results:
    print("--- result ---")
    print(item)

Result:

--- grouping: apple ---
apple
apple
apple
number of tables in group: 3
--- grouping: banana ---
banana
banana
number of tables in group: 2
--- grouping: lemon ---
lemon
lemon
number of tables in group: 2
--- grouping: grape ---
grape
number of tables in group: 1
--- grouping: blueberry ---
blueberry
number of tables in group: 1
--- grouping: lime ---
lime
number of tables in group: 1
--- result ---
   A  B  C  D   Name
0  1  1  1  1  apple
1  7  7  7  7  apple
2  0  0  0  0  apple
--- result ---
   A  B  C  D    Name
0  2  2  2  2  banana
1  8  8  8  8  banana
--- result ---
   A  B  C  D   Name
0  3  3  3  3  lemon
1  9  9  9  9  lemon
--- result ---
   A  B  C  D   Name
0  4  4  4  4  grape
--- result ---
   A  B  C  D       Name
0  5  5  5  5  blueberry
--- result ---
   A  B  C  D  Name
0  6  6  6  6  lime

To make less loop iterations it could first use one loop to group data_tables in dict {"apple": [table, table, ...], "banana": ...} and later run other loop to concat groups and append to results.

import pandas as pd

multiple_fruit = ["apple", "banana", "lemon", "grape", "blueberry", "lime"]

data_tables = [
    pd.DataFrame({"A": [1], "B": [1], "C": [1], "D": [1], "Name": ["apple"]}),
    pd.DataFrame({"A": [2], "B": [2], "C": [2], "D": [2], "Name": ["banana"]}),
    pd.DataFrame({"A": [3], "B": [3], "C": [3], "D": [3], "Name": ["lemon"]}),
    pd.DataFrame({"A": [4], "B": [4], "C": [4], "D": [4], "Name": ["grape"]}),
    pd.DataFrame({"A": [5], "B": [5], "C": [5], "D": [5], "Name": ["blueberry"]}),
    pd.DataFrame({"A": [6], "B": [6], "C": [6], "D": [6], "Name": ["lime"]}),
    pd.DataFrame({"A": [7], "B": [7], "C": [7], "D": [7], "Name": ["apple"]}),
    pd.DataFrame({"A": [8], "B": [8], "C": [8], "D": [8], "Name": ["banana"]}),
    pd.DataFrame({"A": [9], "B": [9], "C": [9], "D": [9], "Name": ["lemon"]}),
    pd.DataFrame({"A": [0], "B": [0], "C": [0], "D": [0], "Name": ["apple"]}),
]

# --- grouping ---

all_groups = {}

for table in data_tables:
    name = table.loc[0, "Name"]
    print(name)
    if name not in all_groups:
        all_groups[name] = []
    all_groups[name].append(table)

# --- concatenating ---

results = []

for name in multiple_fruit:
    group = all_groups[name]
    print(f"number of tables in group {name}:", len(group))
    # results.append(pd.concat(group).reset_index(drop=True))
    results.append(pd.concat(group, ignore_index=True))

# --- display results ---

for item in results:
    print("--- result ---")
    print(item)
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for your reply and answer, apologies for not being clear enough in my initial post. In short, my main issue was understanding in Python how to do this. The specific line of code I tried didn't work due to some errors (should have included), but I also attacked this from different angles and couldn't figure it out.
Implementing your code was helpful and I got it to work as well, and I understand what most lines/commands mean. I admit I do need to familiarize with doing for loops without a range. I am new to Python and come from MATLAB, so this is new to me altogether to loop through names like this and not numerically. The results.append(pd.concat(... is also new to me, I'll be sure to look into the rules for the append and make sure I can explain why that can be grouped like that.
line results.append(pd.concat(...)) is simply a shorter version for df = pd.concat(...) and results.append(df)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.