0

I have a large excel spreadsheet that I need to read data from certain rows, columns and cells and then output into a different dataframe format. How would I capture the data in specific cells while also ensuring the data can be captured when the spreadsheet is changed? Meaning more columns or rows could be added, but I need to continuously capture this data. Could you provide the code using python and pandas and using loops to dynamically capture this data. Again, not all cells will be used and only certain rows and columns will be used. Here is an example.

Logic

Display the count of the column name for a given quarter and ID. In this case: q1.22. I created new columns called: date and TYPE

Here is the excel spreadsheet:

Data

        q1.22           
ID      type1   OFFICE  nontype1    Customer
NY      1       3       1           2
CA      1       33      1           0
TOTALS  2       36      2           1


data = {
    '0': ['id', 'NY', 'CA', 'TOTALS'],
    'q1.22': ['type1', '1', '1', '2'],
    '0_2': ['OFFICE', '3', '33', '36'],
    '0_3': ['nontype1', '1', '1', '2'],
    '0_4': ['Customer', '2', '0', '1']
}

Desired

ID  date    TYPE
NY  q1.22   type1
NY  q1.22   nontype1
NY  q1.22   Customer
NY  q1.22   Customer
CA  q1.22   type1
CA  q1.22   nontype1

Doing

# Define the row indices for both ranges
start_row, end_row = 0, 3  # Rows 1 to 4 (0-based index)

# Define the column indices for the first range (A to C)
start_col_range1, end_col_range1 = 0, 2  # Columns A to C (0-based index)

# Define the column indices for the second range (E to F)
start_col_range2, end_col_range2 = 4, 5  # Columns E to F (0-based index)

# Create an empty list to store the captured data
captured_data = []

# Loop through rows and columns within the first range (A to C)
for row in range(start_row, end_row + 1):
    row_label = df.iloc[row, 0]  # Assuming the ID column is in the first column
    for col in range(start_col_range1, end_col_range1 + 1):
        col_label = df.columns[col]
        value = df.iloc[row, col]
        captured_data.append({'ID': row_label, 'date': df.iloc[0, 0], 'TYPE': col_label})

# Loop through rows and columns within the second range (E to F)
for row in range(start_row, end_row + 1):
    row_label = df.iloc[row, 0]  # Assuming the ID column is in the first column
    for col in range(start_col_range2, end_col_range2 + 1):
        col_label = df.columns[col]
        value = df.iloc[row, col]
        captured_data.append({'ID': row_label, 'date': df.iloc[0, 0], 'TYPE': col_label})

# Convert the captured data into a DataFrame
output_df = pd.DataFrame(captured_data)

However, this is the output:

ID  date    TYPE
0   id  id  Unnamed: 0
1   id  id  q1.22
2   NY  id  Unnamed: 0
3   NY  id  q1.22
4   CA  id  Unnamed: 0
5   CA  id  q1.22
6   TOTALS  id  Unnamed: 0
7   TOTALS  id  q1.22
8   id  id  Unnamed: 3
9   id  id  Unnamed: 4
10  NY  id  Unnamed: 3
11  NY  id  Unnamed: 4
12  CA  id  Unnamed: 3
13  CA  id  Unnamed: 4
14  TOTALS  id  Unnamed: 3
15  TOTALS  id  Unnamed: 4

Any suggestion is appreciated

15
  • 1
    explain logic of desired output. Commented Apr 22, 2024 at 3:40
  • yes I just updated Commented Apr 22, 2024 at 3:43
  • provide the code for us to generate input dataframe. Commented Apr 22, 2024 at 3:51
  • You should load your data as a MultiIndex, not with type/office/... as a row. Then this would be a simple stack. Commented Apr 22, 2024 at 4:13
  • Ok this is how the spreadsheet is set up though. This is the dilemma… unless you mean for me to convert first to where the columns are row data? @mozway - I just wish to read and extract data from certain rows and columns in this large spreadsheet dynamically Commented Apr 22, 2024 at 4:15

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.