0

I am trying to find a way to read just one value from a big dataframe in Python. I have 2 data tables in my project.

One looks like this:

Company ID  Company  201512  201511  ...  199402  199401
1234        abc      1.1     0.8     ...  2.1     -0.9
.
.
.
4321        cba      2.1     -0.4    ...  0.3     -0.1

There are about 260 months and 10,000 companies. I need to check their monthly returns one by one and see if there are 36 valid data points behind that data point. That means there is no "0" or "NaN". If there are 36 valid data points, I need to run a regression of these 36 data points against 7 factors, which are listed in another table.

The other table looks like this:

Month    Factor1     Factor2     ...     Factor6     Factor7  
201512   -0.4        1.1         ...     2.1         1.2
.
.
.
199401   0.1         0.2         ...     0.3         0.4

Now my problem is, I couldn't find a way to load just one value at a time from table 1 and create a loop for it. Can someone please advise?

3
  • why is 0 not a valid monthly return? Commented Sep 22, 2017 at 18:58
  • Well you could use value = df['some_field'].iloc[the_index] but you perhaps don't want that in a for loop if there's a way to group_by.aggregate() in some way and take a specific value. Commented Sep 22, 2017 at 19:01
  • Because 0 is highly likely to be just a missing data point or typo. Commented Sep 22, 2017 at 19:25

2 Answers 2

1

You can iterate over rows with following code:

for index, row in df.iterrows():

Then the index would be the index of the row, and you can access the columns with lets say row["Company"] for example.

Sign up to request clarification or add additional context in comments.

Comments

0

you don't want a for loop for this.

assuming 0 is a valid monthly return and that you only have 36 columns after Company you can easily find all companies with valid monthly return data:

df = df[df.notnull().all(1)]

if, for some unknown reason, you want to get rid of 0s, you can do a replace first:

df = df[df.replace(0, np.nan).notnull().all(1)]

edit for the comment:

you could do something like:

cols = df.columns
first_col = get_first_return_col(df)
for i in range(first_col, len(cols)):
    df = df[df[cols[i : i + 36]].notnull().all(1)]
    run_regression(df[cols[i]])

1 Comment

Thank you for the answer. This helps if I just need one regression for each company, but I actually need to run multiple regressions for each company. It goes like this. I read 201512 data for company abc, I found 36 valid data after that point, I run a regression and note done the results. Then I check 201511 data for the same company to see if there are still 36-month valid data points. If yes, I need to run another regression for these 36 months, which is just 1-month different from the previous regression.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.