4

I'd like to read a lot of excel files using pandas (python). When importing the data, I want ALL my columns to be stored as strings.

The problem is that I don't know the number of columns or even their names (it changes every time). Would you have an easy solution for this problem?

What I tried to do:

converters = { i : str for i in range(0,99)}
df = pd.read_excel('example.xlsx', converters = converters)

But the Index gets out of range sometimes since the excel files are different.

Ideally I'd like to do:

df = pd.read_excel('example.xlsx', converters = ALL)

Nevertheless, I haven't found something that would help me doing something similar so far...

Thank you for your help.

5
  • 1
    df = pd.read_excel('example.xlsx').asytpe(str) ? Commented Jan 12, 2017 at 15:49
  • MaxU, I don't think DataFrame objects have asytpe attribute Commented Jan 12, 2017 at 15:53
  • can you share the error from using converters = { i : str for i in range(0,99)} Commented Jan 12, 2017 at 15:54
  • piRSquared, "Index is out of Range". Which makes sense since the excel file is different every time. Sometimes a file has 99 columns, sometimes it has 10 columns. If the dictionary has more element than columns the index will be out of range. Commented Jan 12, 2017 at 16:06
  • @user7410504 yeah, I just replicated that... thinking... Commented Jan 12, 2017 at 16:08

1 Answer 1

4

UPDATE: i think we can use the standard (for Pandas) xlrd module and then reuse for reading data from the Excel file

xl = pd.ExcelFile(fn)
ncols = xl.book.sheet_by_index(0).ncols
df = xl.parse(0, converters={i : str for i in range(ncols)})

OLD answer:

I think you would have first to get number of columns:

from openpyxl import load_workbook

workbook = load_workbook(filename, use_iterators=True)
col_num = workbook.worksheets[0].max_column

converters = { i : str for i in range(col_num)}
...
Sign up to request clarification or add additional context in comments.

4 Comments

When I try to upvote again... it just takes it way. Which isn't what I want. How do I upvote twice? This is my next meta question.
Thanks MaxU: it works for most cases but sometimes I have extra columns at the end of the file that are not part of the table I extract (I used skiprow to avoid them). So in your code col_num would be too high and the index would be out of range. A solution I found would be to use read_excel two times: the first time to get df.columns.max (after skipping the rows I don't need) and then the second time using converters = { i : str for i in range(df.columns.max)}. Nevertheless I would like to avoid reading the excel files two times....
@user7410504 if you want to avoid reading it multiple times, it really should be in a better format. This is the reason we use formats, so we can avoid doing inefficient things.
Yeah I agree. It's just that I'm dealing with a lot of heavy files (that I didn't create myself). Reformatting is a pain :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.