0

Would it be more space efficient to convert columns with binary values to 'category' or 'int8' data type? I'm working with half a million rows and a couple thousand columns of binary values.

UPDATE: Just for clarification, the individual cells will be just a 0 or a 1, not a combination of them.

1
  • Even 0 will use up 1 byte. You'd think 1 bit should be possible, but this is not true. Your best option is to aggregate 8 binary values to a byte and store as array of int: look at Converting Binary Numpy Array into Unsigned Integer. Commented Mar 29, 2018 at 17:15

1 Answer 1

0

you can use sys.getsizeof() of course it's not as simple as I make it seem below but this could help.

import pandas as pd
import sys

string = pd.DataFrame({'str':['010101']},dtype='str')
cat = pd.DataFrame({'cat':['010101']}, dtype='category')
int8 = pd.DataFrame({'int':['010101']}, dtype='int8')
int32 = pd.DataFrame({'int':['010101']}, dtype='int32')

print(sys.getsizeof(string),string.dtypes)
print()
print(sys.getsizeof(cat), cat.dtypes)
print()
print(sys.getsizeof(int8), int8.dtypes)
print()
print(sys.getsizeof(int32), int32.dtypes)

out

181 str    object
dtype: object

262 cat    category
dtype: object

105 int    int8
dtype: object

108 int    int32
dtype: object
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.