Create Empty Dataframe in Pandas specifying column types

Question

I'm trying to create an empty data frame with an index and specify the column types. The way I am doing it is the following:

df = pd.DataFrame(index=['pbp'],
                  columns=['contract',
                           'state_and_county_code',
                           'state',
                           'county',
                           'starting_membership',
                           'starting_raw_raf',
                           'enrollment_trend',
                           'projected_membership',
                           'projected_raf'],
                  dtype=['str', 'str', 'str', 'str',
                         'int', 'float', 'float',
                         'int', 'float'])

However, I get the following error,

TypeError: data type not understood

What does this mean?

I don't think you can specify the dtypes in this manner, you can pass a single type such as str but a not a list of strings. The dtype will be inferred when you assign the column values. I think that it should be unneccessary to specify at all — EdChum
– EdChum, Commented Apr 6, 2016 at 21:05
@EdChum that's true according to the docs, I wonder though why is it that the constructor doesn't allow that... wouldn't it be more efficient to create an empty dataframe with the types from the beginning for allocation purposes? — jimijazz
– jimijazz, Commented Jan 31, 2018 at 14:27
This would be very useful when concatenating empty DataFrames. The reason why I came to this question is that I found that if you concatenate two DataFrames with the same column names but one of the dataframes is empty without dtypes initialized, all the dtypes of the resulting concatenated DataFrame will have dtypes of object, which then causes an error when serializing to HDF. TL;DR initializing dtype from the DataFrame constructor would be in my opinion very useful. — Francesco
– Francesco, Commented Feb 28, 2024 at 10:52

Asclepius · Accepted Answer · 2023-07-02 22:36:27Z

139

You can use the following:

df = pd.DataFrame({'a': pd.Series(dtype='int'),
                   'b': pd.Series(dtype='str'),
                   'c': pd.Series(dtype='float')})

or more abstractly:

df = pd.DataFrame({c: pd.Series(dtype=t) for c, t in {'a': 'int', 'b': 'str', 'c': 'float'}.items()})

If you then use df, you have:

>>> df 
Empty DataFrame 
Columns: [a, b, c]
Index: []

and if you check its types:

>>> df.dtypes
a      int32
b     object
c    float64
dtype: object

edited Jul 2, 2023 at 22:36

Asclepius

64.6k20 gold badges188 silver badges164 bronze badges

answered Nov 28, 2019 at 8:59

Alberto

1,4632 gold badges10 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

ruancomelli Over a year ago

This answer also applies to non-empty dataframes, which is what I was looking for: df = pd.DataFrame({'x': [1, 2, 4], 'y': pd.Series(['odd', 'even', 'even'], dtype='category')})

ryanjdillon · Accepted Answer · 2021-08-06 13:09:39Z

41

One way to do it:

import numpy
import pandas

dtypes = numpy.dtype(
    [
        ("a", str),
        ("b", int),
        ("c", float),
        ("d", numpy.datetime64),
    ]
)
df = pandas.DataFrame(numpy.empty(0, dtype=dtypes))

edited Aug 6, 2021 at 13:09

answered Dec 18, 2018 at 8:27

ryanjdillon

19.2k10 gold badges88 silver badges113 bronze badges

Comments

user48956 · Accepted Answer · 2018-09-06 17:17:07Z

28

This really smells like a bug.

Here's another (simpler) solution.

import pandas as pd
import numpy as np

def df_empty(columns, dtypes, index=None):
    assert len(columns)==len(dtypes)
    df = pd.DataFrame(index=index)
    for c,d in zip(columns, dtypes):
        df[c] = pd.Series(dtype=d)
    return df

df = df_empty(['a', 'b'], dtypes=[np.int64, np.int64])
print(list(df.dtypes)) # int64, int64

edited Sep 6, 2018 at 17:17

answered Jan 22, 2018 at 2:45

user48956

15.9k24 gold badges101 silver badges174 bronze badges

Comments

kadee · Accepted Answer · 2020-11-24 14:09:54Z

28

This is an old question, but I don't see a solid answer (although @eric_g was super close).

You just need to create an empty dataframe with a dictionary of key:value pairs. The key being your column name, and the value being an empty data type.

So in your example dataset, it would look as follows (pandas 0.25 and python 3.7):

variables = {'contract':'',
             'state_and_county_code':'',
             'state':'',
             'county':'',
             'starting_membership':int(),
             'starting_raw_raf':float(),
             'enrollment_trend':float(),
             'projected_membership':int(),
             'projected_raf':float()}

df = pd.DataFrame(variables, index=[])

In old pandas versions, one may have to do:

df = pd.DataFrame(columns=[variables])

edited Nov 24, 2020 at 14:09

kadee

9,2092 gold badges45 silver badges36 bronze badges

answered Jul 9, 2018 at 23:38

SummerEla

1,9823 gold badges27 silver badges46 bronze badges

4 Comments

Anatoly Scherbakov Over a year ago

I do not think that works because Pandas throws an error saying that dict is unhashable type (which is understandable). And, there is no mention of this format in documentation.

SummerEla Over a year ago

I'm actively using this in my code and it works great. I'm using pandas 0.22.0, how about you?

teemoleen Over a year ago

I also get the same problem as @AnatolyScherbakov. I'm using 0.23.0 . This seems like the most direct way if would work.

SummerEla Over a year ago

I've updated the above code to work with the most recent version of python and pandas. Hope it helps.

kadee · Accepted Answer · 2024-06-11 10:38:29Z

26

My solution (without setting an index) is to initialize a dataframe with column names and specify data types using astype() method.

schema = {
    'contract' : str, 
    'state_and_county_code': str,
    'state': str,
    'county': str,
    'starting_membership': int,
    'starting_raw_raf': float,
    'enrollment_trend': float,
    'projected_membership': int,
    'projected_raf': float,
}
df = pd.DataFrame(columns=schema).astype(schema)

edited Jun 11, 2024 at 10:38

kadee

9,2092 gold badges45 silver badges36 bronze badges

answered Apr 3, 2019 at 6:40

Korhan

3713 silver badges6 bronze badges

3 Comments

Simon Ejsing Over a year ago

I came to the same solution. You can define a schema for your data frame using a dict:

schema = { 'name': str, 'number': float, 'date': np.datetime64}   df = pd.DataFrame(columns=schema.keys()).astype(schema)

Korhan Over a year ago

@SimonEjsing yours is a more elegant solution, thanks for sharing

Nicolas Martinez Over a year ago

Clean solution, and work for non-empty dataframes too. Great job!

Markus Dutschke · Accepted Answer · 2022-06-28 09:53:14Z

13

Not working, just a remark.

You can get around the Type Error using np.dtype:

pd.DataFrame(index = ['pbp'], columns = ['a','b'], dtype = np.dtype([('str','float')]))

but you get instead:

NotImplementedError: compound dtypes are not implementedin the DataFrame constructor

edited Jun 28, 2022 at 9:53

Markus Dutschke

10.8k5 gold badges73 silver badges67 bronze badges

answered Apr 6, 2016 at 22:12

ptrj

5,23221 silver badges31 bronze badges

5 Comments

Mike Jarvis Over a year ago

This is really the right answer. Even fixing the TypeError, it's still not something that pandas bothered to implement. You can't even copy a dtype from an existing compound-dtype DataFrame to start off a new empty DataFrame, which really seems like a valid use case.

Corey Over a year ago

@MikeJarvis if you want to copy the dtypes of an existing frame, you can select 0 rows from that frame and have your empty DF with the same dtypes. For example cpy = df.loc[[False]*len(df)] should do the trick

Kyle Pena Over a year ago

I don't know what it means for it to be the "right answer" if it doesn't work. I think you're saying something like: "I wish this worked".

Jan Over a year ago

This is a misleading "answer", although it carries important information. Maybe it should be rephrased as: "Even though you can get around the type error via .... it would still not be possible because pandas has not implemented it: ..."

ptrj Over a year ago

@Jan You're right, this is not really an answer. Please feel free to update/rephrase.

JaminSore · Accepted Answer · 2017-03-06 18:27:09Z

I found this question after running into the same issue. I prefer the following solution (Python 3) for creating an empty DataFrame with no index.

import numpy as np
import pandas as pd

def make_empty_typed_df(dtype):
    tdict = np.typeDict
    types = tuple(tdict.get(t, t) for (_, t, *__) in dtype)
    if any(t == np.void for t in types):
        raise NotImplementedError('Not Implemented for columns of type "void"')
    return pd.DataFrame.from_records(np.array([tuple(t() for t in types)], dtype=dtype)).iloc[:0, :]

Testing this out ...

from itertools import chain

dtype = [('col%d' % i, t) for i, t in enumerate(chain(np.typeDict, set(np.typeDict.values())))]
dtype = [(c, t) for (c, t) in dtype if (np.typeDict.get(t, t) != np.void) and not isinstance(t, int)]

print(make_empty_typed_df(dtype))

Out:

Empty DataFrame

Columns: [col0, col6, col16, col23, col24, col25, col26, col27, col29, col30, col31, col32, col33, col34, col35, col36, col37, col38, col39, col40, col41, col42, col43, col44, col45, col46, col47, col48, col49, col50, col51, col52, col53, col54, col55, col56, col57, col58, col60, col61, col62, col63, col64, col65, col66, col67, col68, col69, col70, col71, col72, col73, col74, col75, col76, col77, col78, col79, col80, col81, col82, col83, col84, col85, col86, col87, col88, col89, col90, col91, col92, col93, col95, col96, col97, col98, col99, col100, col101, col102, col103, col104, col105, col106, col107, col108, col109, col110, col111, col112, col113, col114, col115, col117, col119, col120, col121, col122, col123, col124, ...]
Index: []

[0 rows x 146 columns]

And the datatypes ...

print(make_empty_typed_df(dtype).dtypes)

Out:

col0      timedelta64[ns]
col6               uint16
col16              uint64
col23                int8
col24     timedelta64[ns]
col25                bool
col26           complex64
col27               int64
col29             float64
col30                int8
col31             float16
col32              uint64
col33               uint8
col34              object
col35          complex128
col36               int64
col37               int16
col38               int32
col39               int32
col40             float16
col41              object
col42              uint64
col43              object
col44               int16
col45              object
col46               int64
col47               int16
col48              uint32
col49              object
col50              uint64
               ...       
col144              int32
col145               bool
col146            float64
col147     datetime64[ns]
col148             object
col149             object
col150         complex128
col151    timedelta64[ns]
col152              int32
col153              uint8
col154            float64
col156              int64
col157             uint32
col158             object
col159               int8
col160              int32
col161             uint64
col162              int16
col163             uint32
col164             object
col165     datetime64[ns]
col166            float32
col167               bool
col168            float64
col169         complex128
col170            float16
col171             object
col172             uint16
col173          complex64
col174         complex128
dtype: object

Adding an index gets tricky because there isn't a true missing value for most data types so they end up getting cast to some other type with a native missing value (e.g., ints are cast to floats or objects), but if you have complete data of the types you've specified, then you can always insert rows as needed, and your types will be respected. This can be accomplished with:

df.loc[index, :] = new_row

Again, as @Hun pointed out, this NOT how Pandas is intended to be used.

Jacek Błocki · Accepted Answer · 2020-12-29 12:50:54Z

5

Taking lists columns and dtype from your examle you can do the following:

cdt={i[0]: i[1] for i in zip(columns, dtype)}    # make column type dict
pdf=pd.DataFrame(columns=list(cdt))    # create empty dataframe
pdf=pdf.astype(cdt)                    # set desired column types

DataFrame doc says only a single dtype is allowed in constructor call.

answered Dec 29, 2020 at 12:50

Jacek Błocki

5634 silver badges10 bronze badges

Comments

javidcf · Accepted Answer · 2019-03-28 17:01:21Z

I found the easiest workaround for me was to simply concatenate a list of empty series for each individual column:

import pandas as pd

columns = ['contract',
           'state_and_county_code',
           'state',
           'county',
           'starting_membership',
           'starting_raw_raf',
           'enrollment_trend',
           'projected_membership',
           'projected_raf']
dtype = ['str', 'str', 'str', 'str', 'int', 'float', 'float', 'int', 'float']
df = pd.concat([pd.Series(name=col, dtype=dt) for col, dt in zip(columns, dtype)], axis=1)
df.info()
# <class 'pandas.core.frame.DataFrame'>
# Index: 0 entries
# Data columns (total 9 columns):
# contract                 0 non-null object
# state_and_county_code    0 non-null object
# state                    0 non-null object
# county                   0 non-null object
# starting_membership      0 non-null int32
# starting_raw_raf         0 non-null float64
# enrollment_trend         0 non-null float64
# projected_membership     0 non-null int32
# projected_raf            0 non-null float64
# dtypes: float64(3), int32(2), object(4)
# memory usage: 0.0+ bytes

Eric G. · Accepted Answer · 2017-07-28 22:41:30Z

2

You can do this by passing a dictionary into the DataFrame constructor:

df = pd.DataFrame(index=['pbp'],
                  data={'contract' : np.full(1, "", dtype=str),
                        'starting_membership' : np.full(1, np.nan, dtype=float),
                        'projected_membership' : np.full(1, np.nan, dtype=int)
                       }
                 )

This will correctly give you a dataframe that looks like:

     contract  projected_membership   starting_membership
pbp     ""             NaN           -9223372036854775808

With dtypes:

contract                 object
projected_membership    float64
starting_membership       int64

That said, there are two things to note:

1) str isn't actually a type that a DataFrame column can handle; instead it falls back to the general case object. It'll still work properly.

2) Why don't you see NaN under starting_membership? Well, NaN is only defined for floats; there is no "None" value for integers, so it casts np.NaN to an integer. If you want a different default value, you can change that in the np.full call.

answered Jul 28, 2017 at 22:41

Eric G.

6725 silver badges6 bronze badges

1 Comment

user2357112 Over a year ago

No need to put a bunch of dummy data in the columns when you could use empty arrays.

Markus Dutschke · Accepted Answer · 2022-06-28 09:40:25Z

fast(est) & clear: initialize with `numpy` `ndarrays` directly

import numpy as np
import pandas as pd

df = pd.DataFrame(
    {'a': np.ndarray((0,), dtype=int),
     'b': np.ndarray((0,), dtype=str),
     'c': np.ndarray((0,), dtype=float)
     }
)
print(df.dtypes)

yields

a      int64
b     object
c    float64
dtype: object

performance benchmark

This is also the fastest way of doing it, as can be seen in the following

Python 3.8.8 (default, Apr 13 2021, 19:58:26) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.18.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: %timeit pd.DataFrame({'a': np.ndarray((0,), dtype=int), 'b': np.ndarray(
   ...: (0,), dtype=str), 'c': np.ndarray((0,), dtype=float)})

183 µs ± 388 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: 

In [4]: def df_empty(columns, dtypes, index=None):
   ...:     assert len(columns)==len(dtypes)
   ...:     df = pd.DataFrame(index=index)
   ...:     for c,d in zip(columns, dtypes):
   ...:         df[c] = pd.Series(dtype=d)
   ...:     return df
   ...: %timeit df_empty(['a', 'b', 'c'], dtypes=[int, str, float])

1.14 ms ± 2.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: 

In [5]: %timeit pd.DataFrame({'a': pd.Series(dtype='int'), 'b': pd.Series(dtype='str'), 'c': pd.Series(dtype='float')})
564 µs ± 658 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Hun · Accepted Answer · 2016-04-06 21:57:53Z

1

pandas doesn't offer pure integer column. You can either use float column and convert that column to integer as needed or treat it like an object. What you are trying to implement is not the way pandas is supposed to be used. But if you REALLY REALLY want that, you can get around the TypeError message by doing this.

df1 =  pd.DataFrame(index=['pbp'], columns=['str1','str2','str2'], dtype=str)
df2 =  pd.DataFrame(index=['pbp'], columns=['int1','int2'], dtype=int)
df3 =  pd.DataFrame(index=['pbp'], columns=['flt1','flt2'], dtype=float)
df = pd.concat([df1, df2, df3], axis=1)

    str1 str2 str2 int1 int2  flt1  flt2
pbp  NaN  NaN  NaN  NaN  NaN   NaN   NaN

You can rearrange the col order as you like. But again, this is not the way pandas was supposed to be used.

 df.dtypes
str1     object
str2     object
str2     object
int1     object
int2     object
flt1    float64
flt2    float64
dtype: object

Note that int is treated as object.

answered Apr 6, 2016 at 21:57

Hun

3,9972 gold badges17 silver badges16 bronze badges

5 Comments

user2357112 Over a year ago

What the heck are you talking about? Of course Pandas supports integer columns.

user2357112 Over a year ago

There does seem to be a problem with passing dtype=int with no data, though.

user48956 Over a year ago

This absolutely looks like a bug - is still the behavior in the latest release. Did you submit it?

Victor Uriarte Over a year ago

Its expected behavior, its listed on the caveats. Its due to there being no nan for int. You can read more about it on the docs

user48956 Over a year ago

@VictorUriarte That doesn't explain why no int columns can be specified in the constructor. If you ask for a int column and later insert a nan, the right behaviour would be to promote the column to float, or raise an exception

Asclepius · Accepted Answer · 2021-08-26 21:55:05Z

Create empty dataframe in Pandas specifying column types:

import pandas as pd

c1 = pd.Series(data=None, dtype='string', name='c1')
c2 = pd.Series(data=None, dtype='bool', name='c2')
c3 = pd.Series(data=None, dtype='float', name='c3')
c4 = pd.Series(data=None, dtype='int', name='c4')

df = pd.concat([c1, c2, c3, c4], axis=1)

df.info('verbose')

We create columns as Series and give them the correct dtype, then we concat the Series into a DataFrame, and that's it

We have the DataFrame constructor with dtypes!

<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   c1      0 non-null      string 
 1   c2      0 non-null      bool   
 2   c3      0 non-null      float64
 3   c4      0 non-null      int32  
dtypes: bool(1), float64(1), int32(1), string(1)
memory usage: 0.0+ bytes

reservoirinvest · Accepted Answer · 2024-01-28 02:45:19Z

1

One could use dataclass for easy maintenance, as follows:

from dataclasses import dataclass

@dataclass
class Contract:
    contract: str = 'contract'
    state_and_county_code: str = 'zip'
    state: str = 'state'
    county: str = 'county'
    starting_membership: float = 0.0
    starting_raw_raf: float = 0.0
    enrollment_trend: float = 0.0
    projected_membership: int = 0
    projected_raf : float = 0.0

    def empty(self):
        empty_df = pd.DataFrame([self.__dict__]).iloc[0:0]
        return empty_df

To get an empty df, instantiate as follows:

empty_contract_df = Contract().empty()

answered Jan 28, 2024 at 2:45

reservoirinvest

1,8472 gold badges23 silver badges35 bronze badges

Comments

Kato · Accepted Answer · 2022-03-22 09:46:02Z

0

I recommend this:

columns = ["a", "b"]
types = ['float32', 'str']
predefined_size = 10

df = pd.DataFrame({c: pd.Series(index=range(predefined_size), dtype=t) 
                   for c,t in zip(columns, types)})

Advantages

support old pandas version (e.g. 0.19.2)
could initialize both the type and size

answered Mar 22, 2022 at 9:46

Kato

5681 gold badge5 silver badges8 bronze badges

Collectives™ on Stack Overflow

Create Empty Dataframe in Pandas specifying column types

15 Answers 15

1 Comment

Comments

Comments

4 Comments

3 Comments

5 Comments

Comments

Comments

Comments

1 Comment

fast(est) & clear: initialize with `numpy` `ndarrays` directly

performance benchmark

Comments

5 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

15 Answers 15

1 Comment

Comments

Comments

4 Comments

3 Comments

5 Comments

Comments

Comments

Comments

1 Comment

fast(est) & clear: initialize with numpy ndarrays directly

performance benchmark

Comments

5 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

fast(est) & clear: initialize with `numpy` `ndarrays` directly