161

I'm trying to create an empty data frame with an index and specify the column types. The way I am doing it is the following:

df = pd.DataFrame(index=['pbp'],
                  columns=['contract',
                           'state_and_county_code',
                           'state',
                           'county',
                           'starting_membership',
                           'starting_raw_raf',
                           'enrollment_trend',
                           'projected_membership',
                           'projected_raf'],
                  dtype=['str', 'str', 'str', 'str',
                         'int', 'float', 'float',
                         'int', 'float'])

However, I get the following error,

TypeError: data type not understood

What does this mean?

3
  • I don't think you can specify the dtypes in this manner, you can pass a single type such as str but a not a list of strings. The dtype will be inferred when you assign the column values. I think that it should be unneccessary to specify at all Commented Apr 6, 2016 at 21:05
  • 10
    @EdChum that's true according to the docs, I wonder though why is it that the constructor doesn't allow that... wouldn't it be more efficient to create an empty dataframe with the types from the beginning for allocation purposes? Commented Jan 31, 2018 at 14:27
  • 3
    This would be very useful when concatenating empty DataFrames. The reason why I came to this question is that I found that if you concatenate two DataFrames with the same column names but one of the dataframes is empty without dtypes initialized, all the dtypes of the resulting concatenated DataFrame will have dtypes of object, which then causes an error when serializing to HDF. TL;DR initializing dtype from the DataFrame constructor would be in my opinion very useful. Commented Feb 28, 2024 at 10:52

15 Answers 15

139

You can use the following:

df = pd.DataFrame({'a': pd.Series(dtype='int'),
                   'b': pd.Series(dtype='str'),
                   'c': pd.Series(dtype='float')})

or more abstractly:

df = pd.DataFrame({c: pd.Series(dtype=t) for c, t in {'a': 'int', 'b': 'str', 'c': 'float'}.items()})

If you then use df, you have:

>>> df 
Empty DataFrame 
Columns: [a, b, c]
Index: []

and if you check its types:

>>> df.dtypes
a      int32
b     object
c    float64
dtype: object
Sign up to request clarification or add additional context in comments.

1 Comment

This answer also applies to non-empty dataframes, which is what I was looking for: df = pd.DataFrame({'x': [1, 2, 4], 'y': pd.Series(['odd', 'even', 'even'], dtype='category')})
41

One way to do it:

import numpy
import pandas

dtypes = numpy.dtype(
    [
        ("a", str),
        ("b", int),
        ("c", float),
        ("d", numpy.datetime64),
    ]
)
df = pandas.DataFrame(numpy.empty(0, dtype=dtypes))

Comments

28

This really smells like a bug.

Here's another (simpler) solution.

import pandas as pd
import numpy as np

def df_empty(columns, dtypes, index=None):
    assert len(columns)==len(dtypes)
    df = pd.DataFrame(index=index)
    for c,d in zip(columns, dtypes):
        df[c] = pd.Series(dtype=d)
    return df

df = df_empty(['a', 'b'], dtypes=[np.int64, np.int64])
print(list(df.dtypes)) # int64, int64

Comments

28

This is an old question, but I don't see a solid answer (although @eric_g was super close).

You just need to create an empty dataframe with a dictionary of key:value pairs. The key being your column name, and the value being an empty data type.

So in your example dataset, it would look as follows (pandas 0.25 and python 3.7):

variables = {'contract':'',
             'state_and_county_code':'',
             'state':'',
             'county':'',
             'starting_membership':int(),
             'starting_raw_raf':float(),
             'enrollment_trend':float(),
             'projected_membership':int(),
             'projected_raf':float()}

df = pd.DataFrame(variables, index=[])

In old pandas versions, one may have to do:

df = pd.DataFrame(columns=[variables])

4 Comments

I do not think that works because Pandas throws an error saying that dict is unhashable type (which is understandable). And, there is no mention of this format in documentation.
I'm actively using this in my code and it works great. I'm using pandas 0.22.0, how about you?
I also get the same problem as @AnatolyScherbakov. I'm using 0.23.0 . This seems like the most direct way if would work.
I've updated the above code to work with the most recent version of python and pandas. Hope it helps.
26

My solution (without setting an index) is to initialize a dataframe with column names and specify data types using astype() method.

schema = {
    'contract' : str, 
    'state_and_county_code': str,
    'state': str,
    'county': str,
    'starting_membership': int,
    'starting_raw_raf': float,
    'enrollment_trend': float,
    'projected_membership': int,
    'projected_raf': float,
}
df = pd.DataFrame(columns=schema).astype(schema)

3 Comments

I came to the same solution. You can define a schema for your data frame using a dict: schema = { 'name': str, 'number': float, 'date': np.datetime64} df = pd.DataFrame(columns=schema.keys()).astype(schema)
@SimonEjsing yours is a more elegant solution, thanks for sharing
Clean solution, and work for non-empty dataframes too. Great job!
13

Not working, just a remark.

You can get around the Type Error using np.dtype:

pd.DataFrame(index = ['pbp'], columns = ['a','b'], dtype = np.dtype([('str','float')]))

but you get instead:

NotImplementedError: compound dtypes are not implementedin the DataFrame constructor

5 Comments

This is really the right answer. Even fixing the TypeError, it's still not something that pandas bothered to implement. You can't even copy a dtype from an existing compound-dtype DataFrame to start off a new empty DataFrame, which really seems like a valid use case.
@MikeJarvis if you want to copy the dtypes of an existing frame, you can select 0 rows from that frame and have your empty DF with the same dtypes. For example cpy = df.loc[[False]*len(df)] should do the trick
I don't know what it means for it to be the "right answer" if it doesn't work. I think you're saying something like: "I wish this worked".
This is a misleading "answer", although it carries important information. Maybe it should be rephrased as: "Even though you can get around the type error via .... it would still not be possible because pandas has not implemented it: ..."
@Jan You're right, this is not really an answer. Please feel free to update/rephrase.
5

I found this question after running into the same issue. I prefer the following solution (Python 3) for creating an empty DataFrame with no index.

import numpy as np
import pandas as pd

def make_empty_typed_df(dtype):
    tdict = np.typeDict
    types = tuple(tdict.get(t, t) for (_, t, *__) in dtype)
    if any(t == np.void for t in types):
        raise NotImplementedError('Not Implemented for columns of type "void"')
    return pd.DataFrame.from_records(np.array([tuple(t() for t in types)], dtype=dtype)).iloc[:0, :]

Testing this out ...

from itertools import chain

dtype = [('col%d' % i, t) for i, t in enumerate(chain(np.typeDict, set(np.typeDict.values())))]
dtype = [(c, t) for (c, t) in dtype if (np.typeDict.get(t, t) != np.void) and not isinstance(t, int)]

print(make_empty_typed_df(dtype))

Out:

Empty DataFrame

Columns: [col0, col6, col16, col23, col24, col25, col26, col27, col29, col30, col31, col32, col33, col34, col35, col36, col37, col38, col39, col40, col41, col42, col43, col44, col45, col46, col47, col48, col49, col50, col51, col52, col53, col54, col55, col56, col57, col58, col60, col61, col62, col63, col64, col65, col66, col67, col68, col69, col70, col71, col72, col73, col74, col75, col76, col77, col78, col79, col80, col81, col82, col83, col84, col85, col86, col87, col88, col89, col90, col91, col92, col93, col95, col96, col97, col98, col99, col100, col101, col102, col103, col104, col105, col106, col107, col108, col109, col110, col111, col112, col113, col114, col115, col117, col119, col120, col121, col122, col123, col124, ...]
Index: []

[0 rows x 146 columns]

And the datatypes ...

print(make_empty_typed_df(dtype).dtypes)

Out:

col0      timedelta64[ns]
col6               uint16
col16              uint64
col23                int8
col24     timedelta64[ns]
col25                bool
col26           complex64
col27               int64
col29             float64
col30                int8
col31             float16
col32              uint64
col33               uint8
col34              object
col35          complex128
col36               int64
col37               int16
col38               int32
col39               int32
col40             float16
col41              object
col42              uint64
col43              object
col44               int16
col45              object
col46               int64
col47               int16
col48              uint32
col49              object
col50              uint64
               ...       
col144              int32
col145               bool
col146            float64
col147     datetime64[ns]
col148             object
col149             object
col150         complex128
col151    timedelta64[ns]
col152              int32
col153              uint8
col154            float64
col156              int64
col157             uint32
col158             object
col159               int8
col160              int32
col161             uint64
col162              int16
col163             uint32
col164             object
col165     datetime64[ns]
col166            float32
col167               bool
col168            float64
col169         complex128
col170            float16
col171             object
col172             uint16
col173          complex64
col174         complex128
dtype: object

Adding an index gets tricky because there isn't a true missing value for most data types so they end up getting cast to some other type with a native missing value (e.g., ints are cast to floats or objects), but if you have complete data of the types you've specified, then you can always insert rows as needed, and your types will be respected. This can be accomplished with:

df.loc[index, :] = new_row

Again, as @Hun pointed out, this NOT how Pandas is intended to be used.

Comments

5

Taking lists columns and dtype from your examle you can do the following:

cdt={i[0]: i[1] for i in zip(columns, dtype)}    # make column type dict
pdf=pd.DataFrame(columns=list(cdt))    # create empty dataframe
pdf=pdf.astype(cdt)                    # set desired column types

DataFrame doc says only a single dtype is allowed in constructor call.

Comments

3

I found the easiest workaround for me was to simply concatenate a list of empty series for each individual column:

import pandas as pd

columns = ['contract',
           'state_and_county_code',
           'state',
           'county',
           'starting_membership',
           'starting_raw_raf',
           'enrollment_trend',
           'projected_membership',
           'projected_raf']
dtype = ['str', 'str', 'str', 'str', 'int', 'float', 'float', 'int', 'float']
df = pd.concat([pd.Series(name=col, dtype=dt) for col, dt in zip(columns, dtype)], axis=1)
df.info()
# <class 'pandas.core.frame.DataFrame'>
# Index: 0 entries
# Data columns (total 9 columns):
# contract                 0 non-null object
# state_and_county_code    0 non-null object
# state                    0 non-null object
# county                   0 non-null object
# starting_membership      0 non-null int32
# starting_raw_raf         0 non-null float64
# enrollment_trend         0 non-null float64
# projected_membership     0 non-null int32
# projected_raf            0 non-null float64
# dtypes: float64(3), int32(2), object(4)
# memory usage: 0.0+ bytes

Comments

2

You can do this by passing a dictionary into the DataFrame constructor:

df = pd.DataFrame(index=['pbp'],
                  data={'contract' : np.full(1, "", dtype=str),
                        'starting_membership' : np.full(1, np.nan, dtype=float),
                        'projected_membership' : np.full(1, np.nan, dtype=int)
                       }
                 )

This will correctly give you a dataframe that looks like:

     contract  projected_membership   starting_membership
pbp     ""             NaN           -9223372036854775808

With dtypes:

contract                 object
projected_membership    float64
starting_membership       int64

That said, there are two things to note:

1) str isn't actually a type that a DataFrame column can handle; instead it falls back to the general case object. It'll still work properly.

2) Why don't you see NaN under starting_membership? Well, NaN is only defined for floats; there is no "None" value for integers, so it casts np.NaN to an integer. If you want a different default value, you can change that in the np.full call.

1 Comment

No need to put a bunch of dummy data in the columns when you could use empty arrays.
2

fast(est) & clear: initialize with numpy ndarrays directly

import numpy as np
import pandas as pd

df = pd.DataFrame(
    {'a': np.ndarray((0,), dtype=int),
     'b': np.ndarray((0,), dtype=str),
     'c': np.ndarray((0,), dtype=float)
     }
)
print(df.dtypes)

yields

a      int64
b     object
c    float64
dtype: object

performance benchmark

This is also the fastest way of doing it, as can be seen in the following

Python 3.8.8 (default, Apr 13 2021, 19:58:26) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.18.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: %timeit pd.DataFrame({'a': np.ndarray((0,), dtype=int), 'b': np.ndarray(
   ...: (0,), dtype=str), 'c': np.ndarray((0,), dtype=float)})

183 µs ± 388 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: 

In [4]: def df_empty(columns, dtypes, index=None):
   ...:     assert len(columns)==len(dtypes)
   ...:     df = pd.DataFrame(index=index)
   ...:     for c,d in zip(columns, dtypes):
   ...:         df[c] = pd.Series(dtype=d)
   ...:     return df
   ...: %timeit df_empty(['a', 'b', 'c'], dtypes=[int, str, float])

1.14 ms ± 2.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: 

In [5]: %timeit pd.DataFrame({'a': pd.Series(dtype='int'), 'b': pd.Series(dtype='str'), 'c': pd.Series(dtype='float')})
564 µs ± 658 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Comments

1

pandas doesn't offer pure integer column. You can either use float column and convert that column to integer as needed or treat it like an object. What you are trying to implement is not the way pandas is supposed to be used. But if you REALLY REALLY want that, you can get around the TypeError message by doing this.

df1 =  pd.DataFrame(index=['pbp'], columns=['str1','str2','str2'], dtype=str)
df2 =  pd.DataFrame(index=['pbp'], columns=['int1','int2'], dtype=int)
df3 =  pd.DataFrame(index=['pbp'], columns=['flt1','flt2'], dtype=float)
df = pd.concat([df1, df2, df3], axis=1)

    str1 str2 str2 int1 int2  flt1  flt2
pbp  NaN  NaN  NaN  NaN  NaN   NaN   NaN

You can rearrange the col order as you like. But again, this is not the way pandas was supposed to be used.

 df.dtypes
str1     object
str2     object
str2     object
int1     object
int2     object
flt1    float64
flt2    float64
dtype: object

Note that int is treated as object.

5 Comments

What the heck are you talking about? Of course Pandas supports integer columns.
There does seem to be a problem with passing dtype=int with no data, though.
This absolutely looks like a bug - is still the behavior in the latest release. Did you submit it?
Its expected behavior, its listed on the caveats. Its due to there being no nan for int. You can read more about it on the docs
@VictorUriarte That doesn't explain why no int columns can be specified in the constructor. If you ask for a int column and later insert a nan, the right behaviour would be to promote the column to float, or raise an exception
1

Create empty dataframe in Pandas specifying column types:

import pandas as pd

c1 = pd.Series(data=None, dtype='string', name='c1')
c2 = pd.Series(data=None, dtype='bool', name='c2')
c3 = pd.Series(data=None, dtype='float', name='c3')
c4 = pd.Series(data=None, dtype='int', name='c4')

df = pd.concat([c1, c2, c3, c4], axis=1)

df.info('verbose')

We create columns as Series and give them the correct dtype, then we concat the Series into a DataFrame, and that's it

We have the DataFrame constructor with dtypes!

<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   c1      0 non-null      string 
 1   c2      0 non-null      bool   
 2   c3      0 non-null      float64
 3   c4      0 non-null      int32  
dtypes: bool(1), float64(1), int32(1), string(1)
memory usage: 0.0+ bytes

Comments

1

One could use dataclass for easy maintenance, as follows:

from dataclasses import dataclass

@dataclass
class Contract:
    contract: str = 'contract'
    state_and_county_code: str = 'zip'
    state: str = 'state'
    county: str = 'county'
    starting_membership: float = 0.0
    starting_raw_raf: float = 0.0
    enrollment_trend: float = 0.0
    projected_membership: int = 0
    projected_raf : float = 0.0

    def empty(self):
        empty_df = pd.DataFrame([self.__dict__]).iloc[0:0]
        return empty_df

To get an empty df, instantiate as follows:

empty_contract_df = Contract().empty()

Comments

0

I recommend this:

columns = ["a", "b"]
types = ['float32', 'str']
predefined_size = 10

df = pd.DataFrame({c: pd.Series(index=range(predefined_size), dtype=t) 
                   for c,t in zip(columns, types)})

Advantages

  • support old pandas version (e.g. 0.19.2)
  • could initialize both the type and size

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.