Substrings in Python. Copies in memory?

Question

Say I have a string my_string in Python and that I tokenize it according to some_pattern:

match.re.search(some_pattern, my_string)
string_1 = match.group(1)
string_2 = match.group(2)
....

Are string_1 and string_2 ("deep") copies of the substrings in my_string or references to the same location in memory? Do string_1 and string_2 allocate memory for full copies of the characters in my_string?

Please note that I am not asking about the immutability of the strings. If my_string is very long, I would like to know what is the hit in memory that I take by tokenizing my strings.

I don't need to know exactly how much memory is re-used, but it would certainly be useful to know if a tokenization of a string ends up duplicating memory.

Why the downvote? Please note that I am not asking about the immutability of the strings, I know that strings are immutable in Python. — Amelio Vazquez-Reina
– Amelio Vazquez-Reina, Commented Dec 4, 2012 at 19:26
Keep in mind you're asking about a very specif implementation detail which may change between python versions and will have at least subtle differences between python implementations. — Bi Rico
– Bi Rico, Commented Dec 4, 2012 at 19:37

NPE · Accepted Answer · 2012-12-04 19:38:07Z

5

From looking at the Python 2.7.3 source code, taking a slice of a string makes a copy of the character data:

Objects/stringobject.c:

string_slice() calls the following function, PyString_FromStringAndSize():

/* Inline PyObject_NewVar */
op = (PyStringObject *)PyObject_MALLOC(PyStringObject_SIZE + size);
if (op == NULL)
    return PyErr_NoMemory();
PyObject_INIT_VAR(op, &PyString_Type, size);
op->ob_shash = -1;
op->ob_sstate = SSTATE_NOT_INTERNED;
if (str != NULL)
    Py_MEMCPY(op->ob_sval, str, size);
op->ob_sval[size] = '\0';

Here, str is a pointer to the character data, and size is the length. Note the malloc and the memcpy.

Different Python implementations (and indeed different versions of CPython) might behave differently. For example, Jython probably uses java.lang.String, which doesn't make a copy.

answered Dec 4, 2012 at 19:38

NPE

503k114 gold badges970 silver badges1k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Peter Schneider Over a year ago

Another hint is the python-c-api-reference: there is now way to contruct a view on the data of a string.

BrenBarn · Accepted Answer · 2012-12-04 19:23:02Z

1

Python strings are immutable, so the distinction isn't that meaningful in this case, but they are copies. Nothing you can do to string_1 and string_2 will affect the contents of my_string.

answered Dec 4, 2012 at 19:23

BrenBarn

253k39 gold badges421 silver badges392 bronze badges

1 Comment

Amelio Vazquez-Reina Over a year ago

Thanks, but I am not asking about the immutability of the strings. I know they are immutable.

Ashwini Chaudhary · Accepted Answer · 2012-12-04 19:47:05Z

1

Strings are immutable in python, so the substrings nothing but new objects.

In [7]: str="foobar"

In [8]: id(str)
Out[8]: 140976032

In [10]: id(str[:4])
Out[10]: 141060224

The only case where the substring object returned is same as the original string object is when the string==substring:

In [16]: foo="foobar"

In [17]: id(foo)
Out[17]: 140976032

In [18]: id(foo[:])
Out[18]: 140976032

In [19]: foo="foobar"*10000   # huge string

In [20]: id(foo)
Out[20]: 141606344

In [21]: id(foo[:])
Out[21]: 141606344

edited Dec 4, 2012 at 19:47

answered Dec 4, 2012 at 19:22

Ashwini Chaudhary

252k60 gold badges478 silver badges519 bronze badges

3 Comments

NPE Over a year ago

One doesn't follow from the other. In Java, strings are immutable, yet substrings refer to the original string's storage.

Amelio Vazquez-Reina Over a year ago

As @NPE mentioned, I am not asking about the immutability of the strings. I know they are immutable. Please see my note in the OP.

Ashwini Chaudhary Over a year ago

@user273158 a substring is always results in a new object in python, but in case of some small strings you might see some caching done by python internally, but for that the substring returned should be equal to the string.

Jon Clements · Accepted Answer · 2012-12-04 19:40:22Z

0

Not sure it helps much or even answers your question, but you could use finditer and then slice the original string only on demand...

>>> import re
>>> string = 'abcdefhijkl'
>>> matches = list(re.finditer('.' , string))
>>> dir(matches[0])
['__class__', '__copy__', '__deepcopy__', '__delattr__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'end', 'endpos', 'expand', 'group', 'groupdict', 'groups', 'lastgroup', 'lastindex', 'pos', 're', 'regs', 'span', 'start', 'string']
>>> matches[0].span()
(0, 1)

and then go from there...

answered Dec 4, 2012 at 19:40

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

Collectives™ on Stack Overflow

Substrings in Python. Copies in memory?

4 Answers 4

1 Comment

1 Comment

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related