5

Say I have a string my_string in Python and that I tokenize it according to some_pattern:

match.re.search(some_pattern, my_string)
string_1 = match.group(1)
string_2 = match.group(2)
....

Are string_1 and string_2 ("deep") copies of the substrings in my_string or references to the same location in memory? Do string_1 and string_2 allocate memory for full copies of the characters in my_string?

Please note that I am not asking about the immutability of the strings. If my_string is very long, I would like to know what is the hit in memory that I take by tokenizing my strings.

I don't need to know exactly how much memory is re-used, but it would certainly be useful to know if a tokenization of a string ends up duplicating memory.

2
  • 1
    Why the downvote? Please note that I am not asking about the immutability of the strings, I know that strings are immutable in Python. Commented Dec 4, 2012 at 19:26
  • Keep in mind you're asking about a very specif implementation detail which may change between python versions and will have at least subtle differences between python implementations. Commented Dec 4, 2012 at 19:37

4 Answers 4

5

From looking at the Python 2.7.3 source code, taking a slice of a string makes a copy of the character data:

Objects/stringobject.c:

string_slice() calls the following function, PyString_FromStringAndSize():

/* Inline PyObject_NewVar */
op = (PyStringObject *)PyObject_MALLOC(PyStringObject_SIZE + size);
if (op == NULL)
    return PyErr_NoMemory();
PyObject_INIT_VAR(op, &PyString_Type, size);
op->ob_shash = -1;
op->ob_sstate = SSTATE_NOT_INTERNED;
if (str != NULL)
    Py_MEMCPY(op->ob_sval, str, size);
op->ob_sval[size] = '\0';

Here, str is a pointer to the character data, and size is the length. Note the malloc and the memcpy.

Different Python implementations (and indeed different versions of CPython) might behave differently. For example, Jython probably uses java.lang.String, which doesn't make a copy.

Sign up to request clarification or add additional context in comments.

1 Comment

Another hint is the python-c-api-reference: there is now way to contruct a view on the data of a string.
1

Python strings are immutable, so the distinction isn't that meaningful in this case, but they are copies. Nothing you can do to string_1 and string_2 will affect the contents of my_string.

1 Comment

Thanks, but I am not asking about the immutability of the strings. I know they are immutable.
1

Strings are immutable in python, so the substrings nothing but new objects.

In [7]: str="foobar"

In [8]: id(str)
Out[8]: 140976032

In [10]: id(str[:4])
Out[10]: 141060224

The only case where the substring object returned is same as the original string object is when the string==substring:

In [16]: foo="foobar"

In [17]: id(foo)
Out[17]: 140976032

In [18]: id(foo[:])
Out[18]: 140976032

In [19]: foo="foobar"*10000   # huge string

In [20]: id(foo)
Out[20]: 141606344

In [21]: id(foo[:])
Out[21]: 141606344

3 Comments

One doesn't follow from the other. In Java, strings are immutable, yet substrings refer to the original string's storage.
As @NPE mentioned, I am not asking about the immutability of the strings. I know they are immutable. Please see my note in the OP.
@user273158 a substring is always results in a new object in python, but in case of some small strings you might see some caching done by python internally, but for that the substring returned should be equal to the string.
0

Not sure it helps much or even answers your question, but you could use finditer and then slice the original string only on demand...

>>> import re
>>> string = 'abcdefhijkl'
>>> matches = list(re.finditer('.' , string))
>>> dir(matches[0])
['__class__', '__copy__', '__deepcopy__', '__delattr__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'end', 'endpos', 'expand', 'group', 'groupdict', 'groups', 'lastgroup', 'lastindex', 'pos', 're', 'regs', 'span', 'start', 'string']
>>> matches[0].span()
(0, 1)

and then go from there...

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.