Your solution is reasonable. An element object is the list of children. The .text attribute of the element object is related only to things (usually a text) that are not part of other (nested) elements.
There are things to be improved in your code. In Python, string concatenation is an expensive operation. It is better to build the list of substrings and to join them later -- like this:
output_lst = []
for child in root.find('text'):
output_lst.append(ET.tostring(child, encoding="unicode"))
output_text = ''.join(output_lst)
The list can be also build using the Python list comprehension construct, so the code would change to:
output_lst = [ET.tostring(child, encoding="unicode") for child in root.find('text')]
output_text = ''.join(output_lst)
The .join can consume any iterable that produces strings. This way the list need not to be constructed in advance. Instead, a generator expression (that is what can be seen inside the [] of the list comprehension) can be used:
output_text = ''.join(ET.tostring(child, encoding="unicode") for child in root.find('text'))
The one-liner can be formatted to more lines to make it more readable:
output_text = ''.join(ET.tostring(child, encoding="unicode")
for child in root.find('text'))