1

How to extract specific string from file name using just "one line" of code? I can do it in two lines (if we consider only lines with extracted and extracted2) but can't figure out if it is possible to do it in one line?

I would like to extract "this" from filename text__text_numberandtext_text_text_this.xlsx

Here is the "2 line" code I have at the moment:

s = "text__text_numberandtext_text_text_this.xlsx"
extracted = '_'.join(s.split('_')[6:7])
extracted2 = '.'.join(extracted.split('.')[:1])
print(extracted2)
0

4 Answers 4

5

You can use regex and do something like:

>>> s
'text__text_numberandtext_text_text_this.xlsx'
>>> re.search('.*_(\w+)\.xlsx', s).group(1)
'this'

In the regex above we capture the last word characters after the "_" and before the ".xlsx" extension.

Don't look for "one line" of code. Think about the cleanest solution instead.

Sign up to request clarification or add additional context in comments.

2 Comments

"Don't look for "one line" of code. Think about the cleanest solution instead." +1. Pythonic does not imply one line of code does all the work. Remember the pythonic mantra: "Simple is better than complex, flat is better than nested". @OP Your code is readable and everyone understands it. EDIT: It is called the Zen of Python, not mantra
If you count the added code from importing the re module, this is waaay more than one line.
2
spam = "text__text_numberandtext_text_text_this.xlsx"
eggs = spam.split('_')[-1].split('.')[0]
print(eggs)

output

this

EDIT: it's interesting to benchmark the 3 alternatives.

from timeit import timeit

print(timeit("s.split('_')[-1].split('.')[0]", setup="s='text__text_numberandtext_text_text_this.xlsx'"))
print(timeit("re.search('.*_(\w+)\.xlsx', s).group(1)", setup="import re; s='text__text_numberandtext_text_text_this.xlsx'"))
print(timeit("s[s.rfind('_')+1:s.rfind('.')]", setup="s='text__text_numberandtext_text_text_this.xlsx'"))

output:

0.8729359760000079
2.0453107610010193
0.6893644140000106

1 Comment

Although all three solutions give desired results, I am accepting this as an answer as personally I find it easiest to understand. Thanks to all!
2
s[s.rfind('_')+1:s.rfind('.')]

output:

'this'

It's not what your code does, but if I'm understanding correctly, it's what your description is asking for. This only works if you know the text you're looking for is immediately between the last underscore and the last period.

Comments

1

Just to add another perspective: because in the end you are dealing with a path (or filename for that matter), a good idea can be to use pathlib. Using the stem attribute you easily get the name without the extension, and then just need to take the last _ part using rsplit:

from pathlib import Path

s = Path("text__text_numberandtext_text_text_this.xlsx")
print(s.stem.rsplit('_', 1)[-1])

A short explanation on that rsplit:

  • It works the same as split, only if you give it a maxsplit argument, it splits from the right end.
  • Because we are only interested in the last part, we use maxsplit=1.
  • rsplit returns a list, in this case with just 2 elements. We then take the last element with [-1].

A somewhat more efficient version is using rpartition instead of rsplit:

s.stem.rpartition('_')[-1]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.