1

I am trying to split a string on multiple lines of a csv into three substrings, which I need to remain on the same line while also adding single quotation marks on sub-string 2 and 3 on the line followed by a comma.

The lines in the csv be in the following format:

12345678/ABCDE.pdf
12345678/ABCDE.pdf
12345678/ABCDE.pdf

As I am new to Python, I have tried a split on the lines which returns the first two sub-strings without the / but I am not sure how to obtain the final desired output.

'12345678', 'ABCDE.pdf'

I would like the output to look like the below

12345678,'/ABCDE.pdf','ABCDE',
12345678,'/ABCDE.pdf','ABCDE',
12345678,'/ABCDE.pdf','ABCDE',

with the final string containing the title of the pdf without the file extension.

Any help would be greatly appreciated.

1
  • So split it again... Commented Aug 15, 2017 at 16:19

2 Answers 2

1

Using split again, you can easily construct the desired output string without the need for regex.

In [22]: %%timeit
    ...: s = '''12345678/ABCDE.pdf
    ...: 12345678/ABCDE.pdf
    ...: 12345678/ABCDE.pdf'''
    ...: for l in s.splitlines():
    ...:     s_parts = l.split('/')
    ...:     new_s = '{},\'/{}\',\'{}\','.format(s_parts[0], s_parts[1], s_parts[1].split('.')[0])
    ...:
100000 loops, best of 3: 3.55 µs per loop

Output:

Out[24]: "12345678,'/ABCDE.pdf','ABCDE',"

For comparison, the regex solution posted which also works fine has the following runtime performance. The performance delta here is not too significant, but with a large number of items to process, it could be a factor.

In [25]: %%timeit
    ...: s = ["12345678/ABCDE.pdf",
    ...:       "12345678/ABCDE.pdf",
    ...:       "12345678/ABCDE.pdf"]
    ...: new_s = [[re.findall("\d+", i)[0], "/"+i.split("/")[-1], re.findall("[A
    ...: -Z]+", i)[0]] for i in s]
    ...:
100000 loops, best of 3: 11.6 µs per loop
Sign up to request clarification or add additional context in comments.

Comments

0

You can use re.split() and re.findall():

s = ["12345678/ABCDE.pdf",
      "12345678/ABCDE.pdf",
      "12345678/ABCDE.pdf"]
new_s = [[re.findall("\d+", i)[0], "/"+i.split("/")[-1], re.findall("[A-Z]+", i)[0]] for i in s]

Output:

[['12345678', '/ABCDE.pdf', 'ABCDE'], ['12345678', '/ABCDE.pdf', 'ABCDE'], ['12345678', '/ABCDE.pdf', 'ABCDE']]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.