Splitting python list based on regular expression

Question

I have the following python list:

['chhattisgarh_2015_aa.csv', 'chhattisgarh_2016_aa.csv', 'daman_and_diu_2000_aa.csv', 'daman_and_diu_2001_aa.csv', 'daman_and_diu_2002_aa.csv']

How do I separate it into 2 lists:

['chhattisgarh_2015_aa.csv', 'chhattisgarh_2016_aa.csv'] and ['daman_and_diu_2000_aa.csv', 'daman_and_diu_2001_aa.csv', 'daman_and_diu_2002_aa.csv']

The lists are split based on the words preceeding the year i.e. 2000...

I know I should use regex in python but not sure how to do it. Also, the solution needs to be extensible and not dependent on actual names e.g. chattisgarh

thanks @RoryDaulton, the elements are strings. Updated my question to reflect that — user308827
– user308827, Commented Jun 19, 2016 at 22:56
Could you do it based on the text before the first _? like using name.partition("_")[0] to compare titles? This wouldn't work if you had titles like 'foo_bar_2000' vs 'foo_foo_2000' though. — Tadhg McDonald-Jensen
– Tadhg McDonald-Jensen, Commented Jun 19, 2016 at 22:57
doesn't work since different list elements can have different number of _s — user308827
– user308827, Commented Jun 19, 2016 at 22:58
Are you sure the year contains the first numeric character in each list? — Rory Daulton
– Rory Daulton, Commented Jun 19, 2016 at 22:59
yes, the year contains the first and only numeric character in the list — user308827
– user308827, Commented Jun 19, 2016 at 22:59

Blorgbeard · Accepted Answer · 2016-06-19 23:16:44Z

5

You can use itertools.groupby here:

import itertools
import re

list = ['chhattisgarh_2015_aa.csv', 'chhattisgarh_2016_aa.csv',
        'daman_and_diu_2000_aa.csv', 'daman_and_diu_2001_aa.csv',
        'daman_and_diu_2002_aa.csv']

grouped = itertools.groupby(sorted(list), lambda x: re.match('(.+)_\d{4}', x).group(1))    

for (key, values) in grouped:
    print(key)
    print([x for x in values])

The regex (.+)_\d{4} matches a group of at least one character (which is what we group by) followed by an underscore and 4 digits.

answered Jun 19, 2016 at 23:16

Blorgbeard

104k50 gold badges237 silver badges276 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Rory Daulton · Accepted Answer · 2016-06-19 23:25:07Z

Here is one way to get a dictionary, where for each "name" key the value is a list of the strings starting with that name, keeping the order of the original list. This does not use regex and in fact uses no modules at all. You can easily modify this to make a function, remove the trailing underscore from each name, checking for various errors in the data list, getting the resulting lists out of the dictionary, and so on.

If you allow other modules, or allow changes in the order, I'm sure there are other ways.

a = ['chhattisgarh_2015_aa.csv', 'chhattisgarh_2016_aa.csv',
     'daman_and_diu_2000_aa.csv', 'daman_and_diu_2001_aa.csv',
     'daman_and_diu_2002_aa.csv']

names_dict = {}
for item in a:
    # Find the first numeric character in the item
    for i, c in enumerate(item):
        if c.isdigit():
            break
    # Store the string in the dictionary according to its preceding characters
    name = item[:i]
    if names_dict.get(name, None):
        names_dict[name].append(item)
    else:
        names_dict[name] = [item]

print(names_dict)

The result of this code (prettified) is

{'daman_and_diu_': [
    'daman_and_diu_2000_aa.csv', 'daman_and_diu_2001_aa.csv',
    'daman_and_diu_2002_aa.csv'],
 'chhattisgarh_': [
    'chhattisgarh_2015_aa.csv', 'chhattisgarh_2016_aa.csv']
}

akuiper · Accepted Answer · 2016-06-19 23:20:42Z

2

Another option to use regular expression combined with dictionary:

files = ["chhattisgarh_2015_aa.csv", "chhattisgarh_2016_aa.csv", "daman_and_diu_2000_aa.csv", "daman_and_diu_2001_aa.csv", "daman_and_diu_2002_aa.csv"]

import re
from collections import defaultdict

groupedFiles = defaultdict(list)
for fileName in files:
    pattern = re.findall("(.*)\\d{4}", fileName)[0]
    groupedFiles[pattern].append(fileName)

groupedFiles

{'chhattisgarh_': ['chhattisgarh_2015_aa.csv',
                   'chhattisgarh_2016_aa.csv'],
 'daman_and_diu_': ['daman_and_diu_2000_aa.csv',
                    'daman_and_diu_2001_aa.csv',
                    'daman_and_diu_2002_aa.csv']}

answered Jun 19, 2016 at 23:20

akuiper

216k33 gold badges362 silver badges379 bronze badges

Collectives™ on Stack Overflow

Splitting python list based on regular expression

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related