1

I have a pandas df column containing the following strings:

0    Future(conId=462009617, symbol='CGB', lastTradeDateOrContractMonth='20211220', multiplier='1000', currency='CAD', localSymbol='CGBZ21', tradingClass='CGB')
1    Stock(conId=80268543, symbol='IJPA', exchange='AEB', currency='EUR', localSymbol='IJPA', tradingClass='IJPA')
2    Stock(conId=153454120, symbol='EMIM', exchange='AEB', currency='EUR', localSymbol='EMIM', tradingClass='EMIM')

I would like to extract data from strings and organize it as columns. As you can see, not all rows contain the same data and they are not in the same order. I only need some of the columns; this is the expected output:

     Type      conId symbol  localSymbol
0  Future  462009617    CGB       CGBZ21
1   Stock   80268543   IJPA         IJPA
2   Stock  153454120   EMIM         EMIM

I made some tests with str.extract but couldn't get what I want.

Any ideas on how to achieve it? Thanks

2 Answers 2

1

You could try this using string methods. Assuming that the strings are stored in a column named 'main_col':

df["Type"] = df.main_col.str.split("(", expand = True)[0]
df["conId"] = df.main_col.str.partition("conId=")[2].str.partition(",")[0]
df["symbol"] = df.main_col.str.partition("symbol=")[2].str.partition(",")[0]
df["localSymbol"] = df.main_col.str.partition("localSymbol=")[2].str.partition(",")[0]
Sign up to request clarification or add additional context in comments.

6 Comments

Thanks for your answer. The first row of code raises the following ValueError: Length of values (2) does not match length of index (10) Do you have any idea why?
Hi, apologies. I missed a parameter and hence split was creating a list from the first occurrence of value only. Please add the parameter expand = True to the split method and it should work fine. I have made the edit to the answer.
Thanks. Now the first command works fine, but I get an error on the second line: 'Series' object has no attribute 'partition'. I'm interested in your code because it looks easier to read than Nikolaos's one
Hi, I apologize again as I missed another tiny detail. Since I'm doing the partition twice to get the substring. I should have used str again as an intermediate Series object gets created. I have made the changes. I have tested the code on a sample, hopefully the edited code would work fine for you.
Great, now it works and it's very easy to read
|
1

One solution using pandas.Series.str.extract (as you tried using it):

>>> df
                                                                                                                                                           col
0  Future(conId=462009617, symbol='CGB', lastTradeDateOrContractMonth='20211220', multiplier='1000', currency='CAD', localSymbol='CGBZ21', tradingClass='CGB')
1  Stock(conId=80268543, symbol='IJPA', exchange='AEB', currency='EUR', localSymbol='IJPA', tradingClass='IJPA')                                              
2  Stock(conId=153454120, symbol='EMIM', exchange='AEB', currency='EUR', localSymbol='EMIM', tradingClass='EMIM')

>>> df.col.str.extract(r"^(?P<Type>Future|Stock).*conId=(?P<conId>\d+).*symbol='(?P<symbol>[A-Z]+)'.*localSymbol='(?P<localSymbol>[A-Z0-9]+)'")
     Type      conId symbol localSymbol
0  Future  462009617  CGB    CGBZ21    
1  Stock   80268543   IJPA   IJPA      
2  Stock   153454120  EMIM   EMIM 

In the above, I assume that:

  • Type takes the two values Future or Stock
  • conId consists of digits
  • symbol consists of capital alphabet letters
  • localSymbol consists of digits and capital alphabet letters

You may want to adapt the pattern to better fit your needs.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.