1

I have a pandas dataframe like in the example below. Column 0 has many HTML tags, from which I need to extract all URLs and add them as columns in this DataFrame, while respecting the row order.

In this case, column 2, row 0 would have the: "https://sco...". In reality, this column could have as many as 10 URLs, which should be added to individual columns of the Dataframe. I've tried using Beautiful Soup, but I couldn't make it work accurately with a Dataframe like this.

I've tried extracting using the Regex below to extract all those URLs, but I couldn't plug it to the Dataframe.

postsOnlyURL = re.findall('"(http.*?)"',all_text,re.IGNORECASE|re.DOTALL)


                                                    0                                                  1
0   src="https://sco ...                               publicado a 23/10/2019Ident...
1   Ativo</div></div><div class="_7jwu">Começou a ...  AtivoComeçou a ser publicado a 23/10/2019Ident...
2   Ativo</div></div><div class="_7jwu">Começou a ...  AtivoComeçou a ser publicado a 23/10/2019Ident...

Is there a way to make this work?

3
  • If you have difficulties in plugging directly the result of the regex into the dataframe, just split the problem in two part: build one list of the size of the dataframe, which will have as element the list of the URL for the associated row of the dataframe. That should solve the decoding part. Then it will be time to add the lists to the dataframe. Commented Feb 18, 2020 at 14:29
  • Beautiful Soup could be a way to go here. Can you share the code you tried? Also a sample dataframe would help Commented Feb 18, 2020 at 14:39
  • @AliCirik The sample dataframe can be accessed here: docs.google.com/spreadsheets/d/…. Regarding the code I tried I couldn't elaborate as I deleted it because it wasn't making much sense in my head. Commented Feb 18, 2020 at 14:46

2 Answers 2

3

I can't access your dataset, but in general this is a way to extract urls from strings in dataframe with regex and create new columns dynamically according to the number of urls extracted:

df = pd.DataFrame({'Col1': ['check my blog http://example.com/blah or this is an example of https://google.com or http://facebook.com',
                             'get url from https://facebook.com', 'You can find answers at https://stackoverflow.com/']})

pattern = r'(https?://[^\s]+)'

df['urls'] = df['Col1'].str.findall(pattern)
df['urls'] = [','.join(map(str, l)) for l in df['urls']]
df = pd.concat([df, df['urls'].str.split(',', expand=True)], axis=1)
Sign up to request clarification or add additional context in comments.

1 Comment

Works perfectly, just what I needed 100%. Great job! Only difference was: df['posts'].str.findall('"(http.*?)"',re.DOTALL)
1

Here is a potential solution

import pandas as pd
import re
from bs4 import BeautifulSoup

# Create sample df
a = ["""Ativo</div></div><div class="_7jwu">Começou a ser publicado a <span>23/10/2019</span></div><div class="_8jox"><div aria-describedby="js_m" aria-haspopup="true" class="_4rhp" role="tooltip" tabindex="0">Identificação: 411753089755204</div></div></div><div class="_8k-_"><div class="_3qn7 _61-0 _2fyi _3qng" style="max-width: 120px;"><span data-hover="tooltip"><i class="_3-8_ img sp_-Fn2d835eMD sx_39e484" alt=""></i></span><span data-hover="tooltip"><i class="_3-8_ img sp_-Fn2d835eMD sx_f3b669" alt=""></i></span><span data-hover="tooltip"><i class="img sp_-Fn2d835eMD sx_e31062" alt=""></i></span></div></div></div><div class="_7jwv"><div style="display: inline-block; width: auto;"><button aria-pressed="false" data-testid="SUIAbstractMenu/button" type="button" aria-disabled="false" class="_271k _271l _1o4e _271m _1qjd _7tvm _7tv2 _7tv4" style="width: auto; letter-spacing: normal; color: rgb(68, 73, 80); font-size: 12px; font-weight: bold; font-family: Arial, sans-serif; line-height: 26px; text-align: center; background-color: transparent; border-color: transparent; height: 28px; padding-left: 7px; padding-right: 7px; border-radius: 2px;"><div class="_43rl"><i aria-hidden="true" class="_271o img sp_6UxJZoFesmZ sx_e4448e" alt=""></i><span class="accessible_elem">Abrir menu pendente</span></div></button></div></div></div><div class="_7jwy"><div class="_7jyg _7jyh"><div class="_7k71"><div class="_8nsi _8nqp"><div class="_3qn7 _61-0 _2fyi _3qng" style="width: 100%;"><img alt="imaginBank" class="_8nqq img" src="https://scontent.flis8-1.fna.fbcdn.net/v/t1.6435-9/56757490_843111089374606_3751796641934344192_n.png?_nc_cat=105&amp;_nc_oc=AQn_sfVuUVpGuXh9Xew56gOSFzdktA5s1xfEWBMkYzLNQ6m8zdOZve6xFIzu7IOEJL0&amp;_nc_ht=scontent.flis8-1.fna&amp;oh=ce8b3a3feea1162bc0874c260ab2b308&amp;oe=5EC4B78C"><div class="_3qn7 _61-0 _2fyh _3qnf" style="width: 100%;"><div class="_8nqr _3qn7 _61-3 _2fyi _3qng"><span style="font-family: Arial, sans-serif; font-size: 12px; line-height: 16px; letter-spacing: normal; font-weight: bold; overflow-wrap: normal; text-align: left; color: rgb(28, 30, 33);"><a data-hovercard="/ajax/hovercard/hovercard.php?id=197438223941899" target="_blank" href="https://www.facebook.com/imaginBank/">imaginBank</a></span></div><div class="_8nrv"><div class="_4ik4 _4ik5" style="-webkit-line-clamp: 2;"><div><span class="_8jos">Patrocinado</span></div></div></div></div></div></div></div><div class="_7jyr"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 16px; max-height: 112px; -webkit-line-clamp: 7;"><div>Parking, peajes, impuestos, gasolina... Al final termina siendo una pasta. ¿Te has planteado recortar estos gastos? No, no hablamos de abandonar la conducción. Hablamos de enchufarnos al futuro. Conoce todos los beneficios de tener un coche un eléctrico y lo fácil que es conseguirlo con un Préstamo Auto de imaginBank. #Enchúfate<br> <br> *La concesión de la operación está sujeta al análisis de la solvencia y de la capacidad de devolución del solicitante, en función de las políticas de riesgo de la entidad. imaginBank de CaixaBank</div></div></div></div><div maxchangeamount="1" currentselectedindex="0" class="_23n-"><div class="_4u-c"><div index="0" class="_a28"><div class="_a2e"><div class="_2zgz"><div class="_7jy-"><div class="_7jyr"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 16px; max-height: 112px; -webkit-line-clamp: 7;"> </div></div></div><a target="_blank" class="_231w _231z _4yee" href="http://play.google.com/store/apps/details?id=com.imaginbank.app" style="color: rgb(33, 111, 219);"><img class="_7jys _7jyt img" src="https://scontent.flis8-2.fna.fbcdn.net/v/t39.16868-6/s600x600/68872437_623249314832062_3424786237267902464_n.jpg?_nc_cat=107&amp;_nc_oc=AQnCOg6lOVmyYNmKW9TeJMIQqFnp__ENhA6b0IF9n6OOvKhuFdfBFFn5A-i6mv9Qs9A&amp;_nc_ht=scontent.flis8-2.fna&amp;_nc_tp=7&amp;oh=4c316693ef6d41f08a19c047bbef6ff5&amp;oe=5EC0B4DA" alt=""><div class="_8jgz _8jg_"><div class="_8jh1"><div class="_8jh2"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 14px; max-height: 28px; -webkit-line-clamp: 2;">Préstamo desde 3.000€ hasta 30.000€. Solicita el tuyo
  desde la app</div></div></div><div class="_8jh3"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 14px; max-height: 28px; -webkit-line-clamp: 2;"></div></div></div><div class="_8jh4"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 12px; max-height: 24px; -webkit-line-clamp: 2;"></div></div></div><div class="_8jh5"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 12px; max-height: 24px; -webkit-line-clamp: 2;"></div></div></div></div><div class="_8jh0"><button type="button" aria-disabled="false" class="_271k _271m _1qjd _3-9a" style="max-width: 80px; letter-spacing: normal; color: rgb(68, 73, 80); font-size: 11px; font-weight: normal; font-family: Arial, sans-serif; line-height: 16px; text-align: center; background-color: rgb(245, 246, 247); border-color: rgb(218, 221, 225); height: 18px; padding-left: 4px; padding-right: 4px; background-clip: padding-box;"><div class="_43rl"><div data-hover="tooltip" data-tooltip-display="overflow" class="_43rm">Use App</div></div></button></div></div></a></div></div><div class="_2zgz"><div class="_7jy-"><div class="_7jyr"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 16px; max-height: 112px; -webkit-line-clamp: 7;"> </div></div></div><a target="_blank" class="_231w _231z _4yee" href="http://play.google.com/store/apps/details?id=com.imaginbank.app" style="color: rgb(33, 111, 219);"><img class="_7jys _7jyt img" src="https://scontent.flis8-1.fna.fbcdn.net/v/t39.16868-6/s600x600/69107399_623249321498728_5143385648069083136_n.jpg?_nc_cat=110&amp;_nc_oc=AQlHwBVTCf9XcxXVP4VH0YnbwivUgg1PXA8uYOxShCkbr9woauh1CiNiQTJbguBYmbc&amp;_nc_ht=scontent.flis8-1.fna&amp;_nc_tp=7&amp;oh=e52db30189225e3525fbef0cec013c31&amp;oe=5EFC734A" alt=""><div class="_8jgz _8jg_"><div class="_8jh1"><div class="_8jh2"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 14px; max-height: 28px; -webkit-line-clamp: 2;">Préstamo desde 3.000€ hasta 30.000€. Solicita el tuyo
  desde la app</div></div></div><div class="_8jh3"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 14px; max-height: 28px; -webkit-line-clamp: 2;"></div></div></div><div class="_8jh4"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 12px; max-height: 24px; -webkit-line-clamp: 2;"></div></div></div><div class="_8jh5"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 12px; max-height: 24px; -webkit-line-clamp: 2;"></div></div></div></div><div class="_8jh0"><button type="button" aria-disabled="false" class="_271k _271m _1qjd _3-9a" style="max-width: 80px; letter-spacing: normal; color: rgb(68, 73, 80); font-size: 11px; font-weight: normal; font-family: Arial, sans-serif; line-height: 16px; text-align: center; background-color: rgb(245, 246, 247); border-color: rgb(218, 221, 225); height: 18px; padding-left: 4px; padding-right: 4px; background-clip: padding-box;"><div class="_43rl"><div data-hover="tooltip" data-tooltip-display="overflow" class="_43rm">Use App</div></div></button></div></div></a></div></div><div class="_2zgz"><div class="_7jy-"><div class="_7jyr"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 16px; max-height: 112px; -webkit-line-clamp: 7;"> </div></div></div><a target="_blank" class="_231w _231z _4yee" href="http://play.google.com/store/apps/details?id=com.imaginbank.app" style="color: rgb(33, 111, 219);"><img class="_7jys _7jyt img" src="https://scontent.flis8-1.fna.fbcdn.net/v/t39.16868-6/s600x600/68744822_623249324832061_2903488387056926720_n.jpg?_nc_cat=109&amp;_nc_oc=AQkQBwWIk_gZ3WxbsRYe6kyjcJk0HU4XjUHDUQEP1diakZkjkk5Ng8U38gF9L3ZWaTI&amp;_nc_ht=scontent.flis8-1.fna&amp;_nc_tp=7&amp;oh=ef1d597a68349654821b7f2a7c730287&amp;oe=5EF8CBA5" alt=""><div class="_8jgz _8jg_"><div class="_8jh1"><div class="_8jh2"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 14px; max-height: 28px; -webkit-line-clamp: 2;">Préstamo desde 3.000€ hasta 30.000€. Solicita el tuyo
  desde la app</div></div></div><div class="_8jh3"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 14px; max-height: 28px; -webkit-line-clamp: 2;"></div></div></div><div class="_8jh4"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 12px; max-height: 24px; -webkit-line-clamp: 2;"></div></div></div><div class="_8jh5"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 12px; max-height: 24px; -webkit-line-clamp: 2;"></div></div></div></div><div class="_8jh0"><button type="button" aria-disabled="false" class="_271k _271m _1qjd _3-9a" style="max-width: 80px; letter-spacing: normal; color: rgb(68, 73, 80); font-size: 11px; font-weight: normal; font-family: Arial, sans-serif; line-height: 16px; text-align: center; background-color: rgb(245, 246, 247); border-color: rgb(218, 221, 225); height: 18px; padding-left: 4px; padding-right: 4px; background-clip: padding-box;"><div class="_43rl"><div data-hover="tooltip" data-tooltip-display="overflow" class="_43rm">Use App</div></div></button></div></div></a></div></div></div></div></div><a class="_32rk _32rh _1cy6" href="#"><div direction="forward" class="_10sf _5x5_"><div class="_5x6d"><div class="_3bwv _3bww"><div class="_3bwy"><div class="_3bwx"><i class="_3-8w img sp_JmF3rXGjoQG sx_77d801" alt=""></i></div></div></div></div></div></a></div></div></div><div class="_7kfi"><div class="_7kd5"></div><a class="_7kfh" data-testid="snapshot_footer_link" href="#"><span style="font-family: Arial, sans-serif; font-size: 13px; line-height: 17px; letter-spacing: normal; overflow-wrap: normal; text-align: left; color: rgb(24, 119, 242);">Ver Det""",
    """Ativo</div></div><div class="_7jwu">Começou a ser publicado a <span>23/10/2019</span></div><div class="_8jox"><div aria-describedby="js_p" aria-haspopup="true" class="_4rhp" role="tooltip" tabindex="0">Identificação: 712910935875237</div></div></div><div class="_8k-_"><div class="_3qn7 _61-0 _2fyi _3qng" style="max-width: 120px;"><span data-hover="tooltip"><i class="_3-8_ img sp_-Fn2d835eMD sx_39e484" alt=""></i></span><span data-hover="tooltip"><i class="_3-8_ img sp_-Fn2d835eMD sx_f3b669" alt=""></i></span><span data-hover="tooltip"><i class="img sp_-Fn2d835eMD sx_e31062" alt=""></i></span></div></div></div><div class="_7jwv"><div style="display: inline-block; width: auto;"><button aria-pressed="false" data-testid="SUIAbstractMenu/button" type="button" aria-disabled="false" class="_271k _271l _1o4e _271m _1qjd _7tvm _7tv2 _7tv4" style="width: auto; letter-spacing: normal; color: rgb(68, 73, 80); font-size: 12px; font-weight: bold; font-family: Arial, sans-serif; line-height: 26px; text-align: center; background-color: transparent; border-color: transparent; height: 28px; padding-left: 7px; padding-right: 7px; border-radius: 2px;"><div class="_43rl"><i aria-hidden="true" class="_271o img sp_6UxJZoFesmZ sx_e4448e" alt=""></i><span class="accessible_elem">Abrir menu pendente</span></div></button></div></div></div><div class="_7jwy"><div class="_7jyg _7jyh"><div class="_7k71"><div class="_8nsi _8nqp"><div class="_3qn7 _61-0 _2fyi _3qng" style="width: 100%;"><img alt="imaginBank" class="_8nqq img" src="https://scontent.flis8-1.fna.fbcdn.net/v/t1.6435-9/56757490_843111089374606_3751796641934344192_n.png?_nc_cat=105&amp;_nc_oc=AQn_sfVuUVpGuXh9Xew56gOSFzdktA5s1xfEWBMkYzLNQ6m8zdOZve6xFIzu7IOEJL0&amp;_nc_ht=scontent.flis8-1.fna&amp;oh=ce8b3a3feea1162bc0874c260ab2b308&amp;oe=5EC4B78C"><div class="_3qn7 _61-0 _2fyh _3qnf" style="width: 100%;"><div class="_8nqr _3qn7 _61-3 _2fyi _3qng"><span style="font-family: Arial, sans-serif; font-size: 12px; line-height: 16px; letter-spacing: normal; font-weight: bold; overflow-wrap: normal; text-align: left; color: rgb(28, 30, 33);"><a data-hovercard="/ajax/hovercard/hovercard.php?id=197438223941899" target="_blank" href="https://www.facebook.com/imaginBank/">imaginBank</a></span></div><div class="_8nrv"><div class="_4ik4 _4ik5" style="-webkit-line-clamp: 2;"><div><span class="_8jos">Patrocinado</span></div></div></div></div></div></div></div><div class="_7jyr"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 16px; max-height: 112px; -webkit-line-clamp: 7;"><div>Ya reciclas, vas con tu botella de agua para rellenar y has dejado de usar bolsas de plástico para hacer la compra. ¿Qué sigue? Un coche eléctrico. Entérate por qué molan tanto y #Enchúfate con un Préstamo Auto de imaginBank para hacerte con tu coche eléctrico o híbrido.<br> <br> *La concesión de la operación está sujeta al análisis de la solvencia y de la capacidad de devolución del solicitante, en función de las políticas de riesgo de la entidad. imaginBank de CaixaBank</div></div></div></div><div maxchangeamount="1" currentselectedindex="0" class="_23n-"><div class="_4u-c"><div index="0" class="_a28"><div class="_a2e"><div class="_2zgz"><div class="_7jy-"><div class="_7jyr"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 16px; max-height: 112px; -webkit-line-clamp: 7;"> </div></div></div><a target="_blank" class="_231w _231z _4yee" href="http://play.google.com/store/apps/details?id=com.imaginbank.app" style="color: rgb(33, 111, 219);"><img class="_7jys _7jyt img" src="https://scontent.flis8-1.fna.fbcdn.net/v/t39.16868-6/s600x600/69012922_678822735864357_7122692236617187328_n.jpg?_nc_cat=104&amp;_nc_oc=AQketjlSUFzRGTqej50cs1XsD1InX5WgLsjHTd4mL6OWT7-OhrJXFvcz8WyRuRBSsqM&amp;_nc_ht=scontent.flis8-1.fna&amp;_nc_tp=7&amp;oh=cbfe1cf2d19b5c37ede1b2dfdf674276&amp;oe=5EBC43D4" alt=""><div class="_8jgz _8jg_"><div class="_8jh1"><div class="_8jh2"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 14px; max-height: 28px; -webkit-line-clamp: 2;">Préstamo desde 3.000€ hasta 30.000€ ¡Solicítalo desde la app!</div></div></div><div class="_8jh3"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 14px; max-height: 28px; -webkit-line-clamp: 2;"></div></div></div><div class="_8jh4"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 12px; max-height: 24px; -webkit-line-clamp: 2;"></div></div></div><div class="_8jh5"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 12px; max-height: 24px; -webkit-line-clamp: 2;"></div></div></div></div><div class="_8jh0"><button type="button" aria-disabled="false" class="_271k _271m _1qjd _3-9a" style="max-width: 80px; letter-spacing: normal; color: rgb(68, 73, 80); font-size: 11px; font-weight: normal; font-family: Arial, sans-serif; line-height: 16px; text-align: center; background-color: rgb(245, 246, 247); border-color: rgb(218, 221, 225); height: 18px; padding-left: 4px; padding-right: 4px; background-clip: padding-box;"><div class="_43rl"><div data-hover="tooltip" data-tooltip-display="overflow" class="_43rm">Use App</div></div></button></div></div></a></div></div><div class="_2zgz"><div class="_7jy-"><div class="_7jyr"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 16px; max-height: 112px; -webkit-line-clamp: 7;"> </div></div></div><a target="_blank" class="_231w _231z _4yee" href="http://play.google.com/store/apps/details?id=com.imaginbank.app" style="color: rgb(33, 111, 219);"><img class="_7jys _7jyt img" src="https://scontent.flis8-2.fna.fbcdn.net/v/t39.16868-6/s600x600/68897058_678822749197689_3660284254794809344_n.jpg?_nc_cat=102&amp;_nc_oc=AQnuGGaDQSJvqp6qWgRPMeQJ5mGectLDp8RrAPgACaUxLzaXjGrN6r0SaQUAWU7Io_g&amp;_nc_ht=scontent.flis8-2.fna&amp;_nc_tp=7&amp;oh=ab5a43134d95ebef545ff39420bf7f9c&amp;oe=5EBEB413" alt=""><div class="_8jgz _8jg_"><div class="_8jh1"><div class="_8jh2"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 14px; max-height: 28px; -webkit-line-clamp: 2;">Préstamo desde 3.000€ hasta 30.000€. Solicita el tuyo
  desde la app</div></div></div><div class="_8jh3"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 14px; max-height: 28px; -webkit-line-clamp: 2;"></div></div></div><div class="_8jh4"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 12px; max-height: 24px; -webkit-line-clamp: 2;"></div></div></div><div class="_8jh5"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 12px; max-height: 24px; -webkit-line-clamp: 2;"></div></div></div></div><div class="_8jh0"><button type="button" aria-disabled="false" class="_271k _271m _1qjd _3-9a" style="max-width: 80px; letter-spacing: normal; color: rgb(68, 73, 80); font-size: 11px; font-weight: normal; font-family: Arial, sans-serif; line-height: 16px; text-align: center; background-color: rgb(245, 246, 247); border-color: rgb(218, 221, 225); height: 18px; padding-left: 4px; padding-right: 4px; background-clip: padding-box;"><div class="_43rl"><div data-hover="tooltip" data-tooltip-display="overflow" class="_43rm">Use App</div></div></button></div></div></a></div></div><div class="_2zgz"><div class="_7jy-"><div class="_7jyr"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 16px; max-height: 112px; -webkit-line-clamp: 7;"> </div></div></div><a target="_blank" class="_231w _231z _4yee" href="http://play.google.com/store/apps/details?id=com.imaginbank.app" style="color: rgb(33, 111, 219);"><img class="_7jys _7jyt img" src="https://scontent.flis8-2.fna.fbcdn.net/v/t39.16868-6/s600x600/68874914_678822752531022_5734270725913575424_n.jpg?_nc_cat=108&amp;_nc_oc=AQkv8fSf75LF4KE4JbneYEcjKyFRw7xil-Nq6Q_rhP_qvoe04zH3ZUa4SwRvS8Nq0XU&amp;_nc_ht=scontent.flis8-2.fna&amp;_nc_tp=7&amp;oh=c5ad235fa78dd110248f46bd3913f998&amp;oe=5EF66280" alt=""><div class="_8jgz _8jg_"><div class="_8jh1"><div class="_8jh2"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 14px; max-height: 28px; -webkit-line-clamp: 2;">Préstamo desde 3.000€ hasta 30.000€. Solicita el tuyo
  desde la app</div></div></div><div class="_8jh3"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 14px; max-height: 28px; -webkit-line-clamp: 2;"></div></div></div><div class="_8jh4"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 12px; max-height: 24px; -webkit-line-clamp: 2;"></div></div></div><div class="_8jh5"><div tabindex="0" role="button"><div class="_4ik4 _4ik5" style="line-height: 12px; max-height: 24px; -webkit-line-clamp: 2;"></div></div></div></div><div class="_8jh0"><button type="button" aria-disabled="false" class="_271k _271m _1qjd _3-9a" style="max-width: 80px; letter-spacing: normal; color: rgb(68, 73, 80); font-size: 11px; font-weight: normal; font-family: Arial, sans-serif; line-height: 16px; text-align: center; background-color: rgb(245, 246, 247); border-color: rgb(218, 221, 225); height: 18px; padding-left: 4px; padding-right: 4px; background-clip: padding-box;"><div class="_43rl"><div data-hover="tooltip" data-tooltip-display="overflow" class="_43rm">Use App</div></div></button></div></div></a></div></div></div></div></div><a class="_32rk _32rh _1cy6" href="#"><div direction="forward" class="_10sf _5x5_"><div class="_5x6d"><div class="_3bwv _3bww"><div class="_3bwy"><div class="_3bwx"><i class="_3-8w img sp_JmF3rXGjoQG sx_77d801" alt=""></i></div></div></div></div></div></a></div></div></div><div class="_7kfi"><div class="_7kd5"></div><a class="_7kfh" data-testid="snapshot_footer_link" href="#"><span style="font-family: Arial, sans-serif; font-size: 13px; line-height: 17px; letter-spacing: normal; overflow-wrap: normal; text-align: left; color: rgb(24, 119, 242);">Ver Det"""
    ]

b = [
    """AtivoComeçou a ser publicado a 23/10/2019Identificação: 411753089755204Abrir menu pendenteimaginBankPatrocinadoParking, peajes, impuestos, gasolina... Al final termina siendo una pasta. ¿Te has planteado recortar estos gastos? No, no hablamos de abandonar la conducción. Hablamos de enchufarnos al futuro. Conoce todos los beneficios de tener un coche un eléctrico y lo fácil que es conseguirlo con un Préstamo Auto de imaginBank. #Enchúfate *La concesión de la operación está sujeta al análisis de la solvencia y de la capacidad de devolución del solicitante, en función de las políticas de riesgo de la entidad. imaginBank de CaixaBank Préstamo desde 3.000€ hasta 30.000€. Solicita el tuyo
  desde la appUse App Préstamo desde 3.000€ hasta 30.000€. Solicita el tuyo
  desde la appUse App Préstamo desde 3.000€ hasta 30.000€. Solicita el tuyo
  desde la appUse AppVer Det""",
    """AtivoComeçou a ser publicado a 23/10/2019Identificação: 712910935875237Abrir menu pendenteimaginBankPatrocinadoYa reciclas, vas con tu botella de agua para rellenar y has dejado de usar bolsas de plástico para hacer la compra. ¿Qué sigue? Un coche eléctrico. Entérate por qué molan tanto y #Enchúfate con un Préstamo Auto de imaginBank para hacerte con tu coche eléctrico o híbrido. *La concesión de la operación está sujeta al análisis de la solvencia y de la capacidad de devolución del solicitante, en función de las políticas de riesgo de la entidad. imaginBank de CaixaBank Préstamo desde 3.000€ hasta 30.000€ ¡Solicítalo desde la app!Use App Préstamo desde 3.000€ hasta 30.000€. Solicita el tuyo
  desde la appUse App Préstamo desde 3.000€ hasta 30.000€. Solicita el tuyo
  desde la appUse AppVer Det"""
]

df = pd.DataFrame({0: a, 1: b})


def get_links(x):
    soup = BeautifulSoup(x, parser='html.parser')
    links = [i.get('href') for i in soup.findAll('a', attrs={'href': re.compile("^http")})]
    return links

df[0].apply(get_links)

df[0].apply(get_links) returns

0    [https://www.facebook.com/imaginBank/, http://...
1    [https://www.facebook.com/imaginBank/, http://...
Name: 0, dtype: object

df[1].apply(get_links) returns

0    []
1    []
Name: 1, dtype: object

7 Comments

Two questions: this part: re.compile("^http://"). Does it also take into consideration the https links? And the last question: Will this keep all those URLs in the same respective row? The thing with the URLs is that they should be associated with the correct row. So if for example there are 2 urls in the first row of the DF, then those should still stay in the same row, just different columns (they can also all stay in the same column as I can then use split in Excel or Google Sheet). But thank you for your work so far it's a great help.
@GustavoPacheco good catch. I updated the answer with re.compile("^http")
This does not save the links in the df. If you want to save these links in the same df, you can do something like df['links_0'] = df[0].apply(get_links) and similar for the other df column.
I also need to plug more attributes in there, I guess the best way would be to simply use a Regex instead of specifying each attribute. Is there a way that you know of for me to be able to know that for example, when extracting those URLS, keep an identifier of the row from which it came from? It's really crucial for me to keep the URLs in accordance with the row from which they came from. Because then I need to plug them back to the DF. Could you give me a hand here? I've been scratching my head for a while with this.
If you do df['links_0'] = df[0].apply(get_links), links extracted by the get_links function will be in the new (i.e., links_0) column of your original df. This way you can track which links are coming from which row.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.