1

I'm running the following steps in an attempt to clean-up a string which is obtained using $query_text_lower = file_get_contents(websiteURL).

What I want, is to return just words. No javascript, no random numbers, no CSS or any other kinds of scripts.

//remove javascript
$query_text_lower = preg_replace("/<script[^>]*>.*?< *script[^>]*>/i", "", $new_text); 

//remove html tags
$query_text_lower2 = strip_tags($query_text_lower);

//removes any text containing links (may not be best, as some sites link useful words within the text. Does tend to remove a lot of ads though
$query_text_lower3 = preg_replace('/<a\s.*?>.*?<\/a>/s', '', $query_text_lower2);

//removes linebreaks
$query_text_lower4 = trim(preg_replace('/\s+/', ' ', $query_text_lower3));

echo $query_text_lower4;
die();

Here is an example of what I am outputting at the moment:

developing a cafe: 13 steps - wikihow /**/ /**/ messages log in log in via log in remember me forgot? create an account explore community dashboardrandom articleabout uscategoriesrecent changes help us communication an articlerequest a new articleanswer a requestmore ideas... edit edit this article home » categories » investment and business » business » buying & forming a business » hospitality businesses articleeditdiscuss wh.mergelang({ 'navlist_collapse': '- collapse','navlist_expand': '+ expand','usernameoremail': 'username or email','password': 'password' }); edit articlehow to start a cafe edited by harri, maluniu, annie, afc8871 and 1 other if you have always dreamt of management a business, then learning how to start a cafe may be the answer. with the right planning beforehand, opening a cafe can become highly profitable. your cafe can easily become a place where staff come to relax, enjoy schedule with friends or family, grab a quick bite to eat, or to work on their latest project. start a cafe business by following the steps below. ad google_ad_customer = "ca-pub-9543332082073187"; /* iframe unit - intro */ if(abtype == 2) google_ad_slot = '6354743772'; else //a or normal google_ad_slot = '8579663774'; if(abtype == 2 || abtype == 3 || abtype == 4 || abtype == 5 || abtype == 6) { google_ad_width = 671; google_ad_height = 120; google_max_num_ads = 2; } else if(abtype >= 7) { google_ad_width = 645; google_ad_height = 60; google_max_num_ads = 1; } else { google_ad_width = 671; google_ad_height = 60; google_max_num_ads = 1; } google_ad_results = 'html'; google_override_format = true; google_ad_channel = "0206790666+7733764704+1640266093+6709519645+8052511407+6822404019+7122150828" + gchans + xchannels; if( fromsearch ) { document.communication(''); } //--> edit steps 1communication your business and marketing plans. these are very important aspects of any business, as they will show your course of action for both management and marketing the business. refer to these documents often to make sure you stay on track. without these documents, you may not be able to secure funding. ad google_ad_customer = "ca-pub-9543332082073187"; /* iframe unit - first step */ if(abtype == 2) google_ad_slot = '4878010577'; else //a or normal google_ad_slot = '5205564977'; if(abtype == 2 || abtype == 3 || abtype == 4 || abtype == 5 || abtype == 6) { google_ad_width = 629; google_ad_height = 120; google_max_num_ads = 2; } else if(abtype >= 7) { google_ad_width = 600; google_ad_height = 60; google_max_num_ads = 1; } else { google_ad_width = 629; google_ad_height = 60; google_max_num_ads = 1; } google_ad_results = 'html'; google_override_format = true; google_ad_channel = "2748203808+7733764704+1640266093+6709519645+8052511407+2490795108+6822404019+7122150828" + gchans + xchannels; document.communication(''); //--> 2follow all legalities for starting a cafe business of this nature in your area. make sure you get all the necessary licenses, permits, and insurance required on federal, state, and local levels. 3secure funding for your business. in your business plan, you determined how much funding you need to start a cafe business. contact investors, apply for loans, and use whatever capital you have on hand to start the business. 
2

2 Answers 2

1

Your javascript regex is off

you have:

$query_text_lower = preg_replace("/<script[^>]*>.*?< *script[^>]*>/i", "", $new_text); 

You're not detecting </script> inside the returned document, so it's not removing the javascript code itself from the page, but when you call striptags, you are removing the tags, so they don't appear in your final output. However, I can't see the site you're pulling this from so I can't be 100% on that one.

Let me know if that makes sense. Basically, the way it looks to me is that your first regex isn't actually matching anything.

Sign up to request clarification or add additional context in comments.

1 Comment

I think you pinpointed the problem. Changed the regex to: $query_text_lower = preg_replace('/<script\b[^>]*>(.*?)<\/script>/is', "", $new_text); It now works
1

You can't parse that stuff out using regular expressions. The suggestion to parse the HTML using the existing DOM tools is the right way to go.

2 Comments

Yeah, you can try real hard, but you're not gonna get everything. Upvote
I will explore using a DOM-traversing tool - thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.