2

Before code : for loop should run at least 143,792,640,000 times and create the table at least produce 563,760 rows without duplicated I want to know how to speed up or something parallel computing like Hadoop that could accelerate between php and MySQL.

Code below:

MySQL connection

$link=mysql_connect($servername,$username,$password);
mysql_select_db($dbname);
$sql= "INSERT INTO EM (source,target) VALUES ";

for loop read data into MySQL check function if duplicate not insert and update count=count+1

for($i=0;$i<$combine_arr_size;$i++){
    for($j=0;$j<$combine_arr_size;$j++){  

//below check if find the duplicated like a,b we recognize b,a is same thing

if(check($combine_words_array[$i],$combine_words_array[$j])) {
                $update_query="UPDATE EM SET count = count+1 where (source='$combine_words_array[$i]' AND target='$combine_words_array[$j]') OR (source='$combine_words_array[$j]' AND target='$combine_words_array[$i]');";
                mysql_query($update_query);
            } else {
                if (!$link) {
                    die("Connection failed: " . mysql_error());
                }

//else using insert into table () value to concatenate the string

    $sql.="('$combine_words_array[$i]','$combine_words_array[$j]'),";     
            mysql_query(substr($sql,0,-1));
            $sql= "INSERT INTO EM (source,target) VALUES ";        
        }
    }
} 

read the all vector align from comebine_word_array[] to combine_word_array[]

below is check function , check if find the pair return value

function check($src, $trg) {
    $query = mysql_query("SELECT * FROM EM WHERE (source='$src' AND target='$trg') OR (source='$trg' AND target='$src');");
    if (mysql_num_rows($query) > 0) {
        return 1;
    } else {
        return 0;
    }
}

table

+--------+--------------+------+-----+---------+-------+
| Field  | Type         | Null | Key | Default | Extra |
+--------+--------------+------+-----+---------+-------+
| source | varchar(255) | YES  |     | NULL    |       |
| target | varchar(255) | YES  |     | NULL    |       |
| count  | int(11)      | NO   |     | 0       |       |
| prob   | double       | NO   |     | 0       |       |
+--------+--------------+------+-----+---------+-------+

now the php code just influence the source ,target and count

10
  • 1
    143B rows, phew! How long does this take on your production hardware presently? :-) I imagine that a lot of this could be converted to a stored procedure, and so would run a lot faster. Try that first, maybe? Commented Jul 21, 2015 at 7:44
  • Also, can you add to your question an explanation of the pseudocode of this algorithm and what it is doing? Maybe you are doing something really inefficiently, and there is a better/faster way to do it. Commented Jul 21, 2015 at 7:45
  • (Correction: 143B iterations, not rows. Still a lot of work though!) Commented Jul 21, 2015 at 7:51
  • 1
    Please also provide your mySQL scheme. You are doing lookups on quite a large set, so I do hope you have index fields. You could also consider different queries (like REPLACE) using some KEYS which would also clean up your code. Also, you treate source and target as interchangable values in your checks and updates, but not your inserts. If the fields are indeed interchangable, you could try inserting values where source and target are automatically assigned the lesser or greater value. (actually, are you dealing with a graph of some sort?) Commented Jul 21, 2015 at 8:48
  • 1
    oh, and actually before going to the SQL bit, you could preprocess the data directly in php. I mean, try to reduce the actual datasets by eliminating duplicates (which you treat as counts) by directly computing the counts and doing the combination afterwards which you also could simplify, I guess.. and knowing about the actual problem space of your data might help reduce the complexity even a bit more.. Commented Jul 21, 2015 at 9:01

2 Answers 2

1

It is difficult to know exactly what you want to do with duplicate combinations. For example you are getting every combination of the array, which is going to get lots of duplicates which you will then count twice.

However I would be tempted to load the words into an table (possibly a temp table) and then do a cross join of the table against itself to get every combination, and use this to do an INSERT with an on duplicate key clause.

Very crudely, something like this:-

<?php

$sql = "CREATE TEMPORARY TABLE words
        (
            word varchar(255),
            PRIMARY KEY (`word`),
        )";

$link = mysql_connect($servername,$username,$password);
mysql_select_db($dbname);
$sql = "INSERT INTO words (word) VALUES ";
$sql_parm = array();

foreach($combine_words_array AS $combine_word)
{
    $sql_parm[] = "('".mysql_real_escape_string($combine_word)."')";
    if (count($sql_parm) > 500)
    {
        mysql_query($sql.implode(',', $sql_parm));
        $sql_parm = array();
    }
}

if (count($sql_parm) > 0)
{
    mysql_query($sql.implode(',', $sql_parm));
    $sql_parm = array();
}

$sql = "INSERT INTO EM(source, target)
        SELECT w1.word, w2.word
        FROM words w1
        CROSS JOIN words w2
        ON DUPLICATE KEY UPDATE `count` = `count` + 1
        ";

mysql_query($sql);

This does rely on having a unique key covering both the source and target columns.

But whether this is an option depends on the details of the records. For example with your current code if there were 2 words (say A and B) you would find the combination A / B and the combination B / A. But both combinations would update the same records

Sign up to request clarification or add additional context in comments.

Comments

1

Put a better processor on your server and increase the RAM, then go to your php.ini settings and raise the maximum allocated memory for all the various memory/processor relative configurations.

This will empower the server further and improve the running efficiency.

If you cannot find your php.ini file. Create a new php file with the following contents and open it in the browser:

<?php phpinfo(); ?>

Make sure you delete this file after finding out where php.ini is... as an unwanted user (hacker) could find this file and it would give them detailed information leading to vulnerabilities in your server configuration.

Once you've found php.ini, do some looks online to determine settings that are not obvious and increase the memory allocations in various areas.

3 Comments

memory_limit has been set in value -1 , so that will be no limit , but still run more than 3 months
"an unwanted user (hacker) could find this file [phpinfo script]" - I don't imagine such a script would be web-accessible. It should be run from the console, since the OP's script would also be console-based.
You'll be surprised how many people leave a phpinfo.php file on their webroot by accident. I haven't ran it in the console before, but will take a look at it. Thanks for the advice.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.