19

Problem: I have huge amount of sql queries (around 10k-20k) and I want to run them asynchronous in 50 (or more) threads.

I wrote a powershell script for this job, but it is very slow (It took about 20 hours to execute all). Desired result is 3-4 hours max.

Question: How can I optimize this powershell script? Should I reconsider and use another technology like python or c#?

I think it's powershell issue, because when I check with whoisactive the queries are executing fast. Creating, exiting and unloading jobs takes a lot of time, because for each thread is created separate PS instances.

My code:

$NumberOfParallerThreads = 50;


$Arr_AllQueries = @('Exec [mystoredproc] @param1=1, @param2=2',
                    'Exec [mystoredproc] @param1=11, @param2=22',
                    'Exec [mystoredproc] @param1=111, @param2=222')

#Creating the batches
$counter = [pscustomobject] @{ Value = 0 };
$Batches_AllQueries = $Arr_AllQueries | Group-Object -Property { 
    [math]::Floor($counter.Value++ / $NumberOfParallerThreads) 
};

forEach ($item in $Batches_AllQueries) {
    $tmpBatch = $item.Group;

    $tmpBatch | % {

        $ScriptBlock = {
            # accept the loop variable across the job-context barrier
            param($query) 
            # Execute a command

            Try 
            {
                Write-Host "[processing '$query']"
                $objConnection = New-Object System.Data.SqlClient.SqlConnection;
                $objConnection.ConnectionString = 'Data Source=...';

                $ObjCmd = New-Object System.Data.SqlClient.SqlCommand;
                $ObjCmd.CommandText = $query;
                $ObjCmd.Connection = $objConnection;
                $ObjCmd.CommandTimeout = 0;

                $objAdapter = New-Object System.Data.SqlClient.SqlDataAdapter;
                $objAdapter.SelectCommand = $ObjCmd;
                $objDataTable = New-Object System.Data.DataTable;
                $objAdapter.Fill($objDataTable)  | Out-Null;

                $objConnection.Close();
                $objConnection = $null;
            } 
            Catch 
            { 
                $ErrorMessage = $_.Exception.Message
                $FailedItem = $_.Exception.ItemName
                Write-Host "[Error processing: $($query)]" -BackgroundColor Red;
                Write-Host $ErrorMessage 
            }

        }

        # pass the loop variable across the job-context barrier
        Start-Job $ScriptBlock -ArgumentList $_ | Out-Null
    }

    # Wait for all to complete
    While (Get-Job -State "Running") { Start-Sleep 2 }

    # Display output from all jobs
    Get-Job | Receive-Job | Out-Null

    # Cleanup
    Remove-Job *

}

UPDATE:

Resources: The DB server is on a remote machine with:

  • 24GB RAM,
  • 8 cores,
  • 500GB Storage,
  • SQL Server 2016

We want to use the maximum cpu power.

Framework limitation: The only limitation is not to use SQL Server to execute the queries. The requests should come from outside source like: Powershell, C#, Python, etc.

7
  • you would need a RunspacePool to open multiple threads... Here is a hint how to Commented Jul 12, 2019 at 14:16
  • make certain your queries are hitting db index(s) ... if not jack up your db to make this happen ... if possible assure your db is run entirely in RAM to speed it up ... is your db server on a remote machine ... is the current slowness possibly due to IO ? if so you could clone the entire remote db into a local db server and run your sql again that local clone Commented Jul 15, 2019 at 5:13
  • It doesnt really matter about asynchronous threads, You will need to understand sql query plans, bottlenecks and ad deadlocks etc... If you use any third party application, it wont have base effect on query performance. Important to understand deadlocks. Framework limitations has to be dealt to understand issue further. Commented Jul 19, 2019 at 14:24
  • 6
    I am PS expert, but looking at your script it seems you are connecting to DB for every query? If so there is huge overhead, you should keep the connection live in the thread once if possible Commented Jul 19, 2019 at 19:28
  • 2
    Try: PSTreadJob from the PowerShell gallery, it runs concurrent jobs based on threads rather than processes Commented Jul 23, 2019 at 7:03

6 Answers 6

5
+50

RunspacePool is the way to go here, try this:

$AllQueries = @( ... )
$MaxThreads = 5

# Each thread keeps its own connection but shares the query queue
$ScriptBlock = {
    Param($WorkQueue)

    $objConnection = New-Object System.Data.SqlClient.SqlConnection
    $objConnection.ConnectionString = 'Data Source=...'

    $objCmd = New-Object System.Data.SqlClient.SqlCommand
    $objCmd.Connection = $objConnection
    $objCmd.CommandTimeout = 0

    $query = ""

    while ($WorkQueue.TryDequeue([ref]$query)) {
        $objCmd.CommandText = $query
        $objAdapter = New-Object System.Data.SqlClient.SqlDataAdapter $objCmd
        $objDataTable = New-Object System.Data.DataTable
        $objAdapter.Fill($objDataTable) | Out-Null
    }

    $objConnection.Close()

}

# create a pool
$pool = [RunspaceFactory]::CreateRunspacePool(1, $MaxThreads)
$pool.ApartmentState  = 'STA'
$pool.Open()

# convert the query array into a concurrent queue
$workQueue = New-Object System.Collections.Concurrent.ConcurrentQueue[object]
$AllQueries | % { $workQueue.Enqueue($_) }

$threads = @()

# Create each powershell thread and add them to the pool
1..$MaxThreads | % {
    $ps = [powershell]::Create()
    $ps.RunspacePool = $pool
    $ps.AddScript($ScriptBlock) | Out-Null
    $ps.AddParameter('WorkQueue', $workQueue) | Out-Null
    $threads += [pscustomobject]@{
        Ps = $ps
        Handle = $null
    }
}

# Start all the threads
$threads | % { $_.Handle = $_.Ps.BeginInvoke() }

# Wait for all the threads to complete - errors will still set the IsCompleted flag
while ($threads | ? { !$_.Handle.IsCompleted }) {
    Start-Sleep -Seconds 1
}

# Get any results and display an errors
$threads | % {
    $_.Ps.EndInvoke($_.Handle) | Write-Output
    if ($_.Ps.HadErrors) {
        $_.Ps.Streams.Error.ReadAll() | Write-Error
    }
}

Unlike powershell jobs, a RunspacePools can share resources. So there is one concurrent queue of all the queries, and each thread keeps its own connection to the database.

As others have said though - unless you're stress testing your database, you're probably better off reorganising the queries into bulk inserts.

Sign up to request clarification or add additional context in comments.

Comments

5

You need to reorganize your script so that you keep a database connection open in each worker thread, using it for all queries performed by that thread. Right now you are opening a new database connection for each query, which adds a large amount of overhead. Eliminating that overhead should speed things up to or beyond your target.

3 Comments

Yes, I am aware of this, but the problem is this $ScriptBlock thingy... It has different scope and I can't declare and open connection outside and use it inside. I don't have access to global variables in there. Good catch, though.
How about, before going into the script block, dividing the queries into 50 smaller batches. Then pass each smaller batch into the script block instead of passing one query in at a time, and have the script block loop through the queries in its smaller batch.
Have you tried passing the open connection to the scriptblock in the -Arguments parameter? Then refer it to as $args[0]
2

Try using SqlCmd.

You can use run multiple processes using Process.Start() and use sqlcmd to run queries in parallel processes.

Of course if you're obligated to do it in threads, this answer will no longer be the solution.

Comments

2
  1. Group your queries based on the table and operations on that table. Using this you can identity how much async sql queries you could run against your different tables.
  2. Make sure the size of the each table against which you are going to run. Because if table contains millions of rows and your doing a join operation with some other table as well will increase the time or if it is a CUD operation then might lock your table as well.
    1. And also choose number of threads based on your CPU cores and not based on assumptions. Because CPU core will run one process at a time so better you could create number of cores * 2 threads are efficient one.

So first study your dataset and then do the above 2 items so that you could easily identity what are all the queries are run parallely and efficiently.

Hope this will give some ideas. Better you could use any python script for that So that you could easily trigger more than one process and also monitor their activites.

2 Comments

Point 3 is not exactly right. The CPU core will run a ton of threads at once, but it can only actively process one at a time. So 50 threads on one CPU will run, but a ton of time will be spent waiting on threads when a thread could be actively running. Thus, you conclusion is correct, don't do that!
Yes. That's why I told to create threads based on the cores. Yes you could create thousands of threads for a cup but only one will be executed at a time and remain8ng threads are in waiting stage. My point here is all those waiting threads are unnecessarily occupying the memory and cpu also needs to do context switch. So for better efficiency we could trigger Number of threads based Cpu cores. It's like threads pooling using which we could do the work.
2

Sadly I don't have the time right this instant to answer this fully, but this should help:

First, you aren't going to use the entire CPU for inserting that many records, almost promised. But!

Since it appears you are using SQL string commands:

  1. Split the inserts into groups of say ~100 - ~1000 and manually build bulk inserts:

Something like this as a POC:

  $query = "INSERT INTO [dbo].[Attributes] ([Name],[PetName]) VALUES "

  for ($alot = 0; $alot -le 10; $alot++){
     for ($i = 65; $i -le 85; $i++) {
       $query += "('" + [char]$i + "', '" + [char]$i + "')"; 
       if ($i -ne 85 -or $alot -ne 10) {$query += ",";}
      }
   }

Once a batch is built, then pass it to SQL for the insert, using effectively your existing code.

The buld insert would look something like:

INSERT INTO [dbo].[Attributes] ([Name],[PetName]) VALUES ('A', 'A'),('B', 'B'),('C', 'C'),('D', 'D'),('E', 'E'),('F', 'F'),('G', 'G'),('H', 'H'),('I', 'I'),('J', 'J'),('K', 'K'),('L', 'L'),('M', 'M'),('N', 'N'),('O', 'O'),('P', 'P'),('Q', 'Q'),('R', 'R'),('S', 'S')

This alone should speed up your inserts by a ton!

  1. Don't use 50 threads, as previous mentioned unless you have 25+ logical cores. You will spend most of your SQL insert times waiting on the network, and hard drives NOT the CPU. By having that many threads enqueued you will have most of your CPU time reserved on waiting for the slower parts of the stack.

These two things alone I'd imagine can get your inserts down to a matter of minutes (I did 80k+ once using basically this approach in about 90 seconds).

The last part could be refactoring so that each core gets its own Sql connection, and then you leave it open until you are ready to dispose of all threads.

2 Comments

But I am not using INSERT statements, I am executing stored procedures. I have no power over the sql queries and cannot optimize them. I will consider the 2nd suggestion.
@Nyagolova If you can't change / optimize them: Could you add the total number of queries and some examples ?
1

I don't know much about powershell, but I do execute SQL in C# all the time at work.

C#'s new async/await keywords make it extremely easy to do what you are talking about. C# will also make a thread pool for you with the optimal amount of threads for your machine.

async Task<DataTable> ExecuteQueryAsync(query)
{
    return await Task.Run(() => ExecuteQuerySync(query));
}

async Task ExecuteAllQueriesAsync()
{
    IList<Task<DataTable>> queryTasks = new List<Task<DataTable>>();

    foreach query
    {
         queryTasks.Add(ExecuteQueryAsync(query));
    }

    foreach task in queryTasks
    {
         await task;
    }
}

The code above will add all the queries to the thread pool's work queue. Then wait on them all before completing. The result being that the max level of parallelism will be reached for your SQL.

Hope this helps!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.