1

What I have:

//This data set contains columns (second index) having the same value in each row (first index)
double[][] dataSet = new double[][]
{
  new double[] {1, 2, 3, 4},
  new double[] {5, 6, 7, 4},
  new double[] {8, 9, 10, 4},
}; 

What i want to get:

// This data set has no column where the value in each row is the same
double[][] reducedDataSet = new double[][]
{
  new double[] {1, 2, 3},
  new double[] {5, 6, 7},
  new double[] {8, 9, 10},
}; 

In python this can be easily done by:

all_equal_value_indices = numpy.all(data_set == data_set[0, :], axis=0) // Finds the indices of all columns that have equal values in each row
reduced_data_set = data_set[:, ~all_equal_value_indices] // Takes all rows with only those columns where all_equal_value_indices is not 1

In C# I can get an array containing the indices that should be excluded relatively fast, but how can I use these indices as mask to get only those columns not contained in these indices?

What i tried:

var numberOfDeletedColumns = 0;
var reducedDataSet = dataSet;

foreach (var columnToDelete in columnsToDelete)
{
  reducedDataSet = reducedDataSet.RemoveColumn(columnToDelete - numberOfDeletedColumns++);
}

RemoveColumn is an extension provided by Accord.Net and has the following code:

/// <summary>Returns a new matrix without one of its columns.</summary>
public static T[][] RemoveColumn<T>(this T[][] matrix, int index)
{
  T[][] objArray = new T[matrix.Length][];
  for (int index1 = 0; index1 < matrix.Length; ++index1)
  {
    objArray[index1] = new T[matrix[index1].Length - 1];
    for (int index2 = 0; index2 < index; ++index2)
      objArray[index1][index2] = matrix[index1][index2];
    for (int index2 = index + 1; index2 < matrix[index1].Length; ++index2)
      objArray[index1][index2 - 1] = matrix[index1][index2];
  }
  return objArray;
}

But this is much slower than the implementation in python. Could someone suggest a faster method to achieve a reduced data set?

3
  • 1
    sample data plz Commented Dec 15, 2017 at 15:49
  • 1
    I am not entirely sure what you are asking. Can you please provide some sample data to work with? The data provided should be a before and after. Commented Dec 15, 2017 at 15:49
  • Sorry @Steve and Michael Coxon i provided an example array now to make it clearer. I will check the provided answers now. Commented Dec 17, 2017 at 14:25

2 Answers 2

1

Array.Copy helps it run about 2x faster on my computer.

static T[][] FastRemoveColumn<T>(T[][] matrix, int index)
{
    T[][] objArray = new T[matrix.Length][];
    for (int i = 0; i < matrix.Length; i++)
    {
        var line = matrix[i];
        var reducedline = new T[line.Length - 1];
        Array.Copy(line, 0, reducedline, 0, index);
        Array.Copy(line, index + 1, reducedline, index, line.Length - index - 1);
        objArray[i] = reducedline;                
    }
    return objArray;
}

and I also tried multithread. It runs very slow:

static T[][] MultiThreadRemoveColumn<T>(T[][] matrix, int index)
{
    T[][] objArray = new T[matrix.Length][];
    Parallel.For(0, matrix.Length, i =>
    {
        var line = matrix[i];
        var reducedline = new T[line.Length - 1];
        Array.Copy(line, 0, reducedline, 0, index);
        Array.Copy(line, index + 1, reducedline, index, line.Length - index - 1);
        objArray[i] = reducedline;                
    });
    return objArray;
}

Test:

// init
double[][] arr = new double[2000][];
for (int i = 0; i < arr.Length; i++)            
    arr[i] = new double[2000];

double v = 0;
for (int i = 0; i < arr.Length; i++)
{
    for (int j = 0; j < arr[i].Length; j++)
    {
        arr[i][j] = v;
        v++;
    }
}

Stopwatch sw = Stopwatch.StartNew();
var reducedArr = RemoveColumn(arr, 200);
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
sw.Restart();
var reducedArr2 = FastRemoveColumn(arr, 200);    
sw.Stop();        
Console.WriteLine(sw.ElapsedMilliseconds);
sw.Restart();
var reducedArr3 = MultiThreadRemoveColumn(arr, 200); 
sw.Stop();     
Console.WriteLine(sw.ElapsedMilliseconds);

// Check the result
for (int i = 0; i < reducedArr.Length; i++)
{
    for (int j = 0; j < reducedArr[i].Length; j++)
    {
        if(reducedArr[i][j] != reducedArr2[i][j]) throw new Exception();
        if(reducedArr[i][j] != reducedArr3[i][j]) throw new Exception();   
    }
}

Update

Solution to remove several columns:

public static T[][] DeleteColumns<T>(T[][] matrix, int[] columns)
{
    if (matrix.Length == 0) return new T[0][];
    bool[] delColumns = new bool[matrix[0].Length];
    foreach (int col in columns) delColumns[col] = true;
    List<int> remainCols = new List<int>();
    for (int i = 0; i < delColumns.Length; i++)
    {
        if (!delColumns[i]) remainCols.Add(i);
    }
    var target = new T[matrix.Length][];
    for (int rowIndex = 0; rowIndex < matrix.Length; rowIndex++)
    {
        T[] sourceRow = matrix[rowIndex];
        T[] targetRow = new T[remainCols.Count];
        for (int i = 0; i < remainCols.Count; i++)
        {
            targetRow[i] = sourceRow[remainCols[i]];
        }
        target[rowIndex] = targetRow;    
    }
    return target;
}

Test on a 2000x2000 matrix. Comparing with Adam Brown's solution, testing removing all columns is absolutely unfair, but my solution is faster even if removing only one column.

Sign up to request clarification or add additional context in comments.

3 Comments

Tank you for your time skyoxZ. Your solution indeed is faster than mine. Adam Brown provided an answer that is slower on a small number of columns to remove but much faster on many columns. That's why i accepted his answer. Still +1 because for some use cases your solution could be interesting for some people
@YannicKlem Check my new solution.
At the "unfair" test with removing all columns your solution is now ~27.5 times faster Than the one provided by Adam Brown. Doing the test with removing every second column your solution still is ~1.5 times faster. Thank you very much. Didn't want to challenge you :)
1

Sometimes, components don't work nicely in aggregate. In this case, your remove column function is reallocating the entire matrix, so the operation is linear in the number of columns you want to remove (ouch). To fix this, remove all the columns in one pass.

class Program
{
    static void Main(string[] args)
    {
        var matrix = new[]
        {
            new [] {1, 2, 3, 4, 5},
            new [] {1, 2, 3, 4, 5},
            new [] {1, 2, 3, 4, 5},
            new [] {1, 2, 3, 4, 5},
        };

        var result = matrix.DeleteColums(new [] {0, 2, 4});

        foreach (var row in result)
        {
            foreach (var column in row)
            {
                Console.Write(column);
                Console.Write(" ");
            }
            Console.WriteLine();
        }

        Console.ReadKey();
    }
}

static class MatrixHelper
{
    public static T[][] DeleteColums<T>(this T[][] matrix, int[] columns)
    {
        var sorted = columns.Distinct().OrderBy(e => e).Concat(new [] { int.MaxValue }).ToArray();
        var target = new T[matrix.Length][];

        for (int row = 0; row < matrix.Length; row++)
        {
            var sourceRow = matrix[row];
            var targetRow = new T[sourceRow.Length - columns.Length];
            var sortedIndex = 0;
            for (int i = 0; i < sourceRow.Length; i++)
            {
                if (i == sorted[sortedIndex])
                {
                    sortedIndex++;
                    continue;
                }

                targetRow[i - sortedIndex] = sourceRow[i];
            }
            target[row] = targetRow;
        }

        return target;
    }
}

If this isn't enough, then you'll need to think about whether you need to use arrays. For instance, you could have a data structure for your matrix that dynamically masks columns, instead of an array-of-arrays.

UPDATE:

Given that other solutions in this page have assumed that the matrix, despite being represented by a jagged array, has the same indices per row, I thought I would give another go at making a faster solution. Here are two solutions that beat all previous ones in this thread, under those assumptions, including a faster parallel one.

public static T[][] DeleteColumns<T>(this T[][] matrix, int[] columns)
    {
        if (matrix.Length == 0) return matrix;

        //Previous code assumed matrix could be jagged - new code assumes all columns 
        //present and all rows same length
        var rowLength = matrix[0].Length;

        if (rowLength == 0) return matrix;

        var sorted = columns.Distinct().ToArray();
        var target = new T[matrix.Length][];
        var remainingLength = rowLength - sorted.Length;

        //Allocate the targets all in one go - to avoid doing allocation in parallel.
        for (var row = 0; row < matrix.Length; row++)
        {
            target[row] = new T[remainingLength];
        }

        //Work out remaining columns (previous code assumed these could 
        //be different per row, this assumes all rows have the same
        //contents.
        var remaining = Enumerable.Range(0, rowLength).Except(sorted).ToArray();

        for (int row = 0; row < matrix.Length; row++)
        {
            var sourceRow = matrix[row];
            var targetRow = target[row];
            for (int i = 0; i < targetRow.Length; i++)
            {
                targetRow[i] = sourceRow[remaining[i]];
            }
        }

        return target;
    }

And the faster parallel one (allocation for the parallel one is now about 90% of the total time):

public static T[][] DeleteColumnsParallel<T>(this T[][] matrix, int[] columns)
    {
        if (matrix.Length == 0) return matrix;

        //Previous code assumed matrix could be jagged - new code assumes all columns 
        //present and all rows same length
        var rowLength = matrix[0].Length;

        if (rowLength == 0) return matrix;

        var sorted = columns.Distinct().ToArray();
        var target = new T[matrix.Length][];
        var remainingLength = rowLength - sorted.Length;

        //Allocate the targets all in one go - to avoid doing allocation in parallel.
        for (var row = 0; row < matrix.Length; row++)
        {
            target[row] = new T[remainingLength];
        }

        //Work out remaining columns (previous code assumed these could 
        //be different per row, this assumes all rows have the same
        //contents.
        var remaining = Enumerable.Range(0, rowLength).Except(sorted).ToArray();

        Parallel.For(0, matrix.Length, row =>
        {
            var sourceRow = matrix[row];
            var targetRow = target[row];
            for (int i = 0; i < targetRow.Length; i++)
            {
                targetRow[i] = sourceRow[remaining[i]];
            }
        });

        return target;
    }

Results for 10000x10000 matrix with half the columns randomly removed.

  • My previous attempt: 1300ms
  • Best previous attempt: 450ms
  • My new serial version: 390ms
  • My new parallel version: 310ms
  • Time just to allocate the result matrix (i.e. lower bound on best achievable time): 265ms

But, I think it's important to call out a far, far faster solution. At the moment, in the fastest parallel solution, 90% of the time is spent allocating memory. If, on the other hand, you were to make a Matrix class that had it's own indexer, you would be able to dynamically pretend that certain columns of the underlying data structure didn't exist. Depending on how you're using the matrices, versus how often you're masking rows or columns, this could be dramatically faster.

3 Comments

Thank you for your time to provide a possible solution. The solution of skyoxz seems to be the fastest. Its 2 times faster than the one i tried while yours is about 22 times slower than the one i tried. i will check it with a lots of columns to remove. Because yours should scale better
Hi. That was the point. I assumed from your question - and the fact that you were having performance issues at all - that you were removing multiple (large numbers) of columns at once. This code is pretty much the same running time for removing 1000 columns as 1 column.
Yes your assumption was correct. Skyoxz has updated his answer and now his solution is a bit faster for removing multiple columns.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.