1

I'm looking for a way to return a dynamic column list from a LINQ join of two datatables.

First, this is not a duplicate. I have already studied and discarded:

C# LINQ list select columns dynamically from a joined dataset

Creating a LINQ select from multiple tables

How to do a LINQ join that behaves exactly like a physical database inner join?

(and many others)

Here is my starting point:

public static DataTable JoinDataTables(DataTable dt1, DataTable dt2, string table1KeyField, string table2KeyField, string[] columns) {
   DataTable result = ( from dataRows1 in dt1.AsEnumerable()
                        join dataRows2 in dt2.AsEnumerable()
                        on dataRows1.Field<string>(table1KeyField) equals dataRows2.Field<string>(table2KeyField)
                        [...I NEED HELP HERE with the SELECT....]).CopyToDataTable();
   return result;
}

A few notes and requirements:

  1. There is no database engine. The data sources are large CSV files (500K+ records) being read into c# DataTables.
  2. Because the CSVs are large, looping through each record in the join is a bad solution for performance reasons. I've already tried record looping and it's just too slow. I get great performance on the join above, but I can't find a way to have it return just the columns I want (specified by the caller) without looping records.
  3. If I need to loop over columns in the join, that is perfectly fine, I just don't want to loop rows.
  4. I want to be able to pass in an array of column names and return just those columns in the resulting DataTable. If both datatables being passed in happen to have a column named the same, and if that column is in my array of column names, just pass back either column because the data will be the same between the 2 columns in that case.
  5. If I need to pass in 2 arrays (1 for each datatable's desired columns) that's fine, but 1 array of column names would be ideal.
  6. The column list cannot be static and hardcoded into the function. The reason is because my JoinDataTables() is called from many different places in my system in order to join a wide variety of CSVs-turned-datatables, and each CSV file has very different columns.
  7. I don't want all columns returned in the resulting DataTable -- just the columns I specify in the columns array.

So suppose, before calling JoinDataTables(), I have the following 2 datatables:

Table: T1
T1A T1B T1C T1D
==================
10  AA  H1  Foo1
11  AB  H1  Foo2
12  AA  H2  Foo1
13  AB  H2  Foo2

Table: T2
T2A T2X T2Y T2Z
==================
12  N1  O1  Yeah1
17  N2  O2  Yeah2
18  N3  O1  Yeah1
19  N4  O2  Yeah2

Now suppose we join these 2 tables like so: ON T1.T1A = T2.T2A

select * from [join]

and that yields this resultset:

T1A T1B T1C T1D   T2A T2X T2Y T2Z
====================================
12  AA  H2  Foo1  12  N1  O1  Yeah1

Notice that only 1 row is yielded by the join.

Now to the crux of my question. Suppose that for a given use case, I want to return only 4 columns from this join: T1A, T1D, T2A, and T2Y. So my resultset would then look like this:

T1A T1D   T2A  T2Y
==================
12  Foo1  12   O1

I'd like to be able to call my JoinDataTables function like so:

DataTable dt = JoinDataTables(dt1, dt2, "T1A", "T2A", new string[] {"T1A", "T1D", "T2A", "T2Y"});

Keeping in mind performance and the fact that I don't want to loop through records (because it's slow for large sets of data), how can this be accomplished? (The join is already working well, now I just need a correct select segment (whether via new{..} or whatever you think)).

I cannot accept a solution with a hardcoded column list inside the function. I have found examples of that approach all over SO.

Any ideas?

EDIT: I'd be ok getting ALL columns back every time, but every attempt I've made to include all columns has resulted in some kind of FULL OUTER JOIN or CROSS JOIN, returning orders of magnitude more records than it should. So, I'd be open to getting all columns back, as long as I don't get the cross join.

6
  • "I don't want all columns returned in the resulting DataTable" -- Why don't you? If I did have a valid reason to exclude columns (I'm not saying you don't), I'd select everything in the join, then write code to grab just the columns I want from result and stuff them into a new DataTable that had just the required rows. I'd look at programmatically generating a LINQ expression to select the columns I wanted, but there's a good chance I'd end up just writing a bunch of regular C# code to create datarows. If you're not comfortable with the first option, the second is dull but routine. Commented Jan 24, 2019 at 22:31
  • Ed, I should have mentioned it, but every attempt I've made to include ALL columns has resulted in some kind of FULL OUTER JOIN or CROSS JOIN, returning orders of magnitude more records than it should. So, I'd be open to getting all columns back, as long as I don't get the cross join. Commented Jan 24, 2019 at 22:57
  • That puts a very different complexion on it. Can you do any join at all without getting a cross join? I think that’s the real issue here and I think it’s unrelated to selecting columns. I’d look hard at what’s in the columns you’re joining on. Have you tried this with small controlled test data files? Commented Jan 25, 2019 at 0:00
  • Yes, I can get a valid inner join if I return ONLY the columns from DT1. But that's not sufficient. For example, if I do this: select dataRows1).Distinct().CopyToDataTable(); -- all is right and good, except I can't find a way to also return columns from DT2 without iterating every record and building a new DT from the ground up. There MUST be a simpler, faster, and more elegant way than that. Commented Jan 25, 2019 at 0:13
  • Good luck with your problem. Commented Jan 25, 2019 at 0:44

1 Answer 1

0

I'm not sure of the performance with 500k records, but here is an attempted solution.

Since you are combining two subsets of DataRows from different tables, there are no easy operations that will create the subset or create a new DataTable from the subsets (though I have an extension method for flattening an IEnumerable<anon> where anon = new { DataRow1, DataRow2, ... } from a join, it would probably be slow for you).

Instead, I pre-create an answer DataTable with the columns requested and then use LINQ to build the value arrays to be added as the rows.

public static DataTable JoinDataTables(DataTable dt1, DataTable dt2, string table1KeyField, string table2KeyField, string[] columns) {
    var rtnCols1 = dt1.Columns.Cast<DataColumn>().Where(dc => columns.Contains(dc.ColumnName)).ToList();
    var rc1 = rtnCols1.Select(dc => dc.ColumnName).ToList();
    var rtnCols2 = dt2.Columns.Cast<DataColumn>().Where(dc => columns.Contains(dc.ColumnName) && !rc1.Contains(dc.ColumnName)).ToList();
    var rc2 = rtnCols2.Select(dc => dc.ColumnName).ToList();

    var work = from dataRows1 in dt1.AsEnumerable()
               join dataRows2 in dt2.AsEnumerable()
               on dataRows1.Field<string>(table1KeyField) equals dataRows2.Field<string>(table2KeyField)
               select (from c1 in rc1 select dataRows1[c1]).Concat(from c2 in rc2 select dataRows2[c2]).ToArray();

    var result = new DataTable();
    foreach (var rc in rtnCols1)
        result.Columns.Add(rc.ColumnName, rc.DataType);
    foreach (var rc in rtnCols2)
        result.Columns.Add(rc.ColumnName, rc.DataType);

    foreach (var rowVals in work)
        result.Rows.Add(rowVals);

    return result;
}

Since you were using query syntax, I did as well, but normally I would probably do the select like so:

select rc1.Select(c1 => dataRows1[c1]).Concat(rc2.Select(c2 => dataRows2[c2])).ToArray();

Updated: It is probably worthwhile to use the column ordinals instead of the names to index into each DataRow by replacing the definitions of rc1 and rc2:

var rc1 = rtnCols1.Select(dc => dc.Ordinal).ToList();
var rc1Names = rtnCols1.Select(dc => dc.ColumnName).ToHashSet();
var rtnCols2 = dt2.Columns.Cast<DataColumn>().Where(dc => columns.Contains(dc.ColumnName) && !rc1Names.Contains(dc.ColumnName)).ToList();
var rc2 = rtnCols2.Select(dc => dc.Ordinal).ToList();
Sign up to request clarification or add additional context in comments.

11 Comments

Thank you NetMage - I'm leaving on vacation today, in a few hours, so I don't know if I can try this out today or 9 days from now when I'm back, but I'll certainly report back on results at some point. Thanks for your patience.
@HerrimanCoder Gives me time to try some other optimizations. Can you give me a range of percent of join versus non-join rows in the two tables?
I'm not sure I have that info but I'll see if I can find out.
NetMage, I finally was able to try this out against some datasets, and it gives me an OutOfMemory exception.
@HerrimanCoder How large are your datasets - I created some simple 500k datasets without issue using 64-bit (.Net target Any CPU) C# in LINQPad.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.