0

I have a sparse table that has nested sub-tables in some of the rows, as shown below, how do I represent this structure with scala collections

| rowkey |  orderid  |      name   |    amount    |     supplier      |   account

| rowkey1|id0: 1001  |id1: "apple" |  id1: 1000   | id3: "fruits, inc"|
                     |id2: "apple2"|  id2: 1200   |                   | 

| rowkey2|id4: 1002  |id5: "orange"|  id5: 5000   |                   | 

| rowkey3|id6: 1003  |id7: "pear"  |  id7: 500    |                   |id10: 77777
                     |id8: "pear2"  |  id8: 350    |                   | 
                     |id9: "pear3"  |  id9: 500    |                   | 

note: id1,2,3,.. represent unique identifiers for each "group attribute", which is basically the groupid for each sub-row, e.g. in the first row "|id2: "apple2"| id2: 1200" belong to the same group id2 (sub-row with two attributes (name and amount) under rowkey1)

another way to look at these 3 rows:

    rowkey1, (orderid, id0, 1001), (name, id1, "apple"), (amount, id1, 1000), (name, id2, "apple2"), (amount, id2,1200), (supplier, id3, "fruit inc.")
    rowkey2, (orderid, id4, 1002), (name, id5, "orange"), (amount, id5,5000)
    rowkey3, (orderid, id6, 1003), (name, id7, "pear"), (amount, id7,500),(name, id8, "pear2"), (amount, id8,350),(name, id9, "pear3"), (amount, id9, 250), (account, id10, 777777)

edit: note that the table has 2000 columns, Is it possible to create a class (or add attributes to a class) dynamically, e.g. load field names and types from external file in Scala? I know that case classes are limited to 22 fields

edit2: also note that any of the attributes can have multiple lines (except rowkey), i.e. orderid, name, amount, supplier, account and 1995+ other columns, so creating individual "singleLine" classes for all of them is not feasible, I'm looking for the most general solution.

thanks for the answers, I guess to make it more general I can create these classes:

case class ColumnLine(
  id: Int,
  value: Option[Any]
)
case class Column(
  colname: String,
  coltype: String,
  lines: Option[List[ColumnLine]]
)
case class Row (
  rowkey:String,
  columns:Map[String,Column] //colname -> Column
)
case class Table (
  name:String,
  rows:Map[String,Row] //rowkey -> Row
)

now I'm trying to figure out how to query this structure, i.e. return rows where column with colname=="amount" contains lines where value >500

edit3: ok, this is "quick and dirty" way, but seems to work, it scans 10M records in ~15 sec on my laptop

import scala.util.control.Breaks._

object hello{

def main(args: Array[String]) {
    val n = 10000000
    def uuid = java.util.UUID.randomUUID.toString
    val row: Row = new Row(uuid, List(
                Column("orderid", "String", List(Single("id2",Some(uuid)))),
                Column("name", "String", List(Single("id2",Some("apple")),Single("id3",Some("apple2")))),
                Column("amount", "Int", List(Single("id2",Some(1000)),Single("id3",Some(1200)))),
                Column("supplier", "String", List(Single("id4",Some("fruits.inc")))),
                Column("account", "Int", List(Single("id10",Some(7777))))
                           )
            )
    println(new java.util.Date)
    val table: List[Row]= List.fill(n)(row)
    table.par.filter(row=> gt(row, "amount",500))
    .filter(row=> eq(row, "supplier","fruits.inc"))
    .filter(row=> eq(row, "account", 7777))
    //.foreach(println)
    println(new java.util.Date)

}

def eq (row:Row, colname: String, colvalue:Any): Boolean = {
    var res:Boolean = false
    val col:Column = getCol(row,colname) 
    breakable{ 
        for (line <- col.lines){ 
            if (line.value.getOrElse()==colvalue){
                res = true
                break
            }
        }
    }
    return res
}

def gt (row:Row, colname: String, colvalue:Int): Boolean = {
        var res:Boolean = false
        val col:Column = getCol(row,colname)
        breakable{
                for (line <- col.lines){
                        if (line.value.getOrElse().asInstanceOf[Int]>colvalue){
                                res = true
                                break
                        }
                }
        }
        return res
}

def getCol(row: Row, colname: String) : Column =
  row.columns.filter(_.colname==colname).head

case class Single(id: String, value: Option[Any])

case class Column(
  colname: String,
  coltype: String,
  lines: List[Single]
)

case class Row(
   rowkey: String,
   columns: List[Column]
)

}
2
  • 4
    Scala 2.11 has removed the 22 field restriction Commented Dec 12, 2014 at 1:48
  • we're planning to use spark and it works only on scala 2.10 I believe Commented Dec 16, 2014 at 21:04

2 Answers 2

1

There are many ways. For example, you could define the following:

case class OrderLine(
  name:String,
  amount:Int,
  supplier:Option[String],
  account:Option[String]
)

case class Order(
  rowkey:String,
  orderid:String,
  orders:Seq[OrderLine]
)

and then (this is just to create the example above; with 2000 rows read from a file, it would, of course be different, but you get the idea):

   val myOrders: Seq[Order] =
     Seq(
       Order("rowkey1", "1001", Seq(
         OrderLine("apple", 1000, Some("fruits, inc"), None),
         OrderLine("apple2", 1200, None, None)
       )),
       Order("rowkey2", "1002", Seq(
         OrderLine("orange", 5000, None, None)
       )),
       Order("rowkey3", "1003", Seq(
         OrderLine("pear", 500, None, Some("77777")),
         OrderLine("pear", 350, None, None),
         OrderLine("pear", 500, None, None)
       ))
     )

The code to load the data from an external file would depend on how the external file is structured. Basically, I would create a function to read an OrderLine from the file, and a function to read an Order (which, in turn, uses the function to read the OrderLine). These would be your basic building blocks to assemble the 2000 lines into an in-memory data structure.

Sign up to request clarification or add additional context in comments.

Comments

1

The most natural way to represent this in Scala, assuming that the column structure can be treated as fixed, would be something like

case class Single(name: String, amount: Int)

case class SingleEntry(
  orderid: Int,
  name: String,
  amount: Int,
  supplier: Option[Int],
  account: Option[Long]
)

case class Entry(
  orderid: Int,
  items: List[Single],
  supplier: Option[String],
  account: Option[Long]
) {
  def singly(p: Single => Boolean): List[SingleEntry] =
    items.filter(p).map{ case(name, amount) =>
      SingleEntry(orderid, name, amount, supplier, account)
    }
}

And then to pull out the items you want, you would

table.
  filter(_.supplier.exists(_ == "fruits.inc")).
  flatMap(_.singly(_.amount > 500))

But there are many ways you could represent this data structure, including with maps (nested or otherwise); I wouldn't take any particular answer as canonical.

4 Comments

the table has 2000 columns, is it possible to create a class programmatically, i.e. load attribute names and types from external file
@alex - If you have two thousand columns, you probably just want a map from column names to their values.
@RexKerr May I propose you editing your answer to delete the word 'is' at the end of the first paragraph? It is a minor correction and thus I cannot edit your answer (at least 6 chars must be changed).
@MikelUrkia - Thanks for the catch. I have fixed the sentence.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.