2

I have a requirement which is build a question survey system. Simply say, it need question, predefine answer and user's answers record.

  • Question need a question id, question text
  • Answer need a answer id, answer text
  • User's answer record need a record id, user id, question id, answer id,date,os,ip,browser info, is live

For user record, I need to keep all history, that's why I need a "is live" column. So only the latest answer for each user is true. When user answer a same question again, all exist answer record for this user will be history( is live = false ).

Seems simple structure. But When I got more than 100,000 question, more than 1 million users, and each user for each question have more than 20 answer record, then the records are more than 100,000 * 1,000,000 * 20 = 2,000,000,000,000 records. Then it become a big issue.

I also need to describe how I need to use this data. I need to provide another system, which can use user's record to target a group of users by define a question answer criteria. For example:

  1. (Q1=A1 && Q2=A3 && Q3=A5 && (Q4=A8 || Q5=A9)) criteria 1
  2. (Q1!=A1 && Q2=A3) criteria 2
  3. (Q4=A8 || Q5!=A9) criteria 3

After I define the criteria:

  1. I need to provide one api to get all user ids who match a criteria (api1)
  2. I need to provide one api to get all criterias for a user (api2)

The api need fast and live. And api will be called frequently.

So just imagine when there are 200,000,000,000 records in one table. The api call will be very slow or even kill the db.

So, I have some solution which is not good, I just list here so we can discuss:

  1. Each question got a single table to save all user record for this question.
  2. Each user got a single table to save all question record for this user.
  3. Both 1 and 2

But I can see there solution is not very good and efficient. So want to discuss about it here. Doesn't matter what kind of technology (sql, nosql, hadoop etc...)

Please put you thoughts here.

1 Answer 1

2

I would try with mongoDB using only one "user" collection storing answers in arrays:

{userId: 1, 
 name: "nick",
 ...,
 "answers": [
    { questionId:1,
      answerId: 1,
      date: Date(...),
      ...,
      isLive: 1},
    { questionId:1
      answerId: 2,
      date: Date(...),
      ...,
      isLive: 0}
 ]
}

Then I would use a Multikey Index on the property "answers.isLive" to ensure high speed access to "live" answers.

Another multikey index on "answers.questionId" and "answers.answerId" should ensure high performances retrieving data with your criteria.

With number like yours I would take into consideration sharding your collection from the start.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.