9

For example, if I have a column called companyId and many other columns I want to keep, and in companyId I have values like 100, 101, 102, ..., basically a list of Ids and each Id appear different number of times. How do I randomly sample data based on the companyId column so that it's according to the proportion of each Id?

eg: If I have 500 rows and 100 companyA, 100 companyB and 300 companyC and I want to sample 100 rows from this table. How do I make my data have 20 companyA, 20 companyB and 60 companyC?

1

1 Answer 1

17

You should use the TABLESAMPLE clause available since PostgreSQL 9.5, described in the SELECT documentation.

There are two built-in table sampling methods: SYSTEM and BERNOULLI. Both try to get a random sample, but the SYSTEM method picks table blocks and uses the data therein, while BERNOULLI uses a random sample from the whole table. SYSTEM is faster, but BERNOULLI offers a more evenly distributed sample.

You have to specify the percentage of the rows that you want to get.
If you want a fifth of your table, you would use:

SELECT * FROM mytable TABLESAMPLE BERNOULLI (20);

If your requirement is to get a certain number of result rows, you might consider using the tsm_system_rows contrib module, which offers the SYSTEM_ROWS table sample method that works similar to SYSTEM, but you can specify the number of result rows.

Sign up to request clarification or add additional context in comments.

4 Comments

does this answer the question? The OP wants to know how to sample "20 companyA, 20 companyB and 60 companyC?"
You are right, it is an approximation. I assumed that these numbers were just for illustration and represent an estimate rather than hard requirements (note that the numbers are proportional to the frequency with which the values occur in the table).
thanks for that note, it does help understanding how tablesample works.
You can also use LIMIT n if you only want to receive N rows back. Also note that TABLESAMPLE SYSTEM, on semi-sized data sets (234k rows) sometimes returns zero rows for small sample values e.g. 0.1 or 0.01 or even 0.8

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.