0

I have a peculiar scenario in which the sample obtained in two consecutive samplings are not consistent even when I've provided a seed value. I'm using the following code (Which was an outcome of a discussion here:

var conversionSample = sortedConversionSubset.sample(true, (sampleSize + 0.05), 3*x).limit((conversionCount * sampleSize).toInt) 

var nonConversionSample = sortedNonConversionSubset.sample(true, (sampleSize + 0.05), 3*x).limit((nonConversionCount * sampleSize).toInt) 

Here

  1. 'sampleSize' is a constant fraction value less than 0.8
  2. 'x' is a constant int, which represents xth iteration in a for loop

  3. 'conversionCount' and 'nonConversionCount' are int values representing number of rows in each subset

Now the observation being that in two successive runs the sample generated is different in both cases which was not the expected behavior.

sortedConversionSubset
+--------------------------------------+----------+
|clientid                              |Conversion|
+--------------------------------------+----------+
|02438b66-2de4-4765-bae3-de7453647ea7_1|1         |
|203865ed-f02a-4ed9-9098-82691de707a4_0|1         |
|203865ed-f02a-4ed9-9098-82691de707a4_1|1         |
|674e2337-aec5-434e-b56e-8c2efcc42894_1|1         |
|6d6036d3-c161-4f5d-8557-80b85dd87bd9_0|1         |
|6d6036d3-c161-4f5d-8557-80b85dd87bd9_1|1         |
|7797aba3-3eea-4556-856e-753812b4b551_0|1         |
|7797aba3-3eea-4556-856e-753812b4b551_1|1         |
|870ab2a5-0650-42b8-9e6f-bde3859f64fd_0|1         |
|870ab2a5-0650-42b8-9e6f-bde3859f64fd_1|1         |
|9b606693-4ffa-44a5-bd7c-cc6974ce3e83_0|1         |
|be218b72-c664-40cf-adf5-e3519095e941_0|1         |
|e7dc7fd9-32df-46a1-b3bd-793bbda09f6f_0|1         |
|eaf434da-6a8f-4ab0-a744-62bea663ed5e_0|1         |
|eaf434da-6a8f-4ab0-a744-62bea663ed5e_1|1         |
+--------------------------------------+----------+


sortedNonConversionSubset
+--------------------------------------+----------+
|clientid                              |Conversion|
+--------------------------------------+----------+
|03358d8f-9b9c-4258-9c99-234ab102c29b_1|0         |
|040d213c-e91a-42f4-9bf7-90671670dc17_0|0         |
|04fe5148-1c56-4c88-aed0-1f01220bffd6_0|0         |
|0ed2e621-9ba4-46f0-8793-a84d32538c39_0|0         |
|0f9bcf42-e7fa-49a0-9d75-6c9bbc38b4d5_0|0         |
|108c5478-abc0-44d9-968b-47f81c4f5a37_0|0         |
|129eb883-159d-49be-b8ae-9aa44a3e2919_0|0         |
|13e3d779-026b-4d12-8619-aa5fe6ca99ed_0|0         |
|14497295-eebd-44aa-9f26-fc5e4810fb54_0|0         |
|1855d96d-3647-4c4f-a20f-7e46f7635798_0|0         |
|1911caf0-a470-4898-9b62-57c604422727_0|0         |
|1b91b8dc-09b8-47e2-b892-f5c14b650019_0|0         |
|1dfa820c-77e0-4927-8a39-ecd8e842b09b_0|0         |
|1e48e346-4ada-4a8d-896b-7658cc2499cd_0|0         |
|252be902-4204-40a5-9d3c-dd3a7d0f0355_0|0         |
|2995b49d-525b-43e9-ab36-8b8910a4607c_0|0         |
|2bc06b59-4624-4ddd-87a3-ed04cba88233_0|0         |
|2d4538a5-20e6-4742-ae46-aad0a5ed3fff_0|0         |
|31563716-9380-4662-90e5-7f63a1ab9072_0|0         |
|34442a3e-0437-4c41-86fb-1ac55062993a_0|0         |
|35151629-2f86-4917-90d2-42daa5ae4f5c_0|0         |
|3c37e066-dff5-4bd9-84ab-b9e73f3f3fdd_0|0         |
|3e998096-3a4b-4b57-a1de-69d2dbd19abd_0|0         |
|3f8ace3c-d378-4423-97a0-3d9cf35ba256_0|0         |
|49a0cfb8-490f-4252-84fa-2b9e250e9333_0|0         |
|4c3f11fa-e3ba-4eb1-977a-06f034bf8a54_0|0         |
|4ee484f4-e877-44c3-9390-c4e4072c5dee_0|0         |
|4fa035b3-dcd5-40e1-9107-0a0c943ff597_1|0         |
|529704d2-5a60-4718-a03f-639e040f6634_0|0         |
|560f6978-028b-4a37-9f97-d97e93976bf7_0|0         |
|57b47c74-b071-4278-89c9-f7b4cb1225d1_0|0         |
|58305773-f944-4039-8452-f5eb8d62f0cf_0|0         |
|58dfa9dd-43cf-4eb7-ade6-7235004a9815_0|0         |
|5b146218-9bb6-46f0-8c83-df131d78f591_0|0         |
|5ca3b5bc-35a9-42a5-bd37-a8fc94366dc6_0|0         |
|5d5f2ea0-aed9-4c2d-8c22-68859ec35e8e_0|0         |
|5f9ebf92-3b1b-4628-b949-44a32e6d3659_0|0         |
|64822b8c-009e-48ab-b6ca-1a7ece1106fa_0|0         |
|6b352714-af74-4773-854b-073e644e8684_0|0         |
|6e528e49-472e-48c7-baa9-edc25303e427_0|0         |
|73203f58-8be2-4716-b8f0-79c64400c57b_0|0         |
|741630e0-1c99-497d-a127-5c4c562952c5_0|0         |
|778e3b8a-2ca5-469a-9697-f646962e8308_0|0         |
|8029c542-d933-43fb-b359-f2438dcd5660_0|0         |
|8b06ba24-2af3-4eec-811a-4d1779f37876_0|0         |
|8fb43dff-260d-4ece-85e2-3bc2cb636ac1_0|0         |
|90f8a4cb-1956-43c4-ac7d-8c6514cd023a_0|0         |
|916f2e2a-6135-4004-8d54-d80b822ce394_0|0         |
|968a7ca3-1649-4586-9e60-b7e8565e708a_0|0         |
|a32782cc-8c4c-403b-aa83-09f1cec45fdb_0|0         |
|a63f44d5-a4d5-45a0-8a4b-cebf05df810b_0|0         |
|a6f958bc-e050-4216-b981-d51f1c0ff60d_0|0         |
|a7dba1bb-d7ff-44e6-9c4c-997ae59a2337_1|0         |
|ac33d675-d9cc-43b5-94fb-7d412773db14_0|0         |
|b1227816-9bf2-474f-8e82-5739acf6c895_0|0         |
|b1c27a2e-6efc-4869-880b-9ce0a4962edc_0|0         |
|b4ff6d43-cf0a-4f1d-9431-1edcb8ee1fb6_0|0         |
|b9e477ab-2065-42bb-832b-5d0e98ee05c7_0|0         |
|ba8c4efe-e71c-468c-b1bf-37efff596907_0|0         |
|c21eefc8-43d0-4be0-a252-b9fc4dbb7ad0_0|0         |
|c3785311-87c8-43bc-99a8-01d64f5eaa87_0|0         |
|c543bde7-deb8-4484-b0be-353c44baf6eb_1|0         |
|ca31e550-9d28-4628-bfe8-53648a2007f7_0|0         |
|cbc33697-20cb-4f8b-accd-0a6396a4ea41_0|0         |
|cc7810aa-08fc-44e7-acdc-ac948a28f9b9_0|0         |
|d1efdc7c-afb0-4995-bbbd-a76f731d2492_0|0         |
|d6a4b928-e576-41d7-9628-18709765199d_0|0         |
|d7311ec7-6c50-448d-8a6e-f690c3070d57_1|0         |
|d86b09f9-70a0-4101-a13b-129fe3a37b86_0|0         |
|d911be5b-aceb-45c8-a79e-73ccfa1b96f0_0|0         |
|db0c7b10-80f7-4071-aa53-fe0e2dc5ebce_0|0         |
|dce14c51-fa57-4e98-987d-708e2a9aa293_0|0         |
|dd026fb8-f818-4d1e-aaa4-4c9b3fd24994_0|0         |
|dfa9c55c-1e75-4010-be86-a6b1eb723672_0|0         |
|ea29f600-9e85-40f4-9f88-dcef46beb0c1_0|0         |
|eb5e58fc-eaac-4059-8ebc-1fab1ccf3555_1|0         |
|eb7568ab-83ac-45a7-bf4b-3b048d6c7c53_0|0         |
|f5b1cfc4-e397-4699-adab-0af6ee0e1b76_0|0         |
|facbfc8c-d477-4b27-bf15-52a56c26cbf6_0|0         |
|ffd03bca-ef40-4fa4-913e-73c002f29796_0|0         |
+--------------------------------------+----------+

1st Run Sample
+--------------------------------------+----------+
|clientid                              |Conversion|
+--------------------------------------+----------+
|203865ed-f02a-4ed9-9098-82691de707a4_1|1         |
|6d6036d3-c161-4f5d-8557-80b85dd87bd9_0|1         |
|6d6036d3-c161-4f5d-8557-80b85dd87bd9_1|1         |
|02438b66-2de4-4765-bae3-de7453647ea7_1|1         |
|7797aba3-3eea-4556-856e-753812b4b551_0|1         |
|870ab2a5-0650-42b8-9e6f-bde3859f64fd_0|1         |
|1dfa820c-77e0-4927-8a39-ecd8e842b09b_0|0         |
|252be902-4204-40a5-9d3c-dd3a7d0f0355_0|0         |
|2995b49d-525b-43e9-ab36-8b8910a4607c_0|0         |
|2bc06b59-4624-4ddd-87a3-ed04cba88233_0|0         |
|31563716-9380-4662-90e5-7f63a1ab9072_0|0         |
|5ca3b5bc-35a9-42a5-bd37-a8fc94366dc6_0|0         |
|5d5f2ea0-aed9-4c2d-8c22-68859ec35e8e_0|0         |
|5f9ebf92-3b1b-4628-b949-44a32e6d3659_0|0         |
|5f9ebf92-3b1b-4628-b949-44a32e6d3659_0|0         |
|5f9ebf92-3b1b-4628-b949-44a32e6d3659_0|0         |
|6b352714-af74-4773-854b-073e644e8684_0|0         |
|6e528e49-472e-48c7-baa9-edc25303e427_0|0         |
|6e528e49-472e-48c7-baa9-edc25303e427_0|0         |
|741630e0-1c99-497d-a127-5c4c562952c5_0|0         |
|03358d8f-9b9c-4258-9c99-234ab102c29b_1|0         |
|040d213c-e91a-42f4-9bf7-90671670dc17_0|0         |
|040d213c-e91a-42f4-9bf7-90671670dc17_0|0         |
|04fe5148-1c56-4c88-aed0-1f01220bffd6_0|0         |
|129eb883-159d-49be-b8ae-9aa44a3e2919_0|0         |
|1855d96d-3647-4c4f-a20f-7e46f7635798_0|0         |
|3c37e066-dff5-4bd9-84ab-b9e73f3f3fdd_0|0         |
|3e998096-3a4b-4b57-a1de-69d2dbd19abd_0|0         |
|3f8ace3c-d378-4423-97a0-3d9cf35ba256_0|0         |
|49a0cfb8-490f-4252-84fa-2b9e250e9333_0|0         |
|4fa035b3-dcd5-40e1-9107-0a0c943ff597_1|0         |
|4fa035b3-dcd5-40e1-9107-0a0c943ff597_1|0         |
|529704d2-5a60-4718-a03f-639e040f6634_0|0         |
|560f6978-028b-4a37-9f97-d97e93976bf7_0|0         |
|560f6978-028b-4a37-9f97-d97e93976bf7_0|0         |
|560f6978-028b-4a37-9f97-d97e93976bf7_0|0         |
|778e3b8a-2ca5-469a-9697-f646962e8308_0|0         |
|8b06ba24-2af3-4eec-811a-4d1779f37876_0|0         |
+--------------------------------------+----------+

2nd Run Sample
+--------------------------------------+----------+
|clientid                              |Conversion|
+--------------------------------------+----------+
|02438b66-2de4-4765-bae3-de7453647ea7_1|1         |
|7797aba3-3eea-4556-856e-753812b4b551_0|1         |
|870ab2a5-0650-42b8-9e6f-bde3859f64fd_0|1         |
|870ab2a5-0650-42b8-9e6f-bde3859f64fd_1|1         |
|be218b72-c664-40cf-adf5-e3519095e941_0|1         |
|be218b72-c664-40cf-adf5-e3519095e941_0|1         |
|1dfa820c-77e0-4927-8a39-ecd8e842b09b_0|0         |
|252be902-4204-40a5-9d3c-dd3a7d0f0355_0|0         |
|2995b49d-525b-43e9-ab36-8b8910a4607c_0|0         |
|2bc06b59-4624-4ddd-87a3-ed04cba88233_0|0         |
|31563716-9380-4662-90e5-7f63a1ab9072_0|0         |
|5ca3b5bc-35a9-42a5-bd37-a8fc94366dc6_0|0         |
|5d5f2ea0-aed9-4c2d-8c22-68859ec35e8e_0|0         |
|5f9ebf92-3b1b-4628-b949-44a32e6d3659_0|0         |
|5f9ebf92-3b1b-4628-b949-44a32e6d3659_0|0         |
|5f9ebf92-3b1b-4628-b949-44a32e6d3659_0|0         |
|6b352714-af74-4773-854b-073e644e8684_0|0         |
|6e528e49-472e-48c7-baa9-edc25303e427_0|0         |
|6e528e49-472e-48c7-baa9-edc25303e427_0|0         |
|741630e0-1c99-497d-a127-5c4c562952c5_0|0         |
|03358d8f-9b9c-4258-9c99-234ab102c29b_1|0         |
|040d213c-e91a-42f4-9bf7-90671670dc17_0|0         |
|040d213c-e91a-42f4-9bf7-90671670dc17_0|0         |
|04fe5148-1c56-4c88-aed0-1f01220bffd6_0|0         |
|129eb883-159d-49be-b8ae-9aa44a3e2919_0|0         |
|1855d96d-3647-4c4f-a20f-7e46f7635798_0|0         |
|3c37e066-dff5-4bd9-84ab-b9e73f3f3fdd_0|0         |
|3e998096-3a4b-4b57-a1de-69d2dbd19abd_0|0         |
|3f8ace3c-d378-4423-97a0-3d9cf35ba256_0|0         |
|49a0cfb8-490f-4252-84fa-2b9e250e9333_0|0         |
|4fa035b3-dcd5-40e1-9107-0a0c943ff597_1|0         |
|4fa035b3-dcd5-40e1-9107-0a0c943ff597_1|0         |
|529704d2-5a60-4718-a03f-639e040f6634_0|0         |
|560f6978-028b-4a37-9f97-d97e93976bf7_0|0         |
|560f6978-028b-4a37-9f97-d97e93976bf7_0|0         |
|560f6978-028b-4a37-9f97-d97e93976bf7_0|0         |
|778e3b8a-2ca5-469a-9697-f646962e8308_0|0         |
|8b06ba24-2af3-4eec-811a-4d1779f37876_0|0         |
+--------------------------------------+----------+

The two samples being different could be a road blocker for me and just want to check how I could make these consistent

3
  • In general Spark doesn't provide stable sorting so if you sort by column which is not uniquer in each run values you sample may be different. Commented Feb 27, 2017 at 4:39
  • @zero323 As per stackoverflow.com/questions/32229941/…, I feel there is a mismatch in how the sampling should behave with seed. As per your comment here, is it safe to assume that we can not ensure that the exact same sample is picked up even after providing a seed value? Commented Feb 27, 2017 at 9:13
  • If upstream structure has non-deterministic order then you simply don't sample the same structure. You can try to confirm that by using DataFrame which has explicit order (like parallelized local collection). Commented Feb 27, 2017 at 14:27

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.