3

I try to create spark dataframe where I want to convert a list into a column.

Code:

def create_id(n):
    return ''.join(random.choice(string.ascii_lowercase + string.digits) for _ in range(50))

list_a = [create_id(25) for x in range(100)]
list_b = [create_id(25) for x in range(100)]

df = sc.parallelize([["a", list_a], ["b", list_b]]).toDF()

This results in

    _1                                                _2
0   a   [dv2vtdl3sobadlw1svs39emp2n9ogwzzek8b6gvug7xkp...
1   b   [kdv6b9ehqx1t8kbxd77ha8435bhduyxp0ilv6e09wpejx..

This will create 100 columns, not 100 rows:

df = sc.parallelize([list_a, list_b]).toDF()

Does anyone know how I can create a DataFrame with a two columns and 100 rows?

3
  • Does this answer your question? Manually create a pyspark dataframe Commented Jun 30, 2021 at 12:02
  • I have already seen this, but this is not suitable for me, since it uses tuples and the index of the tuple is responsible for the column the value will be in. Commented Jun 30, 2021 at 12:10
  • then, you did not understand how it works cause that's exactly how it should be done. Commented Jun 30, 2021 at 12:11

1 Answer 1

4

Using post Manually create a pyspark dataframe:

def create_id(n):
    return ''.join(random.choice(string.ascii_lowercase + string.digits) for _ in range(n))

list_a = [create_id(25) for _ in range(100)]
list_b = [create_id(25) for _ in range(100)]

df = spark.createDataFrame(zip(list_a,list_b), ['a', 'b'])

# OR

list_a_b = [(create_id(25), create_id(25)) for _ in range(100)]
df = spark.createDataFrame(list_a_b, ['a', 'b'])

df.show()
+--------------------+--------------------+
|                   a|                   b|
+--------------------+--------------------+
|68blfnltq9fh81c4y...|3fl1wb5h2euy3sgd7...|
|ac37fb7qif71zzjpr...|xbqzzgiq9s6t5jiqm...|
|72rk28znzr6jjsi69...|5wvl528eg5y3p1lsk...|
|fioqnla3ijvl5769s...|1xvs2592uaxadv1o4...|
|7der8ld8fd6vl6g9d...|lrup85xitjz1uhsfl...|
|gycdap4hodaxxggw8...|h2oz370tzo6fnpke3...|
|ccvqcyzeynuks63pq...|iut82y2k1irfdvep1...|
|ngq29fnq2usghspgh...|z6j4mibrrjznoc9s8...|
|3qb6xyk5c1kbg0xq1...|l10ouv4w24d66e0ak...|
|u6dcvzede90xa7zz2...|hnh571t9szy0pwjrp...|
|3122g38k47jm24t7f...|tzbxlua574l88qtw1...|
|6pnva6ow83yxexqp1...|0nfj3v59b8jh0qv1g...|
|kl7xyftax3z32ot8o...|0sf6iyiyxpyvyd5kj...|
|36qwiiifgbzba4n8c...|xt4lpkjle8qynnlpo...|
|owsgb02rnov8qrhvw...|1zu4oisit25y2g14i...|
|bcmg0flh4d9tnvnjc...|7lfwx9kf7qens70p8...|
|6sdy1e8i3y1w0rtpr...|gw79bsrx8jlse6ixu...|
|83h5iq10clte1gcpr...|kblufuhlwabu7sv3u...|
|7g20ga0m756f0qsj7...|1fzo40vwtrp0kud8j...|
|07tw66i7dpcphczz1...|9a8c9ditp9dzomxh4...|
+--------------------+--------------------+
only showing top 20 rows

Sign up to request clarification or add additional context in comments.

1 Comment

thank you, zip is great for this usecase :-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.