Newest 'apache-spark' Questions

-2 votes

1 answer

93 views

Code that loops through a df and joins two seperate dataframe and sink into delta lake, how do I make it run faster?

I have two dataframes: Budget and Forecast. For those dataframes, I'm trying to create snapshot record by joining with temp table snapshot_to_collect for loop. I'm ...

user282159

1

asked Apr 2, 2024 at 16:00

1 vote

1 answer

170 views

PYSPARK: Find the penultimate (2nd largest value) row-wise

I am working out some Pyspark exercises and trying to write efficient and best practice code to be ready to work in a production environment. Could use some feedback w.r.t.: Is the code structured ...

Jongert

29

asked Feb 22, 2024 at 15:22

1 vote

1 answer

91 views

PySpark: Create a column containing the minimum divided by maximum of each row

I am working out some PySpark exercises and trying to write efficient and best practice code to be ready to work in a production environment. I could use some feedback with respect to: Is the code ...

Jongert

29

asked Feb 22, 2024 at 14:41

1 vote

1 answer

217 views

Find the youngest athlete to win a Gold medal

I have a dataframe loaded from CSV file which contains data from Olympics games. Goal is find out the athletes with minimum age who won Gold medal. I have managed to come up with following code. Is ...

Bagira

191

asked Aug 11, 2023 at 13:00

0 votes

1 answer

147 views

Graph coloring problem with Spark (JAVA)?

I am trying to create the algorithm described in the image from this link Simply put any two adjacent nodes must not share tuples and if they do they are colored with the same color. If there's a ...

John Campbell

11

asked Oct 5, 2021 at 16:08

3 votes

2 answers

966 views

PySpark SCD Type 1

I am using PySpark in Azure DataBricks to try to create a SCD Type 1. I would like to know if this is an efficient way of doing this? Here is my SQL table: ...

AGB

31

asked Apr 27, 2021 at 15:35

1 vote

1 answer

176 views

Group events close in time into sessions and assign unique session IDs

The following is a trimmed-down example of my actual code, but it suffices to show the algorithmic problem I'm trying to solve. Given is a DataFrame with events, each with a user ID and a timestamp. <...

Tobias Hermann

596

asked Dec 9, 2020 at 7:48

4 votes

1 answer

219 views

Rewriting scala code in object-oriented style style to reduce repetitive use of similar functions

I need help in rewriting my code to be less repetitive. I am used to coding procedural and not object-oriented. My scala program is for Databricks. how would you combine cmd 3 and 5 together? Does ...

Dung Tran

143

asked Aug 12, 2020 at 14:17

2 votes

0 answers

1k views

Spark Scala: SQL rlike vs Custom UDF

I've a scenario where 10K+ regular expressions are stored in a table along with various other columns and this needs to be joined against an incoming dataset. Initially I was using "spark sql rlike" ...

Wiki_91

43

asked May 26, 2020 at 0:12

2 votes

2 answers

823 views

Scala app to transpose columns into rows

This is the first application or really any Scala I have every written. So far it functions as I would hope it would. I just found this community and would love some peer review of possible ...

GenericDisplayName

121

asked Sep 5, 2019 at 15:19

2 votes

1 answer

72 views

Filtering and creating new columns by condensing the lists for each item information

I am trying to improve my programming skills at work (analyst) and one of the engineering projects I worked is around ETL. Essentially, we roll up all individuals account information to a single row ...

Rob

121

asked Jun 26, 2019 at 15:49

2 votes

0 answers

281 views

Quantiles calculation in Pyspark MLib

I am trying to find out quantiles for each column on the table for various firms using spark 1.6 I have around 5000 entries in firm_list and 300 entries in attr_lst...

Vishwanath560

21

asked Apr 25, 2019 at 5:24

3 votes

0 answers

57 views

spark takes long time for checking an array of items present in another array

I am new to spark. I have two dataframes df1 and df2. df1 has three rows. df2 has more than few million rows. I want to check whether all items in df2 are in transaction of df1, if so sum up the costs....

priya

131

asked Apr 9, 2019 at 9:04

4 votes

3 answers

1k views

Binary check code in pyspark

I am trying to find out if a column is binary or not. If a column is only having 1 or 0 then I am flagging it as binary, else ...

Shankar Panda

143

asked Mar 26, 2019 at 9:53

1 vote

1 answer

180 views

Managing PySpark DataFrames

I was successfully able to write a small script using PySpark to retrieve and organize data from a large .xml file. Being new to using PySpark, I am wondering if there is any better way to write the ...

Wilson

111

asked Oct 19, 2018 at 14:43

2 votes

0 answers

609 views

Function to generate pyspark diff and return differences in line

I have written a function that takes two pyspark dataframes and creates a diff in line. I am struggling to get it to scale with 100s of columns. I get around the for...

shannona2013

21

asked May 22, 2018 at 4:43

3 votes

1 answer

593 views

Large dataset with pyspark - optimizing join, sort, compare between rows and group by with aggregation

I have a csv file with more than 700,000,000 records in this structure: ...

phoebe

33

asked Apr 11, 2018 at 6:36

1 vote

1 answer

5k views

Joining Apache Spark data frames, with many conditional substitutions

I am joining two data frame in spark using scala . My code looks very ugly because of the multiple when condition . Can somebody please help me simplify my code? Here is my existing code . ...

Sudarshan kumar

115

asked Jan 30, 2018 at 6:37

3 votes

0 answers

2k views

Apache spark compaction script to handle small files in hdfs

I have some use cases where I have small parquet files in Hadoop, say, 10-100 MB. I would to compact them so as to have files at least say 100 MB or 200 MB. The logic of my code is to: * find a ...

javadev

131

asked Nov 13, 2017 at 4:57

3 votes

0 answers

2k views

Adding columns in Spark dataframe based on rules

I have a dataframe df, which contains below data: ...

Varun Chadha

31

asked Jun 3, 2017 at 17:34

2 votes

0 answers

186 views

Reduce sample rate of GPS data based on distance between points

The algorithm needs to reduce an RDD[GPSRecord] based on the distance between several points, e.g. "give me only GPS records when the distance between them exceeds ...

MiguelAraCo

111

asked Jan 31, 2017 at 1:39

6 votes

0 answers

100 views

Pyspark Solver for Tiered Board Games

I've written a Pyspark program that will completely solve a tiered board game (no loops, each game position is a member of only one tier) and writes each tier to a file. It also determines the ...

Michael

61

asked Nov 17, 2016 at 6:17

3 votes

1 answer

133 views

Classifying and counting database entries using Scala map and flatMap

I am new to Spark and Scala and I have solved the following problem. I have a table in database with following structure: ...

Shams Tabraiz Alam

31

asked Sep 7, 2016 at 9:39

0 votes

1 answer

6k views

Unit testing Spark transformation on DataFrame

Looking for suggestions on how to unit test a Spark transformation with ScalaTest. The test class generates a DataFrame from static data and passes it to a transformation, then makes assertion on the ...

wrschneider

131

asked Jul 18, 2016 at 21:46

4 votes

0 answers

600 views

PySpark Dataframes program to process huge amounts of server data from a parquet file

I'm new to spark and dataframes and I'm looking for feedback on what bad or inefficient processes might be in my code so I can improve and learn. My program reads in a parquet file that contains ...

flybonzai

355

asked Apr 20, 2016 at 15:55

1 vote

1 answer

879 views

Python + spark to parse and save logs

I need to parse logs and have got following code. I can see two problems: map().filter() may induce some performance penalties and copy-paste block parser.py: <...

Loom

565

asked Mar 29, 2016 at 8:43

3 votes

1 answer

3k views

Implementing an inner product using pyspark

I'm trying to implement a dot product using pyspark in order to learn pyspark's syntax. I've currently implemented the dot product like so: ...

Thunder Shiviah

65

asked Jan 20, 2016 at 1:50

2 votes

0 answers

386 views

Increase performance of Spark-job Collaborative Recommendation.

This is my first Spark Application. I am using "ALS.train" for training the model - Model Factorization. The total time that the Application takes is approx 45 mins. Note: I think takeOrdered is the ...

Jaspinder

173

asked Jan 18, 2016 at 7:57

15 votes

1 answer

36k views

Generic “reduceBy” or “groupBy + aggregate” functionality with Spark DataFrame

Maybe I totally reinvented the wheel, or maybe I've invented something new and useful. Can one of you tell me if there's a better way of doing this? Here's what I'm trying to do: I want a generic <...

Nathaniel

253

asked Dec 26, 2015 at 0:01

5 votes

0 answers

739 views

RandomForest multi-class classification

Below is the code I have for a RandomForest multiclass-classification model. I am reading from a CSV file and doing various transformations as seen in the code. I ...

Huga

153

asked Dec 3, 2015 at 18:41

8 votes

2 answers

3k views

Average movie rankings

Given a list of tuples of the form, (a, b, c), is there a more direct or optimized for calculating the average of all the c's ...

JasonAizkalns

215

asked Jun 29, 2015 at 23:41

0 votes

1 answer

651 views

Class for finding the median of a two-dimensional space

I have a simple static class that it's purpose is given an RDD of Point to find the median of each dimension and return that as a new ...

Aki K

1,329

asked Nov 25, 2014 at 13:02

11 votes

2 answers

28k views

Producing a sorted wordcount with Spark

I'm currently learning how to use Apache Spark. In order to do so, I implemented a simple wordcount (not really original, I know). There already exists an example on the documentation providing the ...

merours

213

asked Jul 10, 2014 at 12:04

5 votes

1 answer

2k views

Why does the LR on spark run so slowly?

Because the MLlib does not support the sparse input, I ran the following code, which supports the sparse input format, on spark clusters. The settings are: 5 nodes, each node with 8 cores (all the ...

Tim

151

asked Dec 31, 2013 at 8:15

Stack Exchange Network

Questions tagged [apache-spark]

Code that loops through a df and joins two seperate dataframe and sink into delta lake, how do I make it run faster?

PYSPARK: Find the penultimate (2nd largest value) row-wise

PySpark: Create a column containing the minimum divided by maximum of each row

Find the youngest athlete to win a Gold medal

Graph coloring problem with Spark (JAVA)?

PySpark SCD Type 1

Group events close in time into sessions and assign unique session IDs

Rewriting scala code in object-oriented style style to reduce repetitive use of similar functions

Spark Scala: SQL rlike vs Custom UDF

Scala app to transpose columns into rows

Filtering and creating new columns by condensing the lists for each item information

Quantiles calculation in Pyspark MLib

spark takes long time for checking an array of items present in another array

Binary check code in pyspark

Managing PySpark DataFrames

Function to generate pyspark diff and return differences in line

Large dataset with pyspark - optimizing join, sort, compare between rows and group by with aggregation

Joining Apache Spark data frames, with many conditional substitutions

Apache spark compaction script to handle small files in hdfs

Adding columns in Spark dataframe based on rules

Reduce sample rate of GPS data based on distance between points

Pyspark Solver for Tiered Board Games

Classifying and counting database entries using Scala map and flatMap

Unit testing Spark transformation on DataFrame

PySpark Dataframes program to process huge amounts of server data from a parquet file

Python + spark to parse and save logs

Implementing an inner product using pyspark

Increase performance of Spark-job Collaborative Recommendation.

Generic “reduceBy” or “groupBy + aggregate” functionality with Spark DataFrame

RandomForest multi-class classification

Average movie rankings

Class for finding the median of a two-dimensional space

Producing a sorted wordcount with Spark

Why does the LR on spark run so slowly?

Hot Network Questions