Skip to main content

Questions tagged [apache-spark]

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

Filter by
Sorted by
Tagged with
-2 votes
1 answer
93 views

I have two dataframes: Budget and Forecast. For those dataframes, I'm trying to create snapshot record by joining with temp table snapshot_to_collect for loop. I'm ...
user282159's user avatar
1 vote
1 answer
170 views

I am working out some Pyspark exercises and trying to write efficient and best practice code to be ready to work in a production environment. Could use some feedback w.r.t.: Is the code structured ...
Jongert's user avatar
  • 29
1 vote
1 answer
91 views

I am working out some PySpark exercises and trying to write efficient and best practice code to be ready to work in a production environment. I could use some feedback with respect to: Is the code ...
Jongert's user avatar
  • 29
1 vote
1 answer
217 views

I have a dataframe loaded from CSV file which contains data from Olympics games. Goal is find out the athletes with minimum age who won Gold medal. I have managed to come up with following code. Is ...
Bagira's user avatar
  • 191
0 votes
1 answer
147 views

I am trying to create the algorithm described in the image from this link Simply put any two adjacent nodes must not share tuples and if they do they are colored with the same color. If there's a ...
John Campbell's user avatar
3 votes
2 answers
966 views

I am using PySpark in Azure DataBricks to try to create a SCD Type 1. I would like to know if this is an efficient way of doing this? Here is my SQL table: ...
AGB's user avatar
  • 31
1 vote
1 answer
176 views

The following is a trimmed-down example of my actual code, but it suffices to show the algorithmic problem I'm trying to solve. Given is a DataFrame with events, each with a user ID and a timestamp. <...
Tobias Hermann's user avatar
4 votes
1 answer
219 views

I need help in rewriting my code to be less repetitive. I am used to coding procedural and not object-oriented. My scala program is for Databricks. how would you combine cmd 3 and 5 together? Does ...
Dung Tran's user avatar
  • 143
2 votes
0 answers
1k views

I've a scenario where 10K+ regular expressions are stored in a table along with various other columns and this needs to be joined against an incoming dataset. Initially I was using "spark sql rlike" ...
Wiki_91's user avatar
  • 43
2 votes
2 answers
823 views

This is the first application or really any Scala I have every written. So far it functions as I would hope it would. I just found this community and would love some peer review of possible ...
GenericDisplayName's user avatar
2 votes
1 answer
72 views

I am trying to improve my programming skills at work (analyst) and one of the engineering projects I worked is around ETL. Essentially, we roll up all individuals account information to a single row ...
Rob's user avatar
  • 121
2 votes
0 answers
281 views

I am trying to find out quantiles for each column on the table for various firms using spark 1.6 I have around 5000 entries in firm_list and 300 entries in attr_lst...
Vishwanath560's user avatar
3 votes
0 answers
57 views

I am new to spark. I have two dataframes df1 and df2. df1 has three rows. df2 has more than few million rows. I want to check whether all items in df2 are in transaction of df1, if so sum up the costs....
priya's user avatar
  • 131
4 votes
3 answers
1k views

I am trying to find out if a column is binary or not. If a column is only having 1 or 0 then I am flagging it as binary, else ...
Shankar Panda's user avatar
1 vote
1 answer
180 views

I was successfully able to write a small script using PySpark to retrieve and organize data from a large .xml file. Being new to using PySpark, I am wondering if there is any better way to write the ...
Wilson's user avatar
  • 111
2 votes
0 answers
609 views

I have written a function that takes two pyspark dataframes and creates a diff in line. I am struggling to get it to scale with 100s of columns. I get around the for...
shannona2013's user avatar
3 votes
1 answer
593 views

I have a csv file with more than 700,000,000 records in this structure: ...
phoebe's user avatar
  • 33
1 vote
1 answer
5k views

I am joining two data frame in spark using scala . My code looks very ugly because of the multiple when condition . Can somebody please help me simplify my code? Here is my existing code . ...
Sudarshan kumar's user avatar
3 votes
0 answers
2k views

I have some use cases where I have small parquet files in Hadoop, say, 10-100 MB. I would to compact them so as to have files at least say 100 MB or 200 MB. The logic of my code is to: * find a ...
javadev's user avatar
  • 131
3 votes
0 answers
2k views

I have a dataframe df, which contains below data: ...
Varun Chadha's user avatar
2 votes
0 answers
186 views

The algorithm needs to reduce an RDD[GPSRecord] based on the distance between several points, e.g. "give me only GPS records when the distance between them exceeds ...
MiguelAraCo's user avatar
6 votes
0 answers
100 views

I've written a Pyspark program that will completely solve a tiered board game (no loops, each game position is a member of only one tier) and writes each tier to a file. It also determines the ...
Michael's user avatar
  • 61
3 votes
1 answer
133 views

I am new to Spark and Scala and I have solved the following problem. I have a table in database with following structure: ...
Shams Tabraiz Alam's user avatar
0 votes
1 answer
6k views

Looking for suggestions on how to unit test a Spark transformation with ScalaTest. The test class generates a DataFrame from static data and passes it to a transformation, then makes assertion on the ...
wrschneider's user avatar
4 votes
0 answers
600 views

I'm new to spark and dataframes and I'm looking for feedback on what bad or inefficient processes might be in my code so I can improve and learn. My program reads in a parquet file that contains ...
flybonzai's user avatar
  • 355
1 vote
1 answer
879 views

I need to parse logs and have got following code. I can see two problems: map().filter() may induce some performance penalties and copy-paste block parser.py: <...
Loom's user avatar
  • 565
3 votes
1 answer
3k views

I'm trying to implement a dot product using pyspark in order to learn pyspark's syntax. I've currently implemented the dot product like so: ...
Thunder Shiviah's user avatar
2 votes
0 answers
386 views

This is my first Spark Application. I am using "ALS.train" for training the model - Model Factorization. The total time that the Application takes is approx 45 mins. Note: I think takeOrdered is the ...
Jaspinder's user avatar
  • 173
15 votes
1 answer
36k views

Maybe I totally reinvented the wheel, or maybe I've invented something new and useful. Can one of you tell me if there's a better way of doing this? Here's what I'm trying to do: I want a generic <...
Nathaniel's user avatar
  • 253
5 votes
0 answers
739 views

Below is the code I have for a RandomForest multiclass-classification model. I am reading from a CSV file and doing various transformations as seen in the code. I ...
Huga's user avatar
  • 153
8 votes
2 answers
3k views

Given a list of tuples of the form, (a, b, c), is there a more direct or optimized for calculating the average of all the c's ...
JasonAizkalns's user avatar
0 votes
1 answer
651 views

I have a simple static class that it's purpose is given an RDD of Point to find the median of each dimension and return that as a new ...
Aki K's user avatar
  • 1,329
11 votes
2 answers
28k views

I'm currently learning how to use Apache Spark. In order to do so, I implemented a simple wordcount (not really original, I know). There already exists an example on the documentation providing the ...
merours's user avatar
  • 213
5 votes
1 answer
2k views

Because the MLlib does not support the sparse input, I ran the following code, which supports the sparse input format, on spark clusters. The settings are: 5 nodes, each node with 8 cores (all the ...
Tim's user avatar
  • 151