Skip to main content
Filter by
Sorted by
Tagged with
0 votes
0 answers
110 views

I am trying to migrate Delta Tables to Iceberg using Scala-Spark keeping the data intact, following this - https://iceberg.apache.org/docs/1.4.3/delta-lake-migration/ Here is the sample code (...
Pramit Pakhira's user avatar
0 votes
1 answer
40 views

We have a use-case wherein we need to cache certain data that has been processed so that Spark does not reprocess the same data in the event of task failures. So say we have a thousand Foo objects for ...
mandar's user avatar
  • 125
1 vote
1 answer
64 views

i am trying to register custom code(for map) like below val session: CqlSession = CassandraConnector.apply(spark.sparkContext).openSession() val codecRegistry: MutableCodecRegistry = session....
Shivam Sajwan's user avatar
0 votes
1 answer
52 views

i want to partition/group rows for every group of size <= limit for example, if i have: +--------+----------+ | id| size| +--------+----------+ | 1| 3| | 2| 6| ...
Gary Chan Chi Hang's user avatar
0 votes
1 answer
43 views

I have two dataframes that have 300 columns and 1000 rows each. They have the same column names. The values are of mixed datatypes like Struct/List/Timestamp/String/etc. I am trying to compare the ...
Noob's user avatar
  • 77
2 votes
1 answer
1k views

Getting the following error while creating a delta table using scalaspark. _delta_log is getting created at the warehouse but it lands into this error after _delta_log creation Exception in thread &...
Sarthak Sharma's user avatar
1 vote
3 answers
1k views

I am trying to run use Intellij to build spark applications written in scala. I get the following error when I execute the scala program: Exception in thread "main" java.lang....
PRATIK CHAPADGAONKAR's user avatar
0 votes
0 answers
79 views

i'm newbie to Scala Spark programming. I have to build a Recommendation System for movies in Scala Spark with the usage of Google Cloud Platform. The dataset is composed by (movie_id, user_id, rating) ...
Luca Genova's user avatar
0 votes
1 answer
73 views

I have a dataframe that looks like this | Column | |------------------------------------------------| |[{a: 2, b: 4}, {a: 2, b: 3}] | |-------...
vZ10's user avatar
  • 2,756
-1 votes
3 answers
111 views

I have the following dataset: |value| +-----+ | 1| | 2| | 3| I want to create a new column newValue that takes the value of newValue from the previous row and does something with it. For ...
Kewitschka's user avatar
  • 1,681
1 vote
2 answers
136 views

I have a Scala Spark dataframe with the schema: root |-- passengerId: string (nullable = true) |-- travelHist: array (nullable = true) | |-- element: integer (containsNull = true)...
Abishek's user avatar
  • 857
-1 votes
2 answers
127 views

I want to divide the quantity value into multiple rows divided by number of months from start date & end date column. Each row should have start date and end date of the month. I also want ...
isrikanthd's user avatar
0 votes
1 answer
2k views

For some weird reasons I need to get the column names of a dataframe and insert it as the first row(I cannot just import without header). I tried using for comprehension to create a dataframe that ...
Jiaming Pei's user avatar
1 vote
1 answer
609 views

I am working on a spark project and have some performance issue that I am struggling with, any help will be appreciated. I have a column Collection that is an array of struct: root |-- Collection: ...
Yue Wang's user avatar
0 votes
1 answer
210 views

I have a spark dataframe column (custHeader) in the below format and I want to extract the value of the key - phone into a separate column. trying to use the from_json function, but it is giving me a ...
marc's user avatar
  • 319
0 votes
1 answer
222 views

Spark broadcasts right dataset from left join, which causes org.apache.spark.sql.execution.OutOfMemorySparkException: Size of broadcasted table far exceeds estimates and exceeds limit of spark.driver....
alsetr's user avatar
  • 20
-1 votes
1 answer
69 views

Imagine a data for this question is in a nested json structure. I have flattened the data from json using explode() and added it into one data-frame with columns project, Task, Task-Evidence, Task-...
Vivek Gowda's user avatar
1 vote
2 answers
775 views

I am trying to understand if there is any difference in the following approaches, in terms of memory usage, optimisation, parallelism etc. Scenario: CSV files in an S3 bucket. 100 columns, more than ...
Shiv Konar's user avatar
3 votes
1 answer
3k views

I am trying to run scala distributed code using spark-submit with cluster mode in minikube. 1.I used this dockerfile FROM datamechanics/spark:2.4.6-hadoop-3.1.0-java-8-scala-2.12-python-3.7-dm18 ...
My coding Way's user avatar
0 votes
3 answers
135 views

I have this Vector[String]: user_uid,score,value 255938,34096,8 259117,34599,10 253664,28891,7 how can I convert it to DataFrame? I already tried this: val dataInVectorRow = dataInVectorString .map(...
user avatar
0 votes
1 answer
100 views

Currently I'm working on exploding a struct array with pair of keys are same. { "A": [{ "AA": { "AB": "21", "AC": &...
instancedeveloper's user avatar
0 votes
1 answer
630 views

suppose I have this dataframe: id value A 1 A 2 A 3 B 1 B 2 C 1 D 1 D 2 and so on. basically I want to make sure even with records limit any certain id can only appear in one single file(suppose ...
ForkPork's user avatar
0 votes
1 answer
70 views

I am running a simple query in spark 3.2 val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5") val ...
ASR's user avatar
  • 133
0 votes
1 answer
63 views

We are upgrading the Scala Spark Spark from 2.4.3 to 3.1.3 scalaVersion from 2.11.8 to 2.12.10 spark-cassandra-connector from 2.4.2 to 3.1.0 Cassandra version 3.2 and all the subsequent dependancies. ...
user21166408's user avatar
1 vote
1 answer
358 views

Context: Working on a message processing application which processes millions of messages every day. The application is built using scala, spark, and uses Kafka, Cassandra DB. Multiple DB queries are ...
Niranjana Datta's user avatar
1 vote
2 answers
119 views

I have a dataframe with a lot of columns, but for this example, we can use this one: `val dfIn = sqlContext.createDataFrame(Seq(("r0", 0, 2, 3, "a"),("r1", 1, 0, 0, "...
isaga's user avatar
  • 11
0 votes
1 answer
166 views

I am new to Scala and am currently studying datasets for Scala and Spark. Based on my input dataset below, I am trying to create a new dataset (see below). In the new dataset, I aim to have a new ...
AIBball's user avatar
  • 331
1 vote
2 answers
1k views

I am running a simple query in two versions of spark, 2.3 & 3.2. The code is as below spark-shell --master yarn --deploy-mode client val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF(&...
ASR's user avatar
  • 133
0 votes
0 answers
205 views

I have been looking online for awhile, but have found nothing, thus this question. I would like to be able to debug my Apache Spark code (written in Scala) remotely on Databricks, similar to the way ...
MrMuppet's user avatar
  • 757
0 votes
1 answer
173 views

I have a Dataframe that has a column "grades" containing a list of Grade objects that have 2 fields: name (String) and value (Double). I would like to add the word PASS to the list of tags ...
xard4sTR's user avatar
0 votes
1 answer
67 views

I have two DataFrames - the first one with the columns model, cnd, age, tags (this is a repeatable field - String list/array), min, max and the second one with the main_model column. I would like to ...
xard4sTR's user avatar
0 votes
1 answer
756 views

I have a dataset with following case class type: case class AddressRawData( addressId: String, customerId: String, address: ...
Nikhil Padole's user avatar
0 votes
1 answer
2k views

in scala spark we can filter if column A value is not equal to column B or same dataframe as df.filter(col("A")=!=col("B")) How we can do this same in Pyspark ? I have tried ...
Arslan Ali's user avatar
0 votes
0 answers
131 views

I have created a dataframe by reading the data from db2 and dataframe looks like below. df1.show() Table_Name | Source_count | Target_Count ---------------------------------------- Test_tab | 12750 ...
phani437's user avatar
1 vote
1 answer
109 views

I am trying to run a simple hello world spark application This is my code package com.sd.proj.executables import org.apache.spark.sql.functions.lit import org.apache.spark.sql.{DataFrame, ...
dsam05's user avatar
  • 31
0 votes
1 answer
56 views

Why we use Val instead of Var for accumulators? If its like a counter that's shared across for multiple executor nodes to just update/change it, then it means reassigning a Val right? val accum = sc....
user23062's user avatar
0 votes
2 answers
370 views

I am trying to translate a pyspark job, which is dynamically coalescing the columns from two datasets with additional filters/condition. conditions_ = [when(df1[c]!=df2[c], lit(c)).otherwise("&...
sinsom's user avatar
  • 29
0 votes
2 answers
8k views

This is my first post so let me know if I need to give more details. I am trying to create a boolean column, "immediate", that shows true when at least on of the columns has some data in it. ...
jackdotdi's user avatar
0 votes
0 answers
67 views

I am trying to translate below code from pyspark to scala. I am able to successfully create the dataframes from input data. from pyspark.sql.functions import col, array, when, array_remove, lit, size, ...
sinsom's user avatar
  • 29
0 votes
1 answer
635 views

How Spark broadcast the data when we use Broadcast Join with hint - As I can see when we use the broadcast hint: It calls this function def broadcast[T](df: Dataset[T]): Dataset[T] = { Dataset[T](...
sho's user avatar
  • 234
2 votes
2 answers
13k views

When running my job, I am getting the following exception: Exception in User Class: org.apache.spark.SparkException : Job aborted due to stage failure: Task 32 in stage 2.0 failed 4 times, most recent ...
jamesbascle's user avatar
  • 1,013
0 votes
0 answers
54 views

I want to join large(1TB) data RDD with medium(10GB) size data RDD. There was an earlier processing on large data with was completing in 8 hours. I then joined the medium sized data to get an info ...
user0712's user avatar
0 votes
1 answer
495 views

I want to convert a string date column to a date or timestamp (YYYY-MM-DD). How can i do it in scala Spark Sql ? Input: D1 Apr 24 2022| Jul 08 2021| Jan 16 2022| Expected : D2 2022-04-24| 2021-07-08| ...
Namrata's user avatar
0 votes
1 answer
417 views

Need to add quotes for all in spark dataframe Input: val someDF = Seq( | ("user1", "math","algebra-1","90"), | ("user1", "physics&...
rajasekar k's user avatar
2 votes
0 answers
2k views

I am reading and writing events from EventHub in spark after trying to aggregated based on few keys like this: val df1 = df0 .groupBy( colKey, colTimestamp ) .agg(...
rick's user avatar
  • 21
0 votes
1 answer
360 views

This code works only if I make directory="s3://bucket/folder/2022/10/18/4/*" from pyspark.sql.functions import from_json from pyspark.streaming import StreamingContext ssc = ...
Salsa Steve's user avatar
1 vote
2 answers
1k views

I have a question I was unable to solve when working with Scala Spark (or PySpark). How can we merge two fields that are arrays of structs of different fields. For example, if I have schema like so: ...
Gligorijevic's user avatar
0 votes
1 answer
806 views

I'm looking for a way to install outside packages on spylon kernel. I already tried initialize spark-shell with --package command inside the spylon but it justs creates another instance. I tried %%...
João Paulo Andrade's user avatar
3 votes
2 answers
3k views

I want to move all files under a directory in my s3 bucket to another directory within the same bucket, using scala. Here is what I have: def copyFromInputFilesToArchive(spark: SparkSession) : Unit = {...
Dylan Sanderson's user avatar
1 vote
1 answer
479 views

I have a CSV file with data as below id,name,comp_name 1,raj,"rajeswari,motors" 2,shiva,amber kings my requirement is to read this file to spark RDD, then do map split with coma delimiter. ...
Roy John's user avatar