184 questions
2
votes
1
answer
150
views
Spark memory error in thread spark-listener-group-eventLog
I have a pyspark application which is using Graphframes to compute connected components on a DataFrame.
The edges DataFrame I generate has 2.7M records.
When I run the code it is slow, but slowly ...
2
votes
0
answers
61
views
How to run large scale processing of multiple shortest path finding calls?
I'm trying to incorporate a solution that takes a start and end coordinates, alongside timestamps, to find the shortest path between them. This uses the UK road network pulled from OSM, the start and ...
2
votes
1
answer
81
views
How to clean up a graph removing redundant relationships
Hi I'm trying to process a large edges dataframe of a network. The problem is that each connected node has two relationships between them. Since loading two edges into a graph would technically be ...
0
votes
1
answer
86
views
Graphframes Pyspark route compaction
In pyspark given a directed graph structure represented by a nodes and edges dataframes how can i compact routes.
Given certain nodes that can be sources and certain nodes that can be a route ...
1
vote
1
answer
420
views
How do I run GraphFrame in AWS Glue 3.0?
How do I use GraphFrame in AWS Glue 3.0. I see that only Spark 2.x version has python wheel package but other version of Spark does not have it. I am getting class loading exception
py4j.protocol....
1
vote
3
answers
655
views
how to detect a cycle in a Spark Graphframes?
Here is a Spark Graphframes df repsenting a directed graph, there may be some cycles in this graph. How can I detect the cycles in a Graphframe?
For example, here is a graph
| src | dst |
| --- | --- |...
1
vote
1
answer
450
views
GraphFrames for pyspark in Azure Synapse
I'm trying to run the basic graphframes python sample on Azure Synapse. The works fine when I upload the correct .jar file from here and write the code in scala. But the same .jar file doesn't get ...
1
vote
0
answers
619
views
GraphFrames and connected components
I have a graph and that consists of vertices and edges and I am using graphframes library to find connected components of that graph.
import GraphFrames as gf
connected_components = gf.GraphFrame(...
1
vote
0
answers
242
views
compute connectedcomponents using spark and graphframes on a very large number of vertices
I am working with a very large graph of approximately 100 million vertices and I am using graphframes.connectedcomponents with spark to resolve the graph. The output of the solution is a forest like ...
0
votes
1
answer
539
views
Use of Graphframes library in palantir-foundry
I want to use GrafFrames package with Pyspark in my Foundry code repository.
As mentioned here:
https://www.palantir.com/docs/foundry/transforms-python/environment-troubleshooting/#packages-which-...
0
votes
1
answer
321
views
GraphFrames Pregel doesn't converge
I have a relatively shallow, directed, acyclic graph represented in GraphFrames (a large number of nodes, mainly on disjunct subgraphs). I want to propagate the id of the root nodes (nodes without ...
1
vote
0
answers
375
views
i had the error Py4JJavaError: An error occurred while calling o65.showString in pyspark
i am trying to implement this code using:
python 3.9
spark-3.3.1-bin-hadoop3 included pyspark
java 1.8.0_171
the paths is alright and i am running other codes on jupyter but i didn't find any answer ...
3
votes
0
answers
2k
views
PySpark GraphFrame and networkx for graphs with hierarchy
I need to create a graph like this which have two relationships, continent-country, country-city. I have 3 columns: city, country, continent, but not sure how to get it into this graph.
Below is an ...
0
votes
1
answer
364
views
How to use GraphFrames on EMR serverless
Summary of steps executed:
Uploaded the python script to S3.
Created a virtualenv that installs graphframes and uploaded it to S3.
Added a VPC to my EMR application.
Added graphframes package to ...
0
votes
1
answer
2k
views
Error when running graphframes in google colab
I am using google colab and I cannot seem to use graphframes.
This is what i do:
!pip install pyspark
Which gives:
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/...
0
votes
0
answers
459
views
How to implement custom graph clustering algorithm on Spark using GraphFrame?
I have a very large, weighted graph on Azure COSMOS DB. Number of vertices and edges are in billions and size of DB is several TBs. I am trying to cluster the graph on Spark using some custom ...
1
vote
0
answers
438
views
Graphframes connectedComponents is not working if I run my spark jobs via databricks connect
Graphframe connectedComponents is throwing exceptions when i try to run my spark job from databricks connect. Here are the configurations i am using for spark session
spark = (
SparkSession
....
1
vote
0
answers
206
views
What are the use cases for using Graphframes' connectedComponents various algorithms?
As a background: I am a python coder using Graphframes and pyspark through Databricks. I've been using Graphframes to deduplicate records in the context of record-linkage. Below is some pseudo-code ...
0
votes
1
answer
151
views
Define edge rules in pyspark graphframes
I am using graphframes to represent a graph in pyspark from a similar dataframe:
data = [
("1990", "1995"),
("1980", "1996"),
("1993", &...
0
votes
1
answer
508
views
Unique ID (UID) generation using pyspark across different data sources
We are working on a use case to generate a unique ID (UID) for the Customers spanning across different systems/data sources. The unique ID will be generated using PII information such as email & ...
2
votes
0
answers
258
views
How to create citation network of articles using graphframes?
I have a corpus of 44940 articles, each article has id, title and list of references (other articles that were cited in). The schema of corpus looks somthing like this :
+---+-----+----------+
| id|...
2
votes
1
answer
825
views
How to get list of graph nodes after using connectedComponents of pyspark
I am learning PySpark in Python. If I use the below line of code to get components from my graph, then one column would be added to my GraphDataFrame with the component (random number). But I am ...
0
votes
1
answer
637
views
Unable to run analytics using GraphFrames and PySpark on Jupyter Notebook
I've been trying to install GraphFrames on my environment. I am using Jupyter Notebook and I've successfully installed Spark. In order to install GraphFrames, I did !pip install graphframes directly ...
1
vote
1
answer
362
views
using a modules method in Pyspark map
I have heard that it is available to call a method of another module in python to bring some calculations that is not implemented in spark and of course it is inefficient to do that.
I need a method ...
1
vote
0
answers
38
views
how to find diamond in graph by Spark graphx
I'm using GraphFrame in Spark GraphX. I tried to find the a diamond in my graph. My graph as following:
nodeA->nodeB->nodeD->nodeF
nodeA->nodeE->nodeD->nodeG
so we can know there is ...
0
votes
1
answer
253
views
group the related values in one group
trying to group the column values based on related records
partColumns = (["partnumber","colVal1","colVal2", "colVal3","colVal4","colVal5"])
...
1
vote
1
answer
523
views
How to load GraphFrame/Pyspark DataFrame into Pytorch Geometric (InMemory)Dataset?
Anybody ever done a custom pytorch.data.InMemoryDataset for a spark GraphFrame (or rather Pyspark DataFrames? Looked for people that have done it already but didn't find anything on GitHub/...
1
vote
1
answer
2k
views
Py4JJavaError: An error occurred while calling o65.createGraph
I wanted to install graphframes for spark following the instructions on the spark website, but the command:
pyspark --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12
did not work for me.
I ...
0
votes
1
answer
567
views
Graphframes and BFS
I'm having some problem to understand BFS on Graphframe. I´m trying to get the "father of all" - the one that has no parent in the graph.
See, I have this Dataframe:
val df = sqlContext....
5
votes
1
answer
3k
views
Install package Graphframes using spark-shell
I am trying to install PySpark package Graphframes using spark-shell :
pyspark --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12
However, there is any error like this in the terminal:
root@...
0
votes
1
answer
142
views
How to run PySpark with installed packages?
Normally, when I run pyspark with graphframes I have to use this command:
pyspark --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12
In the first time run this, this will install the packages ...
0
votes
1
answer
2k
views
Cannot set checkpoint dir when running Connected Component example
This is the Connected Components example by graphframe:
from graphframes.examples import Graphs
g = Graphs(sqlContext).friends() # Get example graph
result = g.connectedComponents()
result.select(&...
0
votes
1
answer
1k
views
ModuleNotFoundError: No module named 'graphframes'
I want to run graphframes with pyspark.
I found this answer and follow its instruction but it doesn't work.
This is my code hello_spark.py:
import pyspark
conf = pyspark.SparkConf().set("spark....
0
votes
1
answer
1k
views
Reduce and Lambda on pyspark dataframe
Below is an example from https://graphframes.github.io/graphframes/docs/_site/user-guide.html
the only thing I confused is the purpose of "lit(0)" from function of condition
if this "...
11
votes
4
answers
13k
views
PySpark packages installation on kubernetes with Spark-Submit: ivy-cache file not found error
I am fighting it the whole day. I am able to install and to use a package (graphframes) with spark shell or a connected Jupiter notebook, but I would like to move it to the kubernetes based spark ...
0
votes
1
answer
2k
views
graphframes for pySpark v3.0.1
I'm trying to use the graphframes library with pySpark v3.0.1. (I'm using vscode on debian but trying to import the package from pyspark shell didn't work either)
According to the documentation, using ...
0
votes
1
answer
565
views
Update vertices values in GraphFrame
I wonder is there any way to update vertices (or edges) values after constructing a graph with GraphFrame? I have a graph and its vertices have these ['id', 'name', 'age'] columns. I've written a code ...
1
vote
1
answer
2k
views
Pyspark + Graphframes: "recursive" message aggregation
I've created the following graph:
spark = SparkSession.builder.appName('aggregate').getOrCreate()
vertices = spark.createDataFrame([('1', 'foo', 99),
('2', 'bar', 10)...
1
vote
1
answer
929
views
Pyspark and Graphframes: Aggregate messages power mean
Given the following graph:
Where A has a value of 20, B has a value of 5 and C has a value of 10, I would like to use pyspark/graphframes to compute the power mean. That is,
In this case n is the ...
1
vote
0
answers
378
views
Iterative GraphFrames AggregateMessages hitting memory limits
I'm using GraphFrame's aggregateMessages capability to build a custom clustering algorithm. I tested this algorithm on a small sample dataset (~100 items) and verified that it works. But when I run ...
3
votes
2
answers
2k
views
How to create edge list from spark data frame in Pyspark?
I am using graphframes in pyspark for some graph type of analytics and wondering what would be the best way to create the edge list data frame from a vertices data frame.
For example, below is my ...
1
vote
1
answer
6k
views
How to Get Connected Component with Graphframes in Pyspark and Raw Data in Spark Dataframe?
I have a spark data frame which looks like below:
+--+-----+---------+
|id|phone| address|
+--+-----+---------+
| 0| 123| james st|
| 1| 177|avenue st|
| 2| 123|spring st|
| 3| 999|avenue st|
| 4|...
3
votes
1
answer
2k
views
How to start graphframes on spark on pyspark on juypter on docker?
Been playing with pyspark on juypter all day with no issues. Just by simply using the docker image juypter/pyspark-notebook, 90% of everything I need is packaged (YAY!)
I would like to start exploring ...
0
votes
1
answer
398
views
PySpark: remove rows which derivate from others
I do have the following dataframe, which contains all the paths within a tree after going through all nodes. For each jump between nodes, a row will be created where "dist" is the number of ...
2
votes
1
answer
293
views
Using GraphFrames (Scala) to compute hierarchy
I have a dataframe below:
employee_id|employee_name|manager_employee_id|
----------------------------------------------
1 eric (ceo) 1
2 edward 1
3 ...
2
votes
1
answer
922
views
Computing PageRank on a digraph with edge weights using GraphFrames
Assume I use GraphFrames to construct a digraph g with edge weights from the positive real numbers. I would then like to compute the PageRank with taking the edge weights into account. I don't see how ...
0
votes
1
answer
1k
views
how build parent child relationship in pyspark or python?
I have numbers like key,value(1,2),(3,4),(5,6) ,(7,8),(9,10),(2,11),(4,12),(6,13),(8,14),(14,19)
my input is (1,2),(3,4),(5,6) ,(7,8),(9,10),(2,11),(4,12),(6,13),(8,14)
here i need to create relation ...
5
votes
1
answer
2k
views
GraphFrames: Merge edge nodes with similar column values
tl;dr: How do you simplify a graph, removing edge nodes with identical name values?
I have a graph defined as follows:
import graphframes
from pyspark.sql import SparkSession
spark = SparkSession....
0
votes
1
answer
791
views
Getting shortestPaths in GraphFrames with Java
I am new to Spark and GraphFrames.
When I wanted to learn about shortestPaths method in GraphFrame, GraphFrames documentation gave me a sample code in Scala, but not in Java.
In their document, they ...
1
vote
1
answer
2k
views
RDD Warning: Not enough space to cache rdd in memory
I am trying to run PageRank algorithm on a graphframe using pyspark. However when I execute it the program keeps running endlessly and I get following warnings:
The code is as follows:
vertices = sc....