5

I'm cracking my head at this weird Pyspark behavior and couldn't find anything on the internet.

We have a MySQL cluster in a private network, which we can access by a SSH Tunnel. I'm trying to read data from this database using Pyspark and SSHTunnelForwarder, and I'm doing it like this:

  1. Creating the tunnel:
server = SSHTunnelForwarder(
        (target_tunnel_ip_address, 22),
        ssh_username=tunnel_username",
        ssh_private_key=private_key_filepath",
        remote_bind_address=(mysql_address, 3306)
        )

server.start()
  1. Creating a JDBC URL using the database information like so:
hostname = "localhost" #Because I forwarded I forwarded the remote port to my localhost
port = server.local_bind_port #To access which port the SSHTunnelForwarder used
username = my_username
password = my_password
database = my_database
jdbcUrl = "jdbc:mysql://{}:{}/{}?user={}&password={}".format(hostname, port, database, username, password)
  1. Reading the data from the database:
data = spark.read \
                  .format("jdbc") \
                  .option("url", jdbcUrl) \
                  .option("driver", "com.mysql.cj.jdbc.Driver") \
                  .option("query", query) \
                  .load()

So far so good, this seems to work and I can see the table columns: [Output of variable data][1] [1]: https://i.sstatic.net/YJhCC.png

DataFrame[id: int, company_id: int, life_id: int, type_id: int, cep: string, address: string, number: int, complement: string, neighborhood: string, city: string, state: string, origin: string, created_at: timestamp, updated_at: timestamp, active: boolean]

But as soon as I call any method that actually reads the data, like .head(), .collect() or other variations, I get this error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7629.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7629.0 (TID 11996, XX.XXX.XXX.XXX, executor 0): com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure

Does anyone have any idea why is this happening and how to fix it?

3
  • try this stackoverflow.com/questions/21903411/… Commented Jul 7, 2021 at 6:31
  • 2
    Did you ever figure this out? Commented Apr 29, 2022 at 12:04
  • Did u figure it out. Stuck with the same thing. Commented Apr 29, 2022 at 14:45

1 Answer 1

1

That code is executed in the driver but the tasks run on the executor, when you reference "localhost" each executor will interpret at itself and fail to connect.
Instead get the hostname of the driver (e.g. socket.gethostname()) and use it in the JDBC URL

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.