How to query Apache-Pulsar?

Question

Apache pulsar offers a very interesting architecture with the tiered storage offloaders.

I wonder how I could do interactive queries from another application? I mean direct queries to the key-value system/"database" and not using pulsar-sql, which uses Presto underneath.

In @sijieg, on twitter, has posted the following schema :

It looks like I can access the State-store (or even the Segment-reader) and directly access the data in the Bookies (and maybe the tiered storage, according a Metastore ?) How can we access these State-store/Segment-reader and access the data as would do Flink-Pulsar or Spark-Pulsar.

David Kjerrumgaard · Accepted Answer · 2020-05-07 16:08:20Z

1

I am not sure why you are opposed to using pulsar-sql, which uses Presto. That is the preferred method for performing complex SQL based queries against data that is stored inside the BookKeeper storage layers (both on bookie disk and in tiered storage). Presto parses the SQL and produces the AST and query plan to return the data, so it does provide a lot of value in that regard.

However, if you are interested in accessing the data directly on BK, then you can use the older DLog API. or the new table (key/value) service that is embedded in Bookies.

answered May 7, 2020 at 16:08

David Kjerrumgaard

1,0767 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user10077504 Over a year ago

The real use case is enrich data from Apache Flink that are "stored" into Apache Pulsar. Pulsar SQL would not respond fast enough and a direct access to the data would be more interesting. What I am really looking for is to query a service/module/anything with a key but not knowing if the data is either in a bookie (and which one) or on S3. What I understood is abstracted by Pulsar SQL. This table (k/v) service would be useful to me only if I know in which bookie is my data, if it's even in bookie; right?

David Kjerrumgaard Over a year ago

FWIW, the table service overlays the entire BookKeeper cluster, so you don't need to know which bookie has the actual data. The table service will take care of locating and returning the data for you automatically.

feder Over a year ago

Made the same observation as David when testing PulsarSQL. And yes, if you look into historic-queries, in other terms pull-queries, PulsarSQL/Trino is an extremely fast solution and can be horizontally scaled as you like. And since Pulsar balanced the load more uniformly than Kafka, I see no faster solution. For push-queries, i.e. real-time queries - use Apache Flink (for low latency) or Apache Spark (for high through-put).

Sergii Zhevzhyk · Accepted Answer · 2020-04-11 19:08:44Z

0

A fast answer is "You don't directly query Apache Pulsar". But let's have a deeper look.

Apache Pulsar is not an RDBMS where SQL queries are the main way how to work with data. If your system needs SQL queries and the load is not so extreme, just use a traditional RDBMS or NoSQL of your choice.

Why is it hard to make queries to Apache Pulsar? The main reason is that Apache Pulsar is a distributed pub-sub messaging system, where data are treated as unbounded streams and it makes it hard to run traditional SQL queries in a performant way. The solution, in this case, is steam processing engines (Pulsar Functions, Apache Flink, Apache Spark) where data can be selected, transformed and written somewhere.

If you still need to run queries against some data stored in Pulsar, it is possible to forward this information to RDBMS or NoSQL database using built-in sink connectors.

And for analytics, it can be enough to use pulsar-sql communicating directly with the storage layer (Bookies).

answered Apr 11, 2020 at 19:08

Sergii Zhevzhyk

4,20224 silver badges28 bronze badges

2 Comments

user10077504 Over a year ago

hi @Sergii Zhevzhyk, I have edited my question with more details, but I do mean to access the data exactly like pulsar-sql does. I do know that Apache pulsar is not a RDBMS. Pulsar commiters seem to say that Apache Pulsar is a perfect fit as a Apache Flink's state backend and I can not imagine it would use pulsar-sql.

Sergii Zhevzhyk Over a year ago

Apache Bookkeeper is an essential part of Apache Pulsar and it can be accessed directly, but first, you need to find out how to make sense out of received data by including metadata from Zookeeper. The process becomes too complicated. AFAIK State Store displayed on the picture will keep the state of Flink function and operator, but nothing else. Again, it doesn't make sense to query this State Store directly. You can always use Flink to query/process your stream and publish data to a topic or somewhere else.

Collectives™ on Stack Overflow

How to query Apache-Pulsar?

2 Answers 2

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related