2

Apache pulsar offers a very interesting architecture with the tiered storage offloaders.

I wonder how I could do interactive queries from another application? I mean direct queries to the key-value system/"database" and not using pulsar-sql, which uses Presto underneath.

In @sijieg, on twitter, has posted the following schema : enter image description here

It looks like I can access the State-store (or even the Segment-reader) and directly access the data in the Bookies (and maybe the tiered storage, according a Metastore ?) How can we access these State-store/Segment-reader and access the data as would do Flink-Pulsar or Spark-Pulsar.

2 Answers 2

1

I am not sure why you are opposed to using pulsar-sql, which uses Presto. That is the preferred method for performing complex SQL based queries against data that is stored inside the BookKeeper storage layers (both on bookie disk and in tiered storage). Presto parses the SQL and produces the AST and query plan to return the data, so it does provide a lot of value in that regard.

However, if you are interested in accessing the data directly on BK, then you can use the older DLog API. or the new table (key/value) service that is embedded in Bookies.

Sign up to request clarification or add additional context in comments.

3 Comments

The real use case is enrich data from Apache Flink that are "stored" into Apache Pulsar. Pulsar SQL would not respond fast enough and a direct access to the data would be more interesting. What I am really looking for is to query a service/module/anything with a key but not knowing if the data is either in a bookie (and which one) or on S3. What I understood is abstracted by Pulsar SQL. This table (k/v) service would be useful to me only if I know in which bookie is my data, if it's even in bookie; right?
FWIW, the table service overlays the entire BookKeeper cluster, so you don't need to know which bookie has the actual data. The table service will take care of locating and returning the data for you automatically.
Made the same observation as David when testing PulsarSQL. And yes, if you look into historic-queries, in other terms pull-queries, PulsarSQL/Trino is an extremely fast solution and can be horizontally scaled as you like. And since Pulsar balanced the load more uniformly than Kafka, I see no faster solution. For push-queries, i.e. real-time queries - use Apache Flink (for low latency) or Apache Spark (for high through-put).
0

A fast answer is "You don't directly query Apache Pulsar". But let's have a deeper look.

Apache Pulsar is not an RDBMS where SQL queries are the main way how to work with data. If your system needs SQL queries and the load is not so extreme, just use a traditional RDBMS or NoSQL of your choice.

Why is it hard to make queries to Apache Pulsar? The main reason is that Apache Pulsar is a distributed pub-sub messaging system, where data are treated as unbounded streams and it makes it hard to run traditional SQL queries in a performant way. The solution, in this case, is steam processing engines (Pulsar Functions, Apache Flink, Apache Spark) where data can be selected, transformed and written somewhere.

If you still need to run queries against some data stored in Pulsar, it is possible to forward this information to RDBMS or NoSQL database using built-in sink connectors.

And for analytics, it can be enough to use pulsar-sql communicating directly with the storage layer (Bookies).

2 Comments

hi @Sergii Zhevzhyk, I have edited my question with more details, but I do mean to access the data exactly like pulsar-sql does. I do know that Apache pulsar is not a RDBMS. Pulsar commiters seem to say that Apache Pulsar is a perfect fit as a Apache Flink's state backend and I can not imagine it would use pulsar-sql.
Apache Bookkeeper is an essential part of Apache Pulsar and it can be accessed directly, but first, you need to find out how to make sense out of received data by including metadata from Zookeeper. The process becomes too complicated. AFAIK State Store displayed on the picture will keep the state of Flink function and operator, but nothing else. Again, it doesn't make sense to query this State Store directly. You can always use Flink to query/process your stream and publish data to a topic or somewhere else.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.