1

We have a simple quickfix app that translates FIX messages into our own internal format and vice versa. It consists of two threads, one of which is the FIX::Application and FIX::MessageCracker, and the other thread handles the input from the internal network side.

The internal handler thread also holds a separate MYSQL connection to the same MariaDB-Server instance as the quickfix uses for the MySQLStore. If the DB is under too much stress, my expectation is that each individual execution would take progressively longer but should still finish.

Now, when we run a load test and flood the FIX app with NewOrderSingle messages, at some point one of the two threads, sometimes both, will just hang in the mysql_real_query() method (see stack trace below).

There is no error code or signal thrown.

The test sends as many NewOrderSingle FIX messages as possible without waiting for a response. The FIX app tries to safe all messages in its DB Store. When the internal process is done it sends an update back to this process which translates it back into a FIX Response message.

Maybe somebody has encountered this before or can give us some pointers. I can't imagine we are the first to try this but I couldn't find anything related googeling.

Here is a typical stack trace. As you can see the FIX app attempts to insert a new message into the DB. The other threads aren't doing anything. I redacted some sensitive data from it.

(gdb) info threads
  Id   Target Id         Frame 
  8    Thread 0x7f9689a8e700 (LWP 27627) "tgfix" 0x00007f968b7e09a3 in select () from /lib64/libc.so.6
  7    Thread 0x7f968928d700 (LWP 27630) "tgfix" 0x00007f968b7b085d in nanosleep () from /lib64/libc.so.6
  6    Thread 0x7f9688879700 (LWP 27664) "tgfix" 0x00007f968b7e09a3 in select () from /lib64/libc.so.6
  5    Thread 0x7f967bfff700 (LWP 27665) "tgfix" 0x00007f968d0da75d in read () from /lib64/libpthread.so.0
  4    Thread 0x7f967a5f5700 (LWP 10882) "tgfix" 0x00007f968b7b085d in nanosleep () from /lib64/libc.so.6
  3    Thread 0x7f967b7fe700 (LWP 10883) "tgfix" 0x00007f968b7b085d in nanosleep () from /lib64/libc.so.6
  2    Thread 0x7f967affd700 (LWP 10884) "tgfix" 0x00007f968d0d9b3b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
* 1    Thread 0x7f968e82f880 (LWP 27626) "tgfix" 0x00007f968d0d9b3b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
(gdb) t 5
[Switching to thread 5 (Thread 0x7f967bfff700 (LWP 27665))]
#0  0x00007f968d0da75d in read () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007f968d0da75d in read () from /lib64/libpthread.so.0
#1  0x00007f968e176d20 in vio_read () from /usr/lib64/mysql/libmysqlclient.so.18
#2  0x00007f968e176da1 in vio_read_buff () from /usr/lib64/mysql/libmysqlclient.so.18
#3  0x00007f968e15af3a in my_real_read(st_net*, unsigned long*) () from /usr/lib64/mysql/libmysqlclient.so.18
#4  0x00007f968e15bdac in my_net_read () from /usr/lib64/mysql/libmysqlclient.so.18
#5  0x00007f968e14e84c in cli_safe_read () from /usr/lib64/mysql/libmysqlclient.so.18
#6  0x00007f968e14fe1b in cli_read_query_result () from /usr/lib64/mysql/libmysqlclient.so.18
#7  0x00007f968e151056 in mysql_real_query () from /usr/lib64/mysql/libmysqlclient.so.18
#8  0x00000000004267d6 in MySql::Database::execute (this=this@entry=0x7fffd9eaacf8, 
    sql="insert into tg_messages_in (beginstring,sendercompid,targetcompid,session_qualifier,msgseqnum,clordid,message) values ('FIX.4.2', '<TARGET>', 'CUSTOMER', '', 1917, 'fix_1757507653635360943_1916', '8="...) at mysql.cpp:155
#9  0x000000000041f8f3 in FixApplication::persistLocally (this=this@entry=0x7fffd9eaace0, message=..., sessionid=...) at fixapplication.cpp:96
#10 0x00000000004209dc in FixApplication::fromApp (this=0x7fffd9eaace0, message=..., sessionid=...) at fixapplication.cpp:57
#11 0x000000000046659f in FIX::Session::verify (this=this@entry=0x1b49d10, msg=..., checkTooHigh=checkTooHigh@entry=true, checkTooLow=checkTooLow@entry=true) at Session.cpp:1159
#12 0x000000000046eaab in FIX::Session::next (this=this@entry=0x1b49d10, message=..., timeStamp=..., queued=queued@entry=false) at Session.cpp:1421
#13 0x000000000046fe8c in FIX::Session::next (this=0x1b49d10, 
    msg="8=FIX.4.2\001\071=187\001\063\065=D\001\063\064=1917\001\064\071=CUSTOMER\001\065\062=20250910-12:34:13.635\001\065\066=<TARGET>\001\061\061=fix_1757507653635360943_1916\001\061\070=G\001\062\061=1\001\063\070=1\001\064\060=D\001\064\064=999\001\064\070=DE0005439004\001\065\064=1\001\065\065=idontknow\001\066\060=20250910-12:34:13\001\061\060\060=TGO"..., timeStamp=..., queued=queued@entry=false)
    at Session.cpp:1339
#14 0x0000000000480bd2 in FIX::SocketConnection::readMessages (this=this@entry=0x7f9674000be0, s=...) at SocketConnection.cpp:224
#15 0x0000000000480dbc in FIX::SocketConnection::read (this=0x7f9674000be0, a=..., s=...) at SocketConnection.cpp:170
#16 0x000000000047d2f1 in FIX::SocketAcceptor::onData (this=0x7fffd9eaafc0, server=..., s=6) at SocketAcceptor.cpp:196
#17 0x00000000004e41d0 in FIX::ServerWrapper::onEvent (this=0x7f967bffed20, monitor=..., socket=6) at SocketServer.cpp:60
#18 0x000000000047f4dc in FIX::SocketMonitor::processReadSet (this=this@entry=0x1baebf0, strategy=..., readSet=...) at SocketMonitor.cpp:260
#19 0x000000000047fa57 in FIX::SocketMonitor::block (this=this@entry=0x1baebf0, strategy=..., poll=poll@entry=false, timeout=timeout@entry=0) at SocketMonitor.cpp:219
#20 0x00000000004e3375 in FIX::SocketServer::block (this=0x1baeb90, strategy=..., poll=poll@entry=false, timeout=timeout@entry=0) at SocketServer.cpp:160
#21 0x000000000047e657 in FIX::SocketAcceptor::onStart (this=0x7fffd9eaafc0) at SocketAcceptor.cpp:113
#22 0x00000000004782fa in FIX::Acceptor::startThread (p=<optimized out>) at Acceptor.cpp:245
#23 0x00007f968d0d3ea5 in start_thread () from /lib64/libpthread.so.0
#24 0x00007f968b7e98dd in clone () from /lib64/libc.so.6

The innodb monitoring output is as follows:

| InnoDB |      | 
=====================================
250915 10:10:16 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 12 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 76304 1_second, 76304 sleeps, 7454 10_second, 2248 background, 2248 flush
srv_master_thread log flush and writes: 101400
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 14099, signal count 28376
Mutex spin waits 491840, rounds 658978, OS waits 301
RW-shared spins 23695, rounds 208656, OS waits 1749
RW-excl spins 7403, rounds 429165, OS waits 11905
Spin rounds per wait: 1.34 mutex, 8.81 RW-shared, 57.97 RW-excl
--------
FILE I/O
--------
I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)
I/O thread 3 state: waiting for completed aio requests (read thread)
I/O thread 4 state: waiting for completed aio requests (read thread)
I/O thread 5 state: waiting for completed aio requests (read thread)
I/O thread 6 state: waiting for completed aio requests (write thread)
I/O thread 7 state: waiting for completed aio requests (write thread)
I/O thread 8 state: waiting for completed aio requests (write thread)
I/O thread 9 state: waiting for completed aio requests (write thread)
Pending normal aio reads: 0 [0, 0, 0, 0] , aio writes: 0 [0, 0, 0, 0] ,
 ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
157294 OS file reads, 16750514 OS file writes, 15266644 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 0, seg size 2, 0 merges
merged operations:
 insert 0, delete mark 0, delete 0
discarded operations:
 insert 0, delete mark 0, delete 0
Hash table size 276671, node heap has 1 buffer(s)
0.00 hash searches/s, 0.00 non-hash searches/s
---
LOG
---
Log sequence number 7391011889
Log flushed up to   7391011889
Last checkpoint at  7391009579
Max checkpoint age    7782360
Checkpoint age target 7539162
Modified age          2310
Checkpoint age        2310
0 pending log writes, 0 pending chkp writes
15165146 log i/o's done, 0.00 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 137756672; in additional pool allocated 0
Total memory allocated by read views 488
Internal hash tables (constant factor + variable factor)
    Adaptive hash index 2233968     (2213368 + 20600)
    Page hash           139112 (buffer pool 0 only)
    Dictionary cache    639238  (554768 + 84470)
    File system         83536   (82672 + 864)
    Lock system         334752  (332872 + 1880)
    Recovery system     0   (0 + 0)
Dictionary memory allocated 84470
Buffer pool size        8191
Buffer pool size, bytes 134201344
Free buffers            1
Database pages          8189
Old database pages      3002
Modified db pages       60
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 165713, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 157282, created 101129, written 1523878
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 8189, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
1 read views open inside InnoDB
0 transactions active inside InnoDB
0 out of 1000 descriptors used
---OLDEST VIEW---
Normal read view
Read view low limit trx n:o 1981E82
Read view up limit trx id 1981E82
Read view low limit trx id 1981E82
Read view individually stored trx ids:
-----------------
Main thread process no. 4985, id 140232077883136, state: flushing log
Number of rows inserted 6202720, updated 8496405, deleted 6104373, read 1884859261
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
------------
TRANSACTIONS
------------
Trx id counter 1981E82
Purge done for trx's n:o < 1981E82 undo n:o < 0
History list length 2241
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0, not started
MySQL thread id 78737, OS thread handle 0x7f8a50df1700, query id 19035242 localhost root
show engine innodb status
---TRANSACTION 1981E80, not started
MySQL thread id 78663, OS thread handle 0x7f8a55755700, query id 19034163 localhost tgfix
---TRANSACTION 1981E7D, not started
MySQL thread id 78662, OS thread handle 0x7f8a5570b700, query id 19034161 localhost tgfix
---TRANSACTION 1980EE3, not started
MySQL thread id 78639, OS thread handle 0x7f8a55755700, query id 19029811 localhost 127.0.0.1 tgfixserver
---TRANSACTION 1980ED9, not started
MySQL thread id 78541, OS thread handle 0x7f8a50df1700, query id 18991418 localhost 127.0.0.1 konfiguration
----------------------------
END OF INNODB MONITOR OUTPUT
============================

My inexperienced eye doesn't see any major red flags here except for this line 1 read views open inside InnoDB in the ROW OPERATIONS, which I can't make sense of. The server version is 5.5.65-MariaDB, so a lot of the performance schema is not yet available.

The information_schema doesn't show any open transactions, locks or waits.

The process list shows all in sleep.

PS: I'm aware that it is probably a bug in my code. I don't assume to have found a bug in libmysql or libquickfix. Yet, I was wondering if the mighty power of the internet might shorten my efforts to find the bug. The comments already gave me some new ideas where to look. Especially @Jesper Juhl

However, I did find a Bug report with a similar complaint that stated that in certain circumstances the server would not answer and thus the client side read could hang forever. We are investigating more on this and we might just be able to set the timeout and accept that the server once in a while won't answer.

PPS: We're using libmysql to access a MariaDB server. If that is not the right combination I would be happy to get different option.

12
  • 4
    Probably a bug in your code. What is a debugger and how can it help me diagnose problems? Commented Sep 10 at 13:45
  • You should mention if latency matters too... and if you're adding messages very fast or if you do batches (the latter can help reduce issues, but might add latency). But it sounds like you're just putting too much load on the DB Commented Sep 10 at 15:31
  • 2
    Have you tried thread-sanitizer yet? What about address-sanitizer and undefined-behaviour-sanitizer? (aka tsan, asan & ubsan). Commented Sep 10 at 16:08
  • 1
    1) mariadb and mysql are two different database products. There is a degree of compatibility between the two, but they are not the same! 2) statement execution may have to wait for something else to complete before proceeding. Check out various timeout settings. Commented Sep 10 at 17:49
  • 3
    Which MariaDB server version? Which C/C version? Can you show processlist on the server at the time of the hang? And also show engine innodb status. If the processes are there can you include the show create table info for the queries involved and the query and analyze format=json {query}. If there isn't a server process, like @JesperJuhl gdb attach and thread apply all bt full. Commented Sep 10 at 23:09

1 Answer 1

1

Maybe the Quickest worst way to solve it

The executed query is too slow. Get a bigger machine with more cpu/ram/network/resources. measure it. get it down to as small as possible. Ensure it has no downstream procedures, triggers, etc. That's an obvious culprit even if it isn't at this current moment. if it's long break it up into smaller cost queries and execute them getting intermediate results...multi-threaded debugging best practices...

Questions

Honestly it's impossible to know without any debugging info given from the database. I'm sure there's docs online somewhere about that There's really not enough info here, What's your best thought looking at the mysql docs?

Does this process run on multiple machines? Can you list things that you can rule out like out of memory, cpu maxed out, locked tables or locked rows?

Best Answer

I've come to learn from experience that the answer to "why does my code deadlock" is almost always that's the way it was written. In the exceedingling rare off chance that there's a library issue, good chance getting that fixed if you're the only person with the problem. It just won't get prioritized. Unless you submit the fix!

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.