Format Django QuerySet to output values as an array

Question

I have data in a table that look like this:

src_id, dst_id, params
int   , int   , array

I'm querying the data to extract some values from the array with the following Django query

dataset = query_set.values_list('src_id', 'dst_id', *[e.field for e in settings])

I need to output the data like this:

[
[1,2,[a,b,c,d]],
[3,4,[a,c,d,e]],
...
]

but the values_list returns a list of tuples like (1,2,a,b,c,d)(3,4,a,c,d,e)...

So I've been doing the following to recombine the tuple values into an array:

[[d[0], d[1], d[2:]] for d in dataset]

It works fine, but I have 1M+ rows, and it is slow. It is also using a lot of memory. Ideally I want to format the data properly in the SQL (through Django) and push the output directly to the JSON serializer without having to generate an intermediate array.

Is there a way to optimize this code?

I looked at using a loop, or lambda but that doesn't make much of a difference. I looked at using array but it only takes primitive types, so no much luck with array of arrays.

I am looking for a way to query the DB and output the data directly in the right format, if possible in Django:

I see Django can do slice array lookup like

dataset = query_set.values_list('src_id', 'dst_id', 'data__3_10')

to get the values between indices 3 and 10 for example, but I need to retrieve specific indices (like 1,4,5,6,8,11 for example), not slices.

Any hint on how to format the output of a query like this? is it possible, or just too esoteric for Django? Do I need raw SQL for this?

Thanks

The formatting is unclear that strings in the array should be really without quotes and that no string can contain a comma or "]". If I know, inidividual elements of array (params) can be used as a lookup in .filter(data__3=...), but can not be referenced in values_list. How you achieved it? — hynekcer
– hynekcer, Commented Oct 16, 2017 at 1:33
The question was first unclear. Now I understand from your answer that [1,2,[a,b,c,d]], is not at all a formated output, but a symbolic representation of a nested object, where "a,b,c,d" are values of any identical type. Everything is finally formated by JSON style, that is not a part of any example, because the question is general. (fixed now a typo above) — hynekcer
– hynekcer, Commented Oct 25, 2017 at 17:42

hynekcer · Accepted Answer · 2017-10-16 01:16:54Z

1

If the database is PostgreSQL or Oracle then memory requirements can be substantially decreased in Django 1.11 by iterator() method applied to any queryset, including values() or values_list() queryset. Then a server-side cursor is used by default without doing any caching in Django or in the database driver. Read details in the docs.

Demo

qs = query_set.values_list('src_id', 'dst_id', 'data')
for row in qs.iterator():
    print(row)

Output

(3, 4, ['a', 'c', 'd', 'e'])
...

Complete example

qs = query_set.values_list('src_id', 'dst_id', 'data')
with open(..., 'w') as f:
    f.write('[\n')
    for src_id, dst_id, data in qs.iterator():
        str_sub_array = ','.join(data[x] for x in column_settings)
        f.write('[{},{},[{}]],\n'.format(src_id, dst_id, str_sub_array)
    f.write(']\n')

Memory requirements are negligible.

edited Oct 16, 2017 at 1:16

answered Oct 15, 2017 at 21:36

hynekcer

15.7k6 gold badges66 silver badges102 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

MrE Over a year ago

thanks for the info. that may help although i'm not on 1.11 yet. I can see how iterator would reduce memory req, however the question was more about formatting the data as an array out of the queryset, while picking only necessary values out of the original array. I haven't found a way to do that in Django, only through writing a raw SQL query. Once the data is formatted right, all i need is return it, but without it I do need to iterate over each row, which is slow.

hynekcer Over a year ago

@MrE: Updated answer. Formating is trivial if requirements are clear, memory requirements without named cursor supported by Django 1.11 are a hard problem and raw sql probably can't help much. I think that memory and a fast loop is the necessary basis.

MrE Over a year ago

my goal was to eliminate the formatting loop altogether, by having the SQL return the right format. My current loop takes up to 2min, while the query takes 10sec. The iterator might save memory for a loop, but in the end I need to return the whole dataset, so if the SQL returns it all in one shot, I'll save the memory required to build the output. There doesn't seem to be a Django way for this, so I'll go with raw SQL.

MrE Over a year ago

Thanks for the tip on iterator(). It did help in the end reducing memory footprint, thanks to the json serializer accepting an iterator as input. The main part of the question was about formatting things as an array from SQL in Django though, so I don't feel this should be the accepted answer.

hynekcer Over a year ago

I expected by the title "Format Django QuerySet" that you want to improve it not to avoid it. You did not paste any source about formatting that is slow, to see that it can not be easily improved by recommended best practices. You did not write a ratio: number_of_used_array_field_values / length_of_array_field. One half is absolutely negligible while one hundredths is really very problematic.

|

MrE · Accepted Answer · 2017-10-19 15:51:54Z

1

I was able to format the output directly in Django without resorting to raw SQL, using the extra() feature:

dataset = queryset.objects\
               .extra(select={'data': 'ARRAY[{}]'.format(",".join([fields])})\
               .values_list('src_id','dst_id','data')
return dataset.iterator()

This allows to select the specific indices in the SQL array as I needed.

Returning the iterator (as suggested by @hynekcer), as opposed to the result array itself, actually works better for memory.

Having avoided the loop to reformat the data, the iterator can be sent directly to the JSON serializer which accepts an iterator as an input.

edited Oct 19, 2017 at 15:51

answered Oct 17, 2017 at 17:37

MrE

21.2k15 gold badges92 silver badges113 bronze badges

3 Comments

hynekcer Over a year ago

A normal iterator is not serializable directly by json or simplejson. There are some similar solutions by default or cls parameters or by subclassing JSONEncoder that convert the iterator to list internally. Better solutions for huge data are those that cross (subclass) a type list with iterator type on order to combine low memory requirements and direct serializability. One solution is my SerializableGenerator(iterator). If you serialize to a stream by json.dump not to a string then also output memory requirements are low.

hynekcer Over a year ago

A simple "Query expression function" can be written that doesn't require ".extra()", useful if the query uses more tables and a syntax "relation_field__other_field" is can be compiled by the backend without hard coding table names. Your solution is probably the easiest for a simple case. (Excuse that my answer was not complete. Your question was broad initially and I waited for a feedback after the first step.)

MrE Over a year ago

ujson can take an iterator as an input and works fine. That's what I use. it's also one of the faster serializers out there.

Collectives™ on Stack Overflow

Format Django QuerySet to output values as an array

2 Answers 2

6 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related