2

I have data in a table that look like this:

src_id, dst_id, params
int   , int   , array

I'm querying the data to extract some values from the array with the following Django query

dataset = query_set.values_list('src_id', 'dst_id', *[e.field for e in settings])

I need to output the data like this:

[
[1,2,[a,b,c,d]],
[3,4,[a,c,d,e]],
...
]

but the values_list returns a list of tuples like (1,2,a,b,c,d)(3,4,a,c,d,e)...

So I've been doing the following to recombine the tuple values into an array:

[[d[0], d[1], d[2:]] for d in dataset]

It works fine, but I have 1M+ rows, and it is slow. It is also using a lot of memory. Ideally I want to format the data properly in the SQL (through Django) and push the output directly to the JSON serializer without having to generate an intermediate array.

Is there a way to optimize this code?

I looked at using a loop, or lambda but that doesn't make much of a difference. I looked at using array but it only takes primitive types, so no much luck with array of arrays.

I am looking for a way to query the DB and output the data directly in the right format, if possible in Django:

I see Django can do slice array lookup like

dataset = query_set.values_list('src_id', 'dst_id', 'data__3_10')

to get the values between indices 3 and 10 for example, but I need to retrieve specific indices (like 1,4,5,6,8,11 for example), not slices.

Any hint on how to format the output of a query like this? is it possible, or just too esoteric for Django? Do I need raw SQL for this?

Thanks

2
  • The formatting is unclear that strings in the array should be really without quotes and that no string can contain a comma or "]". If I know, inidividual elements of array (params) can be used as a lookup in .filter(data__3=...), but can not be referenced in values_list. How you achieved it? Commented Oct 16, 2017 at 1:33
  • The question was first unclear. Now I understand from your answer that [1,2,[a,b,c,d]], is not at all a formated output, but a symbolic representation of a nested object, where "a,b,c,d" are values of any identical type. Everything is finally formated by JSON style, that is not a part of any example, because the question is general. (fixed now a typo above) Commented Oct 25, 2017 at 17:42

2 Answers 2

1

If the database is PostgreSQL or Oracle then memory requirements can be substantially decreased in Django 1.11 by iterator() method applied to any queryset, including values() or values_list() queryset. Then a server-side cursor is used by default without doing any caching in Django or in the database driver. Read details in the docs.

Demo

qs = query_set.values_list('src_id', 'dst_id', 'data')
for row in qs.iterator():
    print(row) 

Output

(3, 4, ['a', 'c', 'd', 'e'])
...

Complete example

qs = query_set.values_list('src_id', 'dst_id', 'data')
with open(..., 'w') as f:
    f.write('[\n')
    for src_id, dst_id, data in qs.iterator():
        str_sub_array = ','.join(data[x] for x in column_settings)
        f.write('[{},{},[{}]],\n'.format(src_id, dst_id, str_sub_array)
    f.write(']\n')

Memory requirements are negligible.

Sign up to request clarification or add additional context in comments.

6 Comments

thanks for the info. that may help although i'm not on 1.11 yet. I can see how iterator would reduce memory req, however the question was more about formatting the data as an array out of the queryset, while picking only necessary values out of the original array. I haven't found a way to do that in Django, only through writing a raw SQL query. Once the data is formatted right, all i need is return it, but without it I do need to iterate over each row, which is slow.
@MrE: Updated answer. Formating is trivial if requirements are clear, memory requirements without named cursor supported by Django 1.11 are a hard problem and raw sql probably can't help much. I think that memory and a fast loop is the necessary basis.
my goal was to eliminate the formatting loop altogether, by having the SQL return the right format. My current loop takes up to 2min, while the query takes 10sec. The iterator might save memory for a loop, but in the end I need to return the whole dataset, so if the SQL returns it all in one shot, I'll save the memory required to build the output. There doesn't seem to be a Django way for this, so I'll go with raw SQL.
Thanks for the tip on iterator(). It did help in the end reducing memory footprint, thanks to the json serializer accepting an iterator as input. The main part of the question was about formatting things as an array from SQL in Django though, so I don't feel this should be the accepted answer.
I expected by the title "Format Django QuerySet" that you want to improve it not to avoid it. You did not paste any source about formatting that is slow, to see that it can not be easily improved by recommended best practices. You did not write a ratio: number_of_used_array_field_values / length_of_array_field. One half is absolutely negligible while one hundredths is really very problematic.
|
1

I was able to format the output directly in Django without resorting to raw SQL, using the extra() feature:

dataset = queryset.objects\
               .extra(select={'data': 'ARRAY[{}]'.format(",".join([fields])})\
               .values_list('src_id','dst_id','data')
return dataset.iterator()

This allows to select the specific indices in the SQL array as I needed.

Returning the iterator (as suggested by @hynekcer), as opposed to the result array itself, actually works better for memory.

Having avoided the loop to reformat the data, the iterator can be sent directly to the JSON serializer which accepts an iterator as an input.

3 Comments

A normal iterator is not serializable directly by json or simplejson. There are some similar solutions by default or cls parameters or by subclassing JSONEncoder that convert the iterator to list internally. Better solutions for huge data are those that cross (subclass) a type list with iterator type on order to combine low memory requirements and direct serializability. One solution is my SerializableGenerator(iterator). If you serialize to a stream by json.dump not to a string then also output memory requirements are low.
A simple "Query expression function" can be written that doesn't require ".extra()", useful if the query uses more tables and a syntax "relation_field__other_field" is can be compiled by the backend without hard coding table names. Your solution is probably the easiest for a simple case. (Excuse that my answer was not complete. Your question was broad initially and I waited for a feedback after the first step.)
ujson can take an iterator as an input and works fine. That's what I use. it's also one of the faster serializers out there.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.