2

I'm looking for a way to convert dates given in the format YYYYmmdd to an np.array with dtype='datetime64'. The dates are stored in another np.array but with dtype='float64'.

I am looking for a way to achieve this by avoiding Pandas!

I already tried something similar as suggested in this answer but the author states that "[...] if (the date format) was in ISO 8601 you could parse it directly using numpy, [...]".

As the date format in my case is YYYYmmdd which IS(?) ISO 8601 it should be somehow possible to parse it directly using numpy. But I don't know how as I am a total beginner in python and coding in general.

I really try to avoid Pandas because I don't want to bloat my script when there is a way to get the task done by using the modules I am already using. I also read it would decrease the speed here.

2 Answers 2

3

If noone else comes up with something more builtin, here is a pedestrian method:

>>> dates
array([19700101., 19700102., 19700103., 19700104., 19700105., 19700106.,
       19700107., 19700108., 19700109., 19700110., 19700111., 19700112.,
       19700113., 19700114.])
>>> y, m, d = dates.astype(int) // np.c_[[10000, 100, 1]] % np.c_[[10000, 100, 100]]
>>> y.astype('U4').astype('M8') + (m-1).astype('m8[M]') + (d-1).astype('m8[D]')
array(['1970-01-01', '1970-01-02', '1970-01-03', '1970-01-04',
       '1970-01-05', '1970-01-06', '1970-01-07', '1970-01-08',
       '1970-01-09', '1970-01-10', '1970-01-11', '1970-01-12',
       '1970-01-13', '1970-01-14'], dtype='datetime64[D]')
Sign up to request clarification or add additional context in comments.

5 Comments

Thank you. Could you please explain those last two lines as I am not familiar with anything following dates.astype(int)?
@zorrolo np.c_[] can be used to create column vectors; here this has the effect that due to broadcasting the result of the floor division // is a full table of each pair that can be formed between dates and 10000, 100, 1. Thus we get three copies of dates, one with the last 4 digits removed, one with the two last digits removed and one unchanged. % is modulo here it removes from the left all but 4 digits (which is a nop at this place), and twice all but 2 digits. As a result we will have in variables y, m, d, the year, month and day separately.
... Next we convert the year first to unicode, then to datetime64. and add the month and day both converted to timedelta64.
Great! Is there also a modification of these procedure to find every date in for example, march ignoring years and days? I was looking for a way to filter that array of dates for months but couldn't find any build in np.datetime64 function to do so.
@zorrolo This seems to work: a[(a.astype('M8[M]') - a.astype('M8[Y]')).view(int) == 2]
0

You can go via the python datetime module.

from datetime import datetime
import numpy as np

datestrings = np.array(["18930201", "19840404"])
dtarray = np.array([datetime.strptime(d, "%Y%m%d") for d in datestrings], dtype="datetime64[D]")
print(dtarray)

# out: ['1893-02-01' '1984-04-04'] datetime64[D]

Since the real question seems to be how to get the given strings into the matplotlib datetime format,

from datetime import datetime
import numpy as np
from matplotlib import dates as mdates

datestrings = np.array(["18930201", "19840404"])
mpldates = mdates.datestr2num(datestrings)
print(mpldates)

# out: [691071. 724370.]

3 Comments

I have to explain that I am working with many dates, like 40k because of daily measurements over aprox 125 years. As far as I understand python it is faster to work with numpy array in my depicted case. I had concerns to store the dates not as datetime64. A few days ago I tried something with datestr2num() but the related plot did'nt display the dates on x-axis in a convenient format. So I switched back to what worked for me as I am running out of time for this project.
Indeed, if you use mpldates from above, you would need to set the locator and formatter on the axis yourself. Concerning speed, it is a bit of a paradox that you want to save a few milliseconds of time by not using pandas, while trying to plot 40k data on screen. Graphical output is always the bottleneck. Also consider that one might have spent the time of the effort on trying to avoid using pandas into learning pandas. E.g. you could use pandas to subsample the 40k data into a smaller dataset which is much faster in plotting.
Oh the plot is not the main function of the program! Many calculations are done in advance where numpy and some pieces of scipy are absolutely satisfying. Therefore it would be nonsense to rewrite the whole script to make it work with pandas. But thank you for the additional information.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.