The insight is that there are only 1000 possible values in the input, and the length of the file will likely be no more than 1000000 * 5, or 5 MB, which is not that large for modern computers.
Rather than being really clever and save memory, I have taken the easy way out. I have coded the contest functionality as a single tcl proc using pure tcl (no extensions).
Open the data file
Read the entire data file into a variable that will be treated as a list
Close the data file
Set the value of variables for the maximum count and number corresponding to the maximum count to -1
Starting with an empty count array, loop through each element of the list read from the input file, for each element:
increment the count array element indexed by the number from the input list, save a copy of the resulting value in a local variable
compare the local variable to the value of the current maximum count variable, if it is greater then update the maximum count variable with the value of the temporary variable and the variable with the corresponding number with the list element value
When the loop completes, return the value of the variable containing the number corresponding to the maximum count
The code for the function itself is:
proc contest filename {
# Open and read the text file containing decimal integers
# in the range 0-999 as a single string. The line terminators
# serve as whitespace between the numbers when the string is treated
# as a tcl list.
set fd [open $filename r]
set list [read -nonewline $fd]
close $fd
# The tcl variable "list" (not to be confused with the tcl command of
# the same name) now contains all of the data for this run.
# Initialize some variables...
set maxcnt -1 ; # A copy of the maximum count seen in the count array
set maxnum -1 ; # The number corresponding to the max count
array set cnt {} ; # Note: This is not really needed
foreach w $list { ; # Run through all numbers in the input string
# Note that incrementing a nonexistent element creates it with val 1
set t [incr cnt($w)]
# compare the count calculated against the maximum we've seen
if {$t > $maxcnt} { ; # new maximum
set maxcnt $t ; # save the maximum
set maxnum $w ; # save the value that corresponds to the maximum
}
} ; # end of loop
# puts stdout "Out of [llength $list] numbers, $maxnum was seen $maxcnt times"
# return the most frequent number ; ties go to the first one seen
return $maxnum
}
# driver code for contest function that prints out a timing
proc test {argv} {
set filename [lindex $argv 0]
if {[file exists $filename]} {
set runtime [time {set m [contest $filename]}]
puts stdout "most frequent number: $m, $runtime"
} else {
puts stderr "Unable to open input file \"$filename\""
}
}
if {($tcl_interactive == 0) && ([llength $argv] > 0)} {
test $argv
}
System I ran it on:
Dell Precision 7820 tower workstation
2 Intel(r) Xeon(r) Silver 4214R CPUs running at 2.4GHz base frequency (total of 24 cores (up to 48 threads) across 2 chips)
96 GiB RAM
Microsoft Windows 11 enterprise
Tcl version 8.6 (BAWT, built with GCC)
From a git bash command line with a locally generated 1,000,000 number data file, the result reported by proc test was 279338 microseconds, and by the "time" command was 1.137s real, and 0 seconds user and sys. I have not had a chance to run it on a Linux system, and I think the user and sys times are not actually measured with git bash on Windows. There is a large discrepancy between the time reported by the code and by the time command in git bash is the AV and other security software that my employer has installed. Removing the data file name from the command line still reports 0.885 seconds for "real"; the difference is 0.282 seconds (282000 microseconds), which is close to the value reported by the printout from the code itself.
I'm certain that this program can be duplicated in python and most likely quite a few other scripting languages. I could also write it in a compiled language, but it would take substantially longer than the 10 minutes it took to write and test this one plus the data generator for the 1,000,000 number file.