Skip to main content
added 465 characters in body
Source Link

From a git bash command line with a locally generatedthe downloaded 1,000,000 number data file, the result reported by proc test was 279338299405 microseconds, and by the "time" command was 1.137s182s real, and 0 seconds user and sys. I have not had a chance to run it on a Linux system, and I think the user and sys times are not actually measured with git bash on Windows. There is a large discrepancy between the time reported by the code and by the time command in git bash is the AV and other security software that my employer has installed. Removing the data file name from the command line still reports 0.885 seconds for "real"; the difference is 0.282297 seconds (282000297000 microseconds), which is close to the value reported by the printout from the code itself.

I'm certain that this program can be duplicated in python and most likely quite a few other scripting languages. I could also write it in a compiled language, but it would take substantially longer than the 10 minutes it took to write and test this one plus the data generator for the 1,000,000 number file that I first tested on.

$ time /c/Tcl8.6/bin/tclsh contest-entry.tcl

real    0m0.885s
user    0m0.000s
sys     0m0.000s

$ time /c/Tcl8.6/bin/tclsh contest-entry.tcl ../Downloads/1M_random_numbers.txt
most frequent number: 142, 299405 microseconds per iteration

real    0m1.182s
user    0m0.000s
sys     0m0.000s

$

[edit: Updated numbers for the run on the downloaded 1M sample file rather than my own locally generated file and to put the final run's output in.]

From a git bash command line with a locally generated 1,000,000 number data file, the result reported by proc test was 279338 microseconds, and by the "time" command was 1.137s real, and 0 seconds user and sys. I have not had a chance to run it on a Linux system, and I think the user and sys times are not actually measured with git bash on Windows. There is a large discrepancy between the time reported by the code and by the time command in git bash is the AV and other security software that my employer has installed. Removing the data file name from the command line still reports 0.885 seconds for "real"; the difference is 0.282 seconds (282000 microseconds), which is close to the value reported by the printout from the code itself.

I'm certain that this program can be duplicated in python and most likely quite a few other scripting languages. I could also write it in a compiled language, but it would take substantially longer than the 10 minutes it took to write and test this one plus the data generator for the 1,000,000 number file.

From a git bash command line with the downloaded 1,000,000 number data file, the result reported by proc test was 299405 microseconds, and by the "time" command was 1.182s real, and 0 seconds user and sys. I have not had a chance to run it on a Linux system, and I think the user and sys times are not actually measured with git bash on Windows. There is a large discrepancy between the time reported by the code and by the time command in git bash is the AV and other security software that my employer has installed. Removing the data file name from the command line still reports 0.885 seconds for "real"; the difference is 0.297 seconds (297000 microseconds), which is close to the value reported by the printout from the code itself.

I'm certain that this program can be duplicated in python and most likely quite a few other scripting languages. I could also write it in a compiled language, but it would take substantially longer than the 10 minutes it took to write and test this one plus the data generator for the 1,000,000 number file that I first tested on.

$ time /c/Tcl8.6/bin/tclsh contest-entry.tcl

real    0m0.885s
user    0m0.000s
sys     0m0.000s

$ time /c/Tcl8.6/bin/tclsh contest-entry.tcl ../Downloads/1M_random_numbers.txt
most frequent number: 142, 299405 microseconds per iteration

real    0m1.182s
user    0m0.000s
sys     0m0.000s

$

[edit: Updated numbers for the run on the downloaded 1M sample file rather than my own locally generated file and to put the final run's output in.]

Source Link

The insight is that there are only 1000 possible values in the input, and the length of the file will likely be no more than 1000000 * 5, or 5 MB, which is not that large for modern computers.

Rather than being really clever and save memory, I have taken the easy way out. I have coded the contest functionality as a single tcl proc using pure tcl (no extensions).

  1. Open the data file

  2. Read the entire data file into a variable that will be treated as a list

  3. Close the data file

  4. Set the value of variables for the maximum count and number corresponding to the maximum count to -1

  5. Starting with an empty count array, loop through each element of the list read from the input file, for each element:

    1. increment the count array element indexed by the number from the input list, save a copy of the resulting value in a local variable

    2. compare the local variable to the value of the current maximum count variable, if it is greater then update the maximum count variable with the value of the temporary variable and the variable with the corresponding number with the list element value

  6. When the loop completes, return the value of the variable containing the number corresponding to the maximum count

The code for the function itself is:

proc contest filename {
   # Open and read the text file containing decimal integers
   # in the range 0-999 as a single string. The line terminators
   # serve as whitespace between the numbers when the string is treated
   # as a tcl list.
   set fd [open $filename r]
   set list [read -nonewline $fd]
   close $fd

   # The tcl variable "list" (not to be confused with the tcl command of
   # the same name) now contains all of the data for this run.
   # Initialize some variables...
   set maxcnt -1 ; # A copy of the maximum count seen in the count array
   set maxnum -1 ; # The number corresponding to the max count
   array set cnt {} ; # Note: This is not really needed
   foreach w $list { ; # Run through all numbers in the input string
      # Note that incrementing a nonexistent element creates it with val 1
      set t [incr cnt($w)]
      # compare the count calculated against the maximum we've seen
      if {$t > $maxcnt} { ; # new maximum
         set maxcnt $t ; # save the maximum
         set maxnum $w ; # save the value that corresponds to the maximum
      }
   } ; # end of loop
   # puts stdout "Out of [llength $list] numbers, $maxnum was seen $maxcnt times"
   # return the most frequent number ; ties go to the first one seen
   return $maxnum
}

# driver code for contest function that prints out a timing
proc test {argv} {
    set filename [lindex $argv 0]
    if {[file exists $filename]} {
       set runtime [time {set m [contest $filename]}]
       puts stdout "most frequent number: $m, $runtime"
    } else {
       puts stderr "Unable to open input file \"$filename\""
    }
}

if {($tcl_interactive == 0) && ([llength $argv] > 0)} {
   test $argv
}

System I ran it on:

  • Dell Precision 7820 tower workstation

  • 2 Intel(r) Xeon(r) Silver 4214R CPUs running at 2.4GHz base frequency (total of 24 cores (up to 48 threads) across 2 chips)

  • 96 GiB RAM

  • Microsoft Windows 11 enterprise

  • Tcl version 8.6 (BAWT, built with GCC)

From a git bash command line with a locally generated 1,000,000 number data file, the result reported by proc test was 279338 microseconds, and by the "time" command was 1.137s real, and 0 seconds user and sys. I have not had a chance to run it on a Linux system, and I think the user and sys times are not actually measured with git bash on Windows. There is a large discrepancy between the time reported by the code and by the time command in git bash is the AV and other security software that my employer has installed. Removing the data file name from the command line still reports 0.885 seconds for "real"; the difference is 0.282 seconds (282000 microseconds), which is close to the value reported by the printout from the code itself.

I'm certain that this program can be duplicated in python and most likely quite a few other scripting languages. I could also write it in a compiled language, but it would take substantially longer than the 10 minutes it took to write and test this one plus the data generator for the 1,000,000 number file.