Finding Occurrences in a String in Assembly x86

Question

I have been working on this program where I have to input a string and then display the character distribution in that string.

For example:
if the input is “minecode” the output should be

C – 1
O – 1
D – 1
E – 2
I – 1
M – 1
N – 1

Here is what I tried to do, but I really don't know how to traverse the loop and check for similar character and then increment the count. The assembler is MASM 615 running on a 32 bit machine.

.686
.MODEL flat, stdcall
.STACK
INCLUDE Irvine32.inc
.DATA
  msg0 BYTE "Enter a string of characters: ",0
  msg1 BYTE "Character Distribution: ",0
  MainArray    dword 10000 dup (?)
  UniqueChar   dword 10000 dup (?)
  CharCount    dword 10000 dup (?)
.CODE
MAIN PROC
    lea edx, msg0
    call WriteString
    call dumpregs        ; 1
    call ReadString
    mov MainArray, eax
    call dumpregs        ; 2
    mov MainArray, ebx
    call dumpregs        ; 3
    call CRLF
    lea edx, msg1
    call WriteString
    call CharDist        ; Calls the procedure
    call dumpregs        ; 5
    exit
MAIN ENDP
CharDist PROC
    mov ecx, lengthof MainArray
    mov esi, OFFSET MainArray
    L1:
; what to do here?? 
Loop L1:
CharDist ENDP
END MAIN

Loop L1: is a syntax error. You only use the trailing : when you define the label, not when you refer to it from elsewhere. (And don't use LOOP in the first place, it's slow and doesn't do anything you can't do just as easily with more common instructions.) — Peter Cordes
– Peter Cordes, Commented Nov 25, 2016 at 4:01

Davislor · Accepted Answer · 2016-11-27 14:59:24Z

5

One possible approach: make an array of 256 counters, store its base address in ebx, and for each byte in the string, increment the counter at that offset from ebx. Afterward, loop over your array of counters and print the nonzero counts.

You never say whether this is a string terminated by a 0 byte (C-style), one preceded by its length (Pascal-style), or one whose length is passed in as a second parameter, but that will determine when you terminate the loop. If you’re looking for a terminating zero, test the byte you just read, and if you’re counting a specific number of bytes, keep the count of bytes remaining in ecx and test that. (There are special instructions to branch conditionally if ecx is not zero, if you feel like using them.)

If you keep your pointer to the string in esi, you can load the next byte into al with the lodsb instruction. Alternatively, you can mov from [esi] and then inc esi. If you zero out eax before storing each byte in al, this will give you an index in eax that you can use with an array of counters.

edited Nov 27, 2016 at 14:59

answered Nov 25, 2016 at 3:48

Davislor

15.6k2 gold badges39 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Peter Cordes Over a year ago

Yup, only 256 bins is small enough to make a standard histogram approach work well with a plain array, not a hash table or other map data structure.

Peter Cordes Over a year ago

However, your suggestion to use the LOOP instruction to implement the loop is not good. On most CPUs, it's so slow you might as well forget it even exists, unless optimizing for code-size. The other thing I'd suggest is using movzx eax, byte [esi] / inc [table + eax] inside the loop. That's maybe easier to thing about than zeroing eax outside the loop and then just modifying the low byte inside the loop (e.g. with LODSB). Both ways work, though.

Davislor Over a year ago

I would definitely advise the OP to get the simple approach working before trying to implement a hash table in assembly language.

Peter Cordes Over a year ago

My point was that a simple array is the right solution for this specific problem size (which is why a beginner was asigned it for homework, I assume :), but might not be for related problems with sparse sets. Using a hash table for this would be slower. If you were going to do anything, you could assume printable ASCII and do inc [table + eax - 0x20] or something, but probably best not to. Allocating memory that's never touched is not really a problem as far as performance.

Peter Cordes Over a year ago

Who's upvoting both our comments? Is someone lurking here hanging on our every word?

|

Ped7g · Accepted Answer · 2016-11-25 12:30:46Z

3

Other possible approach:

Should be easier to understand and actually it is very "human" straightforward (like how would you do it on paper with pencil), so I wonder how you didn't come up with this one ... you should not only try to understand how this works, but also try to figure out why you didn't see it, what is confusing/blocking you.

for all letters of alphabet [A-Z] as "letter" {
    counter = 0;
    for all characters in input string as "char" {
        if (letter ignore_case_equals char) ++counter;
    }
    if (0 < counter) {
        display "letter - counter" and new line.
    }
}

This may be actually faster for English alphabet and short string, like 3-5 letters (containing letters only); than the suggested count sort, as alphabet is 26 chars and count table is 256 bytes, so 256/26 = ~9. But count table will reveal count for any character, including non-alphabet ones and also it will stall less on branching.

Also notice, how you did start with emitting code for prompts/input/etc... things you already could do in Assembly, avoiding the unknown.

I did start with almost-English description of algorithm. Because I don't care, what I know to write in Assembly (I already know, that pretty much anything :) ), first I want to be sure I know what I want the code do for me and what kind of data I want to process. Then I will command the CPU to do it, by finalizing the data structures plan and splitting those English notes into simpler and simpler steps, until they start to resemble instructions, then I fill up the instructions between comments.

Don't skip the native language reasoning phase, it may save you lot of work on the code, if you figure out some elegant way to organize data or cut down algorithm steps by reusing certain part of it. Not every problem has short elegant solution, but many have.

edited Nov 25, 2016 at 12:30

answered Nov 25, 2016 at 12:22

Ped7g

16.7k4 gold badges33 silver badges71 bronze badges

2 Comments

Peter Cordes Over a year ago

Depending on how you implement the count-matches part, that part can be branchless (CMP / sete al / add ecx, eax). The 0 < count branch for conditional printing is also present in the histogram version. And yes, for short strings, especially with the character limited to just alphabetic characters, this could be faster.

Davislor Over a year ago

If you could guarantee that there would only be letters, you could optimize the array of buckets to be faster, too! But this also works, and it’s good to try more than one approach.

Collectives™ on Stack Overflow

Finding Occurrences in a String in Assembly x86

2 Answers 2

10 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

10 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related