2

I have an array of strings that i read from file ,i want to compare each line of my file to a specific string..the file is too large (about 200 MB of lines)

i have followed this tutorial https://nyu-cds.github.io/python-numba/05-cuda/ but it doesn't show exactly how to deal with array of strings/characters.

import numpy as np
from numba import cuda



@cuda.jit
def my_kernel(io_array):

    tx = cuda.threadIdx.x

    ty = cuda.blockIdx.x

    bw = cuda.blockDim.x

    pos = tx + ty * bw
    if pos < io_array.size:  # Check array boundaries
        io_array[pos]   # i want here to compare each line of the string array to a specific line

def main():
    a = open("test.txt", 'r')  # open file in read mode

    print("the file contains:")
    data = country = np.array(a.read())


    # Set the number of threads in a block
    threadsperblock = 32

    # Calculate the number of thread blocks in the grid
    blockspergrid = (data.size + (threadsperblock - 1)) // threadsperblock

    # Now start the kernel
    my_kernel[blockspergrid, threadsperblock](data)


    # Print the result
    print(data)

if __name__ == '__main__':
        main()

I have two problems.

First: how to send my sentence (string) that i want to compare each line of my file to it to the kernal function. (in the io_array without affecting the threads computation)

Second: it how to deal with string array? i get this error when i run the above code

this error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: typing of intrinsic-call at test2.py (18)

File "test2.py", line 18:
def my_kernel(io_array):
    <source elided>
    if pos < io_array.size:  # Check array boundaries
        io_array[pos]   # do the computation

P.S i'm new to Cuda and have just started learning it.

1 Answer 1

2

First of all this:

data = country = np.array(a.read())

doesn't do what you think it does. It does not yield a numpy array that you can index like this:

io_array[pos]

If you don't believe me, just try that in ordinary python code with something like:

print(data[0]) 

and you'll get an error. If you want help with that, just ask your question on the python or numpy tag.

So we need a different method to load the string data from disk. For simplicity, I choose to use numpy.fromfile(). This method will require that all lines in your file are of the same width. I like that concept. There's more information you would have to describe if you want to handle lines of varying lengths.

If we start out that way, we can load the data as an array of bytes, and use that:

$ cat test.txt
the quick brown fox.............
jumped over the lazy dog........
repeatedly......................
$ cat t43.py
import numpy as np
from numba import cuda

@cuda.jit
def my_kernel(str_array, check_str, length, lines, result):

    col,line = cuda.grid(2)
    pos = (line*(length+1))+col
    if col < length and line < lines:  # Check array boundaries
        if str_array[pos] != check_str[col]:
            result[line] = 0

def main():
    a = np.fromfile("test.txt", dtype=np.byte)
    print("the file contains:")
    print(a)
    print("array length is:")
    print(a.shape[0])
    print("the check string is:")
    b = a[33:65]
    print(b)
    i = 0
    while a[i] != 10:
        i=i+1
    line_length = i
    print("line length is:")
    print(line_length)
    print("number of lines is:")
    line_count = a.shape[0]/(line_length+1)
    print(line_count)
    res = np.ones(line_count)
    # Set the number of threads in a block
    threadsperblock = (32,32)

    # Calculate the number of thread blocks in the grid
    blocks_x = (line_length/32)+1
    blocks_y = (line_count/32)+1
    blockspergrid = (blocks_x,blocks_y)
    # Now start the kernel
    my_kernel[blockspergrid, threadsperblock](a, b, line_length, line_count, res)


    # Print the result
    print("matching lines (match = 1):")
    print(res)

if __name__ == '__main__':
        main()
$ python t43.py
the file contains:
[116 104 101  32 113 117 105  99 107  32  98 114 111 119 110  32 102 111
 120  46  46  46  46  46  46  46  46  46  46  46  46  46  10 106 117 109
 112 101 100  32 111 118 101 114  32 116 104 101  32 108  97 122 121  32
 100 111 103  46  46  46  46  46  46  46  46  10 114 101 112 101  97 116
 101 100 108 121  46  46  46  46  46  46  46  46  46  46  46  46  46  46
  46  46  46  46  46  46  46  46  10]
array length is:
99
the check string is:
[106 117 109 112 101 100  32 111 118 101 114  32 116 104 101  32 108  97
 122 121  32 100 111 103  46  46  46  46  46  46  46  46]
line length is:
32
number of lines is:
3
matching lines (match = 1):
[ 0.  1.  0.]
$
Sign up to request clarification or add additional context in comments.

1 Comment

Perfect answer and "dtype=np.byte" in the np.fromfile() function was the line i was searching for.Great job sir.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.