0

For a given corpus of tokenized texts, I want to perform word weighing with several weighing techniques. To do so, I created the following class:

class Weighing:
    def __init__(self, input_file, word_weighing):
        self.input_file_ = input_file #List in which each element is a list of tokens
        self.word_weighing_ = word_weighing
        self.num_documents = len(self.input_file_)
        
        #Set with all unique words from the corpus
        self.vocabulary = set()
        for text in self.input_file_:  
            self.vocabulary.update(text)
        self.vocabulary_size = len(self.vocabulary)
        
        #Create dictionary that returns index for a token or token for an index of the corpus' vocabulary
        self.word_to_index = dict()
        self.index_to_word = dict()
        for i, word in enumerate(self.vocabulary):
            self.word_to_index[word] = i
            self.index_to_word[i] = word
        
        #Create sparse Document-Term Matrix (DTM)
        self.sparse_dtm = dok_matrix((self.num_documents, self.vocabulary_size), dtype=np.float32)
        for doc_index, document in enumerate(self.input_file_):
            document_counter = Counter(document)
            for word in set(document):
                self.sparse_dtm[doc_index, self.word_to_index[word]] = document_counter[word]    # Update element
            
        #Get word count for all documents to calculate sparse_p_ij
        self.sum_words = Counter()
        for doc in self.input_file_:
            self.sum_words.update(Counter(doc))
            
        #Create probability of word i in document j. Format: sparse matrix
        def create_sparse_p_ij (self):
            sparse_p_ij = dok_matrix((self.num_documents, self.vocabulary_size), dtype=np.float32)
            for j in range(self.num_documents):
                row_counts = self.sparse_dtm.getrow(j).toarray()[0]
                word_index = row_counts.nonzero()[0]
                non_zero_row_counts = row_counts[row_counts != 0]
                
                for i, count in enumerate(non_zero_row_counts):
                    word = self.index_to_word[word_index[i]]
                    prob_ij = count/self.sum_words[word]
                    sparse_p_ij[j,word_index[i]] = prob_ij
            return sparse_p_ij
        
        #Create a binary sparse dtm. Format: sparse matrix
        def create_sparse_binary_dtm(self):    
            binary_sparse_dtm = dok_matrix((self.num_documents, self.vocabulary_size), dtype=np.float32)
            for doc_index, document in enumerate(self.input_file_):
                document_counter = dict.fromkeys(document, 1)
                for word in set(document):
                    binary_sparse_dtm[doc_index, self.word_to_index[word]] = document_counter[word]    # Update element
            return binary_sparse_dtm
    
        #2) Calculate Global Term weighting (4 methods: entropy, IDF, Probabilistic IDF, Normal)
        def calc_entropy(self):
            sparse_p_ij = self.create_sparse_p_ij()
            summed_word_probabilities = sparse_p_ij.sum(0).tolist()[0]
            return np.array([1+((word_probability * np.log2(word_probability))/np.log2(self.num_documents)) for word_probability in summed_word_probabilities])       
        
        def calc_idf(self):
            summed_words = self.sparse_dtm.sum(0).tolist()[0]
            return np.array([np.log2(self.num_documents/word_count) for word_count in summed_words])
        
        def calc_normal(self):
            summed_words = self.sparse_dtm.sum(0).tolist()[0]
            return np.array([1/(math.sqrt(word_count**2)) for word_count in summed_words])
        
        def calc_probidf (self):
            binary_sparse_dtm = self.create_sparse_binary_dtm()
            summed_binary_words_list = binary_sparse_dtm.sum(0).tolist()[0]
            return np.array([np.log2((self.num_documents - binary_word_count)/binary_word_count) for binary_word_count in summed_binary_words_list])
                
        if self.word_weighing_ ==  1:
            gtw = self.calc_entropy()
        elif self.word_weighing_ == 2:
            gtw = self.calc_idf()
        elif self.word_weighing_ == 3:
            gtw = self.calc_normal()
        elif self.word_weighing_ == 4:
            gtw = self.calc_probidf()

Now, when I run:

model = Weighing(input_file = data_list,
             word_weighing = 1)

With data_list is a list of lists with tokenized words.

I get the following error:

Traceback (most recent call last):

  File "<ipython-input-621-b0a9caec82d4>", line 4, in <module>
    word_weighing = 1)

  File "<ipython-input-617-6f3fdcecd170>", line 90, in __init__
    gtw = self.calc_entropy()

AttributeError: 'Weighing' object has no attribute 'calc_entropy'

I looked at a few other similar SO links[1,2,3,4], but none of these seem applicable here.

What can I do to overcome this error?


EDIT:

I've updated the code to:

class Weighing:
    def __init__(self, input_file, word_weighing):
        self.input_file_ = input_file #List in which each element is a list of tokens
        self.word_weighing_ = word_weighing
        self.num_documents = len(self.input_file_)
            
        #Set with all unique words from the corpus
        self.vocabulary = set()
        for text in self.input_file_:  
            self.vocabulary.update(text)
        self.vocabulary_size = len(self.vocabulary)
            
        #Create dictionary that returns index for a token or token for an index of the corpus' vocabulary
        self.word_to_index = dict()
        self.index_to_word = dict()
        for i, word in enumerate(self.vocabulary):
            self.word_to_index[word] = i
            self.index_to_word[i] = word
       
        #Create sparse Document-Term Matrix (DTM)
        self.sparse_dtm = dok_matrix((self.num_documents, self.vocabulary_size), dtype=np.float32)
        for doc_index, document in enumerate(self.input_file_):
            document_counter = Counter(document)
            for word in set(document):
                self.sparse_dtm[doc_index, self.word_to_index[word]] = document_counter[word]    # Update element
                
      
        if self.word_weighing_ ==  1:
            self.gtw = self.calc_entropy()
        elif self.word_weighing_ == 2:
            self.gtw = self.calc_idf()
        elif self.word_weighing_ == 3:
            self.gtw = self.calc_normal()
        elif self.word_weighing_ == 4:
            self.gtw = self.calc_probidf()
        
        #Get word count for all documents to calculate sparse_p_ij
        self.sum_words = Counter()
        for doc in self.input_file_:
            self.sum_words.update(Counter(doc))
            
    #Create probability of word i in document j. Format: sparse matrix
    def create_sparse_p_ij (self):
        sparse_p_ij = dok_matrix((self.num_documents, self.vocabulary_size), dtype=np.float32)
        for j in range(self.num_documents):
            row_counts = self.sparse_dtm.getrow(j).toarray()[0]
            word_index = row_counts.nonzero()[0]
            non_zero_row_counts = row_counts[row_counts != 0]
                
            for i, count in enumerate(non_zero_row_counts):
                word = self.index_to_word[word_index[i]]
                prob_ij = count/self.sum_words[word]
                sparse_p_ij[j,word_index[i]] = prob_ij
        return sparse_p_ij
        
    #Create a binary sparse dtm. Format: sparse matrix
    def create_sparse_binary_dtm(self):    
        binary_sparse_dtm = dok_matrix((self.num_documents, self.vocabulary_size), dtype=np.float32)
        for doc_index, document in enumerate(self.input_file_):
            document_counter = dict.fromkeys(document, 1)
            for word in set(document):
                binary_sparse_dtm[doc_index, self.word_to_index[word]] = document_counter[word]    # Update element
        return binary_sparse_dtm
    
    #2) Calculate Global Term weighting (4 methods: entropy, IDF, Probabilistic IDF, Normal)
    def calc_entropy(self):
        sparse_p_ij = self.create_sparse_p_ij()
        summed_word_probabilities = sparse_p_ij.sum(0).tolist()[0]
        return np.array([1+((word_probability * np.log2(word_probability))/np.log2(self.num_documents)) for word_probability in summed_word_probabilities])       
       
    def calc_idf(self):
        summed_words = self.sparse_dtm.sum(0).tolist()[0]
        return np.array([np.log2(self.num_documents/word_count) for word_count in summed_words])
        
    def calc_normal(self):
        summed_words = self.sparse_dtm.sum(0).tolist()[0]
        return np.array([1/(math.sqrt(word_count**2)) for word_count in summed_words])
        
    def calc_probidf (self):
        binary_sparse_dtm = self.create_sparse_binary_dtm()
        summed_binary_words_list = binary_sparse_dtm.sum(0).tolist()[0]
        return np.array([np.log2((self.num_documents - binary_word_count)/binary_word_count) for binary_word_count in summed_binary_words_list])

However, I still get the error:

AttributeError: 'Weighing' object has no attribute 'calc_entropy'

Now, I call a function before I have initialized it. How can I change my code so that I initialize the def calc_entropy before I initialize the self.gtw?

4
  • All your functions calc_entropy, calc_normal, calc_idf etc. are local to Weighing.__init__, so within Weighing.__init__ they should be called as calc_entropy(self) instead of self.calc_entropy(). Also they are not accessible outside Weighing.__init__. If this is not what you want you should make them proper methods of Weighing Commented Mar 24, 2021 at 16:01
  • All the "methods" other than __init__ are indented one level too deep – they are in the scope of __init__, not the class. That makes them functions local to __init__, not methods of the class. Commented Mar 24, 2021 at 16:02
  • I just updated my code. However, I still run into errors. Do you have a suggestion on how to solve this @MisterMiyagi and @Heike? Commented Mar 24, 2021 at 16:57
  • 1
    First off, instance methods don't need to be initialized. They are bound to the class object itself not instances of the class. Secondly, one problem with your code is that self.create_sparse_p_ij uses self.sum_words which isn't initialized until after the call to self.create_sparse_p_ij. Commented Mar 25, 2021 at 6:40

1 Answer 1

2

It seems to be an indentation problem: You define your method functions like calc_entropy() within your __init__() function and not within your class.

It should be:

class Weighing:
    def __init__(self):
        # your init

    def calc_entropy(self):
        # your method
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for your answer Julian. I have changed the code, but now run into the error that I call a method that isnt initialized yet. Do you know how I can solve this problem?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.