Finding document frequency using Python

Question

Hey everyone I know that this has been asked a couple times here already but I am having a hard time finding document frequency using python. I am trying to find TF-IDF then find the cosin scores between them and a query but am stuck at finding document frequency. This is what I have so far:

#includes
import re
import os
import operator
import glob
import sys
import math
from collections import Counter

#number of command line argument checker
if len(sys.argv) != 3:
    print 'usage: ./part3_soln2.py "path to folder in quotation marks" query.txt'
    sys.exit(1)

#Read in the directory to the files
    path = sys.argv[1]

#Read in the query
y = sys.argv[2]
querystart = re.findall(r'\w+', open(y).read().lower())
query = [Z for Z in querystart]
Query_vec = Counter(query)
print Query_vec

#counts total number of documents in the directory
doccounter = len(glob.glob1(path,"*.txt"))

if os.path.exists(path) and os.path.isfile(y):
    word_TF = []
    word_IDF = {}
    TFvec = []
    IDFvec = []

    #this is my attempt at finding IDF
    for filename in glob.glob(os.path.join(path, '*.txt')):

        words_IDF = re.findall(r'\w+', open(filename).read().lower())

        doc_IDF = [A for A in words_IDF if len(A) >= 3 and A.isalpha()]

        word_IDF = doc_IDF

        #psudocode!! 
        """
        for key in word_idf:
            if key in word_idf:
                word_idf[key] =+1
            else:
                word_idf[key] = 1

    print word_IDF
    """ 

    #goes to that directory and reads in the files there
    for filename in glob.glob(os.path.join(path, '*.txt')):

        words_TF = re.findall(r'\w+', open(filename).read().lower())

        #scans each document for words greater or equal to 3 in length
        doc_TF = [A for A in words_TF if len(A) >= 3 and A.isalpha()]

        #this assigns values to each term this is my TF for each vector
        TFvec = Counter(doc_TF)

        #weighing the Tf with a log function
        for key in TFvec: 
            TFvec[key] = 1 + math.log10(TFvec[key])


    #placed here so I dont get a command line full of text  
    print TFvec 

#Error checker
else:
    print "That path does not exist"

I am using python 2 and so far I don't really have any idea how to count how many documents a term appears in. I can find the total number of documents but I am really stuck on finding the number of documents a term appears in. I was just going to create one large dictionary that held all of the terms from all of the documents that could be fetched later when a query needed those terms. Thank you for any help you can give me.

is there a reason you are trying to implement this yourself rather than using a library: scikit-learn.org/stable/modules/generated/… — Garrett R
– Garrett R, Commented Feb 4, 2016 at 21:54
I read that one but I have to log the tf and idf values and in thought that would be easier if I implemented it myself. Also I will be reading in a directory that contains around 100 text files so again I thought it would be easier than using scikit — Sean
– Sean, Commented Feb 4, 2016 at 22:09
Also I will have to do cosin to the tfidf later on. Does scikit have that as a function as well? — Sean
– Sean, Commented Feb 4, 2016 at 22:10

Mike Koltsov · Accepted Answer · 2016-02-05 08:02:06Z

6

DF for a term x is a number of documents in which x appears. In order to find that, you need to iterate over all documents first. Only then you can compute IDF from DF.

You can use a dictionary for counting DF:

Iterate over all documents
For each document, retrieve the set of it's words (without repetitions)
Increase the DF count for each word from stage 2. Thus you will increase the count exactly by one, regardless of how many times the word was in document.

Python code could look like this:

from collections import defaultdict
import math

DF = defaultdict(int) 
for filename in glob.glob(os.path.join(path, '*.txt')):
    words = re.findall(r'\w+', open(filename).read().lower())
    for word in set(words):
        if len(word) >= 3 and word.isalpha():
            DF[word] += 1  # defaultdict simplifies your "if key in word_idf: ..." part.

# Now you can compute IDF.
IDF = dict()
for word in DF:
    IDF[word] = math.log(doccounter / float(DF[word])) # Don't forget that python2 uses integer division.

PS It's good for learning to implement things manually, but if you ever get stuck, I suggest you to look at NLTK package. It provides useful functions for working with corpora (collection of texts).

answered Feb 5, 2016 at 8:02

Mike Koltsov

3363 silver badges6 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Sean Over a year ago

Thank you so much, someone suggested the default dict to me yesterday but I had no idea how to use it.

Collectives™ on Stack Overflow

Finding document frequency using Python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related