1

My nltk data is ~/nltk_data/corpora/words/(en,en-basic,README)

According to __init__.py inside ~/lib/python2.7/site-packages/nltk/corpus, to read a list of the words in the Brown Corpus, use nltk.corpus.brown.words():

from nltk.corpus import brown
print brown.words()
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

This __init__.py has

words = LazyCorpusLoader(
    'words', WordListCorpusReader, r'(?!README|\.).*')
  1. So when I write from nltk.corpus import words, am I importing the 'words' function from __init__.py which resides in directory python2.7/site-packages/nltk/corpus?

  2. Also why does this happen:

     import nltk.corpus.words
     ImportError: No module named words
     from nltk.copus import words
     # WORKS FINE
    
  3. The "brown" corpus resides inside ~/nltk_data/corpora (and not in nltk/corpus). So why does this command work?

    from nltk.corpus import brown
    

    Shouldn't it be this?

    from nltk_data.corpora import brown
    
1
  • For reference, the prompt from the interpreter was being interpreted as the start of code blocks - I've stripped them out so the blocks work properly. Commented Aug 27, 2013 at 13:18

2 Answers 2

2

Re. point 2: You can import either a module (import module.submodule), or an object from a module (from module.submodule import variable). While you can treat a module as a variable, because it actually is a variable in that scope (from module import submodule), it doesn't work the other way. That's why when you try doing import module.submodule.variable, it fails.

Re. point 3: Depends on what nltk.corpus does. Maybe it searches/loads the nltk_data for you automatically.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks viraptor. i also saw a strange piece of code -- "nltk.corpus.words.words('en')" ref. section 3.4 of nltk.org/book/ch03.html. Now how many dots can one go on applying ? Also this code works only after doing this-- "import nltk, from nltk.corpus import words, nltk.corpus.words.words" Whats are we calling words over words again ? What is special about the second words ? Where could this second word be defined/is getting called from ? Etc Etc...I hope you understand my confusion!!!!
0

1.] Yes - by using LazyCorpusLoader from util where you can find the following description:

"""
    A proxy object which is used to stand in for a corpus object
    before the corpus is loaded.  This allows NLTK to create an object
    for each corpus, but defer the costs associated with loading those
    corpora until the first time that they're actually accessed.

    The first time this object is accessed in any way, it will load
    the corresponding corpus, and transform itself into that corpus
    (by modifying its own ``__class__`` and ``__dict__`` attributes).

    If the corpus can not be found, then accessing this object will
    raise an exception, displaying installation instructions for the
    NLTK data package.  Once they've properly installed the data
    package (or modified ``nltk.data.path`` to point to its location),
    they can then use the corpus object without restarting python.
    """

3.] nltk_data is the folder where the data is, that doesn't suppose to mean that the module is also in that folder (The data is downloaded from nltk_data)

NLTK has built-in support for dozens of corpora and trained models, as listed below. To use these within NLTK we recommend that you use the NLTK corpus downloader, >>> nltk.download()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.