indexing data inside blob using Lucene.NET and C#

Question

I am using Lucene.Net + custom crawler + Ifilter so that I can index data inside blob.

foreach (var item in containerList)
            {
                CloudBlobContainer container = BlobClient.GetContainerReference(item.Name);
                if (container.Name != "indexes")
                {
                    IEnumerable<IListBlobItem> blobs = container.ListBlobs();
                    foreach (CloudBlob blob in blobs)
                    {
                        CloudBlobContainer blobContainer = blob.Container;
                        CloudBlob blobToDownload = blobContainer.GetBlobReference(blob.Name);

                        blob.DownloadToFile(path+blob.Name);
                        indexer.IndexBlobData(path,blob);
                        System.IO.File.Delete(path+blob.Name);
                    }
                }
            }
/*Code for crawling which downloads file Locally on azure instance storage*/

The below code is indexer function which uses IFilter

public bool IndexBlobData(string path, CloudBlob blob)
    {
        Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document();
        try
        {
            TextReader reader = new FilterReader(path + blob.Name);
            doc.Add(new Lucene.Net.Documents.Field("url", blob.Uri.ToString(), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.NOT_ANALYZED));
            doc.Add(new Lucene.Net.Documents.Field("content", reader.ReadToEnd().ToString(), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.ANALYZED));
            indexWriter.AddDocument(doc);
            reader.Close();
            return true;
        }
        catch (Exception e)
        {
            return false;
        }
    }

Now my issue is I don't want to DOWNLOAD file on instance storage.. I directly want to pass the File to FilterReader. But it takes "Physical" path, passing http address doesn't work. Can anybody suggest any other workaround? I don't want to download same file again from blob and then index it, instead i will prefer download and keep it in main memory and directly use index filter.

I am using IFilter from here

astaykov · Accepted Answer · 2012-12-19 08:57:59Z

1

It is not very clear what do you mean by I don't want to download same file again from blob and then index it, instead i will prefer download and keep it in main memory and directly use index filter? What is that main memory - the Azure Blob storage, or local instance memory.

The issue you are facing however cannot be workaround-ed, because of the nature of IFilter interface. If you look a bit deeper into the source you are using from here, you will discover that under the covers it uses IPersistFile COM interface. Unfortunately this interface only works with local files and does not accept streams.

What I would have suggested is to use Stream from Blob and pass it to the Reader, instead of the physical path. However, as already said - IFilter uses COM interfaces which work only with physical paths. So with your current approach there is no way to skip blob downloading.

There is nothing scary about downloading blobs locally. If the storage account is in the same affinity group as the compute, the download will be extremely fast, the traffic will be free. Given you use a small instance size, you will have 165GB for local storage. Which is plenty of storage. You can optimize your process a bit by keeping track of what is indexed and what not. You can use Azure Table storage for that. Another extremely fast and cheap storage solution which is perfect for storing key-value pairs as file name - etag. Then when you enumerate the blobs, first fetch the etag for a blob and check with the table whether it is already indexed or not. Download it only if it is not indexed, then add new record to the Table to mark this file as indexed.

Or... Or don't use IFilter. I don't see any benefit of using IFilter on Azure. IFilters are only registered when the Application is installed. For instance if you want to process Office documents with IFilter - you have to install Microsoft Office on the VM (which currently you can't do, even if you have license, because of license mobility restrictions for MS Office). If you want to get the IFilter for PDF - you have to install Adobe Acrobat Reader (which you can do via a startup task). And so on, and so on - some applications you can install, some you can't. Your Windows Azure VM Instance is plain Windows with no IFilters at all. Imagine a basic installation of Windows Server 2008 R2, with no roles and no features added - that is your instance.

answered Dec 19, 2012 at 8:57

astaykov

30.8k3 gold badges72 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Mandy Over a year ago

I don't want to download same file again from blob and then index it, instead i will prefer download and keep it in main memory and directly use index filter? What is that main memory

Sorry i wasn't pretty clear.. I meant instead of storing it into the local storage can't I just use some other way like Keeping it in RAM ( main memory) or may be in cache.. but as you said i dont think it's possible to use IFilter then. My requirement is I want to index data inside PDFs and Office docs. Is there any alternative for IFilter?

Mandy Over a year ago

You can use Azure Table storage for that. Another extremely fast and cheap storage solution which is perfect for storing key-value pairs as file name - etag. Then when you enumerate the blobs, first fetch the etag for a blob and check with the table whether it is already indexed or not. Download it only if it is not indexed, then add new record to the Table to mark this file as indexed.

Yeah I am thinking something very similar to it. I was thinking of using SQLAzure DB. but I might even prefer using table storage also. Let's see what is more suitable.

astaykov Over a year ago

PDF you can read with iTextsharp. Office files, created with Office version 2007 and later (basically the .docX, .xlsX and so on) can be read with [OpenXML SDK] (microsoft.com/en-us/download/details.aspx?id=5124) - quite hard though. I don't know of any other ways to do that in Azure.

Collectives™ on Stack Overflow

indexing data inside blob using Lucene.NET and C#

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related