Getting top 100 URL from a log file

Question

One of my friends was asked the following question in an interview. Can anyone tell me how to solve it?

We have a fairly large log file, about 5GB. Each line of the log file contains an url which a user has visited on our site. We want to figure out what's the most popular 100 urls visited by our users. How to do it?

Haozhun · Accepted Answer · 2012-05-24 08:50:37Z

7

In case we have more than 10GB RAM, just do it straight forward with hashmap.

Otherwise, separate it into several files, using a hash function. And then process each file and get a top 5. With "top 5"s for each file, it will be easy to get an overall top 5.

Another solution can be sort it using any external sorting method. And then scan the file once to count each occurrence. In the process, you don't have to keep track of the counts. You can safely throw anything that doesn't make into top5 away.

answered May 24, 2012 at 8:50

Haozhun

6,5313 gold badges32 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Emil Vikström Over a year ago

Your second alternative is the way to go. The divide-and-conquer approach is both easy to implement and you'll get linear time (if you are using hashmaps for counting in the smaller files). Sorting the file is really slow in comparison.

Haozhun Over a year ago

@EmilVikström, I agree with you. I wrote the third one as that's the first one coming into my mind. But anyway, in an interview, it is often the case that the interviewer will construct environment that make your previous solution an invalid one, to see if you can come up with anything else.

Thomash Over a year ago

"You can safely throw anything that doesn't make into top5 away." -> No that's wrong.

Haozhun Over a year ago

@Thomash Can you give an example? Note that it's a sorted list of strings.

Thomash Over a year ago

I'll give you an example in my answer as I don't have enough room here.

|

Thomash · Accepted Answer · 2012-05-24 10:09:36Z

2

Just sort the log file according to the URLs (needs constant space if you chose an algorithm like heap sort or quick sort) and then count for each URL how many times it appears (easy, the lines with the same URLs are next to each other).

Overall complexity is O(n*Log(n)).

Why splitting in many files and keeping only top 3 (or top 5 or top N) for each file is wrong:

     File1 File2 File3
url1   5     0     5
url2   0     5     5
url3   5     5     0
url4   5     0     0
url5   0     5     0
url6   0     0     5
url7   4     4     4

url7 never makes it to the top 3 in the individual files but is the best overall.

edited May 24, 2012 at 10:09

answered May 24, 2012 at 9:22

Thomash

6,3791 gold badge32 silver badges50 bronze badges

6 Comments

Haozhun Over a year ago

Constant space is still too much for a 5GB file.

Thomash Over a year ago

No 5GB can be stored in main memory without problem. But if you consider you have only N GB in RAM, you can split the log file in N GB blocks and apply my solution to each block and then aggregate the results.

Haozhun Over a year ago

Can you explain more about your split and aggregation method?

Thomash Over a year ago

Split: very simple, the first N bytes (not exactly N because you don't want to cut a line) go to the first file and you repeat the operation on the rest of the file.

Thomash Over a year ago

Aggregate: file1 url1 visited a times, file2 url1 visited b times -> aggregate url1 visited a+b times.

|

Kunukn · Accepted Answer · 2012-05-25 11:24:09Z

1

Because the log file is fairly large you should read the log-file using a stream-reader. Don't read it all in the memory. I would expect it is feasible to have the number of possible distinct links in the memory while we work on the log-file.

// Pseudo
Hashmap map<url,count>
while(log file has nextline){
    url = nextline in logfile
    add url to map and update count
}

List list 
foreach(m in map){
    add m to list         
}

sort the list by count value
take top n from the list

The runtime is O(n) + O(m*log(m)) where n is the size of the log-file in lines and where the m is number of distinct found links.

Here's a C# implementation of the pseudo-code. An actual file-reader and a log-file is not provided. A simple emulation of reading a log-file using a list in the memory is provided instead.

The algorithm uses a hashmap to store the found links. A sorting algorithm founds the top 100 links afterward. A simple data container data-structure is used for the sorting algorithm.

The memory complexity is dependent on expected distinct links. The hashmap must be able to contain the found distinct links, else this algorithm won't work.

// Implementation
using System;
using System.Collections.Generic;
using System.Linq;

public class Program
{
    public static void Main(string[] args)
    {
        RunLinkCount();
        Console.WriteLine("press a key to exit"); 
        Console.ReadKey();
    }

    class LinkData : IComparable
    {
        public string Url { get; set; }
        public int Count { get; set; }
        public int CompareTo(object obj)
        {
            var other = obj as LinkData;
            int i = other == null ? 0 : other.Count;
            return i.CompareTo(this.Count);
        }
    }

    static void RunLinkCount()
    {
        // Data setup
        var urls = new List<string>();
        var rand = new Random();
        const int loglength = 500000;
        // Emulate the log-file
        for (int i = 0; i < loglength; i++)
        {
            urls.Add(string.Format("http://{0}.com", rand.Next(1000)
                 .ToString("x")));
        }

        // Hashmap memory must be allocated 
        // to contain distinct number of urls
        var lookup = new Dictionary<string, int>();

        var stopwatch = new System.Diagnostics.Stopwatch();
        stopwatch.Start();

        // Algo-time
        // O(n) where n is log line count
        foreach (var url in urls) // Emulate stream reader, readline
        {
            if (lookup.ContainsKey(url))
            {
                int i = lookup[url];
                lookup[url] = i + 1;
            }
            else
            {
                lookup.Add(url, 1);
            }
        }

        // O(m) where m is number of distinct urls
        var list = lookup.Select(i => new LinkData 
             { Url = i.Key, Count = i.Value }).ToList();
        // O(mlogm)
        list.Sort();
        // O(m)
        var top = list.Take(100).ToList(); // top urls

        stopwatch.Stop();
        // End Algo-time

        // Show result
        // O(1)
        foreach (var i in top)
        {
            Console.WriteLine("Url: {0}, Count: {1}", i.Url, i.Count);
        }

        Console.WriteLine(string.Format("Time elapsed msec: {0}",
           stopwatch.ElapsedMilliseconds));
    }
}

Edit: This answer has been updated based on the comments

added: running time and memory complexity analysis
added: pseudo-code
added: explain how we manage a fairly large log-file

edited May 25, 2012 at 11:24

answered May 24, 2012 at 9:47

Kunukn

2,2361 gold badge17 silver badges17 bronze badges

5 Comments

Thomash Over a year ago

This answer may show us that you know C# but no one cares about it. The tags of the question are algorithm and data-structures, not C#.

Kunukn Over a year ago

Not meant as to demonstrate I know C#, but the intend was to 'show' the algorithm through code.

Haozhun Over a year ago

I believe what @Thomash wants to point out is that you should provide your answer concisely. You will prefer reading a short answer to a long code. Don't you? By the way, the keyword in OP's question is "fairly large" but you clearly ignored it.

Kunukn Over a year ago

I have updated my answer. Hope my answer is more clear now, else let me know.

cmcginty Over a year ago

Not the best answer since sorting all pages is unnecessary; also sorting the top 100 is not specified. The fastest way to find k-smaller or k-larger elements in a list is use a the binary partition method from qsort and stop when your pivot = k (i.e. quickselect). The average complexity is O(m). See also: stackoverflow.com/questions/251781/…

Collectives™ on Stack Overflow

Getting top 100 URL from a log file

3 Answers 3

6 Comments

6 Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

6 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related