7

One of my friends was asked the following question in an interview. Can anyone tell me how to solve it?

We have a fairly large log file, about 5GB. Each line of the log file contains an url which a user has visited on our site. We want to figure out what's the most popular 100 urls visited by our users. How to do it?

0

3 Answers 3

7

In case we have more than 10GB RAM, just do it straight forward with hashmap.

Otherwise, separate it into several files, using a hash function. And then process each file and get a top 5. With "top 5"s for each file, it will be easy to get an overall top 5.

Another solution can be sort it using any external sorting method. And then scan the file once to count each occurrence. In the process, you don't have to keep track of the counts. You can safely throw anything that doesn't make into top5 away.

Sign up to request clarification or add additional context in comments.

6 Comments

Your second alternative is the way to go. The divide-and-conquer approach is both easy to implement and you'll get linear time (if you are using hashmaps for counting in the smaller files). Sorting the file is really slow in comparison.
@EmilVikström, I agree with you. I wrote the third one as that's the first one coming into my mind. But anyway, in an interview, it is often the case that the interviewer will construct environment that make your previous solution an invalid one, to see if you can come up with anything else.
"You can safely throw anything that doesn't make into top5 away." -> No that's wrong.
@Thomash Can you give an example? Note that it's a sorted list of strings.
I'll give you an example in my answer as I don't have enough room here.
|
2

Just sort the log file according to the URLs (needs constant space if you chose an algorithm like heap sort or quick sort) and then count for each URL how many times it appears (easy, the lines with the same URLs are next to each other).

Overall complexity is O(n*Log(n)).

Why splitting in many files and keeping only top 3 (or top 5 or top N) for each file is wrong:

     File1 File2 File3
url1   5     0     5
url2   0     5     5
url3   5     5     0
url4   5     0     0
url5   0     5     0
url6   0     0     5
url7   4     4     4

url7 never makes it to the top 3 in the individual files but is the best overall.

6 Comments

Constant space is still too much for a 5GB file.
No 5GB can be stored in main memory without problem. But if you consider you have only N GB in RAM, you can split the log file in N GB blocks and apply my solution to each block and then aggregate the results.
Can you explain more about your split and aggregation method?
Split: very simple, the first N bytes (not exactly N because you don't want to cut a line) go to the first file and you repeat the operation on the rest of the file.
Aggregate: file1 url1 visited a times, file2 url1 visited b times -> aggregate url1 visited a+b times.
|
1

Because the log file is fairly large you should read the log-file using a stream-reader. Don't read it all in the memory. I would expect it is feasible to have the number of possible distinct links in the memory while we work on the log-file.

// Pseudo
Hashmap map<url,count>
while(log file has nextline){
    url = nextline in logfile
    add url to map and update count
}

List list 
foreach(m in map){
    add m to list         
}

sort the list by count value
take top n from the list

The runtime is O(n) + O(m*log(m)) where n is the size of the log-file in lines and where the m is number of distinct found links.

Here's a C# implementation of the pseudo-code. An actual file-reader and a log-file is not provided. A simple emulation of reading a log-file using a list in the memory is provided instead.

The algorithm uses a hashmap to store the found links. A sorting algorithm founds the top 100 links afterward. A simple data container data-structure is used for the sorting algorithm.

The memory complexity is dependent on expected distinct links. The hashmap must be able to contain the found distinct links, else this algorithm won't work.

// Implementation
using System;
using System.Collections.Generic;
using System.Linq;

public class Program
{
    public static void Main(string[] args)
    {
        RunLinkCount();
        Console.WriteLine("press a key to exit"); 
        Console.ReadKey();
    }

    class LinkData : IComparable
    {
        public string Url { get; set; }
        public int Count { get; set; }
        public int CompareTo(object obj)
        {
            var other = obj as LinkData;
            int i = other == null ? 0 : other.Count;
            return i.CompareTo(this.Count);
        }
    }

    static void RunLinkCount()
    {
        // Data setup
        var urls = new List<string>();
        var rand = new Random();
        const int loglength = 500000;
        // Emulate the log-file
        for (int i = 0; i < loglength; i++)
        {
            urls.Add(string.Format("http://{0}.com", rand.Next(1000)
                 .ToString("x")));
        }

        // Hashmap memory must be allocated 
        // to contain distinct number of urls
        var lookup = new Dictionary<string, int>();

        var stopwatch = new System.Diagnostics.Stopwatch();
        stopwatch.Start();

        // Algo-time
        // O(n) where n is log line count
        foreach (var url in urls) // Emulate stream reader, readline
        {
            if (lookup.ContainsKey(url))
            {
                int i = lookup[url];
                lookup[url] = i + 1;
            }
            else
            {
                lookup.Add(url, 1);
            }
        }

        // O(m) where m is number of distinct urls
        var list = lookup.Select(i => new LinkData 
             { Url = i.Key, Count = i.Value }).ToList();
        // O(mlogm)
        list.Sort();
        // O(m)
        var top = list.Take(100).ToList(); // top urls

        stopwatch.Stop();
        // End Algo-time

        // Show result
        // O(1)
        foreach (var i in top)
        {
            Console.WriteLine("Url: {0}, Count: {1}", i.Url, i.Count);
        }

        Console.WriteLine(string.Format("Time elapsed msec: {0}",
           stopwatch.ElapsedMilliseconds));
    }
}

Edit: This answer has been updated based on the comments

  • added: running time and memory complexity analysis
  • added: pseudo-code
  • added: explain how we manage a fairly large log-file

5 Comments

This answer may show us that you know C# but no one cares about it. The tags of the question are algorithm and data-structures, not C#.
Not meant as to demonstrate I know C#, but the intend was to 'show' the algorithm through code.
I believe what @Thomash wants to point out is that you should provide your answer concisely. You will prefer reading a short answer to a long code. Don't you? By the way, the keyword in OP's question is "fairly large" but you clearly ignored it.
I have updated my answer. Hope my answer is more clear now, else let me know.
Not the best answer since sorting all pages is unnecessary; also sorting the top 100 is not specified. The fastest way to find k-smaller or k-larger elements in a list is use a the binary partition method from qsort and stop when your pivot = k (i.e. quickselect). The average complexity is O(m). See also: stackoverflow.com/questions/251781/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.