Pergunta de entrevista da empresa Microsoft

Propose an algorithm to remove duplicates in very large datset (which cannot be completely stored in memory for processing). Now, give scenarios in which your algorithm fails.

Respostas da entrevista

Sigiloso

25 de mai. de 2013

For i = 0 to 999,999 string currentVal = get the item at index i for j = i+1 to 999,999 if (j - i mod fileChunkSize == 0) load file chunk into array if data[j] == currentVal add currentVal to duplicateList and exit for

1

Sigiloso

25 de mai. de 2013

Divide the file into M chunks, each of which is large enough to be sorted in memory. Sort them in memory. For each set of two chunks, we will then carry out the last step of mergesort on two chunks to make one larger chunk (c_1 + c_2) (c_3 + c_4) .. (c_m-1 + c_m) Point at the first element on c_1 and c_2 on disk, and make a new file (we'll call it c_1+2). if c_1's pointed-to element is a smaller number than c_2's pointed-to element, copy it into c_1+2 and point to the next element of c_1. Otherwise, copy c_2's pointed element into and point to the next element of c_2. Repeat the previous step until both arrays are empty. You only need to use the space in memory needed to hold the two pointed-to numbers. During this process, if you encounter c_1 and c_2's pointed-to elements being equal, you have found a duplicate - you can copy it in twice and increment both pointers. The resulting m/2 arrays can be recursively merged in the same manner- it will take log(m) of these merge steps to generate the correct array. Each number will be compared against each other number in a way that will find the duplicates. Alternately, a quick and dirty solution as alluded to by @Evgeny Kluev is to make a bloom filter which is as large as you can reasonably fit in memory. You can then make a list of the index of each element which fails the bloom filter and loop through the file a second time in order to test these members for duplication.

2