Create an empty HashMap, H; For each document, D, (i.e. file in an input directory): Create a HashMapVector, V, for D; For each (non-zero) token, T, in V: If T is not already in H, create an empty TokenInfo for T and insert it into H; Create a TokenOccurrence for T in D and add it to the occList in the TokenInfo for T;
Compute IDF for all tokens in H: Let N be the total number of Documents; For each token, T, in H: Determine the total number of documents, M, in which T occurs (the length of T’s occList); Set the IDF for T to log(N/M);
Compute vector lengths for all documents in H: Assume the length of all document vectors (stored in the DocumentReference) are initialized to 0.0; For each token T in H: Let, I, be the IDF weight of T; For each TokenOccurence of T in document D: Let, C, be the count of T in D; Increment the length of D by (I*C)^2 ; For each document D in H: Set the length of D to be the square-root of the current stored length;
Create a HashMapVector, Q, for the query. Create empty HashMap, R, to store retrieved documents with scores. For each token, T, in Q: Let I be the IDF of T, and K be the count of T in Q; Set the weight of T in Q: W = K * I; Let L be the list of TokenOccurences of T from H; For each TokenOccurence, O, in L: Let D be the document of O, and C be the count of O (tf of T in D); If D is not already in R (D was not previously retrieved) Then add D to R and initialize score to 0.0; Increment D’s score by W * I * C; (product of T-weight in Q and D) Compute the length, L, of the vector Q (the sum of the squares of its weights). For each retrieved document D in R: Let S be the current accumulated score of D; (S is the dot-product of D and Q) Let Y be the length of D as stored in its DocumentReference; Normalize D’s final score to S / (L * Y); Sort retrieved documents in R by final score and return results in an array.