CS2510: Project 2 - MiniGoogle Final Report

CS2510: Project 2 – MiniGoogle Final Report

Guanxiong Ding (gud7)

12/17/2013

General Workflow

Indexing workflow

client send the documents to minigoogle server
mingoogle server dispatch socket connections to worker by fork a process
worker lookup mappers and reducers
worker split the file and send to ma ppers
each mapper count word for its split
mapper partition the word count result and send to reducer
reducer aggregate the result and append to master index files

Querying workflow

client send keywords to minigoogle server
mingoogle server dispatch socket connections to worker by fork a process
worker lookup mappers and reducers
worker send each keyword to each mapper
each mapper get a set of document – occurrence pairs by the keyword
mapper send result set to reducer
reducer gather all sets from mappers and calculate the total occurrence
reducer sort the document by total occurrence and send back to client

Client

It is a program on local machine. It can be runned by user on command line along with parameters. client will first lookup the naming server to find the address of mini-google, then request for a socket connection and then send index/query request to it with connection and then waiting for response.

Mini-google

First it will create a socket and binding with the address and port number, then listen() in a infinite loop. When receiving a request from client, it create a new process regarding to request type (index/query) to handle the request. After that, server will keeping listening for other requests.

Worker

Worker is the critical part of the whole work. It holds the connection with client and send result back to client if all mappers and reducers have done their work. It will split the file by the number of available mapper it got from name server. After file splitted, it will send each split to each mapper by a set of parameters

split file name
document name
number of mappers // to let reducer know when to start reducing
address and ports of all reducers

and then waiting for their response by holding connection on multi-threads. If all mappers response work done, it will send message to client and close the connection.

Mapper helper

Before working, it will register on name server when start running, with it address, port and type, which is “mapper”.

Mapper first do the word counting and combining on it splits by shell script using tr, sed, awk, uniq, sort. and then do partition on the map result, by the number of reducers. It do partition not by lines, but by alphabetic. Meaning if number of reducers is 2, result will be partition to two files, one with terms from [a-m], one with terms from [n-z], which refers to the step of shuffle. Then sent each partition to each reducer and waiting for their response by holding connection on multi-threads. If all reducers response work done, it will send message to worker and close the connection.

Reducer helper

Before working, it will register on name server when start running, with it address, port and type, which is “reducer”.

Reducer first get the number of mappers. By this number, it will know how many mapper it have to wait. After it received all data from mapper, it will start to aggregate the mapper by shell script with awk. Then it will start 5 thread to merge result to master index. When all thread done their job. Reducer will send message back to mapper that its work done.

Name Server

Naming server establish sockets and keep waiting for connection. It support two type of request,register and request. Register will store address, port, server type. Lookup will will send back a set of address, port and server type by server type and number. For example, get 5 mappers, or get 3 reducers.

Master index

In this project, name index is separated alphabetically in 26 files for each letter. The advantage is that they could be better paralleled in indexing and querying with less concern on blocking issues.