Given a person's name, we can search for it in GOOGLE and get a huge number of results, but these results are normally not about the same person because so many people have the same name. It is annoying and difficult to identify what web pages are really about the person you are looking for.

The project goal is to classify these results into groups using Machine Learning techniques like Naive Bayes Learning, the nearest classifier, decision trees and subspace method. Groups could be Athletic, Professor, Business Man, Office Worker, Government Employee, Actor, Dancer, Singer and etc.

Download Project Proposal
 

1. Google results parsing
2. Category Identification
3. Dataset 1 Extraction
4. Dataset 2 Extraction
5. Vocabulary Identification
6. Learner implementation using Weka
7. Web Interface design
8. Integration and testing

Download Job Assignment and Description

 

The project status is updated every sunday night
Sunday April 30, 2006
Project Demo:

http://www.haibozhao.com:8080/mlProject

I am not so happy with the speed, but the accuracy seems fine. A few bugs maybe exist, please email me if you find any bugs or have any suggestions.

Sunday April 16, 2006
Web Interface design
Framework of web programming part
Weka API learning

Working on vocabulary identification, progress will be updated on Wednesday

Sunday April 09, 2006
Project launched on March 29, 2006
Project proposal and plan
Goolge results parsing (parse search result elements)
Conversion from cached web pages to plain text
Tokenization of plain text

Data Set 2 Extraction: I had managed to retrieve the google dataset (1.4); About the directory, I temporarily retrieve web pages through web query (HTTP request). I've tested on actor, lawyer, scientist, they went smoothly.

html2text (http://userpage.fu-berlin.de/~mbayer/tools/html2text.html) is used to extract text from HTML files

 

 

   

Haibo Zhao

zhaohb@uga.edu

Haibao Tang bao@uga.edu
Wanna join the project? Do not hesitate to email us.
   
http://www.haibozhao.com:8080/mlProject
Copyright (c) 2006, All right reserved.