Analyzing Titanic disaster using Machine Learning Algorithms hadoop frame work using big data

ABSTRACT:
Titanic disaster occurred 100 years ago on April 15, 1912, killing about 1500 passengers and crew members. The fateful incident still compel the researchers and analysts to understand what can have led to the survival of some passengers and demise of the others. With the use of Hadoop Bigdata methods and a dataset consisting of 891 rows in the train set and 418 rows in the test set, the research attempts to determine the correlation between factors such as age, sex, passenger class, fare etc. to the chance of survival of the passengers. These factors may or may not have impacted the survival rates of the passengers. In this research paper, various machine learning algorithms namely Map reduce, hdfs, Decision Tree have been implemented to predict the survival of passengers. In particular, this research work compares the algorithm on the basis of the percentage of accuracy on a test dataset.

EXISTING SYSTEM:

Eric Lam and Tang used the Titanic problem to compare and contrast between three algorithms- Naive Bayes, Decision tree analysis and SVM. They concluded that sex was the most dominant feature in accurately predicting the survival. They also suggested that choosing important features for obtaining better results is important. There are no significant differences in accuracy between the three methods they used. 
 performed Decision tree classification and Cluster analysis to suggest that sex is the most important feature as compared to other features in determining the likelihood of the survival of passengers. 
The most important conclusion provided by them is that more features utilized in the models do not necessarily make results better. 

analyzed the direct relationship of social norms and sex with survival. He concluded that on the Titanic, the survival rate of women is more than three times higher than the survival rate of men. 
That people in their prime age died less often than older people. Passengers with high financial stability, traveling in first class, are better able to save themselves as are passengers in second class as compared to third class. 

suggested that human behavior also determines the survival rates of the passengers. He also mentioned that lifeboats were less and most of them were not filled up to their capacities .

DISADVANTAGE:

Average age of male and female has not been classified wherelse gender has been classified as classification in existing system
Survival accuracy was said without any final accuracy calculation 
PROPOSED WORK:

There have been as many inquisitions as there have been questions raised and equally that many types of analysis methods applied to arrive at conclusions. But this project is not about analyzing why or what made the Titanic sink – it is about analyzing the data that is present about the Titanic publicly. It actually uses Hadoop MapReduce to analyze and arrive at:

The average age of the people (both male and female) who died in the tragedy using Hadoop MapReduce.
How many persons survived – traveling class wise.
Dataset Description: PassengerID, Survived  (survived=0 & died=1), Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked
Usin g that dataset we will perform some Analysis and will draw out some insights like finding the average age of male and females died in Titanic, Number of males and females died in each compartment.

ADVANTAGE:
More classification has been done to find the survival while this disaster incident
Nearby classification has been done to find the accuracy that helps to predict more things about this disaster event

SYSTEM ARCHITECTURE:


SYSTEM REQUIREMENTS:
HARDWARE REQUIREMENTS:

System : INTEL I3
Hard Disk            : 500 GB.
Mouse : Logitech.
Ram : 4GB.
Operating system             :          64-bit.

SOFTWARE REQUIREMENTS:

Operating system : Linux.
Coding Language : Java
Database : HDFS
TOOL                    :         Map-reduce                

REFERENCES 
Kaggle.com, ‘Titanic:Machine Learning form Disaster’,[Online]. Available: http://www.kaggle.com/. [Accessed: 10- Feb- 2017].