Using Hashtag Graph-Based Topic Model to Connect Semantically-Related Words Without Co-Occurrence in Microblogs using dotnet


ABSTRACT:

Mining topics in Twitter is increasingly attracting more attention. However, the shortness and informality of tweets leads to extreme sparse vector representation with a large vocabulary, which makes the conventional topic models (e.g., Latent Dirichlet Allocation) often fail to achieve high quality underlying topics. Luckily, tweets always show up with rich user generated hashtags as keywords. In this project, we propose a novel topic model to handle such semi-structured tweets, denoted as Hashtag Graph based Topic Model (HGTM). By utilizing relation information between hashtags in our hashtag graph, HGTM establishes word semantic relations, even if they haven’t co-occurred within a specific tweet. In addition, we enhance the dependencies of both multiple words and hashtags via latent variables (topics) modeled by HGTM. We illustrate that the user contributed hashtags could serve as weakly-supervised information for topic modeling, and hashtag relation could reveal the semantic relation between tweets. Experiments on a real-world twitter data set show that our model provides an effective solution to discover more distinct and coherent topics than the state-of the- art baselines and has a strong ability to control sparseness and noise in tweets.

EXISTING SYSTEM:

Many powerful topic models for document analysis have been proposed, such as LSA, PLSI, and LDA. 
Two kinds of methods have been proposed to tackle the serious sparseness and noise in tweets. One is to aggregate tweets as a large document. 
Typically, Hong et al. aggregated tweets by the same user, the same word or the same hashtag. Mehrotra et al. investigated different pooling schemes for LDA process. 
Hu et al. organized tweets by transforming them to a semantic structure tree via term relationship defined in Wikipedia and WordNet. 
Besides content mining, a few works have used semi-structured information (hashtags or labels) for tweet modeling. Labeled LDA was introduced to control relationship between tweets via manual defined supervision labels.

DISADVANTAGES:

Cannot find exact semantic words for the tweets

PROPOSED SYSTEM:

We propose a novel topic model to handle such semi-structured tweets, denoted as Hashtag Graph based Topic Model (HGTM). 
By utilizing relation information between hashtags in our hashtag graph, HGTM establishes word semantic relations, even if they haven’t co-occurred within a specific tweet.
In addition, we enhance the dependencies of both multiple words and hashtags via latent variables (topics) modeled by HGTM. 
We illustrate that the user contributed hashtags could serve as weakly-supervised information for topic modeling, and hashtag relation could reveal the semantic relation between tweets. 

ADVANTAGES:

1) Able to get semantic relationship with the help of hashtags
2) Our model provides an effective solution to discover more distinct and coherent topics
3) Strong ability to control sparseness and noise in tweets.

SYSTEM ARCHITECTURE:

SYSTEM SPECIFICATION:-
Hardware Requirements: 

System                         :   Pentium IV 2.4 GHz
Hard Disk                    :   40 GB
Monitor                       :   14’ Colour Monitor
Mouse                         :   Optical Mouse

Software Requirements: 

Operating system         :   Windows 7.
Coding Language         :   ASP.Net with C# (Service Pack 1)
Data Base                     :   SQL Server 2008

REFERENCES

Y. Chen, H. Amiri, Z. Li, and T.-S. Chua, “Emerging topic detection for organizations from microblogs,” in Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’13. New York, NY, USA: ACM, 2013, pp. 43–52. 


12:48 AM 1 Crore Projects
Latest Post