Big Data NLP msc datasci

Analysis & forecast of Twitter sentiment in Dublin surrounding vaccination for the Years of 2020 and 2021.

Domain: Distributed big data temporal NLP sentiment analysis.

Description

In this continuous assessment, You are required to identify and carry out an analysis of a large dataset gleaned from the twitter API. Instructions for accessing the data can be found here
https://datascienceparichay.com/article/get-data-from-twitter-api-in-python-step-by-step-guide/
https://www.toptal.com/apache/apache-spark-streaming-twitter
OR You may use the data held here:
https://archive.org/details/twitterstream?sort=-publicdate

You must collect at least 1 year’s tweets on a topic, this data should be stored as requested below, and you are then required to analyse any change sentiment that occurs over the time period that you have selected.

Following your analysis you are then required to make a forecast of the sentiment at 1 week, 1 month and 3 months going forward. This forecast must be displayed as a dynamic dashboard.

You may choose any topic that you wish to analyse EXCEPT for crypto-currency and financial or commodity stocks.

Your project must incorporate the following elements:

Utilisation of a distributed data processing environment (e.g., Hadoop Map-reduce or Spark), for some part of the analysis.
Source dataset(s) can be stored into an appropriate SQL/ NoSQL database(s) prior to processing by MapReduce / Spark (HBase / HIVE / Spark SQL /Cassandra / MongoDB / etc.) The data can be populated into the NoSQL database using an appropriate tool (Hadoop/ Spark etc.)
Post Map-reduce processing dataset(s) can be stored into an appropriate NoSQL database(s) (Follow a similar choice as in the previous step)
Store the data and then follow-up analysis on the output data. It can be extracted from the NoSQL database into another format, using an appropriate tool, if necessary (e.g. extract to CSV to import into R/ Python etc.).
Devise and implement a test strategy in order to perform a comparative analysis of the capabilities of any two databases (MySQL, MongoDB, Cassandra, HBase and CouchDB) in terms of the performance.
You should record a set of appropriate metrics and perform a quantitative analysis for comparison purposes between the two chosen database systems.

Source Code

Report

Big Data NLP msc datasci

Analysis & forecast of Twitter sentiment in Dublin surrounding vaccination for the Years of 2020 and 2021.

Description

Tacitus