Tweeter Temp Check #2 - Taking Data to the Cleaners
Taking Data to the Cleaners by Ashur Baroutta
As discussed in our previous submission, we are following up on our installation and setup process by starting to receive and clean our data. Given we are importing 100-1000 tweets all together, we need to remove unwanted elements from our text in order to get a better quality analysis of the tweets content.
It is important to note why this is necessary. What we are aiming to do with sentiment analysis is mine the subjective textual information of a post and use natural language processing and machine learning libraries/functions to rank the sentiment of the given post. Working with twitter posts and replies, we will choose to clean the mined data of symbols/characters that we don't want affecting our analysis, an example of this would be a hashtag symbol. We are using our imported library and accessing its sub function to substitute certain characters/symbols with empty space so we can more accurately rank sentiment. The following is our written code to clean our mined posts of data we find unnecessary.
After cleaning our data, we can start analyzing each post and assessing the outputs. Admittedly, the information we are mining is subjective and as such our results are of a subjective nature, we would hesitate to say results absolutely depict the true sentiment/intent of a users post, though the insights have proven to be valuable. Computational methods are provided for us in our imported libraries that use ready made algorithms to rank sentiment for us. These are made by a collaboration of scientist in multiple fields. After making our function to clean the data we don't want, we want to format our output into something readable, the following is the code written for setting up our dataframe and integrating our subjectivity/polarity methods (topical related/negative, neutral/positive sentiment) from the TextBlob library.
Now that we've got our desired output format we can begin to analyze different outputs in our next submission.
Ashur,
ReplyDeleteI love this project and completely understand why you're keeping it organized the way you are, just keep in mind you're on the hook for ~8-9 hours per week. 😉
Thanks. : )
Delete