Part 1: Beginners Guide to Natural Language Processing(NLP) on social media

How to analyze text data in a systematical way

A. Tayyip Saka
6 min readJul 10, 2019

In this article we’ll be learning about Natural Language Processing(NLP) which can help analyzing text easily. We’ll see how data should be explored on Python and NLP tasks are carried out for understanding social media posts.

We have a small dataset providing last 500 posts in which TRT World published on Facebook as it is shown. It can be clearly seen that data includes message, name, description, number of love, like, wow, haha, anger, sorry and created time of posts.

Figure 1: Dataset On Facebook

EXPLORATORY DATA ANALYSIS

I firstly performed exploratory data analysis to understand this dataset. Firstly, I tried to learn summary of the dataset and then checked which columns includes NA values. In fact, dimension of original data is 500x10 and it includes texts and numeric values. In addition, there exists blank rows and NA values in news data.

fb_posts.shape
fb_posts.info()
fb_posts.describe()
fb_posts.isnull().sum()

Then, I converted NA values in “reaction” columns to 0 and gathered number of total reactions into one column which is labeled as “Engagement” by reducing dimensionality.

fb_posts['Engagement']=fb_posts[['like','love','wow','haha','sorry','anger']].sum(axis=1)fb_posts_new=fb_posts.drop(['like','love','wow','haha','sorry','anger'],axis=1)

Before calculating total engagement, I analyzed types of engagement and it is obviously shown that although there are several kind of engagement, people prefer clicking ‘like’ button. The main reason was that liking is still as easy ever. You’ll see the ‘like’ button on each post, but now if you tap and hold on it, the ‘like’ will expand to engender a number of emotions: love, haha, wow, sorry, and anger. Also, ‘sorry’ and ‘anger’ button indicate which content of news are more intensive(negative news)

Figure 2: Distribution of Engagement

I just wondered whether length of message has a significant effect on engagement score.That’s why, I calculated word counts of every post and used Pearson correlation to understand relation between two continuous variables.

fb_posts_new['word_count'] = fb_posts_new['message'].apply(lambda x: len(str(x).split(" ")))
fb_posts_new['word_count'].corr(fb_posts_new['Engagement'])*100
output:-9.86 %

It can be seen that there wasn’t strong negative correlation between length of message and engagement.

Apart from that, I looked at which hours people have interacted with our posts more by using “created_time” column.

fb_posts_new['created_time']=pd.to_datetime(fb_posts_new['created_time'])
d = pd.to_datetime(fb_posts_new['created_time'], format='%y-%m-%d %H:%M:%S.%f')
fb_posts_new['Time'] = d.dt.time
fb_posts_new['Hour'] = d.dt.hour

Then, I aggregated engagements based on hours and drew a line graph showing hour vs engagement. (Hours are based on UTC)

time_postengagement=fb_posts_new.groupby('Hour').sum()
time_postengagement=pd.DataFrame(time_postengagement)
sns.set()
time_postengagement.plot()
Figure 3: Hour vs Engagement

The line plot indicates that posts on Facebook which are published at 11 A.M or 5 P.M tend to be interacted by people more.

PREPROCESSING

  1. Removing punctuations, numbers, special characters and short words

I started to pre-processing with removing unnecessary parts of each post. This helped disposing useless pieces of posts.

fb_posts_new['message'] = fb_posts_new['message'].str.replace("[^a-zA-Z#]", " ")fb_posts_new['message'] = fb_posts_new['message'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))

2. Tokenization

Tokenizing means splitting text into meaningful units called as tokens. It is a mandatory step before any kind of processing. Thus, I performed tokenization process for all 500 posts.

tokenized_post = fb_posts_new['message'].apply(lambda x: x.split())
Figure 4: Tokens

3.Removing stopwords

Posts may contain stop words like ‘the’, ‘how’, ‘some’, ‘is’, ‘are’. Stop words could be filtered from the text to be processed. I removed stopwords within posts by using standard package in Python known as NLTK (Natural Language Toolkit).

from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
stop_words = stopwords.words('english')
tokenized_list_of_words = []
for i in range(0, 500):
l1 = tokenized_post.iloc[i]
tokenized_list_of_words.append([word for word in l1 if word not in stop_words])
tokenized_list_of_words

In addition to this, you are able to customize stopwords lists by adding new words if you want.

4.Stemming

Stemming helps us achieve the root forms of inflected words. Stemming is slightly different than Lemmatization in the approach it utilize producing root forms of words and the word produced.Python embodies PorterStemmer algorithm to implement stemming process.

!! A stemmer performs on a single word without knowledge of the context, and hence cannot differentiate between words which have diverse meanings depending on part of speech.

!! While converting any word to the root-base word, stemming can create non-existent work but lemmatization generates real dictionary words.

This doesn’t necessarily affect its efficiency negatively, but there is a risk of “over-stemming” where words like “universe” and “university” are reduced to the same root of “univers”.

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemma_list_of_words = []
for i in range(0, 500):
l1 = tokenized_list_of_words[i]
stemma_list_of_words.append([stemmer.stem(word) for word in l1])
stemma_list_of_words

5.Lemmatization

Lemmatization is the process of grouping together the diverse inflected forms of a word so they can be analyzed as a single item. For example, apple, apples, apple’s are all forms of the word ‘apple’, therefore ‘apple’ is the lemma of all these words. Because lemmatization returns an actual word of the language, it is used where it is necessary to get valid words.

Python NLTK provides WordNet Lemmatizer that uses the WordNet Database to lookup lemmas of words.

import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemma_list_of_words = []
for i in range(0, 500):
l1 = tokenized_list_of_words[i]
lemma_list_of_words.append([lemmatizer.lemmatize(word) for word in l1])
lemma_list_of_words

Lastly, I combined tokens within each post and pre-processing was successfully done.

for i in range(len(tokenized_post)):
lemma_list_of_words[i] = ' '.join(lemma_list_of_words[i])
lemma_list_of_words

VISUALIZATION

I analyzed which words are common in the posts and drew bar chart showing top 10 words.

str = ' '.join(lemma_list_of_words)
words=word_count(str)
words_freq = pd.DataFrame(words.items())
words_freq.columns=['words','freq']
words_freq=words_freq.sort_values(by='freq',ascending=False)
top10 = words_freq.iloc[0:10]
top10
plt.figure(figsize=(10,6))
sns.barplot("freq","words", data=top10, palette="Blues_d").set_title("Top 10 Words")
Figure 5: Frequency of words

Many times you may have seen a cloud filled with many words in different sizes, which demonstrates the frequency or the importance of each word. This is called as WordCloud.(When creating word cloud, I implemented lemmatization instead of stemming)

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
def word_cloud(wd_list):
stopwords = set(STOPWORDS)
all_words = ' '.join([text for text in wd_list])
wordcloud = WordCloud(
background_color='white',
stopwords=stopwords,
width=1600,
height=800,
random_state=21,
colormap='jet',
max_words=40,
max_font_size=200).generate(all_words)
plt.figure(figsize=(12, 10))
plt.axis('off')
plt.imshow(wordcloud, interpolation="bilinear");
word_cloud(lemma_list_of_words)

Now, I am going to show WordCloud of these 500 posts . This comprises of 40 popular words and gives fundamental background about what it was going on on during the last 5 months.

Figure 6: WordCloud

Finally, after implementing steps above, data will be ready to execute further text analysis. Later, we will develop a model that predicts which posts are more likely to be interacted by people on Facebook before publishing. This will provide basic structure to our digital content team.

Next story will be coming soon :)

--

--

A. Tayyip Saka

A writer, consultant, data artist | Master of Science in Business Analytics