I am trying to predict the number of likes an article or a post will get using a NN.
I have a dataframe with ~70,000 rows and 2 columns: "text" (predictor – strings of text) and "likes" (target – continuous int variable). I’ve been reading on the approaches that are taken in NLP problems, but I feel somewhat lost as to what the input for the NN should look like.
Here is what I did so far:
1- Text cleaning: removing html tags, stop words, punctuation, etc... 2- Lower-casing the text column 3- Tokenization 4- Lemmatization 5- Stemming
I assigned the results to a new column , so now I have "clean_text" column with all the above applied to it. However, I’m not sure how to proceed.
In most NLP problems, I have noticed that people use word embeddings, but from what I have understood, it’s a method used when attempting to predict the next word in a text. Learning word embeddings creates vectors for words that are similar to each other syntax-wise, and I fail to see how that can be used to derive the weight/impact of each word on the target variable in my case.
In addition, when I tried to generate a word embedding model using the Gensim library, it resulted in more than 50k words, which I think will make it too difficult or even impossible to onehot encode. Even then, I will have to one hot encode each row and then create a padding for all the rows to be of similar length to feed the NN model, but the length of each row in the new column I created "clean_text" varies significantly, so it will result in very big onehot encoded matrices that are kind of redundant.
Am I approaching this completely wrong? and what should I do?