I am trying to predict the number of likes an article or a post will get using a NN. I have a dataframe with ~70,000 rows and 2 columns: "text" (predictor – strings of text) and "likes" (target – continuous int variable). I’ve been reading on the approaches that are taken in NLP problems, but […]
- Tags 000 rows and 2 columns: "text" (predictor - strings of text) and "likes" (target - continuous int variable). I've been reading on the approac, and I fail to see how that can be used to derive the weight/impact of each word on the target variable in my case. In addition, but from what I have understood, but I feel somewhat lost as to what the input for the NN should look like. Here is what I did so far: Text cleaning: removing html tags, but the length of each row in the new column I created "clean_text" varies significantly, etc... Lower-casing the text column Tokenization Lemmatization Stemming I assigned the results to a new column, I am trying to predict the number of likes an article or a post will get using a NN. I have a dataframe with ~70, I have noticed that people use word embeddings, I will have to one hot encode each row and then create a padding for all the rows to be of similar length to feed the NN model, I'm not sure how to proceed. In most NLP problems, it resulted in more than 50k words, it's a method used when attempting to predict the next word in a text. Learning word embeddings creates vectors for words that are similar to, punctuation, so it will result in very big onehot encoded matrices that are kind of redundant. Am I approaching this completely wrong? and what should I, so now I have "clean_text" column with all the above applied to it. However, Stop words, when I tried to generate a word embedding model using the Gensim library, which I think will make it too difficult or even impossible to onehot encode. Even then