Categories
Artificial Intelligence (AI) Mastering Development

How to use text as an input for a neural network – regression problem? How many likes/claps an article will get

I am trying to predict the number of likes an article or a post will get using a NN.

I have a dataframe with ~70,000 rows and 2 columns: "text" (predictor – strings of text) and "likes" (target – continuous int variable). I’ve been reading on the approaches that are taken in NLP problems, but I feel somewhat lost as to what the input for the NN should look like.

Here is what I did so far:

  1. Text cleaning: removing html tags, stop words, punctuation, etc…
  2. Lower-casing the text column
  3. Tokenization
  4. Lemmatization
  5. Stemming

I assigned the results to a new column , so now I have "clean_text" column with all the above applied to it. However, I’m not sure how to proceed.

In most NLP problems, I have noticed that people use word embeddings, but from what I have understood, it’s a method used when attempting to predict the next word in a text. Learning word embeddings creates vectors for words that are similar to each other syntax-wise, and I fail to see how that can be used to derive the weight/impact of each word on the target variable in my case.

In addition, when I tried to generate a word embedding model using the Gensim library, it resulted in more than 50k words, which I think will make it too difficult or even impossible to onehot encode. Even then, I will have to one hot encode each row and then create a padding for all the rows to be of similar length to feed the NN model, but the length of each row in the new column I created "clean_text" varies significantly, so it will result in very big onehot encoded matrices that are kind of redundant.

Am I approaching this completely wrong? and what should I do?

Leave a Reply

Your email address will not be published. Required fields are marked *