I am trying out Twitter sentiment analysis to practice my skills. I have reached the “Training models” part of the course. Anyways, I have a fundamental doubt on how one can use word counts from training set on future test sets (which may form a different word cloud)? What I did was the following:
- Transformation was a two-step process: 1. Clean the text (remove common words, use stemming) 2. Use Count Vectorizer (converts word counts into columns)
- Fit with sentiment labels.
However, when I try the above model on a new test set, the count vectorizer would result in a different sized vocabulary or a different vocabulary all together. How does one control for it?