Classify questions to 4 different classes, using NLP methods and ML tools for classification. See it on GitHub
Data comes in
.xmlx format, but better to read it from csv file.
df = pd.read_csv("Questions.csv", sep=";") df.head()
| Question | Type | | ———————————————————– | —– | | Is Hirschsprung disease a mendelian or a multifactorial disorder? | summary | | List signaling molecules (ligands) that interact with the receptor EGFR? | list | | Is the protein Papilin secreted? | yesno | | Are long non coding RNAs spliced? | yesno | | Is RANKL secreted from the cells? | yesno |
For this problem we should use
print(df.Type.unique()) >> ['summary' 'list' 'yesno' 'factoid']
df.shape >> (2251, 2)
First step of preprocessing string is to tokenize it - change each word into separate string and gather them into a list. I’ve used
nltk method which has some additional features for example separates punctuation to different tokens.
tokens = nltk.word_tokenize("Is Hirschsprung disease a multifactorial disorder?") >> ['Is', 'Hirschsprung', 'disease', 'a', 'multifactorial', 'disorder', '?']
In every language there are stop words, that actually don’t give much information in a sentence. For example (a, the, in, or, …).
tokens = [token for token in tokens if token not in stopwords_en] >> ['Is', 'Hirschsprung', 'disease', 'mendelian', 'multifactorial', 'disorder', '?']
Removing punctuation also helps reduce number of tokens that not necessary increase informative value of sentence.
tokens = [token for token in tokens if token not in punctuation] >> ['Is', 'Hirschsprung', 'disease', 'mendelian', 'multifactorial', 'disorder']
# ['123a45n6', 'example!', 'witho0ut', 'non-letters'] tokens = [re.sub(r'[^a-zA-Z]', "", token) for token in tokens] >> ['an', 'example', 'without', 'nonletters']
# ['LOWER', 'Case'] tokens = [token.lower() for token in tokens] >> ['lower', 'case']
“Lemmatization (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analyzed as a single item, identified by the word’s lemma, or dictionary form.” (Wiki)
# ['list', 'signaling', 'molecules', 'ligands', 'interact', 'receptor'] tokens = [lemmatize(pair) for pair in pos_tag(tokens)] >> ['list', 'signal', 'molecule', 'ligands', 'interact', 'receptor']
Stemming has basically the same purpose as Lemmatization, but is performed with regex rules, which makes it way faster, and sometimes allow to decrease number of unique tokens in dataset even after lemmatization.
# ['list', 'signal', 'molecule', 'ligands', 'interact', 'receptor'] tokens = [porter.stem(token) for token in tokens] >> ['list', 'signal', 'molecul', 'ligand', 'interact', 'receptor']
sklearn.feature_extraction.text.CountVectorizer to Vectorize my data. It takes text file as input but there is a short trick with
StringIO that allows me to transform data to proper format.
with StringIO('\n'.join([i for i in questions.values])) as text: count_vect = CountVectorizer(analyzer=preprocess_text) count_vect.fit_transform(text)
In out dataset after preprocessing there are 3601 tokens (more than training examples) we will have to deal with it later.
len(count_vect.vocabulary_) >> 3601
There is a vocabulary of words. As we can see in first example not all of them are regular words in english.
words_sorted_by_index, _ = zip(*sorted(count_vect.vocabulary_.items(), key=itemgetter(1))) words_sorted_by_index[:5] >> ('aa', 'aagena', 'abacavir', 'abatacept', 'abc')
This is our final dataset shape, time to do the classification.
count_vect.transform([i for i in questions.values]).toarray().shape >> (2251, 3601)
I’ve tested 4 different classifers using GridSearch with wide param space and CrosValidation.
Decision Tree Accuracy Train: 99.8% Accuracy Valid: 70.4% Best params: class_weight = 'balanced', presort = False
Random Forest Accuracy Train: 93.5% Accuracy Valid: 72.3% Best params: class_weight = 'balanced', max_depth = 30 max_features = 19, n_estimators = 100
K-Neares Neighbours Accuracy Train: 99.8% Accuracy Valid: 32.1% Best params: n_neighbors = 10, weights = 'distance'
Logistic Regression Accuracy Train: 85.7% Accuracy Valid: 74.6% Best params: C = 0.1, multi_class = 'multinomial', solver = 'lbfgs'
The winner is Logistic Regression 🏆
- I’ve tested that
PCAdoesn’t improve performance of any of classifiers.
- Also using a
StandarScaler()wasn’t a good idea due to binary character of data.
- My validation metric was accuracy due to even distibution in class.
Testing On Production
I’ve written short function to classify inputed by user questions to one of 4 classes.
def predict_question(question): x = count_vect.transform([question]).toarray() return classes[clf.predict(x)]
These are results of my classification:
Do you like to study? yesno
How do you feel rright now? summary
What is your name? summary
List two of your favourie films. list
What is the biggest country in Europe? summary
Where are you? factoid
How old are you? summary
25 Oct 2019 Mateusz Dorobek UPC - Human Language Engineering