Feature Extraction in NLP

5 minute read


What is Feature Extraction?

One of the most important parts of data preprocessing is feature extraction, which is a process of reducing data dimensionality by modifying variables describing data such way, that created set of features (feature vector) describe data model accurately and overall in a direct way. Thanks to feature extraction data is easier to compare thus easier to analyze.

Feature extraction is the transformation of original data to a data set with a reduced number of variables, which contains the most discriminatory information. ~ Transient Electromagnetic-Thermal Nondestructive Testing, 2017

Other Data Preprocessing Techniques

Besides feature extraction data preprocessing includes:

  • feature engineering - transforming raw data (which often is very redundant) into a dataset with more informative format.
  • feature transformation - to improve algorithm performance with normalization, scaling and dealing with missing values.
  • feature selection - removing unnecessary features settle a subset of the original features. Unlike feature extraction which transforms attributes (e.g. creates a linear combination of them etc.)

Why we need to use it?

Clarity of data is important to algorithm as it is to humans. Redundancy doesn’t allow us to focus on the important detail, machines work the same. The more concentrated the information in data is the better. It’s worth emphasizing that information is something different from data. It’s an abstract concept of useful value that data brings to the phenomena that we are analyzing.

Not every data is informative, and the role of every data analyst is to draw useful information from the given data. Many believe that well-performed feature extraction is the key to build an effective model.

What are the most common techniques in NLP?

Natural Language Processing requires creating tabular data from the raw text. It’s done many ways but the most common are:

  • BoW (Bag of Words/Ngrams) - represent text that describes the frequency of each word in a document. New feature vector has a column for each word in a document, and each row has a number of occurrences of this word in a sample (sentence, paragraph, etc.)
  • CBoW (Continuous Bag of Words) - extend BoW, where the model tries to predict learned word by surrounding words disregard the order.
  • CSG (Continuous Skip Gram) - this model uses a current word to predict surrounding words, with higher wages for closer words. CBoW is faster, but skip-gram has better results for rare words.
  • Word2vec - is a group of related models: CBoW (Continuous Bag of Words) and Continuous skip-gram, used to produce word embeddings
  • GloVe (Global Vectors) - is an unsupervised learning model that used to obtain dense word vectors similar to Word2Vec. The basic method of the GloVe model is to first create a huge word-context co-occurence matrix consisting of (word, context) pairs such that each element in this matrix represents how often a word occurs with the context (which is a sequence of words).
  • TF-IDF (Term frequency–inverse document frequency) - is a numerical statistic of the importance of each word in a document. TF divides the number of occurrences of each word in a document by the total number of words in the document. IDF is to downscale weights for words that occur in many documents in the corpus and are less informative than those that occur only in a smaller part of the corpus.
  • Autoencoders - is an artificial neural network that encodes the input in an unsupervised way that output can be preserved. This algorithm basically learns to ignore the noise in input data. Its hidden form can be used as an encoding of the input with preserved all information that the second part of the network uses to recreate input data.

Python Implementation Essentials

  • BOW using CountVectorization - sklearn

    from sklearn.feature_extraction.text import CountVectorizer
    corpus = [
        'This is the first document.',
        'This document is the second document.',
        'And this is the third one.',
        'Is this the first document?',
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(corpus)
    >   ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
    >   [[0 1 1 1 0 0 1 0 1]
         [0 2 0 1 0 1 1 0 1]
         [1 0 0 1 1 0 1 1 1]
         [0 1 1 1 0 0 1 0 1]]
  • TF-IDF using TfidfTransformer - sklearn

    from sklearn.feature_extraction.text import TfidfTransformer
    tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
    X_train_tf = tf_transformer.transform(X_train_counts)
    > (2257, 35788)
  • CBOW and CSG using *Word2Vec - genism

    from gensim.models import Word2Vec 
    # CBOW model 
    cbow = Word2Vec(data, min_count = 1, size = 100, window = 5) 
    # Skip Gram model 
    sg = Word2Vec(data, min_count = 1, size = 100, window = 5, sg = 1)
  • Glove using EmbeddingTransformer - zeugma

    from zeugma.embeddings import EmbeddingTransformer
    glove = EmbeddingTransformer('glove')
    X_train = glove.transform(corpus)
  • Autoencoder using layers and models - keras

    from keras.layers import Dense, Flatten, Reshape, Input, InputLayer
    from keras.models import Sequential, Model
    def build_autoencoder(img_shape, code_size):
        # The encoder
        encoder = Sequential()
        # The decoder
        decoder = Sequential()
        return encoder, decoder

20 Nov 2019 Mateusz Dorobek UPC - Human Language Engineering