NLP engineer interview questions collections

Practice make perfect

Posted by Chester on November 19, 2018

Question Collection

What’s word embedding and how to implement?

Vector space models (VSMs) represent (embed) words in a continuous vector space where semantically similar words are mapped to nearby points (‘are embedded nearby each other’). VSMs have a long, rich history in NLP, but all methods depend in some way or another on the Distributional Hypothesis, which states that words that appear in the same contexts share semantic meaning. The different approaches that leverage this principle can be divided into two categories: count-based methods (e.g. Latent Semantic Analysis), and predictive methods (e.g. neural probabilistic language models).

count-based model

  • LSA

    predictive model

General Preprocessing Process?

  1. Metamorphosis by Franz Kafka
  2. Text Cleaning is Task Specific
  3. Manual Tokenization
  4. Tokenization and Cleaning with NLTK
  5. Additional Text Cleaning Considerations
  6. Tips for Cleaning Text for Word Embedding

What is part of speech (POS) tagging? What is the simplest approach to building a POS tagger that you can imagine?

Is one of the main components of almost any NLP analysis. The task of POS-tagging simply implies labelling words with their appropriate Part-Of-Speech (Noun, Verb, Adjective, Adverb, Pronoun, …).

It following another question. Why we need to get part of speech? Here’s is really common and popular application rely on NLP. ->Chatbot

How would you build a POS tagger from scratch given a corpus of annotated sentences? How would you deal with unknown words?

  1. Tnt Tagger

  2. Consider the following analysis involving woman (a noun), bought (a verb), over (a preposition), and the (a determiner). The text.similar() method takes a word w, finds all contexts w1w w2, then finds all words w’ that appear in the same context, i.e. w1w’w2. If we have big enough corpus we can find enough past date support our analysis. — Here’s also a very interesting instruction about building Pos tagger.# A Good Part-of-Speech Tagger in about 200 Lines of Python

How would you train a model that identifies whether the word “Apple” in a sentence belongs to the fruit or the company?

We can apply HMM model to identify the result. Most easy way to implement this is doing n-gram encoding then train the classification model.

How would you find all the occurrences of quoted text in a news article?

a hand-crafted extractor may not be such a bad thing, especially if you need to implement it quickly. Another option to use some existing rule-processing engine.

How would you build a system that auto corrects text that has been generated by a speech recognition system?

Here’s the one big problem we having standard solution today. Language model is one of solution. There’s are different between common language model. We need to generate character-level language model

What is latent semantic indexing and where can it be applied?

Projection into another space matrix. LDA is one the way doing this. Try to get best result from compressing. It can be understand as compressed information with minimum information loss.

What are stop words? Describe an application in which stop words should be removed.

Stop word usually refer to useless information in sentence. It vary be user scenario, however, following the NLTK stop word’s list: me, I , myself are common cases for stop words.

How would you design a model to predict whether a movie review was positive or negative?

For unsupervised learning sentiment analysis is one simple idea. Collect positive and negative word. This question can be treated as very deep problem.

How to implement flexible text matching

  1. Soundex
  2. Metaphone
  3. Edit Distance

    This is the beam In optimization problem, any solution rely on BFS or DFS, the idea is try to exhaustively search all the possible solution. If the problem know the score(performance) in half way(tree), we can set up some threshold to remove part of the tree. The prons is obvious that save computing/memory resource, however, it can’t garentee best solution.

Language Modeling

  • Online training on word2vec? For gensim, is possible to add new vocab into exist model

Topic Modeling & Summarization

Deep learning very basic type?

The real answer here is infinity. However, I don’t believe anyone can actually doing that good for knowing everything.

  • MLP, Boltzman machine, CNN,RNN,GAN, LSTM, autoencoder
  • pooling layer, Relu Layer, full connected layer,
  • BatchNormalization #Initialize

    Activation machine

    Sigmoid, Relu, Step, Tanh, softmax

    Cost function

  • tf.nn.weighted_cross_entropy_with_logits
  • Noise Constrastive Estimation
  • Gradiaent descent

    Adam ..SGD,,

    Overfitting

    batch normalization, dropout,

    tensorflow

    constant,variable, placeholder, session,

    Hyper parameter

    They are tons of parameter. Not gonna to list it all.

  • Learning rate
  • Building process

    Tensorflow is the the only library I’ve used so far.( I assume the keras as part of tensorflow) Following google’s talk , they strongerly suggest peoeple to use Keras and other high level tool first. In this post, I hope to write down higher level idea for interview mainly.

    Backward propagation

Knowledge supplement :)

While I collect questions and try answer them, I review some key tool. I hope they’ll also helpful for reader.

HMM & POS

The main idea of Hidden Markov models

The output at time ii depends on the input at time ii and the output at previous times i−1i−1, i−2i−2, …:

There’s are two example using HMM concept.

  • Language model: p(wi wi−2,wi−1)p(wi wi−2,wi−1)
  • Hidden Markov model: p(yi wi,yi−1,yi−2,…)p(yi wi,yi−1,yi−2,…)
    • e.g., yiyi is part-of-speech tag at time i
  1. filling the empty
  2. review the old story
    • SVM
    • Random forest
    • Decsion tree
    • A/B test basic
    • Index f-1 precision recall
  3. language model
    • Transformer-Decoder architecture
  4. RNN family Best grpahical explaination
    • GRU
    • LSTM
  5. Attention model

Speech Recognition

Collceting the most common DL framework accoding to Speech Recognition Using Deep Neural Networks: A Systematic Review

Pointer network

Reference

PDF Extraction

Purpose: PDF + Image + Hand wrritting

PDF Miner

pip3 install pdfminer.six

Prons: Asian language CJK support Cons:

  • Text based pdf only.
  • Need W ,L M

    Tesseract OCR

  • Amazon Textract
  • Version 4 using LSTM for
  • Multiple langauge, support
  • Pytesseract
  • OpenCV

    text = pytesseract.image_to_data(im,config=config,output= data.frame)

    Image preprocssing

    Retraining

Table Scan

  • Table area

Camelot pdfminer +