Question Collection

What’s word embedding and how to implement?

Vector space models (VSMs) represent (embed) words in a continuous vector space where semantically similar words are mapped to nearby points (‘are embedded nearby each other’). VSMs have a long, rich history in NLP, but all methods depend in some way or another on the Distributional Hypothesis, which states that words that appear in the same contexts share semantic meaning. The different approaches that leverage this principle can be divided into two categories: count-based methods (e.g. Latent Semantic Analysis), and predictive methods (e.g. neural probabilistic language models).

count-based model

LSA
predictive model

General Preprocessing Process?

Metamorphosis by Franz Kafka
Text Cleaning is Task Specific
Manual Tokenization
Tokenization and Cleaning with NLTK
Additional Text Cleaning Considerations
Tips for Cleaning Text for Word Embedding

What is part of speech (POS) tagging? What is the simplest approach to building a POS tagger that you can imagine?

Is one of the main components of almost any NLP analysis. The task of POS-tagging simply implies labelling words with their appropriate Part-Of-Speech (Noun, Verb, Adjective, Adverb, Pronoun, …).

It following another question. Why we need to get part of speech? Here’s is really common and popular application rely on NLP. ->Chatbot

How would you build a POS tagger from scratch given a corpus of annotated sentences? How would you deal with unknown words?

Tnt Tagger
Consider the following analysis involving woman (a noun), bought (a verb), over (a preposition), and the (a determiner). The text.similar() method takes a word w, finds all contexts w1w w2, then finds all words w’ that appear in the same context, i.e. w1w’w2. If we have big enough corpus we can find enough past date support our analysis. — Here’s also a very interesting instruction about building Pos tagger.# A Good Part-of-Speech Tagger in about 200 Lines of Python

How would you train a model that identifies whether the word “Apple” in a sentence belongs to the fruit or the company?

We can apply HMM model to identify the result. Most easy way to implement this is doing n-gram encoding then train the classification model.

How would you find all the occurrences of quoted text in a news article?

a hand-crafted extractor may not be such a bad thing, especially if you need to implement it quickly. Another option to use some existing rule-processing engine.

How would you build a system that auto corrects text that has been generated by a speech recognition system?

Here’s the one big problem we having standard solution today. Language model is one of solution. There’s are different between common language model. We need to generate character-level language model

What is latent semantic indexing and where can it be applied?

Projection into another space matrix. LDA is one the way doing this. Try to get best result from compressing. It can be understand as compressed information with minimum information loss.

What are stop words? Describe an application in which stop words should be removed.

Stop word usually refer to useless information in sentence. It vary be user scenario, however, following the NLTK stop word’s list: me, I , myself are common cases for stop words.

How would you design a model to predict whether a movie review was positive or negative?

For unsupervised learning sentiment analysis is one simple idea. Collect positive and negative word. This question can be treated as very deep problem.

How to implement flexible text matching

Soundex
Metaphone
Edit Distance
What is beam search?

In optimization problem, any solution rely on BFS or DFS, the idea is try to exhaustively search all the possible solution. If the problem know the score(performance) in half way(tree), we can set up some threshold to remove part of the tree. The prons is obvious that save computing/memory resource, however, it can’t garentee best solution.

Language Modeling

Online training on word2vec? For gensim, is possible to add new vocab into exist model

Topic Modeling & Summarization

Deep learning very basic type?

The real answer here is infinity. However, I don’t believe anyone can actually doing that good for knowing everything.

MLP, Boltzman machine, CNN,RNN,GAN, LSTM, autoencoder
pooling layer, Relu Layer, full connected layer,
BatchNormalization #Initialize
Activation machine

Sigmoid, Relu, Step, Tanh, softmax

Cost function
tf.nn.weighted_cross_entropy_with_logits
Noise Constrastive Estimation
Gradiaent descent

Adam ..SGD,,

Overfitting

batch normalization, dropout,

tensorflow

constant,variable, placeholder, session,

Hyper parameter

They are tons of parameter. Not gonna to list it all.
Learning rate
Building process

Tensorflow is the the only library I’ve used so far.( I assume the keras as part of tensorflow) Following google’s talk , they strongerly suggest peoeple to use Keras and other high level tool first. In this post, I hope to write down higher level idea for interview mainly.

Backward propagation

Knowledge supplement :)

While I collect questions and try answer them, I review some key tool. I hope they’ll also helpful for reader.

HMM & POS

The main idea of Hidden Markov models

The output at time ii depends on the input at time ii and the output at previous times i−1i−1, i−2i−2, …:

There’s are two example using HMM concept.

Language model: p(wi wi−2,wi−1)p(wi wi−2,wi−1)
Hidden Markov model: p(yi wi,yi−1,yi−2,…)p(yi wi,yi−1,yi−2,…)
- e.g., yiyi is part-of-speech tag at time i

filling the empty
review the old story
- SVM
- Random forest
- Decsion tree
- A/B test basic
- Index f-1 precision recall
language model
- Transformer-Decoder architecture
RNN family Best grpahical explaination
- GRU
- LSTM
Attention model

Speech Recognition

Collceting the most common DL framework accoding to Speech Recognition Using Deep Neural Networks: A Systematic Review

Pointer network

Reference

PDF Extraction

Purpose: PDF + Image + Hand wrritting

PDF Miner

pip3 install pdfminer.six

Prons: Asian language CJK support Cons:

Text based pdf only.
Need W ,L M
Tesseract OCR
Amazon Textract
Version 4 using LSTM for
Multiple langauge, support
Pytesseract
OpenCV

text = pytesseract.image_to_data(im,config=config,output= data.frame)

Image preprocssing

Retraining

Table Scan

Table area

Camelot pdfminer +

Question Collection

What’s word embedding and how to implement?

count-based model

predictive model

General Preprocessing Process?

What is part of speech (POS) tagging? What is the simplest approach to building a POS tagger that you can imagine?

How would you build a POS tagger from scratch given a corpus of annotated sentences? How would you deal with unknown words?

How would you train a model that identifies whether the word “Apple” in a sentence belongs to the fruit or the company?

How would you find all the occurrences of quoted text in a news article?

How would you build a system that auto corrects text that has been generated by a speech recognition system?

What is latent semantic indexing and where can it be applied?

What are stop words? Describe an application in which stop words should be removed.

How would you design a model to predict whether a movie review was positive or negative?

How to implement flexible text matching

What is beam search?

Language Modeling

Topic Modeling & Summarization

Deep learning very basic type?

Activation machine

Cost function

Gradiaent descent

Overfitting

tensorflow

Hyper parameter

Building process

Backward propagation

Knowledge supplement :)

HMM & POS

Speech Recognition

Pointer network

Reference

PDF Extraction

PDF Miner

Tesseract OCR

OpenCV

Image preprocssing

Retraining

Table Scan

CATALOG

FEATURED TAGS