Question Collection
What’s word embedding and how to implement?
Vector space models (VSMs) represent (embed) words in a continuous vector space where semantically similar words are mapped to nearby points (‘are embedded nearby each other’). VSMs have a long, rich history in NLP, but all methods depend in some way or another on the Distributional Hypothesis, which states that words that appear in the same contexts share semantic meaning. The different approaches that leverage this principle can be divided into two categories: count-based methods (e.g. Latent Semantic Analysis), and predictive methods (e.g. neural probabilistic language models).
count-based model
- LSA
predictive model
General Preprocessing Process?
- Metamorphosis by Franz Kafka
- Text Cleaning is Task Specific
- Manual Tokenization
- Tokenization and Cleaning with NLTK
- Additional Text Cleaning Considerations
- Tips for Cleaning Text for Word Embedding
What is part of speech (POS) tagging? What is the simplest approach to building a POS tagger that you can imagine?
Is one of the main components of almost any NLP analysis. The task of POS-tagging simply implies labelling words with their appropriate Part-Of-Speech (Noun, Verb, Adjective, Adverb, Pronoun, …).
It following another question. Why we need to get part of speech? Here’s is really common and popular application rely on NLP. ->Chatbot
How would you build a POS tagger from scratch given a corpus of annotated sentences? How would you deal with unknown words?
-
Consider the following analysis involving woman (a noun), bought (a verb), over (a preposition), and the (a determiner). The text.similar() method takes a word w, finds all contexts w1w w2, then finds all words w’ that appear in the same context, i.e. w1w’w2. If we have big enough corpus we can find enough past date support our analysis. — Here’s also a very interesting instruction about building Pos tagger.# A Good Part-of-Speech Tagger in about 200 Lines of Python
How would you train a model that identifies whether the word “Apple” in a sentence belongs to the fruit or the company?
We can apply HMM model to identify the result. Most easy way to implement this is doing n-gram encoding then train the classification model.
How would you find all the occurrences of quoted text in a news article?
a hand-crafted extractor may not be such a bad thing, especially if you need to implement it quickly. Another option to use some existing rule-processing engine.
How would you build a system that auto corrects text that has been generated by a speech recognition system?
Here’s the one big problem we having standard solution today. Language model is one of solution. There’s are different between common language model. We need to generate character-level language model
What is latent semantic indexing and where can it be applied?
Projection into another space matrix. LDA is one the way doing this. Try to get best result from compressing. It can be understand as compressed information with minimum information loss.
What are stop words? Describe an application in which stop words should be removed.
Stop word usually refer to useless information in sentence. It vary be user scenario, however, following the NLTK stop word’s list: me, I , myself are common cases for stop words.
How would you design a model to predict whether a movie review was positive or negative?
For unsupervised learning sentiment analysis is one simple idea. Collect positive and negative word. This question can be treated as very deep problem.
How to implement flexible text matching
- Soundex
- Metaphone
- Edit Distance
What is beam search?
In optimization problem, any solution rely on BFS or DFS, the idea is try to exhaustively search all the possible solution. If the problem know the score(performance) in half way(tree), we can set up some threshold to remove part of the tree. The prons is obvious that save computing/memory resource, however, it can’t garentee best solution.
Language Modeling
- Online training on word2vec? For gensim, is possible to add new vocab into exist model
Topic Modeling & Summarization
Deep learning very basic type?
The real answer here is infinity. However, I don’t believe anyone can actually doing that good for knowing everything.
- MLP, Boltzman machine, CNN,RNN,GAN, LSTM, autoencoder
- pooling layer, Relu Layer, full connected layer,
- BatchNormalization
#Initialize
Activation machine
Sigmoid, Relu, Step, Tanh, softmax
Cost function
- tf.nn.weighted_cross_entropy_with_logits
- Noise Constrastive Estimation
-
Gradiaent descent
Adam ..SGD,,
Overfitting
batch normalization, dropout,
tensorflow
constant,variable, placeholder, session,
Hyper parameter
They are tons of parameter. Not gonna to list it all.
- Learning rate
-
Building process
Tensorflow is the the only library I’ve used so far.( I assume the keras as part of tensorflow) Following google’s talk , they strongerly suggest peoeple to use Keras and other high level tool first. In this post, I hope to write down higher level idea for interview mainly.
Backward propagation
Knowledge supplement :)
While I collect questions and try answer them, I review some key tool. I hope they’ll also helpful for reader.
HMM & POS
The main idea of Hidden Markov models
The output at time ii depends on the input at time ii and the output at previous times i−1i−1, i−2i−2, …:
There’s are two example using HMM concept.
-
Language model: p(wi wi−2,wi−1)p(wi wi−2,wi−1) -
Hidden Markov model: p(yi wi,yi−1,yi−2,…)p(yi wi,yi−1,yi−2,…) - e.g., yiyi is part-of-speech tag at time i
- filling the empty
- review the old story
- SVM
- Random forest
- Decsion tree
- A/B test basic
- Index f-1 precision recall
- language model
- Transformer-Decoder architecture
- RNN family
Best grpahical explaination
- GRU
- LSTM
- Attention model
Speech Recognition
Collceting the most common DL framework accoding to Speech Recognition Using Deep Neural Networks: A Systematic Review
Pointer network
Reference
PDF Extraction
Purpose: PDF + Image + Hand wrritting
PDF Miner
pip3 install pdfminer.six
Prons: Asian language CJK support Cons:
- Text based pdf only.
- Need W ,L M
Tesseract OCR
- Amazon Textract
- Version 4 using LSTM for
- Multiple langauge, support
- Pytesseract
-
OpenCV
text = pytesseract.image_to_data(im,config=config,output= data.frame)
Image preprocssing
Retraining
Table Scan
- Table area
Camelot pdfminer +