Predictive Text Embedding through Large-scale Heterogeneous Text Networks is an extension of the previous network embeddings algorithm – LINE, which utilize both labeled data and unlabeled data to learn a representation for text that has a strong predictive power for tasks like text classification
This artical is a simple review/note I did when learning this paper
- This artical is just a simple review/note from my report for this paper
- For more detailed and straightforward information with images, refer to
the report at the bottom of this page
1 Introduction
Motivation
Evaluation of existing methods
Unsupervised text embedding methods (SG, Paragraph Vector)
- Advantages
- Simple
- Scalable
- Effective
- Easy to tune and accommodate unlabeled data
- Disadvantages
- Yield inferior results — Compared to sophisticated deep learning architectures like CNN
- Cuz the deep neural networks fully leverage labeled information that is available for a task when learning the representations
- Yield inferior results — Compared to sophisticated deep learning architectures like CNN
- Advantages
Reasons for the above
- These text embedding methods learn in a unsupervised way
- Not leverage the labeled information available for the task
- The low dimensional representations learned are not particularly tuned for any task (But applicable to many different tasks)
Disadvantages of CNN
- It is Computational
- It assumes the there are large amount of available labeled examples
- It requires exhaustive tuning of many parameters — time-consuming for experts and infeasible for non-experts
Problem Definition
- Learn a representation of text that is optimized for a given text classification task — to have a strong predictive power
- Basic idea it to incorporate both the labeled and unlabeled information when learning the text embeddings
2 Related Work
2.1 Distributed Text Embedding
Supervised — only use labeled data
- Based on DNN like
- RNTNs (Recursive neural tensor networks)
- Each word <—> a low dimensional vector
- Apply the same tensor-based composition function over the sub-phrases/words in a parse tree to recursively learn the embeddings of the phrases
- CNNs (Convolutional neural networks)
- Word -> Vector
- Context Windows -> the same Convolutional kernel -> a max-pooling & fully connected layer
- RNTNs (Recursive neural tensor networks)
- if want to utilize unlabeled data
- Use indirect approaches like Pretrain the word embeddings with unsupervised approaches
Unsupervised
- Learn the embedding by utilizing word co-occurrences in the local context(Skip-gram) or at document level(Paragraph vectors)
2.2 Information Network Embedding
Representations learned through a heterogeneous text network
—> the problem of network/graph embedding
Classical graph embedding algorithms
- not applicable for embedding large-scale networks (millions of vertices & billions of edges)
Recent attempt to embed very large-scale networks
- Perozzi’s “DeepWalk”
- use truncated random walks on the networks
- only applicable for networks with binary edges
- The author’s previous model “LINE”
- Both of them are unsupervised and only handle homogeneous networks
- Perozzi’s “DeepWalk”
PTE — extends the LINE to deal with heterogeneous networks
3 Our Model — PTE (Predictive Text Embedding)
Charateristic
- semi-supervised
- utilize both labeled & unlabeled data
Process
- Labeled information & different levels of word co-occurrence information are first represented as a large-scale heterogeneous text network
- Then it is embedded into a low dimensional space through a principled&efficient algorithm
- This low dimensional embedding
- not only preserves the semantic closeness of words and documents
- but also has a strong predictive power for the particular task
Heterogeneous Text Network
Three types of bipartite networks
- Word-Word Network
- Word-Document Network
- Word-Label Network
Heterogeneous text network is the combination of the above
networks
1. Word-Word Network
2. Word-Document Network
3. Word-Label Network
Model
The probability of vi(vi ∈ VA) generated by vj(vj ∈ VB) is
the softmax result of their cos similarity :Represent a vertex’s conditional distribution by $p(\cdot|v_j)$
Make it close to its empirical distribution:
- λj – importance of the vertex – estimated by the degree
- $\hat{p}(v_i |v_j)$–estimatedby $w_{ij}$
Finally the overal objective function is the sum of the three text network’s objective functions
Approach
Optimized with SGD, using edge sampling & negative sampling
- Step
- Sample a edge according to its proportion of weight
- Sample K negative edges from a noise distribution $p_n(j)$
- Optimize it
Result
- Compared to recent supervised approaches based on CNN
- PTE is
- comparable
- More effective and more efficient
- Has fewer parameters to tune
- PTE is
4 Experiment
refer to the report below