Zejun Lin's Blog

异国漂泊,野蛮生长

0%

Predictive Text Embedding through Large-scale Heterogeneous Text Networks

Predictive Text Embedding through Large-scale Heterogeneous Text Networks is an extension of the previous network embeddings algorithm – LINE, which utilize both labeled data and unlabeled data to learn a representation for text that has a strong predictive power for tasks like text classification

This artical is a simple review/note I did when learning this paper


  • This artical is just a simple review/note from my report for this paper
  • For more detailed and straightforward information with images, refer to
    the report at the bottom of this page

1 Introduction

Motivation

Evaluation of existing methods

  • Unsupervised text embedding methods (SG, Paragraph Vector)

    • Advantages
      • Simple
      • Scalable
      • Effective
      • Easy to tune and accommodate unlabeled data
    • Disadvantages
      • Yield inferior results — Compared to sophisticated deep learning architectures like CNN
        • Cuz the deep neural networks fully leverage labeled information that is available for a task when learning the representations
  • Reasons for the above

    • These text embedding methods learn in a unsupervised way
    • Not leverage the labeled information available for the task
    • The low dimensional representations learned are not particularly tuned for any task (But applicable to many different tasks)
  • Disadvantages of CNN

    • It is Computational
    • It assumes the there are large amount of available labeled examples
    • It requires exhaustive tuning of many parameters — time-consuming for experts and infeasible for non-experts

Problem Definition

  • Learn a representation of text that is optimized for a given text classification task — to have a strong predictive power
  • Basic idea it to incorporate both the labeled and unlabeled information when learning the text embeddings

2.1 Distributed Text Embedding

Supervised — only use labeled data

  • Based on DNN like
    • RNTNs (Recursive neural tensor networks)
      • Each word <—> a low dimensional vector
      • Apply the same tensor-based composition function over the sub-phrases/words in a parse tree to recursively learn the embeddings of the phrases
    • CNNs (Convolutional neural networks)
      • Word -> Vector
      • Context Windows -> the same Convolutional kernel -> a max-pooling & fully connected layer
  • if want to utilize unlabeled data
    • Use indirect approaches like Pretrain the word embeddings with unsupervised approaches

Unsupervised

  • Learn the embedding by utilizing word co-occurrences in the local context(Skip-gram) or at document level(Paragraph vectors)

2.2 Information Network Embedding

Representations learned through a heterogeneous text network
—> the problem of network/graph embedding

  • Classical graph embedding algorithms

    • not applicable for embedding large-scale networks (millions of vertices & billions of edges)
  • Recent attempt to embed very large-scale networks

    • Perozzi’s “DeepWalk”
      • use truncated random walks on the networks
      • only applicable for networks with binary edges
    • The author’s previous model “LINE”
    • Both of them are unsupervised and only handle homogeneous networks
  • PTE — extends the LINE to deal with heterogeneous networks

3 Our Model — PTE (Predictive Text Embedding)

Charateristic

  • semi-supervised
  • utilize both labeled & unlabeled data

Process

  • Labeled information & different levels of word co-occurrence information are first represented as a large-scale heterogeneous text network
  • Then it is embedded into a low dimensional space through a principled&efficient algorithm
  • This low dimensional embedding
    • not only preserves the semantic closeness of words and documents
    • but also has a strong predictive power for the particular task

Heterogeneous Text Network

  • Three types of bipartite networks

    1. Word-Word Network
    2. Word-Document Network
    3. Word-Label Network
  • Heterogeneous text network is the combination of the above
    networks

1. Word-Word Network

wwn

2. Word-Document Network

image_1c4r568nn1rrr1kmj1k7pobb17f513.png-75.6kB

3. Word-Label Network

image_1c4r56o9u1sep1gde3qc1liu6o71g.png-65kB

Model

  • The probability of vi(vi ∈ VA) generated by vj(vj ∈ VB) is
    the softmax result of their cos similarity :

  • Represent a vertex’s conditional distribution by $p(\cdot|v_j)$

  • Make it close to its empirical distribution:

    • λj – importance of the vertex – estimated by the degree
    • $\hat{p}(v_i |v_j)$–estimatedby $w_{ij}$
  • Finally the overal objective function is the sum of the three text network’s objective functions

    Approach

Optimized with SGD, using edge sampling & negative sampling

  • Step
    • Sample a edge according to its proportion of weight
    • Sample K negative edges from a noise distribution $p_n(j)$
    • Optimize it

Result

  • Compared to recent supervised approaches based on CNN
    • PTE is
      • comparable
      • More effective and more efficient
      • Has fewer parameters to tune

4 Experiment

refer to the report below