File: README.md

package info (click to toggle)
fasttext 0.9.2%2Bds-8
  • links: PTS, VCS
  • area: main
  • in suites: trixie
  • size: 4,940 kB
  • sloc: cpp: 5,459; python: 2,427; javascript: 635; sh: 621; makefile: 106; xml: 81; perl: 43
file content (67 lines) | stat: -rw-r--r-- 2,871 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
## Alignment of Word Embeddings

This directory provides code for learning alignments between word embeddings in different languages.

The code is in Python 3 and requires [NumPy](http://www.numpy.org/).

The script `example.sh` shows how to use this code to learn and evaluate a bilingual alignment of word embeddings.

The word embeddings used in [1] can be found on the [fastText project page](https://fasttext.cc) and the supervised bilingual lexicons on the [MUSE project page](https://github.com/facebookresearch/MUSE).

### Supervised alignment

The script `align.py` aligns word embeddings from two languages using a bilingual lexicon as supervision.
The details of this approach can be found in [1].

### Unsupervised alignment

The script `unsup_align.py` aligns word embeddings from two languages without requiring any supervision.
Additionally, the script `unsup_multialign.py` aligns multiple languages to a common space with no supervision.
The details of these approaches can be found in [2] and [3] respectively.

In addition to NumPy, the unsupervised methods require the [Python Optimal Transport](https://pot.readthedocs.io/en/stable/) toolbox.

### Download

Wikipedia fastText embeddings aligned with our method can be found [here](https://fasttext.cc/docs/en/aligned-vectors.html).

### References

If you use the supervised alignment method, please cite:

[1] A. Joulin, P. Bojanowski, T. Mikolov, H. Jegou, E. Grave, [*Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion*](https://arxiv.org/abs/1804.07745)

```
@InProceedings{joulin2018loss,
    title={Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion},
    author={Joulin, Armand and Bojanowski, Piotr and Mikolov, Tomas and J\'egou, Herv\'e and Grave, Edouard},
    year={2018},
    booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
}
```

If you use the unsupervised bilingual alignment method, please cite:

[2] E. Grave, A. Joulin, Q. Berthet, [*Unsupervised Alignment of Embeddings with Wasserstein Procrustes*](https://arxiv.org/abs/1805.11222)

```
@article{grave2018unsupervised,
    title={Unsupervised Alignment of Embeddings with Wasserstein Procrustes},
    author={Grave, Edouard and Joulin, Armand and Berthet, Quentin},
    journal={arXiv preprint arXiv:1805.11222},
    year={2018}
}
```

If you use the unsupervised alignment script `unsup_multialign.py`, please cite:

[3] J. Alaux, E. Grave, M. Cuturi, A. Joulin, [*Unsupervised Hyperalignment for Multilingual Word Embeddings*](https://arxiv.org/abs/1811.01124)

```
@article{alaux2018unsupervised,
  title={Unsupervised hyperalignment for multilingual word embeddings},
  author={Alaux, Jean and Grave, Edouard and Cuturi, Marco and Joulin, Armand},
  journal={arXiv preprint arXiv:1811.01124},
  year={2018}
}
```