SECURITY: api.load() recklessly downloads & runs arbitrary python code

The `api.load()` utility will grab fresh Python code, in the form  of an `__init__.py` file inside the Github project `gensim-data` download area, and then run it at the user's machine. 

This form of dynamic code loading & execution:

* violates user expectations: requesting a dataset should not run arbitrary new code that was never explicitly installed on the user's machine. Further, there's no indication in the `api.load()` docs that this could occur.
* creates a severe security risk: if a bad-faith actor obtains the ability to edit the `gensim-data` Github "releases" files, they can cause arbitrary new code to run on gensim users' machine, when those users use `api.load()`. The users wold think, "I'm still using this old, assumed-safe-through-widespread-use gensim-X.Y.Z version" – but they'd be running arbitrary all-new code from over the network. It's hard to tell who has `gensim-data` project rights. It's also not clear that anyone would quickly notice edits/changes there.

Further, these `__init__.py` files in the `gensim-data` releases aren't even in version-control – instead, they're uploaded as 'assets' via the Github releases interface. (There's no file-size need to do this; the `__init__.py` I reviewed, for `wiki-english-20171001` at <https://github.com/RaRe-Technologies/gensim-data/releases/download/wiki-english-20171001/__init__.py>, is just a tiny shim and I imagine most other such files are, as well. It's code, it should be versioned.)

That they are not in version-control makes them hard to review through normal means (such as browsing the Github website), and raises the possibility they could be changed to something malicious, and then back again, without anyone noticing or it being visible in any persistent logs. 

Recommendations: 

The `api.load()` mechanism should be immediately redesigned to not load any new code over the network – and the developer guidelines for gensim & associated projects should make it clear such dynamic loading of code outside normal package-installation processes (like `pip`) is unacceptable. 

If supporting new datasets requires dataset-specific code, that code should go through normal collaboration/version-control/release procedures, waiting for a new `pip`-installable `gensim` (or other new supporting project) explicit release before running on users' computers. 

ATTN: @menshikh-iv @piskvorky @chaitaliSaini 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

SECURITY: api.load() recklessly downloads & runs arbitrary python code #2283

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

SECURITY: api.load() recklessly downloads & runs arbitrary python code #2283

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions