-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Description
The api.load() utility will grab fresh Python code, in the form of an __init__.py file inside the Github project gensim-data download area, and then run it at the user's machine.
This form of dynamic code loading & execution:
- violates user expectations: requesting a dataset should not run arbitrary new code that was never explicitly installed on the user's machine. Further, there's no indication in the
api.load()docs that this could occur. - creates a severe security risk: if a bad-faith actor obtains the ability to edit the
gensim-dataGithub "releases" files, they can cause arbitrary new code to run on gensim users' machine, when those users useapi.load(). The users wold think, "I'm still using this old, assumed-safe-through-widespread-use gensim-X.Y.Z version" – but they'd be running arbitrary all-new code from over the network. It's hard to tell who hasgensim-dataproject rights. It's also not clear that anyone would quickly notice edits/changes there.
Further, these __init__.py files in the gensim-data releases aren't even in version-control – instead, they're uploaded as 'assets' via the Github releases interface. (There's no file-size need to do this; the __init__.py I reviewed, for wiki-english-20171001 at https://github.com/RaRe-Technologies/gensim-data/releases/download/wiki-english-20171001/__init__.py, is just a tiny shim and I imagine most other such files are, as well. It's code, it should be versioned.)
That they are not in version-control makes them hard to review through normal means (such as browsing the Github website), and raises the possibility they could be changed to something malicious, and then back again, without anyone noticing or it being visible in any persistent logs.
Recommendations:
The api.load() mechanism should be immediately redesigned to not load any new code over the network – and the developer guidelines for gensim & associated projects should make it clear such dynamic loading of code outside normal package-installation processes (like pip) is unacceptable.
If supporting new datasets requires dataset-specific code, that code should go through normal collaboration/version-control/release procedures, waiting for a new pip-installable gensim (or other new supporting project) explicit release before running on users' computers.