Skip to content

UDA Example#343

Open
jrxk wants to merge 91 commits into
asyml:masterfrom
jrxk:uda_example
Open

UDA Example#343
jrxk wants to merge 91 commits into
asyml:masterfrom
jrxk:uda_example

Conversation

@jrxk
Copy link
Copy Markdown
Collaborator

@jrxk jrxk commented Dec 19, 2020

This PR fixes #293.

Description of changes
This PR adds an example that uses UDA to train a text classifier on the IMDB text classification dataset. Please see the README for details.

Test Conducted
Performed experiments with/without UDA, under supervised and semi-supervised settings.

@jrxk jrxk added the data_aug Features on data augmentation label Dec 19, 2020
@jrxk jrxk requested a review from hunterhector December 19, 2020 20:54
@jrxk jrxk self-assigned this Dec 19, 2020
Copy link
Copy Markdown
Member

@hunterhector hunterhector left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole thing is still non-forte, without clear documentation of how to use UDA.

If a user looks at this example, he wouldn't know how to incorporate UDA. There is not explanation in the readme, no comments in the main.py. All he can see is a 500 line code and don't know where to start, or don't know which part is important.

Previous comments on the last PR are not fixed, such as calling subprocess in Python, and there is no use of Forte at all. Why would I need this framework instead of going to Google's code?

Comment thread examples/text_classification/download_imdb.py Outdated
Comment thread examples/text_classification/download_imdb.py Outdated
Comment thread examples/text_classification/utils/imdb_format.py Outdated
Comment thread examples/text_classification/utils/data_utils.py Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 24, 2020

Codecov Report

❌ Patch coverage is 54.16667% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.78%. Comparing base (900cf93) to head (4c1b6fb).
⚠️ Report is 436 commits behind head on master.

Files with missing lines Patch % Lines
forte/data/readers/largemovie_reader.py 52.17% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #343      +/-   ##
==========================================
- Coverage   81.87%   81.78%   -0.10%     
==========================================
  Files         187      187              
  Lines       11941    11940       -1     
==========================================
- Hits         9777     9765      -12     
- Misses       2164     2175      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@hunterhector hunterhector left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd better place all data aug examples in the same folder: https://github.com/asyml/forte/tree/master/examples/data_augmentation

Comment thread examples/text_classification/utils/data_utils.py Outdated
Comment thread examples/text_classification/utils/data_utils.py
Copy link
Copy Markdown
Member

@hunterhector hunterhector left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is still at a very low quality, which cannot be merged at its current state? For example, where is the step to generate the back translation data? How can the user do it on his own?

Here are a list of problems:

  1. there's no variable type
  2. there is no instruction on how to get the back translation data.
  3. A lot of code is directly copied from Google.
  4. All the docstrings do not follow our standard.
  5. After mentioning using the training team's format, there is no change, there are still a lot of ad-hoc code such as InputFeatures, InputExample.

If that's the quality of this work, there's no reason that a user would use this instead of Google's implementation.

Comment thread examples/data_augmentation/uda/config_data.py Outdated
Comment thread examples/data_augmentation/uda/utils/data_utils.py
Comment thread examples/data_augmentation/uda/utils/data_utils.py Outdated
Comment thread examples/data_augmentation/uda/utils/data_utils.py
Comment thread examples/data_augmentation/uda/main.py
Comment thread examples/data_augmentation/uda/utils/imdb_format.py Outdated
Comment thread examples/data_augmentation/uda/utils/data_utils.py Outdated
Comment thread examples/data_augmentation/uda/main.py Outdated
Copy link
Copy Markdown
Member

@hunterhector hunterhector left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only see back_trans folder with the requirement file, I guess you need to refer to Google?

Another way is to document how to get back translation from Google with steps, but not copying the code over.

Again, most docstrings are not following our standard.

Comment thread forte/data/readers/largemovie_reader.py Outdated
Comment thread examples/data_augmentation/uda/utils/model_utils.py
Comment thread examples/data_augmentation/uda/utils/model_utils.py Outdated
@jrxk
Copy link
Copy Markdown
Collaborator Author

jrxk commented Dec 29, 2020

I only see back_trans folder with the requirement file, I guess you need to refer to Google?

Another way is to document how to get back translation from Google with steps, but not copying the code over.

Again, most docstrings are not following our standard.

Sure. It's still a work in progress. I will add instructions on how to install tensor2tensor with the right versions of Tensorflow and and use the pre-trained translation models. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data_aug Features on data augmentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create a text classifier for the IMDB large movie dataset

3 participants