Skip to content

Conversation

@pegasus-lynx
Copy link

Hello Team,

We have been trying to add the support for ngrams and skipgrams into nlcodec. The PR above does the following:

  • Adds the following schemes : NgramScheme , SkipgramScheme , MWEScheme ( combines both NgramScheme and SkipgramScheme )
  • Added a list of PMI variants to allow merging of tokens on the basis of the metrics other than freq.

The implementation for the above cannot directly be merged in the code base as the code is not properly factored. However, we would like you to review the code and see how does it align with the purpose of the codec.

@thammegowda
Copy link
Member

Hi @pegasus-lynx thanks for the pull request.

I will gladly review it, and look forward to merging it.
This week is a little busy one as the ACL is going on ... I will get back to this after a few days.

Thanks again for submitting the PR! Its looking good.

@thammegowda
Copy link
Member

Hey @pegasus-lynx
Sorry for the long delay.

I tried to test these new features. But I am clueless regarding how to use this API.
I need a simple example of how to use ngram, skipgram and MWE.

A simple test case for creating vocab would do; here is an example:

def test_bpe():
vocab_size = 6000
args = dict(inp=IO.read_as_stream(paths=[en_txt, fr_txt]),
level='bpe',
vocab_size=vocab_size,
min_freq=1,
term_freqs=False,
char_coverage=0.99999,
min_co_ev=2)
with tempfile.TemporaryDirectory() as tmpdir:
model_file = Path(tmpdir) / 'model.tsv'
args['model'] = model_file
table = nlc.learn_vocab(**args)
assert len(table) == vocab_size
table2, meta = nlc.Type.read_vocab(model_file)
assert len(table2) == len(table)
table_str = '\n'.join(x.format() for x in table)
table2_str = '\n'.join(x.format() for x in table2)
assert table_str == table2_str

AND/OR a sample usage e.g. https://github.com/isi-nlp/nlcodec/blob/master/docs/intro.adoc#python-api so we

Thanks.

Once again, I apologize for the delay.

@thammegowda
Copy link
Member

@pegasus-lynx let me know if this is ready for a review!

@pegasus-lynx
Copy link
Author

Hey @thammegowda, sorry for the delay. We have moved in a different direction with this branch. I will close this PR and create a new PR for this. As of now, the SkipScheme has some issues in decoding, so it is still not ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants