Course: Natural Language Processing and Information Retrieval
Instructor: Alessandro Moschitti
Objective: Using Tree Kernels for question classification.
Tasks:
- Separate labels and questions
- Parse (English grammer using Stanford Parser)
- Using Senna for Semantic role labelling and Part of speech tagging
- Make PAS from SRL output
- Features (Parse Tree, Bag of Words, Bag of Parts of Speech, Predicate Argument Structure, and TF-IDF Vector)
- Train Tree kernel on train 5500 with 5-fold cross validation
- Test on TREC 2010
Example Tree from a question
Data: TREC 2012 for testing and training questions from "Experimental Data for Question Classification."
Folders:
- data
- .label (label Question)
- .lbl (only label)
- .q (only questions)
- .portStem (questions after using Porter stemmer)
- .stemmed (questions after using Paice/Husk stemmer)
| Class | Definition |
|---|---|
| ABBREVIATION | abbreviation |
| abb | abbreviation |
| exp | expression abbreviated |
| ENTITY | entities |
| animal | animals |
| body | organs of body |
| color | colors |
| creative | inventions, books and other creative pieces |
| currency | currency names |
| dis.med. | diseases and medicine |
| event | events |
| food | food |
| instrument | musical instrument |
| lang | languages |
| letter | letters like a-z |
| other | other entities |
| plant | plants |
| product | products |
| religion | religions |
| sport | sports |
| substance | elements and substances |
| symbol | symbols and signs |
| technique | techniques and methods |
| term | equivalent terms |
| vehicle | vehicles |
| word | words with a special property |
| DESCRIPTION | description and abstract concepts |
| definition | definition of sth. |
| description | description of sth. |
| manner | manner of an action |
| reason | reasons |
| HUMAN | human beings |
| group | a group or organization of persons |
| ind | an individual |
| title | title of a person |
| description | description of a person |
| LOCATION | locations |
| city | cities |
| country | countries |
| mountain | mountains |
| other | other locations |
| state | states |
| NUMERIC | numeric values |
| code | postcodes or other codes |
| count | number of sth. |
| date | dates |
| distance | linear measures |
| money | prices |
| order | ranks |
| other | other numbers |
| period | the lasting time of sth. |
| percent | fractions |
| speed | speed |
| temp | temperature |
| size | size, area and volume |
| weight | weight |
Languages and tools used
- Python - WinPython
- nltk
- nlpnet
- Java
- Stanford Tokenizer
- Stanford NER
- The GATE Predicate-Argument EXtractor Component
- Stemmers [optional]
- SENNA: A Fast Semantic Role Labeling
- Toolkit for Advanced Discriminative Modeling
- Malt Parser
- Hunpos Tagger
- batch scripting