trailrest.blogg.se

Kaggle competition spelling corrector
Kaggle competition spelling corrector





kaggle competition spelling corrector

Length difference information: Character and word level length absolute differences, log length absolute differences, length ratio, log length ratio. Length information: Character and word length.

kaggle competition spelling corrector

Some special cases are avoided to be corrected in order not to destruct data too much, such as the word is a special term recognized by applying NER. Finally, choose the best replacement based on *SpaCy*'s smoothed log probability of the words. First, check if a word is in the *Glove* (described below) vocabulary (~100M), if not, the word is considered to be mispelled.Second, find a list of good replacements for the misspelled word. **Automatic spelling error correction**: No direct evidence to whether this method is applicable. Training data contains mislabeled examples. Risky in destructing data (imformation loss). This may also help our models become better to capture sentence meaning.

kaggle competition spelling corrector

"I am a **NATIONALITY** living in **COUNTRY**." "I am a **Taiwanese** living in **Taiwan**." "What's" to "What is", "We're" to "We are", etc.) using regular expression, replace number-like string with "number" and replace currency symbols with "USD" abbreviation. **Text normalization**: Restore abbreviations (e.g. **Text cleaning**: Tokenizatiom, convert all tokens to lower case, remove punctuations, special tokens, etc. The evaluation metric in this competition is **log loss (or cross-entropy loss in binary classfication)**: Testing data: **2345796** question pairs, no ground truth, need to be evaluated on the *Kaggle* platform. The dataset is released by *Quora*, which is a well-known platform to gain and share knowledge about This report describes our team's solution, which acheives **top 10% (305/3307)** in this competition. In the *Quora Question Pairs* competition, we were challenged to tackle the natural language processing (NLP) problem, given the question pairs, classify whether question pairs are duplicates or not. # Kaggle Competition: Quora Question Pairs Slides







Kaggle competition spelling corrector