Skip Navigation
Expand
User dictionary for Asian Language words
Answer ID 10441   |   Last Review Date 11/06/2018

Can I add a new Asian Language word into the user dictionary?

Environment:

Asian language dictionaries
Analytics, Oracle B2C Service, all versions

Resolution:

The Asian language dictionaries directory contains additional dictionary word lists that are used for part-of-speech (POS) tagging and similar phrase searching.

By adding words to the user dictionary files with special parameters, you can override the default segmentation.

User dictionary files are available only for Chinese and Japanese languages, specifically Mandarin (user.dict_CN.utf8), Cantonese (user.dict_HK.utf8), Taiwanese (user.dict_TW.utf8), and Japanese (user.dict_JP.utf8).

You can create user dictionaries for words specific to an industry or application by adding new words, personal names, and transliterated characters of other alphabets. In addition, you can specify how existing words are segmented. For example, you may want to prevent a product name from being segmented even if it is a compound. The system performs a lookup of more than 500,000 words to determine segmentation. Using the dictionary, alias list, and keywords, you can influence how words are segmented.

If you edit a user dictionary file, you must use a specific format. The word you want to add is followed by the user dictionary part-of-speech tag (listed below), and an optional decomposition pattern (DecompPattern) in the form of a comma-delimited list of numbers specifying the number of characters from the word to include in each component of the string. (Use a zero (0) to indicate that a DecompPattern is not needed.)

For example, the user dictionary entry AABBCC ORGANIZATION 2,2,2 indicates that AABBCC should be decomposed into three, two-character components.

User Dictionary POS Tags for Mandarin, Cantonese, and Taiwanese (case insensitive)

  • NOUN
  • PROPER_NOUN
  • PLACE
  • PERSON
  • ORGANIZATION
  • FOREIGN_PERSON


User Dictionary POS Tags for Japanese (case insensitive)

  • NOUN
  • PROPER_NOUN
  • PLACE
  • PERSON
  • ORGANIZATION
  • GIVEN_NAME
  • SURNAME
  • FOREIGN_PLACE_NAME
  • FOREIGN_GIVEN_NAME
  • FOREIGN_SURNAME


Oracle Cloud Operations must run the Keywordindexer utility before your changes to the word list files are active. To schedule this, submit an incident on our support site.

For additional information, refer to the 'Asian Language Dictionaries' section in online documentation for the version your site is currently running. To access Oracle B2C Service manuals and documentation online, refer to the Documentation for Oracle B2C Service Products.