Black Lives Matter. Support the Equal Justice Initiative and read our statement here.
Unicopedia Sinica

Unicopedia Sinica

Set of Unicode utilities related to ideographs, wrapped into one single app.

Unicopedia Sinica icon UNICOPEDIA SINICA

Unicopedia Sinica is a developer-oriented set of Unicode utilities related to ideographs, wrapped into one single app, built with Electron.

This desktop application works on macOS, Linux and Windows operating systems.

It is a specialized complement to the Unicopedia Plus application.

Unicopedia Sinica social preview

Utilities

The following utilities are currently available:

  • CJK Components
    • Look Up IDS
    • Parse IDS
    • Match IDS
    • Find by Components
  • CJK Local Fonts
  • CJK Sources
  • CJK Variations
  • JavaScript Runner

CJK Components

Look Up IDS

CJK Components - Look Up IDS screenshot

CJK Components - Look Up IDS - Show Graphs screenshot

Parse IDS

  • The Parse IDS feature of the CJK Components utility displays the parsing graph of any well-formed IDS (Ideographic Description Sequence), in accordance with the set of extended IDCs (Ideographic Description Characters) defined in the freely available IDS.TXT data file, maintained by Andrew West.
  • The IDS string can by directly typed, or pasted from the clipboard into the IDS input field.
  • Optionally, a Unihan character can be used as reference in the Entry input field. Any standardized variant or Ideographic Variation Sequence (IVS) is also accepted, and is displayed in the graph with an outstanding dashed outline.
  • It is possible to input predefined sets of Entry and IDS strings selected from the Samples ▾ pop-up menu.
  • As a convenience, the input fields can be emptied using the Clear button.
  • The IDS graph can be displayed either vertically or horizontally. Use the Display Mode drop-down menu to toggle between the two modes.
  • Notes:
    • This feature is primarily designed for IDS validation (well-formed syntax), and can even be used as a kind of "playground" for experimentation, without the need to provide any correct semantics.
    • The IDS parsing is performed recursively, based on the following set of prefix operators:
      Operator Name Arity
      IDC Left to Right 2
      IDC Above to Below 2
      IDC Left to Middle and Right 3
      IDC Above to Middle and Below 3
      IDC Full Surround 2
      IDC Surround from Above 2
      IDC Surround from Below 2
      IDC Surround from Left 2
      IDC Surround from Upper Left 2
      IDC Surround from Upper Right 2
      IDC Surround from Lower Left 2
      IDC Overlaid 2
      Ideographic Variation Indicator 1
      Horizontal Mirror Operator 1
      180° Rotation Operator 1
      Subtraction Operator 2
    • Components whose code point belongs to the PUA (Private Use Area) block are displayed by using an embedded copy of the custom font BabelStone Han PUA, created by Andrew West.
    • For best display results, most recent versions of the following fonts should be downloaded and installed at the OS level:

CJK Components - Parse IDS (IVS) screenshot

CJK Components - Parse IDS (Unencoded) screenshot

Match IDS

  • The Match IDS feature of the CJK Components utility displays a list of IDS-matching Unihan characters, including through regular expressions. It makes use of the IDS (Ideographic Description Sequences) defined in the IDS.TXT data file, maintained by Andrew West.
  • After entering a query, click on the Search button to display a list of all relevant matches, if any, ordered by code point value.
  • Click on the Nested Match toggle button to extend the search to IDS-nested characters whose IDS match the query string.
  • Click on the Code Points checkbox to display the code point under each matching Unihan character.
  • It is possible to choose how many characters are shown one page at a time.
  • The search is performed on the set of 92,865 Unihan characters (excluding CJK compatibility ideographs) defined in Unicode 14.0.
  • The results may include the searched component itself when it happens to be a proper Unihan character too.
  • Use the Results ▾ pop-up menu to perform an action among:
    • Copy Results [copy the results as string to the clipboard]
    • Save Results.. [save the results as string to a text file]
    • Clear Results [clear the current list of results]
  • Various examples of regular expressions are provided for quick copy-and-paste.
  • Notes:

CJK Components - Match IDS screenshot

Find by Components

** Under Construction **

CJK Components - Find by Components screenshot

CJK Local Fonts

  • The CJK Local Fonts utility displays all the local font glyphs of a given Unihan character.
  • Any Unihan character can be entered in the Unihan input field either as a character or a code point. Click on the Look Up button to display all the glyphs.
  • Standardized variants and Ideographic Variation Sequences (IVS) are also accepted in input, either directly, e.g., 劍󠄁, or as a combination of two code points: Unihan base character + variation selector (VS1 to VS256), e.g., U+6F22 U+FE00, or U+9F8D U+E0101; the specific format <9F8D,E0107> is also allowed.
  • Previously looked up characters are kept in a history stack; use the Alt+ and Alt+ keyboard shortcuts to navigate through them up and down inside the input field. Alternatively, you can also use the Lookup History ▾ pop-up menu to automatically look up a specific character.
  • Click on the Compact Layout checkbox to display the local font glyphs in a more compact way: hovering over each glyph frame brings up a tooltip with the local font name.
  • Use the Font Name Filter input field to restrict in real time the display of local font glyphs to the font names matching the text string.
  • Notes:
    • A dashed outline is added to a character frame whenever the glyph of a Unihan character coming with a variation selector is visually different from the glyph of its base character alone; in such case, alt-clicking (or shift-clicking) inside the character frame displays momentarily the base character glyph; this is especially useful to spot subtle differences between glyph variations.
    • For best coverage of Unicode Variation Sequences, some of the following fonts should be downloaded and installed at the OS level:

CJK Local Fonts screenshot

CJK Local Fonts - Filter screenshot

CJK Local Fonts - Variation Selector screenshot

CJK Sources

  • The CJK Sources utility displays in a grid fashion the various sources of a given subset of CJK (Chinese/Japanese/Korean) characters, as referenced in their respective Unicode 14.0 code charts:
  • This is especially useful for comparison purposes between relatable character glyphs.
  • These CJK characters belong to the full set of 93,867 Unihan characters defined in Unicode 14.0.
  • For best display results, most recent versions of the following fonts should be downloaded and installed at the OS level:
  • CJK characters can be entered either directly in the "Characters" input field, or using a series of code points in hexadecimal format in the "Code points" input field.
  • It is also possible to input predefined strings of CJK characters selected from the Samples ▾ pop-up menu.
  • As a convenience, the input fields can be emptied using the Clear button. In output, the standard Unicode code point format U+9999 is used, i.e. "U+" directly followed by 4 or 5 hex digits.
  • In input, more hexadecimal formats are allowed, including Unicode escape sequences, such as \u6E7E or \u{21FE7}. Moving out of the field or typing the Enter key converts all valid codes to standard Unicode code point format.
  • Code point and alphanumeric source references of CJK compatibility characters are systematically displayed in italics.
  • Whereas the original code charts are making use of mutually incompatible, block-specific source orderings, this utility displays the relevant sources always sorted in the same order, discarding any empty column for the sake of clarity:
    Prefix Source Unihan Property
    G China kIRG_GSource
    H Hong Kong kIRG_HSource
    M Macao kIRG_MSource
    T Taiwan kIRG_TSource
    J Japan kIRG_JSource
    K South Korea kIRG_KSource
    KP North Korea kIRG_KPSource
    V Vietnam kIRG_VSource
    UTC UTC kIRG_USource
    SAT SAT kIRG_SSource
    UK U.K. kIRG_UKSource
  • UTC stands for Unicode Technical Committee, which is responsible for the development and maintenance of the Unicode Standard.
  • SAT (SAmganikikrtam Taisotripitakam in Sanskrit) represents a machine-readable text database of the Taishō Tripiṭaka.
  • A table of glyphs statistics is available for quick reference.

CJK Sources screenshot

CJK Variations

CJK Variations screenshot

CJK Variations (Unregistered) screenshot

JavaScript Runner

  • The JavaScript Runner utility lets you execute JavaScript code, and comes with several sample scripts related to CJK, IDS, and IVD; it is useful for quick testing/prototyping or data processing.

JavaScript Runner screenshot

Using

You can download the latest release for macOS.

Building

You'll need Node.js (which comes with npm) installed on your computer in order to build this application.

Clone method

# Clone the repository
git clone https://github.com/tonton-pixel/unicopedia-sinica
# Go into the repository
cd unicopedia-sinica
# Install dependencies
npm install
# Run the application
npm start

Note: to use the clone method, the core tool git must also be installed.

Download method

If you don't wish to clone, you can download the source code, unZip it, then directly run the following commands from a Terminal opened at the resulting unicopedia-sinica-master folder location:

# Install dependencies
npm install
# Run the application
npm start

Packaging

Several scripts are also defined in the package.json file to build OS-specific bundles of the application, using the simple yet powerful Electron Packager Node module.
For instance, running the following command (once the dependencies are installed) will create a Unicopedia Sinica.app version for macOS:

# Build macOS (Darwin) application
npm run build-darwin

License

The MIT License (MIT).

Copyright © 2021-2022 Michel Mariani.

not_used

Something missing? Edit this app.

Keyboard Shortcuts

Key Action
/ Focus the search bar
Esc Focus the search bar and cleans it
Select the next search result
Select the previous search result
Enter Open the selected search result
cmdEnter Ctrl+Enter Open the selected search result in a new tab