Japanese text encoding standard for MSX?

Pagina 3/3
1 | 2 |

Van Takamichi

Hero (599)

afbeelding van Takamichi

13-03-2018, 15:54

MSX Kana Filter is an application that converts MSX 1 byte hiraganas to 2 byte ones.
http://tatsu.life.coocan.jp/MySoft/index.html#MSX
I can translate the enclosed manual but probably it's unnecessary because most people subscribing here seem to know more Japanese than me :)

And yes, MSX Kanji BASIC cannot handle 1 byte hiraganas so CALL AKCNV, the command that converts 1 byte characters to two byte ones, doesn't work with 1 byte hiraganas.

About the character table in Izumic Ballade, I can guess most of these characters are. Is there someone who wants me to write them down? And as is obvious, there are nonstandard characters like a finger (column 12 row 1), "Bs" and "Cl" written as single characters (column 0 and 1, row 3).

Van sd_snatcher

Prophet (3642)

afbeelding van sd_snatcher

14-03-2018, 19:09

Thank you for your tips, Takamichi! And welcome back to the forums. It has been some years since we last saw you around here.

Do you know if the MSX 1-byte hiraganas followed some formally defined standard, or was it custom tailored for this architecture?

Van eimaster

Champion (282)

afbeelding van eimaster

15-03-2018, 01:30

The following site provides some documents and tools used for translating games roms. There are lots of books and documents about Japanese character set which might be knowledgable and of use to you.
http://www.romhacking.net/?page=documents&startpage=1

Van wyrdwad

Paladin (934)

afbeelding van wyrdwad

15-03-2018, 07:13

Yes, thank you for that very detailed explanation! Regarding Izumic Ballade, it turns out the "encoding" being used is just straight ASCII, with a custom character set loaded into memory via an image file. I've been able to account for this in the MSX-BASIC source code via a substitution table, so that particular hurdle has been passed and the game's translation is very much underway:

https://www.youtube.com/watch?v=iSnNQ3raGq8

(Note that this video uses a completely different font from a completely different image file, but like the small version, it's still more or less just straight ASCII, and I was able to work with it via a substitution table and a little bit of graphic editing support from MP83.)

It's interesting to read such a detailed account of Japanese encoding standards -- I knew they were a bit of a mess, but I didn't realize just how much of one! I work in video game localization at XSEED Games, and I'd say about 75% of the modern Japanese games we work on have their text encoded via Shift-JIS, which I greatly appreciate since Shift-JIS is very easy to parse using simple programming -- any routines you write that expect standard CSV formatting with quotation marks around text fields and commas as field delineators will work perfectly on Shift-JIS text more or less 100% of the time.

When we have games that use UTF-8 encoding, however (which basically accounts for the remaining 25% of instances), simple text parsing flies out the window, since you can't really trust bytes read as quotation marks or line breaks anymore.

Fortunately, the need to parse text manually is somewhat rare, as the Japanese developers typically do that for us -- but on those few occasions when they don't, it's proven to be a headache and a half for sure.

That's why personally, I'm a big fan of Shift-JIS. It's nowhere near as dynamic or flexible as UTF-8, but when you need to do simple text parsing, it's the only way to go!

-Tom

Van Grauw

Ascended (10699)

afbeelding van Grauw

15-03-2018, 09:10

XSEED, that’s a neat place to work Smile.

Van Takamichi

Hero (599)

afbeelding van Takamichi

15-03-2018, 13:33

Yes I have been away for a long time, though haven't forgotten MSX.

Several Japanese microcomputers supported 1 byte hiraganas as part of ASCII codes. As far as I know, Basic Master Level 3, PC6001 and PC8801 could display them, but they were not standardized. One easy way to tell whether a microcopmputer could display hiragana is to look at the keyboard; if hiraganas are written, it can display Smile
The hiragana codes always occupied the unused parts, 80H-9FH and E0H-FFH like language specific characters do in here. They conflict with shift JIS system which always consider these codes as a part of full width (2 bytes) character and never as an independent half width (1 byte) character.

Van Takamichi

Hero (599)

afbeelding van Takamichi

15-03-2018, 13:44

On the other hand, when it came to 2 byte characters 80s microcomputers used universally same shift JIS codes. http://www.kiwi-us.com/~ohta/pc88/kanji/other.htm Modern shift JIS isn't standardized but seemingly it was then, at least from these small samples tell.

Van sd_snatcher

Prophet (3642)

afbeelding van sd_snatcher

17-03-2018, 18:35

I noticed that the Wikipedia article about the MSX character set doesn't have the Japanese character set. Only the international and Brazilian ones. Any knowledgeable candidates to fix that article? :)

Van saadatrent1988

Supporter (2)

afbeelding van saadatrent1988

18-03-2019, 09:29

panel123 wrote:

There are اجاره ماشین standard methods to encode Japanese characters for use on a computer, including JIS, Shift-JIS, EUC, and Unicode. While mapping the set of kana is a simple matter, kanji has proven more difficult. Despite efforts, none of the encoding schemes have become the de facto standard, and multiple encoding standards are still in use today.

For example, most Japanese emails are in ISO-2022-JP ("JIS encoding") and web pages in Shift-JIS and yet mobile phones in Japan usually use some form of Extended Unix Code. If a program fails to determine the encoding scheme employed, it can cause mojibake (文字化け, "misconverted garbled/garbage characters", literally "transformed characters") and thus unreadable text on computers.

The first encoding to become widely used was JIS X 0201, which is a single-byte encoding that only covers standard 7-bit ASCII characters with half-width katakana extensions. This was widely used in systems that were neither powerful enough nor had the storage to handle kanji (including old embedded equipment such as cash registers). This means that only katakana, not kanji, was supported using this technique. Some embedded displays still have this limitation.

The development of kanji encodings was the beginning of the split. Shift JIS supports kanji and was developed to be completely backward compatible with JIS X 0201, and thus is in much embedded electronic equipment.

However, Shift JIS has the unfortunate property that it often breaks any parser (software that reads the coded text) that is not specifically designed to handle it. For example, a text search method can get false hits if it is not designed for Shift JIS. EUC, on the other hand, is handled much better by parsers that have been written for 7-bit ASCII (and thus EUC encodings are used on UNIX, where much of the file-handling code was historically only written for English encodings). But EUC is not backwards compatible with JIS X 0201, the first main Japanese encoding. Further complications arise because the original Internet e-mail standards only support 7-bit transfer protocols. Thus RFC 1468 ("ISO-2022-JP", often simply called JIS encoding) was developed for sending and receiving e-mails.

In character set standards such as JIS, not all required characters are included, so gaiji (外字 "external characters") are sometimes used to supplement the character set. Gaiji may come in the form of external font packs, where normal characters have been replaced with new characters, or the new characters have been added to unused character positions. However, gaiji are not practical in Internet environments since the font set must be transferred with text to use the gaiji. As a result, such characters are written with similar or simpler characters in place, or the text may need to be encoded using a larger character set (such as Unicode) that supports the required character.

Unicode was intended to solve all encoding problems over all languages. The UTF-8 encoding used to encode Unicode in web pages does not have the disadvantages that Shift-JIS has. Unicode is supported by international software, and it eliminates the need for gaiji. There are still controversies, however. For Japanese, the kanji characters have been unified with Chinese; that is, a character considered to be the same in both Japanese and Chinese is given a single number, even if the appearance is actually somewhat different, with the precise appearance left to the use of a locale-appropriate font. This process, called Han unification, has caused controversy. The previous encodings in Japan, Taiwan Area, Mainland China and Korea have only handled one language and Unicode should handle all. The handling of Kanji/Chinese have however been designed by a committee composed of representatives from all four countries/areas. Unicode is slowly growing because it is better supported by software from outside Japan, but still (as of 2011) most web pages in Japanese use Shift-JIS. The Japanese Wikipedia uses Unicode.

thankyouuuuuuuuuuuu

Van saadatrent1988

Supporter (2)

afbeelding van saadatrent1988

18-03-2019, 09:30

Yes I have been away for a long time, though haven't forgotten MSX.

Pagina 3/3
1 | 2 |