mirror of
https://github.com/YunoHost-Apps/mediawiki_ynh.git
synced 2024-09-03 19:46:05 +02:00
59 lines
2.4 KiB
Text
59 lines
2.4 KiB
Text
This directory contains some Unicode normalization routines. These routines
|
|
are meant to be reusable in other projects, so I'm not tying them to the
|
|
MediaWiki utility functions.
|
|
|
|
The main function to care about is UtfNormal::toNFC(); this will convert
|
|
a given UTF-8 string to Normalization Form C if it's not already such.
|
|
The function assumes that the input string is already valid UTF-8; if there
|
|
are corrupt characters this may produce erroneous results.
|
|
|
|
To also check for illegal characters, use UtfNormal::cleanUp(). This will
|
|
strip illegal UTF-8 sequences and characters that are illegal in XML, and
|
|
if necessary convert to normalization form C.
|
|
|
|
Performance is kind of stinky in absolute terms, though it should be speedy
|
|
on pure ASCII text. ;) On text that can be determined quickly to already be
|
|
in NFC it's not too awful but it can quickly get uncomfortably slow,
|
|
particularly for Korean text (the hangul decomposition/composition code is
|
|
extra slow).
|
|
|
|
|
|
== Regenerating data tables ==
|
|
|
|
UtfNormalData.inc and UtfNormalDataK.inc are generated from the Unicode
|
|
Character Database by the script UtfNormalGenerate.php. On a *nix system
|
|
'make' should fetch the necessary files and regenerate it if the scripts
|
|
have been changed or you remove it.
|
|
|
|
|
|
== Testing ==
|
|
|
|
'make test' will run the conformance test (UtfNormalTest.php), fetching the
|
|
data from from the net if necessary. If it reports failure, something is
|
|
going wrong!
|
|
|
|
You may have to set up PHPUnit first.
|
|
|
|
$ pear channel-discover pear.phpunit.de
|
|
$ pear install phpunit/PHPUnit
|
|
|
|
== Benchmarks ==
|
|
|
|
Run 'make bench' to download some sample texts from Wikipedia and run some
|
|
cheap benchmarks of some of the functions. Take all numbers with large
|
|
grains of salt.
|
|
|
|
|
|
== PHP module extension ==
|
|
|
|
There's an experimental PHP extension module which wraps the ICU library's
|
|
normalization functions. This is *MUCH* faster than doing this work in pure
|
|
PHP code. This is at https://git.wikimedia.org/summary/mediawiki%2Fextensions%2Fnormal.git.
|
|
It is used by the WMF, which currently runs PHP 5.3.10 on Linux. It hasn't been
|
|
thoroughly tested on other configurations, but may work.
|
|
|
|
If the php_normal.so module is loaded in php.ini, the normalization functions
|
|
will automatically use it. If you can't (or don't want to) load it in php.ini,
|
|
you may be able to load it using the dl() function before the inclusion of
|
|
UtfNormal.php, and it will be picked up.
|
|
|