Alphabet Detection and Frequency Analysis of Unicode Ranges with PHP

programarivm (55)in #utopian-io • 6 years ago (edited)

Repository

https://github.com/programarivm/unicode-ranges

Related Repositories

https://github.com/programarivm/babylon

Have you ever needed to create a random string with Unicode characters encoded in blocks that you'd want to pick at will? I did a few months ago but couldn't find any library to easily achieve my goal.

So I decided to write Unicode Ranges which is a PHP library that provides you with Unicode ranges -- blocks, if you like -- in a friendly, object-oriented way.

By the way, if you are not very familiar with Unicode click here for a quick introduction to the ranges: Basic Latin, Cyrillic, Hangul Hamo, and many, many others.

Here is an example that creates a random char encoded in any of these three Unicode ranges: BasicLatin, Tibetan and Cherokee.

use UnicodeRanges\Randomizer;
use UnicodeRanges\Range\BasicLatin;
use UnicodeRanges\Range\Tibetan;
use UnicodeRanges\Range\Cherokee;

$char = Randomizer::char([
    new BasicLatin,
    new Tibetan,
    new Cherokee,
]);

echo $char . PHP_EOL;

Output:

Ꮉ

And this is how to create a random string with Arabic, HangulJamo and Phoenician characters:

use UnicodeRanges\Randomizer;
use UnicodeRanges\Range\Arabic;
use UnicodeRanges\Range\HangulJamo;
use UnicodeRanges\Range\Phoenician;

$letters = Randomizer::letters([
    new Arabic,
    new HangulJamo,
    new Phoenician,
], 20);

echo $letters . PHP_EOL;

Output:

ᄺᆺڽ𐤂ᆉᅔᅱ𐤆𐤄ᅰᇼᄓ𐤊𐤄ᄃ𐤋ᆝᆛەᅎ

Very useful if you want to create random UTF-8 tokens for example.

I hope these examples will give you the context to follow my explanation -- for further information please read the Documentation.

New Features

Let's now cut to the chase.

Yesterday I created the following Unicode Ranges feature for Babylon to be able to compute the ranges' frequencies -- or put another way, the number of times that a particular unicode range appears in a text.

The ultimate goal is for the language detector to understand alphabets.

This is how the feature is implemented:

Feature/power ranges #1

On the one hand, PowerRanges provides with an array containing all 255 Unicode ranges.

Of course, I didn't manually instantiate the 255 classes, which would have been just tedious! Note that the PowerRanges array is dynamically built by reading the files stored in the unicode-ranges/src/Range/ folder.

This is possible with PHP's ReflectionClass.

<?php
namespace UnicodeRanges;
class PowerRanges
{
    const RANGES_FOLDER = __DIR__ . '/Range';
    protected $ranges = [];
    public function __construct()
    {
        $files = array_diff(scandir(self::RANGES_FOLDER), ['.', '..']);
        foreach ($files as $file) {
            $filename = pathinfo($file, PATHINFO_FILENAME);
            $classname = "\\UnicodeRanges\\Range\\$filename" ;
            $rangeClass = new \ReflectionClass($classname);
            $rangeObj = $rangeClass->newInstanceArgs();
            $this->ranges[] = $rangeObj;
        }
    }
    public function ranges()
    {
        return $this->ranges;
    }
}

On the other hand, Converter::unicode2range($char) converts any multibyte char into its object-oriented Unicode range counterpart.

Example:

use UnicodeRanges\Converter;

$char = 'a';
$range = Converter::unicode2range($char);

echo "Total: {$range->count()}".PHP_EOL;
echo "Name: {$range->name()}".PHP_EOL;
echo "Range: {$range->range()[0]}-{$range->range()[1]}".PHP_EOL;
echo 'Characters: ' . PHP_EOL;
print_r($range->chars());

Output:

Total: 96
Name: Basic Latin
Range: 0020-007F
Characters:
Array
(
    [0] =>  
    [1] => !
    [2] => "
    [3] => #
    [4] => $
    [5] => %
    [6] => &
    [7] => '
    ...

This is how Babylon can now analyze the frequency of the Unicode ranges:

/**
 * @test
 */
public function freq()
{
    $text = '律絕諸篇俱宇宙古今مليارات في мале,тъйжалнопе hola que tal como 토마토쥬스 estas tu hoy この平安朝の';
    $expected = [
        'Basic Latin' => 25,
        'Cyrillic' => 14,
        'CJK Unified Ideographs' => 12,
        'Arabic' => 9,
        'Hangul Syllables' => 5,
        'Hiragana' => 3,
    ];

    $this->assertEquals($expected, (new UnicodeRangeStats($text))->freq());
}

As you can see, a UnicodeRangeStats class is instantiated, which is the one running Converter::unicode2range($char); as it is shown below.

<?php

namespace Babylon;

use Babylon;
use UnicodeRanges\Converter;

/**
 * Unicode range stats.
 *
 * @author Jordi Bassagañas <[email protected]>
 * @link https://programarivm.com
 * @license MIT
 */
class UnicodeRangeStats
{
    const N_FREQ_UNICODE_RANGES = 10;

    /**
     * Text to be analyzed.
     *
     * @var string
     */
    protected $text;

    /**
     * Unicode ranges frequency -- number of times that the unicode ranges appear in the text.
     *
     * Example:
     *
     *      Array
     *      (
     *         [Basic Latin] => 25
     *         [Cyrillic] => 14
     *         [CJK Unified Ideographs] => 12
     *         [Arabic] => 9
     *         [Hangul Syllables] => 5
     *         [Hiragana] => 3
     *          ...
     *      )
     *
     * @var array
     */
    protected $freq;

    /**
     * Constructor.
     *
     * @param string $text
     */
    public function __construct(string $text)
    {
        $this->text = $text;
    }

    /**
     * The most frequent unicode ranges in the text.
     *
     * @return array
     * @throws \InvalidArgumentException
     */
    public function freq(): array
    {
        $chars = $this->mbStrSplit($this->text);
        foreach ($chars as $char) {
            $unicodeRange = Converter::unicode2range($char);
            empty($this->freq[$unicodeRange->name()])
                ? $this->freq[$unicodeRange->name()] = 1
                : $this->freq[$unicodeRange->name()] += 1;
        }
        arsort($this->freq);

        return array_slice($this->freq, 0, self::N_FREQ_UNICODE_RANGES);
    }

    /**
     * The most frequent unicode range in the text.
     *
     * @return \UnicodeRanges\AbstractRange
     * @throws \InvalidArgumentException
     */
    public function mostFreq(): string
    {
        return key(array_slice($this->freq(), 0, 1));
    }

    /**
     * Converts a multibyte string into an array of chars.
     *
     * @return array
     */
    private function mbStrSplit(string $text): array
    {
        $text = preg_replace('!\s+!', ' ', $text);
        $text = str_replace (' ', '', $text);

        return preg_split('/(?<!^)(?!$)/u', $text);
    }
}

That's all for now!

Today I showed you a few applications of the Unicode Ranges library:

Random phrases (tokens) with UTF chars
Alphabet detection
Frequency analysis of Unicode ranges

Could you think of any more to add to this list?

Any ideas are welcome! Thank you for reading today's post and sharing your views with the community.

GitHub Account

https://github.com/programarivm

#development #php #unicode #utf8

6 years ago in #utopian-io by programarivm (55)

$12.11

Sort:

Trending

[-]

amosbastian (72) 6 years ago

Thanks for the contribution, @programarivm! It's always cool to read about people creating something for a specific need that they couldn't find elsewhere!

Some thoughts about the pull request:

Even though there is little code, you could still add some comments (like function declarations, for example).
Commit messages could be better - this is a good reference.

I look forward to seeing more of your contributions!

Your contribution has been evaluated according to Utopian policies and guidelines, as well as a predefined set of questions pertaining to the category.

To view those questions and the relevant answers related to your post, click here.

Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]

$8.09

8 votes

[-]

utopian-io (71) 6 years ago

Thank you for your review, @amosbastian!

So far this week you've reviewed 10 contributions. Keep up the good work!

$0.00

2 votes

[-]

programarivm (55) 6 years ago

Thanks for the review @amosbastian.

In regards to commenting the code, I believe it is okay not to write comments as long as the code is simple enough, self-explanatory and the names of variables, methods, constants and so on, are meaningful.

Anyway I am reviewing the code already, thank you.

$0.00

[-]

amosbastian (72) 6 years ago

I agree.

$0.00

[-]

checky (51) 6 years ago

Hi @programarivm, I'm @checky ! While checking the mentions made in this post I noticed that @throws doesn't exist on Steem. Did you mean to write @thow ?

If you found this comment useful, consider upvoting it to help keep this bot running. You can see a list of all available commands by replying with `!help`.

$0.00

[-]

utopian-io (71) 6 years ago

Hey, @programarivm!

Thanks for contributing on Utopian.
We’re already looking forward to your next contribution!

Get higher incentives and support Utopian.io!
Simply set @utopian.pay as a 5% (or higher) payout beneficiary on your contribution post (via SteemPlus or Steeditor).

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!

$0.00

STEEM 0.20

TRX 0.12

JST 0.029

BTC 61440.52

ETH 3447.43

USDT 1.00

SBD 2.52

Alphabet Detection and Frequency Analysis of Unicode Ranges with PHP

Repository

Related Repositories

New Features

GitHub Account

If you found this comment useful, consider upvoting it to help keep this bot running. You can see a list of all available commands by replying with !help.

Coin Marketplace

If you found this comment useful, consider upvoting it to help keep this bot running. You can see a list of all available commands by replying with `!help`.