Lists: | pgsql-hackerspgsql-translators |
---|
From: | kerbrose khaled <kerbrose(at)hotmail(dot)com> |
---|---|
To: | "pgsql-translators(at)lists(dot)postgresql(dot)org" <pgsql-translators(at)lists(dot)postgresql(dot)org> |
Subject: | updating unaccent.rules for Arabic letters |
Date: | 2019-11-03 06:02:19 |
Message-ID: | VI1PR06MB5760C7B160B3613739817C2ADA7C0@VI1PR06MB5760.eurprd06.prod.outlook.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers pgsql-translators |
Hello Folks
I would like to update unaccent.rules file to support Arabic letters. so could someone help me or tell me how could I add such contribution. I attached the file including the modifications, only the last 4 lines.
thank you
Attachment | Content-Type | Size |
---|---|---|
unaccent.rules | application/octet-stream | 10.2 KB |
From: | kerbrose khaled <kerbrose(at)hotmail(dot)com> |
---|---|
To: | "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | updating unaccent.rules for Arabic letters |
Date: | 2019-11-03 06:05:25 |
Message-ID: | VI1PR06MB5760293A34B4E091DE682021DA7C0@VI1PR06MB5760.eurprd06.prod.outlook.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers pgsql-translators |
Hello Folks
I would like to update unaccent.rules file to support Arabic letters. so could someone help me or tell me how could I add such contribution. I attached the file including the modifications, only the last 4 lines.
thank you
Attachment | Content-Type | Size |
---|---|---|
unaccent.rules | application/octet-stream | 10.2 KB |
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | kerbrose khaled <kerbrose(at)hotmail(dot)com> |
Cc: | "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: updating unaccent.rules for Arabic letters |
Date: | 2019-11-03 16:12:15 |
Message-ID: | 5527.1572797535@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers pgsql-translators |
kerbrose khaled <kerbrose(at)hotmail(dot)com> writes:
> I would like to update unaccent.rules file to support Arabic letters. so could someone help me or tell me how could I add such contribution. I attached the file including the modifications, only the last 4 lines.
Hi! I've got no objection to including Arabic in the set of covered
languages, but handing us a new unaccent.rules file isn't the way to
do it, because that's a generated file. The adjacent script
generate_unaccent_rules.py generates it from the official Unicode
source data (see comments in that script). What we need, ultimately,
is a patch to that script so it will emit these additional translations.
Past commits that might be useful sources of inspiration include
https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=456e3718e7b72efe4d2639437fcbca2e4ad83099
https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=5e8d670c313531c0dca245943fb84c94a477ddc4
https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=ec0a69e49bf41a37b5c2d6f6be66d8abae00ee05
If you're not good with Python, maybe you could just explain to us
how to recognize these characters from Unicode character properties.
regards, tom lane
From: | "Daniel Verite" <daniel(at)manitou-mail(dot)org> |
---|---|
To: | "kerbrose khaled" <kerbrose(at)hotmail(dot)com> |
Cc: | "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: updating unaccent.rules for Arabic letters |
Date: | 2019-11-04 17:41:59 |
Message-ID: | c2dfc689-4710-4a73-ad69-12807f36a289@manitou-mail.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers pgsql-translators |
kerbrose khaled wrote:
> I would like to update unaccent.rules file to support Arabic letters. so
> could someone help me or tell me how could I add such contribution. I
> attached the file including the modifications, only the last 4 lines.
The Arabic letters are found in the Unicode block U+0600 to U+06FF
(https://www.fileformat.info/info/unicode/block/arabic/list.htm)
There has been no coverage of this block until now by the unaccent
module. Since Arabic uses several diacritics [1] , it would be best to
figure out all the transliterations that should go in and let them in
one go (plus coding that in the Python script).
The canonical way to unaccent is normally to apply a Unicode
transformation: NFC -> NFD and remove the non-spacing marks.
I've tentatively did that with each codepoint in the 0600-06FF block
in SQL with icu_transform in icu_ext [2], and it produces the
attached result, with 60 (!) entries, along with Unicode names for
readability.
Does that make sense to people who know Arabic?
For the record, here's the query:
WITH block(cp) AS (select * FROM generate_series(x'600'::int,x'6ff'::int) AS
cp),
dest AS (select cp, icu_transform(chr(cp), 'any-NFD;[:nonspacing mark:]
any-remove; any-NFC') AS unaccented FROM block)
SELECT
chr(cp) as "src",
icu_transform(chr(cp), 'Name') as "srcName",
dest.unaccented as "dest",
icu_transform(dest.unaccented, 'Name') as "destName"
FROM dest
WHERE chr(cp) <> dest.unaccented;
[1] https://en.wikipedia.org/wiki/Arabic_diacritics
[2] https://github.com/dverite/icu_ext#icu_transform
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
Attachment | Content-Type | Size |
---|---|---|
unaccent-arabic-block.utf8.output | application/octet-stream | 5.1 KB |