Bayanathi Technology: Syllable co-occurrence

I have in hand a respectably large number of syllables in Myanmar language. What could I do with them. The easiest one I could do is to construct wordclouds with them and I had done it. May be it will be interesting to find out how different syllables are associated within a sentence. Something like the association of variables which I am a little bit familiar. Then I skimmed through the help pages of the Quanteda package, hoping to find some function that would do something related. I found “fcm()” function that would create a “feature co-occurrence matrix”:

Create a sparse feature co-occurrence matrix, measuring co-occurrences of features within a user-defined context. The context can be defined as a document or a window within a collection of documents, with an optional vector of weights applied to the co-occurrence counts.

Well, I guess fcm() would give something not quite like the association of numeric variables, but rather the syllables that are found together in sentences.

The following network plot of an “fcm” object, where edges show co-occurrences of features in the quanteda tutorial is what I would like to do.

Recalling that I have 300K-plus sentences tokenized into “words” as reported in my earlier post “Cycle 3: Naive word segmentation (that works)”, I tried running “fcm()” on that data, and failed.

x100.dfm <- dfm(x100_itNS5.w_paliN)
fcmat_news <- fcm(x100.dfm)

Error in .local(x, y, ...) : 
  Cholmod error 'out of memory' at file ../Core/cholmod_memory.c, line 147

Later I found out that it meant I didn’t have enough memory to run fcm(). No wonder! The Global Environment pane of RStudio session showed that “x100.dfm” is a “Large dfm (25578584190 elements, 129.1 Mb”. However, it is still wonderful that Quanteda could run dfm() on “x100_itNS5.w_paliN”, on my little machine.

Now I remember that I have the first 20K sentences from my Myanmar Wikipedia corpus tokenized into syllables with my own syllabification code. To get meaningful result out of the fcm exercise, I guess I will have to remove all English characters including punctuations, remove Myanmar punctuations and numbers as well as Myanmar stopwords before running fcm.

So I looked for Myanmar stopwords on the Web. First I found the authors of Statistical Analyses of Myanmar Corpora had assembled an 1.6 million-sentence Myanmar corpus, out of which they had identified about 1216 stopwords. However, I couldn’t find any clue if they were sharing any of their work.

On the other hand swanhtet1992/myanmar-data on GitHub has shared 275 Myanmar stopwords and I like his one-line README:

အရေးကြီးတာ၊ လိုအပ်တာ သိကြတယ်မှလား။ နောက်လူတွေအတွက် ရှိတာလေးတွေထုတ်ပေးကြတာပေါ့။

Kudos to you swanhtet -yay!

Anyway, to continue with my quest for fcm, I thought I going to forget about stopwords (in general) for the moment and just concentrate on leaving out the sentence ending syllable (excluding the section mark, “။”) because it will co-occur with every other syllable in many sentences, and obviously qualify as a stopword.

To begin with I have this “x20k_syllQS” which is the syllabified first 20,000 sentences from my Myanmar Wikipedia corpus of 306290 sentences.
(1) I remove the last two syllables from each of the 20,000 senteces, consisting of the sentence ending syllable and the section mark “။”.
(2) Create corpus from the results of (1)
(3) Create tokens.
(4) Remove all tokens containing English characters, Myanmar numbers, punctuations.
(5) Create dfm.
(6) Create fcm.

Now for step(1) we remove the last two syllables from each of the 20,000 senteces.

f <- function(x) x[1:(length(x)-2)]
x20ksyll.N2 <- sapply(x20k_syllQS, f)
x20ksyll.N2[c(1,20000)]

[[1]]
 [1] "ဂူ"               "ဂဲ"               "၏"               "သု"               "ည"              
 [6] "စီ"               "မံ"               "ကိန\u103aး"       "("               "P"              
[11] "r"               "o"               "j"               "e"               "c"              
[16] "t"               "Z"               "e"               "r"               "o"              
[21] ")"               "လေ့"              "လာ"              "ရ\u103eာ"        "ဖ\u103dေ"       
[26] "သူ"               "ဖ\u103cစ\u103a"  "သည့\u103a"        "ဂ\u103bန\u103aး" "ဟ\u103dန\u103aး"
[31] "က"               "က\u103dတ\u103a"  "ကီး"              "မ\u103bား"       "သည\u103a"       
[36] "က\u103cား"       "ခံ"               "မ\u103bား"       "ဖ\u103cစ\u103a"  "သည့\u103a"       
[41] "ဝိုင\u103a"        "ဖိုင\u103a"        "ထောက\u103a"      "ပံ့"               "သ"              
[46] "မ\u103bား"       "က"               "ဖတ\u103a"        "ရ\u103eု"         "နိုင\u103a"       

[[2]]
 [1] "ထို"              "အ"              "ခ\u102b"        "မိ"              "ခင\u103a"      
 [6] "ဖ\u103cစ\u103a" "သူ"              "သည\u103a"       "\""             "မိ"             
[11] "မိ"              "တို့"              "တ\u103dင\u103a" "အ"              "မ\u103dေ"      
[16] "ဆက\u103a"       "ခံ"              "မည့\u103a"       "အ"              "မ\u103dေ"      
[21] "ခံ"              "သား"            "မ"              "ရ\u103eိ"

For step(2) to (5):

system.time(
  syll20k.dfm <- corpus(t(data.frame(lapply(x20k_syllQS, paste0, collapse = " ")))) %>%
    tokens(., what = "fasterword") %>%
    tokens_select(.,"[\u1040-\u1049\u104a-\u104b]|[[:punct:]]|[A-z0-9]","remove", valuetype="regex") %>%
    dfm(.)
)

   user  system elapsed 
  58.91    0.06   59.61

dim(syll20k.dfm)

[1] 20000  5153

That resulted in a document-feature matrix of: 20,000 documents (sentences), 5,153 features (syllables) that is 99.4% sparse.
For step(6), we create fcm. That resulted in a feature co-occurrence matrix of: 5,153 by 5,153 features (syllables).

system.time(
  syll20k.fcm <- fcm(syll20k.dfm)
)

   user  system elapsed 
   3.43    0.42    3.92

dim(syll20k.fcm)

[1] 5153 5153

feat <- names(topfeatures(syll20k.fcm, 50))
syll20k.fcm_select <- fcm_select(syll20k.fcm, pattern = feat)
dim(syll20k.fcm_select)

[1] 50 50

I tried to bluff the “textplot_network()” function by adding family = “Pyidaungsu”. Didn’t work.

size <- log(colSums(dfm_select(syll20k.dfm, feat)))
set.seed(93019)
textplot_network(syll20k.fcm_select, min_freq = 0.8, vertex_size = size / max(size) * 3, family = "Pyidaungsu")

Looking hard at the syntax of the taxplot_network() function:

I see that I need to tell the font I’ll use to label the vertices of the plot. I bluffed again:

size <- log(colSums(dfm_select(syll20k.dfm, feat)))
set.seed(93019)
textplot_network(syll20k.fcm_select, min_freq = 0.8, vertex_size = size / max(size) * 3, vertex_labelfont = "Pyidaungsu")

Error in check_font(vertex_labelfont) : 
  Pyidaungsu is not found on your system. Run extrafont::font_import() and extrafont::loadfonts(device = "win") to use custom fonts.

It won’t be fooled! It tells me to use “extrafont” again! As you’ll recall I wasn’t able to import fonts with “extrafont” on my 32-bit, Windows-7 Lenovo laptop.

Desperate, I tried using the showtext package and the Cairo graphic package as suggested by Yixuan Qiu in “showtext: Using System Fonts in R Graphics”. I tried many variations trying to copy him, but none worked out the way it should.

But finally:

The above plot shows the co-occurrence of 50 syllables with highest frequencies from 20,000 sentences. The syllables co-occurrence matrix could be viewed like this:

library(kableExtra)
y <- syll20k.fcm_select
kable(y) %>%
  kable_styling(bootstrap_options = c("striped", "hover", 
  "condensed"),font_size = 11) %>%
     scroll_box(width = "600px")

'as.data.frame.dfm' is deprecated.
Use 'convert(x, to = "data.frame")' instead.
See help("Deprecated")

document	သည်	မ	အ	တွင်	ခဲ့	ပ	သို့	ရှိ	ရာ	လည်း	ကို	သော	ပါ	တစ်	အား	ခု	တို့	မျိုး	နှစ်	ရ	မှ	မှာ	မှု	ရန်	လက်	ပြီး	ရေး	ကြီး	စာ	လူ	သာ	သား	လုပ်	တော်	ပြု	လ	မင်း	ရား	မြို့	ဦး	ပင်	ထို	စစ်	ဝင်	ငံ	ဆောင်	စား	မည်	ခြင်း	တွင်း
သည်	10756	12633	57250	14161	10833	6211	6778	10095	9799	5499	18658	12868	6602	6839	4307	6366	8539	4023	6744	11721	6314	4048	4797	3898	3534	5638	7491	5749	3871	3242	4912	4425	4330	6351	3854	3771	2824	2488	3581	4432	3503	4012	2785	3448	4543	2976	2972	2912	5417	2464
မ	0	4882	24385	4834	3990	3582	3638	4672	4586	2786	8752	5577	3732	2731	1972	2411	3870	1534	2396	6676	2661	2274	2342	1807	1825	2250	3097	2711	1830	1368	2814	2097	1891	3589	1865	2313	1526	1486	1209	1877	1558	1522	1371	1476	1568	1322	1433	1777	3949	1011
အ	0	0	63560	23716	17731	13068	11607	19614	19145	9830	34222	25373	14399	12291	8371	11344	15041	8906	11403	23767	11695	8415	11048	8321	7036	10921	18586	10716	7865	5909	8423	8490	10593	10815	9072	6833	3939	4465	5272	8158	6004	6499	4788	7175	8978	7780	6474	5796	12947	5758
တွင်	0	0	0	2337	5567	2885	2393	4497	4496	2015	6655	4926	2840	2795	1583	4088	2849	1504	4370	4522	2425	1239	1960	1691	1554	2728	3480	2478	1873	1230	1700	1698	1810	2566	1603	2143	1096	863	2019	1955	1149	1445	1204	1844	2397	1496	1095	1073	1842	1003
ခဲ့	0	0	0	0	1700	2295	1912	2632	3240	1762	6116	3328	2466	1749	1374	3030	2196	1027	3573	3730	2329	1137	1818	1322	1228	2533	3196	2013	1498	977	1333	1525	1404	2221	1210	1642	880	661	1255	1955	793	1105	1100	1339	1779	1307	848	743	1387	950
ပ	0	0	0	0	0	1505	1127	1959	2349	1038	4260	2999	1719	1161	1005	1467	1825	757	1522	3044	1295	829	1366	996	790	1174	2414	1277	1216	600	1230	972	1290	1585	999	986	482	754	636	1101	497	642	515	782	1136	840	542	567	1702	463
သို့	0	0	0	0	0	0	911	2038	2239	1245	3776	2836	1373	1452	893	1158	1626	722	1164	2606	1444	768	887	965	812	1168	1318	1154	652	582	936	758	1048	1349	825	791	551	550	601	834	797	1000	590	709	968	677	578	847	1355	606
ရှိ	0	0	0	0	0	0	0	1730	3251	1775	5235	4541	2251	2196	1305	1799	2388	1468	1696	4410	1974	1522	1898	1239	1328	1828	2185	1720	1146	1036	1600	1140	1501	1815	1130	958	617	892	1272	1186	1033	1117	732	935	1468	922	867	1049	2265	915
ရာ	0	0	0	0	0	0	0	0	2733	1605	5884	4097	2327	2154	1283	1895	2729	1215	2059	4303	2068	1380	1775	1211	1288	1694	3280	2286	1607	1028	1663	1588	1865	2812	1354	1104	1047	911	1248	1844	1042	1119	788	1172	1396	1517	932	883	1872	808
လည်း	0	0	0	0	0	0	0	0	0	873	3716	1978	1352	1124	667	809	1462	651	909	2140	935	637	839	573	653	728	1463	1020	749	493	956	767	819	1152	696	553	573	481	571	683	595	605	500	575	753	540	515	564	933	407
ကို	0	0	0	0	0	0	0	0	0	0	5161	8195	4627	3774	2517	3377	5495	2221	3560	8246	3603	2115	3100	2890	2408	3781	4908	3380	2601	1834	3131	2877	3335	3884	3269	2136	1606	1818	1530	2706	1832	2421	1670	1704	2363	2048	1928	2195	4964	1280
သော	0	0	0	0	0	0	0	0	0	0	0	3098	2722	3027	2010	2303	3829	1817	2082	5471	2442	1737	2159	1414	1464	1965	2996	2487	1717	1333	2138	1760	2053	2569	1969	1393	1003	1439	1133	1475	1385	1514	898	1398	1643	1077	1212	1162	3147	946
ပါ	0	0	0	0	0	0	0	0	0	0	0	0	1544	1448	1116	1437	1807	964	1316	2955	1386	1491	1275	998	970	1522	2684	1402	1151	715	1262	1107	1212	1652	991	865	574	619	540	1276	689	597	580	1782	992	1056	808	899	1453	574
တစ်	0	0	0	0	0	0	0	0	0	0	0	0	0	1773	830	2491	1408	1021	1559	2173	1443	996	956	818	823	1305	1412	1051	749	668	907	793	967	924	773	654	399	462	620	1414	739	857	469	718	847	579	645	637	1259	479
အား	0	0	0	0	0	0	0	0	0	0	0	0	0	0	631	681	1277	532	664	1716	760	493	931	882	482	826	1456	901	476	506	664	623	770	1046	702	456	398	457	284	703	338	509	508	427	690	645	547	472	1217	375
ခု	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1229	1134	599	4558	2163	1554	838	959	846	820	1311	1775	1254	1099	560	877	699	975	1214	619	1390	404	303	937	1003	491	529	653	814	1200	752	505	485	790	518
တို့	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1633	1223	1238	3327	1588	1179	1365	999	1051	1220	1791	1493	1019	1139	1510	1369	1130	1967	1045	881	930	915	734	1131	906	1087	727	917	1035	775	808	838	2310	584
မျိုး	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1141	679	1443	737	705	640	406	387	606	1041	550	549	1216	698	942	466	522	458	367	172	209	225	552	607	469	267	555	514	347	647	338	784	267
နှစ်	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1519	2486	1696	747	944	801	931	1388	1762	1300	1183	627	964	957	842	1562	638	1607	618	398	931	1243	615	602	616	785	1214	841	549	486	797	670
ရ	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	3526	2511	1739	2269	1518	1514	2140	3095	2320	1853	1201	2383	1746	1935	3198	1725	1621	1264	1523	1165	2036	1095	1353	1087	1129	1550	1275	1197	1651	3867	939
မှ	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	754	770	943	780	807	1373	1427	1113	716	615	894	878	922	1119	747	899	479	373	737	964	653	762	536	737	1001	670	596	608	1151	601
မှာ	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	496	653	464	545	1006	990	852	559	582	890	615	641	849	436	473	360	360	419	660	702	488	352	434	484	399	464	526	898	364
မှု	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1180	793	688	950	2489	810	620	743	791	633	1363	703	881	543	184	472	275	707	323	432	436	709	950	1099	571	435	1854	498
ရန်	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	642	508	721	1619	680	545	446	553	580	900	774	788	489	254	295	543	449	370	450	500	461	681	790	453	347	845	396
လက်	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	660	759	1111	784	533	317	586	542	716	790	479	409	548	288	378	506	337	356	388	356	452	499	370	402	790	334
ပြီး	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	473	1492	1015	672	551	800	793	954	1212	719	725	419	416	592	896	492	610	614	731	874	695	538	508	980	487
ရေး	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	3433	1908	2111	1127	1188	1984	2283	1800	1093	1029	365	534	641	1711	504	621	1048	1143	2510	2429	794	820	1488	1067
ကြီး	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1266	866	555	893	886	947	2301	565	866	1191	852	1010	1518	618	657	592	656	874	801	507	491	724	523
စာ	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1564	436	1034	1105	605	905	589	488	271	247	375	937	312	378	297	569	576	483	430	437	739	317
လူ	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	496	555	645	452	470	391	323	174	195	300	706	249	356	235	428	472	391	379	305	627	244
သာ	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1045	856	459	1399	678	551	586	571	487	902	508	525	326	513	639	442	450	628	1096	295
သား	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	724	635	1235	563	467	832	406	394	820	330	458	390	520	606	561	482	487	775	355
လုပ်	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1621	615	1426	451	186	327	395	621	376	443	309	533	739	1154	562	422	1644	429
တော်	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2571	787	1065	1811	1662	1137	1615	570	649	566	664	924	921	600	668	932	608
ပြု	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	484	376	268	370	240	421	361	481	319	400	509	445	448	395	1570	264
လ	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	882	356	352	591	627	314	361	416	499	658	486	256	296	736	382
မင်း	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	846	641	476	543	236	299	230	279	166	296	285	337	242	235
ရား	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	601	279	385	214	285	187	290	215	208	179	257	911	172
မြို့	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1039	608	361	294	308	372	448	289	277	194	299	386
ဦး	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	4380	299	371	491	818	871	930	431	407	472	328
ပင်	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	552	480	215	301	353	250	292	356	570	246
ထို	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	221	265	299	380	296	264	428	579	293
စစ်	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	493	339	435	324	208	244	441	357
ဝင်	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	347	710	500	375	279	575	328
ငံ	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1100	850	400	311	627	498
ဆောင်	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	671	287	348	735	461
စား	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	454	250	791	221
မည်	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	489	515	174
ခြင်း	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	3085	468
တွင်း	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	213

As of now, I am leaving the interpretation of this table and above plot to you, and also let you find out for yourselves the right way to write code to get Myanmar text in the plot.

Monday, October 7, 2019

Syllable co-occurrence

No comments:

Post a Comment

Blog Archive