Thursday, October 2, 2025

 Feature co-occurrence in Pyu corpus

In NLP, feature co-occurrence in text analysis is useful for exploring semantic relationships and patterns by identifying words or phrases that appear together frequently. Then it seems like a good idea to study the feature (syllable/word/phrase) co-occurence in the Pyu corpus.

To do so, I took the Pyu corpus I had used for the exercise reported in my previous post and remove all the editorial marks. The editorial marks used there were described in Studies in Pyu Epigraphy, I : State of the Field, Edition and Analysis of the Kan Wet Khaung Mound Inscription, and Inventory of the Corpus. To quote:

2.2 Diplomatic edition
We use bold-cum-italic typeface to highlight the Sanskrit phrases in the
text and indicate the faces (A, b, C, d) over which the lines are spread in
superscript, while we assign numbers to the Pyu glosses, also in superscript,
with the sign #. We use the following editorial conventions:
[ ] uncertain reading
( ) editorial restoration of lost text
〈 〉 editorial addition of omitted text
〈〈 〉〉 scribal insertion
{{ }} scribal deletion
? illegible akṣara
C illegible consonant element of an akṣara
V illegible vowel element of an akṣara
+ lost akṣara
◊ punctuation space

Mordifying the corpus

Remove all edit marks, and white spaces:

library(magrittr)
# Remove all edit marks
patt <- c("\\#","\\[|\\]","\\(|\\)","〈|〉","〈〈|〉〉","\\{\\{|\\}\\}","\\?","C","V","\\+","◊") %>% paste0(., collapse = "|")
pyuDoc <- gsub(patt, "", x1_df.1$combined_values)
# remove white spaces
 pyuDoc <- str_squish(pyuDoc)
 
# check if all removals done:
grep(patt, pyuDoc)
integer(0)
# view line 73 old and new
x1_df.1$combined_values[73]
[1] "siddha[m·] 2 || ◊ ḅay·ṁḥ kmak· [ḅa]y·ṁḥ toṅ· tṅav· tiṁ psiṁ ◊ ḅay·ṁḥ saḥ ḅay·ṁḥ goṃḥ ◊ °o saḥ ḅay·ṁḥ luṅ· hi[p]· ◊ ḅay·ṁḥ luṅ· ti[n·]ṁ droḥ kdiṃ ◊ ḅay·ṁḥ luṅ· tdav·ṃḥ ◊ daṅ·ṃṁ °oy· tsaṁḥ ḅuddha daṅ·ḥ tim·ṁ [m]l[i]y·ṁ kdaṅ· nhoḥ yaṁ ◊ ||@"
pyuDoc[73]
[1] "siddham· 2 || ḅay·ṁḥ kmak· ḅay·ṁḥ toṅ· tṅav· tiṁ psiṁ ḅay·ṁḥ saḥ ḅay·ṁḥ goṃḥ °o saḥ ḅay·ṁḥ luṅ· hip· ḅay·ṁḥ luṅ· tin·ṁ droḥ kdiṃ ḅay·ṁḥ luṅ· tdav·ṃḥ daṅ·ṃṁ °oy· tsaṁḥ ḅuddha daṅ·ḥ tim·ṁ mliy·ṁ kdaṅ· nhoḥ yaṁ ||@"

Run feature co-occurence analysis for a subset of features

Without modification I took the pieces of text separated by space that comes with the corpus to be the “features”.

library(quanteda)
library(quanteda.textplots)

# get names of most frequent features 

# create dfm
pyu_dfm <- dfm(tokens(pyuDoc, what = "fasterword"))
head(pyu_dfm)
Document-feature matrix of: 6 documents, 1,893 features (98.94% sparse) and 0 docvars.
       features
docs    @|| ḅay·ṁḥ dak·ṃ viy·ṃṁ tim·ṁ mlik· °o saḥ tgaṃ knon·
  text1   1      2     1      1     1     1  3   1    1     1
  text2   0      0     0      0     0     0  0   0    0     0
  text3   1      0     0      0     0     0  4   0    0     0
  text4   1      0     0      0     0     0  3   0    0     0
  text5   1      0     0      0     0     0  3   0    0     0
  text6   1      0     0      0     0     0  3   0    0     0
[ reached max_nfeat ... 1,883 more features ]
# get feature names sorted by feature frequencies in descending order
names.2 <- dfm_sort(pyu_dfm, margin = "features") %>% featfreq(.) %>% names(.)

# test
x <- dfm_sort(pyu_dfm) %>% featfreq(.)
y <- dfm_sort(pyu_dfm, margin = "features") %>% featfreq(.)

# get names of most frequently co-occuring freatures

# create feature co-occurence matrix
pyu_fcm <- fcm(pyu_dfm)
# Convert the FCM to a regular matrix to access counts more easily
pyu_fcm_matrix <- as.matrix(pyu_fcm)

# Calculate the sum of co-occurrence counts for each feature (row sums)
feature_counts <- rowSums(pyu_fcm_matrix)

# Sort the feature counts in descending order
sorted_features <- sort(feature_counts, decreasing = TRUE)

# Get the names of the features in the desired order
ordered_feature_names <- names(sorted_features)

# Reorder the FCM based on the sorted feature names
# This involves selecting features in the new order for both rows and columns
sorted_pyu_fcm <- pyu_fcm[ordered_feature_names, ordered_feature_names]

# Display the sorted FCM (optional)
print(sorted_pyu_fcm[1:10,1:10])
Feature co-occurrence matrix of: 10 by 10 features.
        features
features   °o  tiṁ ḅay·ṁḥ   ta tin·ṁ ḅiṁḥ  yaṁ ḅaṁḥ  gi   ḅa
  °o     3953 6186      0 1977  2202 2065 1473 1553 940 1147
  tiṁ       0 3391      0    0  1934  399    0    0 827    0
  ḅay·ṁḥ 2309 2708    628  484   648   39  121  111   6  394
  ta        0 1751      0  357   604  411    0    0 468    0
  tin·ṁ     0    0      0    0   415    0    0    0   0    0
  ḅiṁḥ      0    0      0    0    84 1026    0    0 201    0
  yaṁ       0  540      0  351   280  843  237  577 242    0
  ḅaṁḥ      0  440      0  290    63 1412    0  482 111    0
  gi        0    0      0    0   442    0    0    0 502    0
  ḅa        0 1109      0  340   379  175  132  128 125   93
# extract feature names for analysis

# most frequently co-occuring feature names for analysis
tokeep.1 <- ordered_feature_names[1:30]
toplot.1 <- fcm_keep(sorted_pyu_fcm, tokeep.1)

# most frequent feature names for analysis
tokeep.2 <- names.2[1:30]
toplot.2 <- fcm_keep(sorted_pyu_fcm, tokeep.2)

Plot a network of feature co-occurrences

An fcm object could be plotted as a network, where edges show co-occurrences of features. Currently the size of the network is limited to 1000, because of the computationally intensive nature of network formation for larger matrices. Here, we have opted to use 30 features. Besides, the resulting plot with too many features would be too crowded to be impossible or incomprehensible.

# plot with 30 most frequently co-occurring features
p.1 <- textplot_network(toplot.1)
library(ggplot2)
p.1 + theme(panel.border = element_rect(color = "black", fill = NA, size = 1))


# plot with 30 most frequently found features
p.2 <- textplot_network(toplot.2)
p.2 + theme(panel.border = element_rect(color = "black", fill = NA, size = 1))

Besides, the resulting plot with too many features would be too crowded to be impossible to view or incomprehensible, as seen below for 900 features:

Plot with 900 features
Plot with 900 features