Bayanathi Technology: Removing all Wikipedia articles written by robots

I wanted to be a little more dramatic. Actually the robots didn’t write the articles, they just translated them from English, I guess.

Quite accidentally, I discovered that there are such things as “အခြားဘာသာစကားမှ မူကြမ်းသဘောမျိုး ဘာသာပြန်ထားသော ဆောင်းပါး” which I will try to remove from the original Wikipedia dump XML file. I’ll be using the xml2 package for this exercise.

library(xml2)
system.time(
    xdoc <- read_xml("mywiki-20190201-pages-articles.xml", encoding = "UTF-8")
)

   user  system elapsed 
   3.48    1.05    5.18

I need to get the namespace to use in the Xpath expression to find the nodes, so:

xml_ns(xdoc)

d1  <-> http://www.mediawiki.org/xml/export-0.10/
xsi <-> http://www.w3.org/2001/XMLSchema-instance

Looking for other robot-translated articles

I used this search-phrase: “အခြားဘာသာစကားမှ မူကြမ်းသဘောမျိုး ဘာသာပြန်ထားသော ဆောင်းပါး”.

I put the article titles from this list into a text file called “botTranlateWiki Apr20_2019.txt”. Since this list includes articles that were created at a date later than the date of creation of the Wikipedia dump XML file, I dropped them from the list.

botTrans <- readLines("botTranlateWiki Apr20_2019.txt",encoding="UTF-8")
cat(unlist(botTrans))

ဘိုးအင်း ၇၇၇ ဘိုးအင်း ၇၃၇ စိန့်ပီတာအမ်ဟတ် Reikersdorf ဗီယက်လူမျိုး မာရိုဝမ် ဖလေးနီ ယို့စ် ဂိပတ် ဘဂ္ဂဒက်၏ ပတ်ပတ်လည်မြို့ ဘိုးအင်း ၇၆၇ အီရန်နိုင်ငံရှိ ရွေးကောက်ပွဲများ Vrahovice  El Chavo del 8 Alaska Thunderfuck  နွားနှင့်ကြက်သား စကူးဘီ ဒူး Edmond Debeaumarché Oxford ရဲ့ဂိုဏ်း Roach Freestyle Script တွမ်နှင့်ဂျယ်ရီပုံပြင်များ Christadelphian တွမ်နှင့်ဂျယ်ရီရှိုးပွဲ (၂၀၁၄ တီဗီစီးရီး) ကရပ် မာဂ တိုက်ခိုက်ရေး ရေဒီယိုစတူဒီယို 54 ကွန်ယက် မီနီ အစ္စရေး Kabbalah Abir  KAI T-50 Golden Eagle အဗန်းဂျားစ် (ကာတွန်း) ဒေးဗစ်မြို့ Channel 5 (UK) နိုင်ငံတကာပထဝီပြည်ထောင်စု AIDC F-CK-1 Ching-kuo ဇို-အင်္ဂလိပ်-အိန္ဒိယ အဘိဓာန် ကလာရာ ရော့ခ်မိုး ဝါးအဲယားဝေး စားသောက်ဆိုင်နိပွန် အီရန်နိုင်ငံ၏ ကျွန်းများစာရင်း ချာကီလိုမီတာ မိုယာ ဆထရန်နာ မိုယာ ဘူဂေးရီးယား ဣသရေလအမျိုး၏ပြိုလဲသည်အမျိုးသားအောက်မေ့ရာခန်းမ နိုင်ငံတကာမြို့ကြီးများစင်တာများအစည်းအရုံး ဒီဇိုင်းပြတိုက်ဒိဘုန် Game Dev Story Ariel Sharon Park ပါကစ္စတန်ရှိလိင်တူချစ်သူများအခွင့်အရေး ငရုတ်ကောင်းဝက် Big bang theory Baskin-Robbins ORACLE (teletext) အုပ်ထိန်းရေး ကောင်စီ အီရန်နိုင်ငံရှိ နိုင်ငံရေးပါတီများ ကယ်လ်ဗင် ဟဲရစ် Chatbot အီရန်ရှိတောင်များစာရင်း STV (TV channel) ဆေတန် ၅ ဒုံးပျံ နယူးယောက်မြို့ မြေအောက်ရထား အစ္စလာမ့်ဗိသုကာ

Removing the robot-translated articles

First we need to get these nodesets by identifying them with their titles.

nodes2rm.1 <- list()
text2rm.1 <- list()
L <- length(botTrans)
system.time(
  for (i in 1:L){
    tmp <- xml_find_all(xdoc, paste0("//d1:page[./d1:title = '",botTrans[i],"']"))
    nodes2rm.1[[i]] <- tmp
    text2rm.1[[i]] <- xml_text(tmp)
  }
)
nodes2rm.1

Unfortunately the previous run could not give all the required nodesets. The missing nodesets are identified by “[xml_nodeset (0)]” as in the above sreenshot. I need to use different code to get the four remaining nodesets.

nodes2rm.2 <-  xml_find_all(xdoc, "//d1:page[./d1:title = 'ဘိုးအင်း ၇၇၇' or ./d1:title = 'Vrahovice' or ./d1:title = 'Alaska Thunderfuck' or ./d1:title = 'Abir']")
nodes2rm.2

{xml_nodeset (4)}
[1] <page>\n  <title>Vrahovice</title>\n  <ns>0</ns>\n  <id>52560</id>\n  <revision>\n    ...
[2] <page>\n  <title>ဘိုးအင\u103aး ၇၇၇</title>\n  <ns>0</ns>\n  <id>54226</id>\n  <revisi ...
[3] <page>\n  <title>Abir</title>\n  <ns>0</ns>\n  <id>68636</id>\n  <revision>\n    <id> ...
[4] <page>\n  <title>Alaska Thunderfuck</title>\n  <ns>0</ns>\n  <id>76114</id>\n  <revis ...

These nodesets are removed from the original XML file in two runs.
Remove the first nodesets.

lapply(nodes2rm.1, xml_remove)

Remove the second nodesets.

lapply(nodes2rm.2, xml_remove)

Checking to see if it is done

You can test to see if all nodesets related to all articles created by machine translation were removed.

xml_find_all(xdoc, "//d1:page[./d1:title = 'ဘိုးအင်း ၇၇၇' or ./d1:title = 'Vrahovice' or ./d1:title = 'Alaska Thunderfuck' or ./d1:title = 'Abir']")

{xml_nodeset (0)}

xml_find_all(xdoc, paste0("//d1:page[./d1:title = '",botTrans[2],"']"))

{xml_nodeset (0)}

xml_find_all(xdoc, paste0("//d1:page[./d1:title = '",botTrans[58],"']"))

{xml_nodeset (0)}

Yes! I’m sure they were though I’m too lazy to check for every one of them. You could check them if you like.

Tuesday, April 23, 2019

Removing all Wikipedia articles written by robots

Looking for other robot-translated articles

Removing the robot-translated articles

Checking to see if it is done

No comments:

Post a Comment

Blog Archive