How to get arabic names for Mauritania Moughataa

On HDX, you can download and use the administrative boundaries of Mauritania but with one caveat the names of the different administrative divisions are translated from Arabic to English. For some analysis, it can be useful to have also the Arabic name in the same table. In this post, we are going to scrape a table from with the Arabic name from a website before joining this table to our administrative boundaries data. We will need the rhdx package (not yet on CRAN) and the following packages:

We can use rhdx::pull_dataset to read the Mauritania administrative boundaries dataset in R. The 4th resource contains the second administrative level known as Moughataa.

We can see that the Arabic names are not available in this data, we can even visualize the available name using ggplot2 and sf.

mrt_adm2 %>%
  ggplot() +
  geom_sf() +
  geom_sf_label(aes(label = admin2Name)) +
  theme_minimal()
## Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may
## not give correct results for longitude/latitude data

We need the Arabic name and the City Population website provides a lot of information on population and native name for administrative areas. You can check the Mauritania page here. The page are dynamic and in order to scrape and get the information we need, we have to mimic web browser session using an appropriate user agent. Using the right user agent, we can the rvest R package to scrape the data.

uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
url <- "https://www.citypopulation.de/php/mauritania-admin.php"

url %>%
  html_session(user_agent(uastring)) %>%
  html_nodes("table#tl.data") %>% ## extract the table
  html_table() %>% ## turn it into data in R
  first() %>% ## extract the list with the table inside
  select(admin2NameWeb = Name, admin2NameNative = Native, status = Status) %>% ## rename column
  filter(status %in% c("Department", "Capital District")) -> arabic_adm2 ## filter by departement (adm2) or capital district for Nouackshott
glimpse(arabic_adm2)
## Observations: 47
## Variables: 3
## $ admin2NameWeb    <chr> "Aïoun El Atrouss", "Akjoujt", "Aleg", "Amourj"…
## $ admin2NameNative <chr> "لعيون", "اكجوجت", "الاك", "امرج", "أوجفت", "أط…
## $ status           <chr> "Department", "Department", "Department", "Depa…

As you can see, this table contains some Arabic names (admin2NameNative), we now need to join it to our boundaries data. However, because of spelling differences between the two admin2Name columns in each table, we need to apply some approximative matching (stringdist::amatch).

ind <- amatch(arabic_adm2$admin2NameWeb, mrt_adm2$admin2Name, maxDist = 5)
arabic_adm2$admin2Name <- mrt_adm2$admin2Name[ind]

non_matched_admin2 <- anti_join(mrt_adm2, arabic_adm2)$admin2Name
non_matched_admin2
## [1] "Aioun"

We are missing Aioun but since we have most of the available Moughataa, we can join the two data and check the final results in a map.

final <- left_join(mrt_adm2, select(arabic_adm2, -status, -admin2NameWeb))

ggplot(final) +
  geom_sf() +
  geom_sf_label(aes(label = admin2NameNative)) +
  theme_minimal()

Session info for this analysis.

Session info

devtools::session_info()
##  Session info ──────────────────────────────────────────────────────────
##  setting  value                                      
##  version  R version 3.5.2 Patched (2019-02-05 r76078)
##  os       Arch Linux                                 
##  system   x86_64, linux-gnu                          
##  ui       X11                                        
##  language                                            
##  collate  en_US.UTF-8                                
##  ctype    en_US.UTF-8                                
##  tz       Africa/Dakar                               
##  date     2019-03-04                                 
## 
##  Packages ──────────────────────────────────────────────────────────────
##  package     * version    date       lib source                        
##  assertthat    0.2.0      2017-04-11 [1] CRAN (R 3.5.0)                
##  backports     1.1.3      2018-12-14 [1] CRAN (R 3.5.2)                
##  blogdown      0.10       2019-01-09 [1] CRAN (R 3.5.2)                
##  bookdown      0.9        2018-12-21 [1] CRAN (R 3.5.2)                
##  broom         0.5.1      2018-12-05 [1] CRAN (R 3.5.1)                
##  callr         3.1.1      2018-12-21 [1] CRAN (R 3.5.2)                
##  cellranger    1.1.0      2016-07-27 [1] CRAN (R 3.5.0)                
##  class         7.3-15     2019-01-01 [1] CRAN (R 3.5.2)                
##  classInt      0.3-1      2018-12-18 [1] CRAN (R 3.5.2)                
##  cli           1.0.1      2018-09-25 [1] CRAN (R 3.5.1)                
##  colorspace    1.4-0      2019-01-13 [1] CRAN (R 3.5.2)                
##  crayon        1.3.4      2017-09-16 [1] CRAN (R 3.5.0)                
##  crul          0.7.0      2019-01-03 [1] Github (ropensci/crul@bb12ce1)
##  curl          3.3        2019-01-10 [1] CRAN (R 3.5.2)                
##  DBI           1.0.0.9001 2019-03-01 [1] Github (r-dbi/DBI@2bbe284)    
##  desc          1.2.0      2018-05-01 [1] CRAN (R 3.5.0)                
##  devtools      2.0.1      2018-10-26 [1] CRAN (R 3.5.1)                
##  digest        0.6.18     2018-10-10 [1] CRAN (R 3.5.1)                
##  dplyr       * 0.8.0.1    2019-02-15 [1] CRAN (R 3.5.2)                
##  e1071         1.7-0.1    2019-01-21 [1] CRAN (R 3.5.2)                
##  evaluate      0.13       2019-02-12 [1] CRAN (R 3.5.2)                
##  fansi         0.4.0      2018-12-31 [1] Github (brodieG/fansi@ab11e9c)
##  forcats     * 0.4.0      2019-02-17 [1] CRAN (R 3.5.2)                
##  fs            1.2.6      2018-08-23 [1] CRAN (R 3.5.1)                
##  generics      0.0.2      2018-11-29 [1] CRAN (R 3.5.1)                
##  ggplot2     * 3.1.0      2018-10-25 [1] CRAN (R 3.5.1)                
##  glue          1.3.0      2018-07-17 [1] CRAN (R 3.5.1)                
##  gtable        0.2.0      2016-02-26 [1] CRAN (R 3.5.0)                
##  haven         2.1.0      2019-02-19 [1] CRAN (R 3.5.2)                
##  hms           0.4.2      2018-03-10 [1] CRAN (R 3.5.0)                
##  htmltools     0.3.6      2017-04-28 [1] CRAN (R 3.5.0)                
##  httpcode      0.2.0      2016-11-14 [1] CRAN (R 3.5.0)                
##  httr        * 1.4.0      2018-12-11 [1] CRAN (R 3.5.2)                
##  jsonlite      1.6        2018-12-07 [1] CRAN (R 3.5.2)                
##  knitr         1.21       2018-12-10 [1] CRAN (R 3.5.2)                
##  lattice       0.20-38    2018-11-04 [1] CRAN (R 3.5.2)                
##  lazyeval      0.2.1      2017-10-29 [1] CRAN (R 3.5.0)                
##  lubridate     1.7.4      2018-04-11 [1] CRAN (R 3.5.0)                
##  magrittr      1.5        2014-11-22 [1] CRAN (R 3.5.0)                
##  memoise       1.1.0      2017-04-21 [1] CRAN (R 3.5.0)                
##  modelr        0.1.4      2019-02-18 [1] CRAN (R 3.5.2)                
##  munsell       0.5.0      2018-06-12 [1] CRAN (R 3.5.0)                
##  nlme          3.1-137    2018-04-07 [1] CRAN (R 3.5.2)                
##  pillar        1.3.1.9000 2019-01-22 [1] Github (r-lib/pillar@3a54b8d) 
##  pkgbuild      1.0.2      2018-10-16 [1] CRAN (R 3.5.1)                
##  pkgconfig     2.0.2      2018-08-16 [1] CRAN (R 3.5.1)                
##  pkgload       1.0.2      2018-10-29 [1] CRAN (R 3.5.1)                
##  plyr          1.8.4      2016-06-08 [1] CRAN (R 3.5.0)                
##  prettyunits   1.0.2      2015-07-13 [1] CRAN (R 3.5.0)                
##  processx      3.2.1      2018-12-05 [1] CRAN (R 3.5.1)                
##  ps            1.3.0      2018-12-21 [1] CRAN (R 3.5.2)                
##  purrr       * 0.3.0      2019-01-27 [1] CRAN (R 3.5.2)                
##  R6            2.4.0      2019-02-14 [1] CRAN (R 3.5.2)                
##  Rcpp          1.0.0      2018-11-07 [1] CRAN (R 3.5.1)                
##  readr       * 1.3.1      2018-12-21 [1] CRAN (R 3.5.2)                
##  readxl        1.3.0      2019-02-15 [1] CRAN (R 3.5.2)                
##  remotes       2.0.2      2018-10-30 [1] CRAN (R 3.5.1)                
##  rhdx        * 0.0.1.9000 2019-02-26 [1] local                         
##  rlang         0.3.1      2019-01-08 [1] CRAN (R 3.5.2)                
##  rmarkdown     1.11       2018-12-08 [1] CRAN (R 3.5.2)                
##  rprojroot     1.3-2      2018-01-03 [1] CRAN (R 3.5.0)                
##  rstudioapi    0.9.0      2019-01-09 [1] CRAN (R 3.5.2)                
##  rvest       * 0.3.2      2016-06-17 [1] CRAN (R 3.5.0)                
##  scales        1.0.0      2018-08-09 [1] CRAN (R 3.5.1)                
##  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 3.5.1)                
##  sf          * 0.7-3      2019-02-21 [1] CRAN (R 3.5.2)                
##  stringdist  * 0.9.5.1    2018-06-08 [1] CRAN (R 3.5.0)                
##  stringi       1.3.1      2019-02-13 [1] CRAN (R 3.5.2)                
##  stringr     * 1.4.0      2019-02-10 [1] CRAN (R 3.5.2)                
##  testthat      2.0.1      2018-10-13 [1] CRAN (R 3.5.1)                
##  tibble      * 2.0.1      2019-01-12 [1] CRAN (R 3.5.2)                
##  tidyr       * 0.8.2      2018-10-28 [1] CRAN (R 3.5.1)                
##  tidyselect    0.2.5      2018-10-11 [1] CRAN (R 3.5.1)                
##  tidyverse   * 1.2.1      2017-11-14 [1] CRAN (R 3.5.0)                
##  triebeard     0.3.0      2016-08-04 [1] CRAN (R 3.5.0)                
##  units         0.6-2      2018-12-05 [1] CRAN (R 3.5.1)                
##  urltools      1.7.2      2019-02-04 [1] CRAN (R 3.5.2)                
##  usethis       1.4.0      2018-08-14 [1] CRAN (R 3.5.1)                
##  utf8          1.1.4      2018-05-24 [1] CRAN (R 3.5.0)                
##  withr         2.1.2      2018-03-15 [1] CRAN (R 3.5.0)                
##  xfun          0.5        2019-02-20 [1] CRAN (R 3.5.2)                
##  xml2        * 1.2.0      2018-01-24 [1] CRAN (R 3.5.0)                
##  yaml          2.2.0      2018-07-25 [1] CRAN (R 3.5.1)                
## 
## [1] /usr/lib/R/library