Take Home Exercise 3

Author

Yen Yun Hsuan

Published

June 10, 2023

Background

This is Mini-Challenge 3 of VAST Challenge 2023.

FishEye International, a non-profit focused on countering illegal, unreported, and unregulated (IUU) fishing, has been given access to an international finance corporation’s database on fishing related companies. They have transformed the database into a knowledge graph. It includes information about companies, owners, workers, and financial status. FishEye is aiming to use this graph to identify anomalies that could indicate a company is involved in IUU.

Task

Develop a visual analytics process to find similar businesses and group them. This analysis should focus on a business’s most important features and present those features clearly to the user. Limit your response to 400 words and 5 images.

Data preparation

Installing and launching R packages

Show the code
pacman::p_load(jsonlite, tidygraph, ggraph, 
               visNetwork, graphlayouts, ggforce, 
               skimr, tidytext, tidyverse,tm,topicmodels,reshape2,ggplot2,wordcloud,pals,Snowballac,Lda,Ldatuning,kableExtra,DT,flextable,remotes,ldatuning,igraph,qgraph,
               topicmodels.utils,LDAvis,tsne, semnet, servr,devtools)

Import data

Show the code
mc3_data <- fromJSON("data/MC3.json")

Extracting edges

Convert data type from list to character by mutate and as.character

Show the code
mc3_edges <- as_tibble(mc3_data$links) %>% 
  distinct() %>%
  mutate(source = as.character(source),
         target = as.character(target),
         type = as.character(type)) %>%
  group_by(source, target, type) %>%
    summarise(weights = n(),.groups = "drop") %>%
  filter(source!=target)

Extracting nodes

  • Same country may have different id, so cannot use distinct function, or some ids may be excluded

  • revenue_omu should be numerical value, so first transform the list into character, then need to transform into numeric

  • Reorder the dataframe column sequence by select, with id coming first

  • Replace NA value in revenue_omu with 0

  • Replace “character[0] in product_services with blank

Show the code
mc3_nodes <- as_tibble(mc3_data$nodes) %>% 
  mutate(country = as.character(country),
         id = as.character(id),
         product_services = as.character(product_services),
         revenue_omu = as.numeric(as.character(revenue_omu)),
         type = as.character(type)) %>% 
  select(id, country, type, revenue_omu, product_services) %>%
mutate(revenue_omu = replace(revenue_omu, is.na(revenue_omu), 0), product_services = replace(product_services, product_services == "character(0)", ""))

Data exploration

Edge dataframe

Display the summary statistics

Show the code
skim(mc3_edges)
Data summary
Name mc3_edges
Number of rows 24036
Number of columns 4
_______________________
Column type frequency:
character 3
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1 6 700 0 12856 0
target 0 1 6 28 0 21265 0
type 0 1 16 16 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
weights 0 1 1 0 1 1 1 1 1 ▁▁▇▁▁

Display interactive table

Show the code
DT::datatable(mc3_edges)

Nodes dataframe

No missing values in mc3_nodes

Show the code
skim(mc3_nodes)
Data summary
Name mc3_nodes
Number of rows 27622
Number of columns 5
_______________________
Column type frequency:
character 4
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
id 0 1 6 64 0 22929 0
country 0 1 2 15 0 100 0
type 0 1 7 16 0 3 0
product_services 0 1 0 1737 18959 3244 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
revenue_omu 0 1 402863.8 8583224 0 0 0 0 310612303 ▇▁▁▁▁
Show the code
DT::datatable(mc3_nodes)

Text processing

Perform basic text sensing using tidytext package for product_services in nodes dataframe.

Tokenisation

Split text in product_services field into words with unnest_token() of tidytext.

  • word: output column name that will be created as the text is unnested into it

  • product_services: input column that the text comes from

  • By default, punctuation has been stripped. (Use the to_lower = FALSE argument to turn off this behavior).

  • By default, unnest_tokens() converts the tokens to lowercase, which makes them easier to compare or combine with other datasets.

Show the code
token_nodes <- mc3_nodes %>% 
  unnest_tokens(word,product_services)
Show the code
token_nodes %>% 
  count(word,sort =TRUE) %>% 
  top_n(15) %>% 
# reordered according to the values in the n variable
  mutate(word = reorder(word,n)) %>% 
  ggplot(aes(x=word,y=n))+
  geom_col()+xlab(NULL)+
  coord_flip()+
  labs(x = "Count",
      y = "Unique words",
      title = "Count of unique words found in product_services field")

Many of the frequent words are meaningless, such as ‘and’, so need to remove these words.

Removing stop words

Use stop_words in the tidytext package to clean up stop words.

  • Load the stop_words data included with tidytext.

  • Then anti_join() of dplyr package is used to remove all stop words from the analysis, only the rows from token_nodesthat do not have a match in stop_words are retained in the result.

Show the code
stopword_removed <- token_nodes %>% 
  anti_join(stop_words)
Show the code
stopword_removed %>% 
  count(word, sort = TRUE) %>% 
  top_n(30) %>% 
  mutate(word = reorder(word,n)) %>% 
  ggplot(aes(x=word,y=n))+
  geom_col()+xlab(NULL)+
  coord_flip()+
  labs(x = "Count",
      y = "Unique words",
      title = "Count of unique words found in product_services field")

There are still some meaningless words, including “unknown, related, including, range”, so extend the stopword dataframe and remove these words.

Show the code
extended_stopwords <- stop_words %>% 
  bind_rows(data.frame(word = c('unknown','related','including','range','products')))

stopword_removed_2 <- token_nodes %>% 
  anti_join(extended_stopwords)

#remove s at the end of each word
stopword_removed_2$word <- gsub("(.*)s$", '\\1', stopword_removed_2$word)
Show the code
stopword_removed_2 %>% 
  count(word, sort = TRUE) %>% 
  top_n(30) %>% 
  mutate(word = reorder(word,n)) %>% 
  ggplot(aes(x=word,y=n))+
  geom_col()+xlab(NULL)+
  coord_flip()+
  labs(x = "Count",
      y = "Unique words",
      title = "Count of unique words found in product_services field")

Topic modelling

Create a DTM(document term matrix) to capture the frequency of words across dataframe, then filtering terms based on frequency.

Show the code
# compute document term matrix with terms >= minimumFrequency
minimumFrequency <- 5
DTM <- DocumentTermMatrix(stopword_removed_2$word,control = list(bounds =list(global = c(minimumFrequency, Inf))))
dim(DTM)
[1] 43445  1632
Show the code
# exclude empty rows in DTM
row_sums <- slam::row_sums(DTM)>0
DTM <- DTM[row_sums,]

To determine the optimal number of topics, using the “Deveaud2014” metric for topic coherence, select 3 topics that yields a higher value, indicating a stronger degree of similarity between words within each topic.

Show the code
result <- ldatuning::FindTopicsNumber(
  DTM,
  topics = seq(from = 2, to = 15, by =1),
  metrics = "Deveaud2014",
  method = 'Gibbs',
  control = list(seed = 1234),
  verbose = TRUE
)
fit models... done.
calculate metrics:
  Deveaud2014... done.
Show the code
FindTopicsNumber_plot(result)

Build topic model with number of four, with ‘Gibbs’ sampling method, and with verbose = 100 to display the details of iteration progress.

Show the code
k <- 3
set.seed(1234)
topicmodel <- LDA(DTM,k,method = 'Gibbs',control = list(iter = 500, verbose = 100))
K = 3; V = 1632; M = 35285
Sampling 500 iterations!
Iteration 100 ...
Iteration 200 ...
Iteration 300 ...
Iteration 400 ...
Iteration 500 ...
Gibbs sampling completed!

From the word cloud and feature bar plot, we can see the important features within similar company group.

Show the code
# see the probability of each word belonging to each topic
topicmodel_tidy <- tidy(topicmodel,matrix ='beta')

top_terms <- topicmodel_tidy %>%
  group_by(topic) %>%
  top_n(20, beta) %>%
  arrange(topic, desc(beta))

# Create a separate word cloud for each topic

for (i in 1:3) {
  topic_words <- subset(top_terms, topic == i)
  wordcloud(
    words = topic_words$term,
    freq = topic_words$beta,
    colors = brewer.pal(5, 'Dark2')
  )
}

Based on the top feature, categorizing the groups of companies into the company type shown as below.

Show the code
# see the probability of each word belonging to each topic
topicmodel_tidy <- tidy(topicmodel,matrix ='beta')

# find top 5 words with maximum probability in each topic
top_terms <- topicmodel_tidy %>% 
  group_by(topic) %>% 
  slice_max(beta, n = 5) %>% 
  arrange(topic, -beta)

# visulize by ggplot
top_terms %>% 
  mutate(term = reorder_within(term,beta,topic)) %>% 
  ggplot(aes(beta, term, fill = factor(topic)))+
  geom_col(show.legend = FALSE)+
  facet_wrap(~topic, scales = "free_y",labeller = labeller(topic = c("1" = "Service", "2" = "Seafood can", "3" = "Frozen seafood"))
             )+
  scale_y_reordered()+
  theme(axis.text.x = element_blank())

Network visual

Build tidygraph

Derive the highest beta for each term in topic modelling dataframe

Show the code
topicmodel <- topicmodel_tidy %>% 
  group_by(term) %>% 
   slice(which.max(beta)) %>%
  ungroup()

Join topic model result with node

Show the code
mc3_nodes_update <- left_join(stopword_removed_2, topicmodel, by = c("word"= "term"))

mc3_nodes_update <- na.omit(mc3_nodes_update)

Selecting the word with the highest beta value for each company

Show the code
mc3_nodes_topic <- mc3_nodes_update %>%
  group_by(id, country, type, revenue_omu) %>%
  slice(which.max(beta)) %>%
  ungroup() %>% 
  distinct()

Topic 1

Show the code
mc3_nodes_topic1 <- mc3_nodes_topic %>% 
  filter(topic ==1)

MC3_edge <- mc3_edges %>% 
  filter(source %in% mc3_nodes_topic1$id | target %in% mc3_nodes_topic1$id) %>% 
  rename(from = source) %>%
  rename(to = target) 

id1 <- MC3_edge %>%
  select(from) %>%
  rename(id = from)
id2 <- MC3_edge %>%
  select(to) %>%
  rename(id = to)

mc3_node <- rbind(id1, id2) %>%
  distinct() %>% 
  left_join(mc3_nodes_topic1, by = 'id')

MC3_graph <- tbl_graph(nodes = mc3_node,
                       edges = MC3_edge,
                       directed =FALSE)%>%
  mutate(betweenness_centrality = centrality_betweenness())

If filter by high betweenness centrality, can see that company features are equipment and service only.

Show the code
MC3_graph %>%
  filter(betweenness_centrality >= 2000) %>%
ggraph(layout = "fr")+
  geom_edge_link(aes(width = weights), alpha = 0.50, edge_color = "grey20") +
  geom_node_point(color = "gray", size = 20, show.legend = FALSE) +
  geom_node_text(aes(label = word),  colour = 'black', size=5,show.legend = FALSE) 

Topic 2

Show the code
mc3_nodes_topic2 <- mc3_nodes_topic %>% 
  filter(topic ==2)

MC3_edge <- mc3_edges %>% 
  filter(source %in% mc3_nodes_topic$id | target %in% mc3_nodes_topic1$id) %>% 
  rename(from = source) %>%
  rename(to = target) 

id1 <- MC3_edge %>%
  select(from) %>%
  rename(id = from)
id2 <- MC3_edge %>%
  select(to) %>%
  rename(id = to)

mc3_node <- rbind(id1, id2) %>%
  distinct() %>% 
  left_join(mc3_nodes_topic2, by = 'id')

MC3_graph <- tbl_graph(nodes = mc3_node,
                       edges = MC3_edge,
                       directed =FALSE)%>%
  mutate(betweenness_centrality = centrality_betweenness(),
         closeness_centrality = centrality_closeness())

We can see that some company features with high betweenness centrality look abnormal, including unrelated features to fishing such as shoe and glove.

Show the code
MC3_graph %>%
  filter(betweenness_centrality >= 100000) %>%
ggraph(layout = "fr")+
  geom_edge_link(aes(width = weights), alpha = 0.50, edge_color = "grey20") +
  geom_node_point(color = "gray", size = 10, show.legend = FALSE) +
  geom_node_text(aes(label = word),  colour = 'black', size=4,show.legend = FALSE) 

Topic 3

Show the code
mc3_nodes_topic3 <- mc3_nodes_topic %>% 
  filter(topic ==3)

MC3_edge <- mc3_edges %>% 
  filter(source %in% mc3_nodes_topic3$id | target %in% mc3_nodes_topic1$id) %>% 
  rename(from = source) %>%
  rename(to = target) 

id1 <- MC3_edge %>%
  select(from) %>%
  rename(id = from)
id2 <- MC3_edge %>%
  select(to) %>%
  rename(id = to)

mc3_node <- rbind(id1, id2) %>%
  distinct() %>% 
  left_join(mc3_nodes_topic3, by = 'id')

MC3_graph <- tbl_graph(nodes = mc3_node,
                       edges = MC3_edge,
                       directed =FALSE)%>%
  mutate(betweenness_centrality = centrality_betweenness(),
         closeness_centrality = centrality_closeness())

For nodes with high betweenness centrality in topic 3, it is expected to see a majority of them being related to frozen/seafood and fish. However, there are some companies with unrelated features such as pharmaceutical and metal that have unexpectedly high betweenness centrality.

Show the code
MC3_graph %>%
  filter(betweenness_centrality >= 5000) %>%
ggraph(layout = "fr")+
  geom_edge_link(aes(width = weights), alpha = 0.50, edge_color = "grey20") +
  geom_node_point(color = "gray", size = 10, show.legend = FALSE) +
  geom_node_text(aes(label = word),  colour = 'black', size=4,show.legend = FALSE)