Scrapping Google News with rvest

This is an example of how to scrappe Google News website with the awesome rvest package.

This work is the result of a question in our great discussion group R Dojo.

AntissistemA, a very active user in our group, came up with this problem and I decided to help him. It was a cool challenge, so why not?

I should advice you that a great deal of the basic ideas comes from his own code. I just kept on going and added few things in order the code to work.

First, you should take a look at the Google News (or Google Notícias) website HERE which I reproduce below:

You may notice in the right side of the snapshot that we used the Google Chrome dev-tools in order to detect all the html nodes. You can access this tool by hitting the F12 key. The html nodes are passed as arguments into the rvest functions.

Basically, the idea is to extract the communication vehicle (veiculo), the time elapsed since the news was published (tempo), and the main headline (nome).

The code are presented below:

# loading the packages:
library(dplyr) # we want the pipes
library(rvest) # webscrapping
library(stringr) # to deal with strings and cleaning our data

# extracting the whole website
google <- read_html("https://news.google.com/?hl=pt-BR&gl=BR&ceid=BR:pt-419")

# extracting the com vehicles
# we pass the nodes in html_nodes and extract the text from the last one 
# we use stringr to delete strings that do not matter
veiculo_all <- google %>% html_nodes("div div div main c-wiz div div div article div div div") %>% html_text() %>%
  str_subset("[^more_vert]") %>%
  str_subset("[^share]") %>%
  str_subset("[^bookmark_border]")

veiculo_all[1:10] # take a look at the first ten

##  [1] "InfoMoney"       "G1"              "O Cafezinho"    
##  [4] "InfoMoney"       "Gazeta do Povo"  "Estadão"        
##  [7] "G1"              "Estado de Minas" "O Tempo"        
## [10] "UOL"

# extracting the time elapsed
tempo_all <- google %>% html_nodes("div article div div time") %>% html_text()

tempo_all[1:10] # take a look at the first ten

##  [1] "hoje"          "ontem"         "hoje"          "hoje"         
##  [5] "3 horas atrás" "hoje"          "hoje"          "hoje"         
##  [9] "hoje"          "hoje"

# extracting the headlines
# and using stringr for cleaning
nome_all <- google %>% html_nodes("article") %>% html_text("span") %>%
  str_split("(?<=[a-z0-9áéó!?])(?=[A-Z])") # also considering portuguese special characters

nome_all <- sapply(nome_all, function(x) x[1]) # extract only the first elements

nome_all[1:10] # take a look at the first ten

##  [1] "Haddad em construção, Bolsonaro forte e Alckmin ameaçado: 3 análises sobre as pesquisas Ibope e MDAPara entender um pouco melhor o significado dos números apresentados pelas recentes pesquisas e quais são as sinalizações para os próximos passos da ...amp"
##  [2] "Pesquisa Ibope: Lula, 37%; Bolsonaro, 18%; Marina, 6%; Ciro, 5%; Alckmin, 5%Alvaro Dias tem 3%. Eymael, Boulos, Meirelles, Amoêdo têm 1% cada. Demais candidatos não atingem 1%. Levantamento foi feito entre os dias 17 e 19 e ...amp"                        
##  [3] "Análise da pesquisa Ibope presidencial no Rio de Janeiro"                                                                                                                                                                                                      
##  [4] "Haddad está virtualmente no 2º turno; Bolsonaro 'joga pelo empate' contra Alckmin, diz pesquisador"                                                                                                                                                            
##  [5] "Ibope presidencial: números pouco evidentes que podem mudar a eleição"                                                                                                                                                                                         
##  [6] "Marcio Lacerda desiste de disputar governo de Minas após pressão do PSBSigla fechou um acordo com o PT que incluía a retirada de ex-prefeito de BH da disputa estadual; Lacerda afirmou que se desvinculará do PSB.Estadãohojebookmark_bordersharemore_vert"   
##  [7] "Marcio Lacerda retira sua candidatura ao governo de Minas e anuncia desfiliação do PSBEx-prefeito de BH registrou candidatura mesmo após partido ter dissolvido a comissão estadual; MDB, que indicou o vice de Lacerda, pode assumir a ...amp"                
##  [8] "Lacerda desiste de ser candidato ao governo de Minas"                                                                                                                                                                                                          
##  [9] "Marcio Lacerda retira candidatura e anuncia desfiliação do PSBSegundo a carta escrita pelo ex-prefeito de Belo Horizonte, o conchavo entre seu partido e o PT foi o responsável pelo seu impedimento de seguir no pleito.amp"                                  
## [10] "PSB rebate Lacerda e chama ex-prefeito de inseguro após fim de candidatura"

In this last case we used a regular expression (REGEX) to clean the data by separating the actual headline phrases from the complementary phrases. In some cases we have a phrase ending in uppercase letters such as “SP” collapsed with other phrase initiating with another uppercase letter such as “A” for example. We have to think of a better way to split these cases, but the current result is quite satisfactory for now.

And we have our final data frame:

df_news <- data_frame(veiculo_all, tempo_all, nome_all)

df_news

Scrapping Google News with rvest

Allan Vieira

August 21, 2018