改善在R中从Google获取股票新闻数据的功能

|| 我已经编写了一个功能,可以从给定的股票代号中获取和解析Google的新闻数据,但是我敢肯定有一些方法可以对其进行改进。对于初学者来说,我的函数在GMT时区而不是用户当前时区中返回一个对象,并且如果传递的数字大于299,则它将失败(可能是因为Google每只股票只返回300个故事)。这多少是对我自己关于堆栈溢出的问题的回应,并且在很大程度上依赖于此博客文章。 tl; dr:如何改善此功能?
 getNews <- function(symbol, number){

    # Warn about length
    if (number>300) {
        warning(\"May only get 300 stories from google\")
    }

    # load libraries
    require(XML); require(plyr); require(stringr); require(lubridate);
    require(xts); require(RDSTK)

    # construct url to news feed rss and encode it correctly
    url.b1 = \'http://www.google.com/finance/company_news?q=\'
    url    = paste(url.b1, symbol, \'&output=rss\', \"&start=\", 1,
               \"&num=\", number, sep = \'\')
    url    = URLencode(url)

    # parse xml tree, get item nodes, extract data and return data frame
    doc   = xmlTreeParse(url, useInternalNodes = TRUE)
    nodes = getNodeSet(doc, \"//item\")
    mydf  = ldply(nodes, as.data.frame(xmlToList))

    # clean up names of data frame
    names(mydf) = str_replace_all(names(mydf), \"value\\\\.\", \"\")

    # convert pubDate to date-time object and convert time zone
    pubDate = strptime(mydf$pubDate, 
                     format = \'%a, %d %b %Y %H:%M:%S\', tz = \'GMT\')
    pubDate = with_tz(pubDate, tz = \'America/New_york\')
    mydf$pubDate = NULL

    #Parse the description field
    mydf$description <- as.character(mydf$description)
    parseDescription <- function(x) {
        out <- html2text(x)$text
        out <- strsplit(out,\'\\n|--\')[[1]]

        #Find Lead
        TextLength <- sapply(out,nchar)
        Lead <- out[TextLength==max(TextLength)]

        #Find Site
        Site <- out[3]

        #Return cleaned fields
        out <- c(Site,Lead)
        names(out) <- c(\'Site\',\'Lead\')
        out
    }
    description <- lapply(mydf$description,parseDescription)
    description <- do.call(rbind,description)
    mydf <- cbind(mydf,description)

    #Format as XTS object
    mydf = xts(mydf,order.by=pubDate)

    # drop Extra attributes that we don\'t use yet
    mydf$guid.text = mydf$guid..attrs = mydf$description = mydf$link = NULL
    return(mydf) 

}
    
已邀请:
这是
getNews
函数的较短版本(可能更有效)
  getNews2 <- function(symbol, number){

    # load libraries
    require(XML); require(plyr); require(stringr); require(lubridate);  

    # construct url to news feed rss and encode it correctly
    url.b1 = \'http://www.google.com/finance/company_news?q=\'
    url    = paste(url.b1, symbol, \'&output=rss\', \"&start=\", 1,
               \"&num=\", number, sep = \'\')
    url    = URLencode(url)

    # parse xml tree, get item nodes, extract data and return data frame
    doc   = xmlTreeParse(url, useInternalNodes = T);
    nodes = getNodeSet(doc, \"//item\");
    mydf  = ldply(nodes, as.data.frame(xmlToList))

    # clean up names of data frame
    names(mydf) = str_replace_all(names(mydf), \"value\\\\.\", \"\")

    # convert pubDate to date-time object and convert time zone
    mydf$pubDate = strptime(mydf$pubDate, 
                     format = \'%a, %d %b %Y %H:%M:%S\', tz = \'GMT\')
    mydf$pubDate = with_tz(mydf$pubDate, tz = \'America/New_york\')

    # drop guid.text and guid..attrs
    mydf$guid.text = mydf$guid..attrs = NULL

    return(mydf)    
}
此外,您的代码中可能存在错误,因为我尝试将其用于
symbol = \'WMT\'
,并且返回了错误。我认为
getNews2
对于WMT也适用。检查一下,让我知道它是否适合您。 PS。
description
列仍包含html代码。但是从中提取文本应该很容易。我会在有时间的时候发布更新     

要回复问题请先登录注册