The xml2 package is a binding to libxml2, making it easy to work with HTML and XML from R. The API is somewhat inspired by jQuery.
You can install xml2 from CRAN,
install.packages("xml2")
or you can install the development version from github, using devtools
:
# install.packages("devtools")
devtools::install_github("hadley/xml2")
library("xml2")
x <- read_xml("<foo> <bar> text <baz/> </bar> </foo>")
x
xml_name(x)
xml_children(x)
xml_text(x)
xml_find_all(x, ".//baz")
h <- read_html("<html><p>Hi <b>!")
h
xml_name(h)
xml_text(h)
There are three key classes:
-
xml_node
: a single node in a document. -
xml_doc
: the complete document. Acting on a document is usually the same as acting on the root node of the document. -
xml_nodeset
: a set of nodes within the document. Operations onxml_nodeset
s are vectorised, apply the operation over each node in the set.
xml2 has similar goals to the XML package. The main differences are:
-
xml2 takes care of memory management for you. It will automatically free the memory used by an XML document as soon as the last reference to it goes away.
-
xml2 has a very simple class hierarchy so don't need to think about exactly what type of object you have, xml2 will just do the right thing.
-
More convenient handling of namespaces in Xpath expressions - see
xml_ns()
andxml_ns_strip()
to get started.