Writing performance code with RcppCWB

Andreas Blaette (andreas.blaette@uni-due.de)

2024-09-23

Rationale

The RcppCWB package exposes the functionality of the Corpus Workbench (CWB) to R, so that R users can benefit from the performance of the C code of the CWB. Ease of use and performance should be great most of the time. But there are scenarios when the interface between R and C/C++ is a bottleneck for achieving sufficient performance. In this case, using CWB functionality in C++ functions exposed to R using Rcpp::cppFunction() or Rcpp::sourceCpp() may solve issues with performance and memory limitations.

Basics

Writing C++ functions that use CWB functionality requires loading the Rcpp and RcppCWB package.

library(Rcpp)
library(RcppCWB)

We need to be aware that the default functions for accessing the CWB functionality involve passing length-one character vectors used for looking up the C representation of structural or positional attributes for corpora that have been loaded. It is more efficient to perform this lookup only once. Following this rationale, a set of functions exposes CWB functionality closer to the C logic, passing pointers to attributes that have been looked up.

This functionality can also be used from R. For instance, we look up the p-attribute “word” of the “REUTERS” corpus as follows.

p_attr_word <- p_attr(
  corpus = "REUTERS",
  p_attribute = "word",
  registry = get_tmp_registry()
)

And we use cpos_to_str() to decode the first words of the corpus.

cpos_to_str(p_attr_word, 0:10)
##  [1] "Diamond"   "Shamrock"  "Corp"      "said"      "that"      "effective"
##  [7] "today"     "it"        "had"       "cut"       "its"

While this may also be useful when writing R code, this lower-level functionality is particularly well-suited for writing high-performance C++ code exposed to R.

Inline C++ functions

We start with a first simple scenario, which uses cppFunction() to source an inline C++ function in an R session.

cppFunction(
  'Rcpp::StringVector get_str(SEXP corpus, SEXP p_attribute, SEXP registry, Rcpp::IntegerVector cpos){
     SEXP attr;
     Rcpp::StringVector result;
     attr = RcppCWB::p_attr(corpus, p_attribute, registry);
     result = RcppCWB::cpos_to_str(attr, cpos);
     return(result);
  }',
  depends = "RcppCWB"
)

This is not a very interesting example, but using the function works:

get_str("REUTERS", "word", RcppCWB::get_tmp_registry(), 0:50)
##  [1] "Diamond"      "Shamrock"     "Corp"         "said"         "that"        
##  [6] "effective"    "today"        "it"           "had"          "cut"         
## [11] "its"          "contract"     "prices"       "for"          "crude"       
## [16] "oil"          "by"           "1.50"         "dlrs"         "a"           
## [21] "barrel"       "The"          "reduction"    "brings"       "its"         
## [26] "posted"       "price"        "for"          "West"         "Texas"       
## [31] "Intermediate" "to"           "16.00"        "dlrs"         "a"           
## [36] "barrel"       "the"          "copany"       "said"         "The"         
## [41] "price"        "reduction"    "today"        "was"          "made"        
## [46] "in"           "the"          "light"        "of"           "falling"     
## [51] "oil"

Source C++ function

To provide a more interesting real-life example, we demonstrate a solution to the following scenario: It may be necessary to decode an entire corpus, and to write the tokens of corpus regions to a file in a line-by-line manner. Computing word embeddings may require this input format, for instance.

But if the corpus is really large, decoding the corpus entirely and then writing everything to disk may hit memory limitations. Decoding the tokens of the corpus successively and writing content to the output file on the spot is an obvious solution, but moving data between the R/C++/C interface for every single token is excessively slow. A pure C++ implementation will be much more effective.

The following C++ file that relies on CWB functions as exposed by RcppCWB addresses the scenario.

// [[Rcpp::depends(RcppCWB)]]
#include <Rcpp.h>
#include <RcppCWB.h>

#include <stdio.h>
#include <iostream>
#include <fstream>
#include <cstdlib>


// [[Rcpp::export]]
int write_token_stream(SEXP corpus, SEXP p_attribute, SEXP s_attribute, SEXP registry, SEXP attribute_type, Rcpp::StringVector filename) {
  
  int i, n, region_size;
  Rcpp::IntegerVector region(2);
  std::ofstream outdata;

  n = RcppCWB::attribute_size(corpus, s_attribute, attribute_type, registry);

  outdata.open(filename[0]);
  if( !outdata ) {
    std::cerr << "Error: file could not be opened" << std::endl;
    exit(1);
  }
  
  for (i = 0; i < n; i++){
    region = RcppCWB::struc2cpos(corpus, s_attribute, registry, i);
    region_size = region[1] - region[0] + 1;

    Rcpp::IntegerVector cpos(region_size);
    cpos = Rcpp::seq(region[0], region[1]);

    Rcpp::StringVector values(region_size);
    values = RcppCWB::cpos2str(corpus, p_attribute, registry, cpos);
    
    int j;
    for (j = 0; j < values.length(); j++){ 
      outdata << values(j);
      if (j < values.length() - 1){
        outdata << " ";
      }
    }
    outdata << std::endl;
  }
  outdata.close();
  
  return 0;
}

This code can be sourced, compiled and exposed to R using sourceCpp().

sourceCpp(file = system.file(package = "RcppCWB", "cpp", "fastdecode.cpp"))

We exemplify that everything works as intended using the (smallish) REUTERS corpus. So we create the output …

outfile <- tempfile(fileext = ".txt")

write_token_stream(
  corpus = "REUTERS",
  p_attribute = "word", 
  s_attribute = "id",
  attribute_type = "s",
  registry = RcppCWB::get_tmp_registry(),
  filename = outfile
)
## [1] 0

… and read it (showing the content selectively) to convey that the corpus data has been exported as intended.

readLines(outfile) |>
  lapply(substr, 1, 75) |>
  unlist()
##  [1] "Diamond Shamrock Corp said that effective today it had cut its contract pri"
##  [2] "OPEC may be forced to meet before a scheduled June session to readdress its"
##  [3] "Texaco Canada said it lowered the contract price it will pay for crude oil "
##  [4] "Marathon Petroleum Co said it reduced the contract price it will pay for al"
##  [5] "Houston Oil Trust said that independent petroleum engineers completed an an"
##  [6] "Kuwait s Oil Minister in remarks published today said there were no plans f"
##  [7] "Indonesia appears to be nearing a political crossroads over measures to der"
##  [8] "Saudi riyal interbank deposits were steady at yesterday's higher levels in "
##  [9] "The Gulf oil state of Qatar recovering slightly from last year's decline in"
## [10] "Saudi Arabian Oil Minister Hisham Nazer reiterated the kingdom's commitment"
## [11] "Saudi crude oil output last month fell to an average of 3.5 mln barrels per"
## [12] "Deputy oil ministers from six Gulf Arab states will meet in Bahrain today t"
## [13] "Saudi Arabian Oil Minister Hisham Nazer reiterated the kingdom's commitment"
## [14] "Kuwait's oil minister said in a newspaper interview that there were no plan"
## [15] "The port of Philadelphia was closed when a Cypriot oil tanker Seapride II r"
## [16] "A study group said the United States should increase its strategic petroleu"
## [17] "A study group said the United States should increase its strategic petroleu"
## [18] "Unocal Corp's Union Oil Co said it lowered its posted prices for crude oil "
## [19] "The New York Mercantile Exchange set April one for the debut of a new proce"
## [20] "Argentine crude oil production was down 10.8 pct in January 1987 to 12.32 m"

Moving ahead

Writing C++ functions is obviously more demanding than writing R code. But using CWB functionality as exposed by RcppCWB in C++ functions that can be used from R may be a great solution to performance and memory issues. Rcpp brings writing C++ code much closer to what R users are acquainted with, making writing high-performance C++ close much easier. So we encourage considering this option when pure R solutions are not fast enough.