Code
library(pdfsearch)
file <- system.file("pdf", "LeBeauetal2020-gcq.pdf", package = "pdfsearch")Brandon LeBeau
February 12, 2026
For the past few years, I’ve had on my list to work on improving my pdfsearch package to make splitting of multi-column PDF documents more robust. I’ve accomplished some preliminary benefits for multi-column PDFs that in my testing have improved the functionality and accuracy of the splitting. There have been some additional enhancements that have also improved the scrubbing of features that are often not of interest when searching PDF documents, such as page numbers, headers, and footers. The package is now at version 0.5.0 and available on GitHub. I plan to do a bit more testing and push to CRAN at some point soon.
There is a new functionality within the package that is experimental and not the default. The new split_method = "coordinates" option uses token coordinates from pdftools::pdf_data() and can be more robust than whitespace-only splitting in my initial testing. A new argument, column_count is a way for users to specify the number of columns in the PDF document, which can be used to control how the splitting is handled. The default is "auto" and will attempt to infer the number of columns, but users can also specify "1" for single-column reading order or "2" for left-column then right-column order.
Use column_count to control how column order is handled:
"auto": infer number of columns."1": force single-column reading order."2": force left-column then right-column order.This is an example of this working on a PDF that comes within the package. The file is a published research report that is multi-column, but has tables that are single column or span both columns of text. The following code chunk set this file and then shows an example of performing keyword search within this document.
# A tibble: 6 × 5
keyword page_num line_num line_text token_text
<chr> <int> <int> <list> <list>
1 test theory 1 2 <chr [1]> <list [1]>
2 test theory 3 53 <chr [1]> <list [1]>
3 test theory 9 247 <chr [1]> <list [1]>
4 test theory 14 359 <chr [1]> <list [1]>
5 test theory 17 508 <chr [1]> <list [1]>
6 above-level 1 12 <chr [1]> <list [1]>
Several options are available to reduce non-body text before keyword searching. This includes page headers, footers, section headings, and captions. These can be particularly helpful for multi-column documents where such elements may be more prevalent and can impact how well the multi-column PDF is recreated into a single document. The goal of removing these is to better align column text and keep the sentence structure and keyword proximity intact.
res_clean <- keyword_search(
file,
keyword = "variance",
path = TRUE,
split_pdf = TRUE,
split_method = "coordinates",
column_count = "2",
remove_section_headers = TRUE,
remove_page_headers = TRUE,
remove_page_footers = TRUE,
remove_repeated_furniture = TRUE,
repeated_edge_n = 2,
repeated_edge_min_pages = 4,
remove_captions = TRUE,
caption_continuation_max = 2
)
head(res_clean)# A tibble: 6 × 5
keyword page_num line_num line_text token_text
<chr> <int> <int> <list> <list>
1 variance 4 85 <chr [1]> <list [1]>
2 variance 4 87 <chr [1]> <list [1]>
3 variance 4 89 <chr [1]> <list [1]>
4 variance 7 185 <chr [1]> <list [1]>
5 variance 17 471 <chr [1]> <list [1]>
6 variance 18 622 <chr [1]> <list [1]>
keyword_search()Use table_mode to choose whether table-like blocks are searched:
"keep": include all text (default)."remove": exclude table-like blocks from search."only": search only table-like blocks.Additional options can improve table-only extraction:
table_include_headers: include nearby table header rows (default TRUE).table_header_lookback: number of lines above detected table blocks to inspect for header rows (default 3).table_include_notes: include trailing note/source rows.table_note_lookahead: number of lines after detected blocks to inspect for notes.table_block_max_gap: maximum number of non-table lines allowed before a block is split. Increase this when tables are fragmented.When specifying table_mode = 'remove', the same cleaning options above are applied to table blocks as well, which can help ensure that only body text is retained for keyword searching. When using table_mode = 'only', the cleaning options are not applied since the focus is on analyzing tables specifically.
res_keep <- keyword_search(
file,
keyword = "0.83",
path = TRUE,
split_pdf = TRUE,
split_method = "coordinates",
table_mode = "keep",
convert_sentence = FALSE
)
res_remove <- keyword_search(
file,
keyword = "0.83",
path = TRUE,
split_pdf = TRUE,
split_method = "coordinates",
table_mode = "remove",
convert_sentence = FALSE
)
res_only <- keyword_search(
file,
keyword = "0.83",
path = TRUE,
split_pdf = TRUE,
split_method = "coordinates",
table_mode = "only",
table_include_headers = TRUE,
table_header_lookback = 3,
table_block_max_gap = 3,
table_include_notes = FALSE,
table_note_lookahead = 2,
convert_sentence = FALSE
)
c(
keep = nrow(res_keep),
remove = nrow(res_remove),
only = nrow(res_only)
) keep remove only
4 2 2
extract_tables()Tables can now be removed entirely or can be extracted directly. The extract_tables() now supports coordinate splitting and output modes:
"parsed": list of parsed table data frames."blocks": metadata plus raw block lines."both": both parsed tables and block metadata.It also supports table-block tuning options, these are discussed in more detail in the next section, but include:
table_include_headers, table_header_lookbacktable_include_notes, table_note_lookaheadtable_min_numeric_tokens, table_min_digit_ratio, table_min_block_lines, and table_block_max_gapmerge_across_pages for continuation tables that span adjacent pagestab_blocks <- extract_tables(
file,
path = TRUE,
split_pdf = TRUE,
split_method = "coordinates",
column_count = "1",
remove_section_headers = TRUE,
remove_page_headers = TRUE,
remove_page_footers = TRUE,
remove_repeated_furniture = TRUE,
remove_captions = TRUE,
table_include_headers = TRUE,
table_header_lookback = 3,
table_block_max_gap = 3,
table_include_notes = FALSE,
table_note_lookahead = 2,
merge_across_pages = TRUE,
output = "blocks"
)
head(tab_blocks)# A tibble: 3 × 6
page_num block_id line_start line_end line_text page_end
<int> <int> <dbl> <int> <list> <int>
1 8 1 1 6 <chr [6]> 8
2 9 1 1 12 <chr [12]> 9
3 14 1 1 19 <chr [19]> 14
tab_parsed <- extract_tables(
file,
path = TRUE,
split_pdf = TRUE,
split_method = "coordinates",
column_count = "1",
remove_section_headers = TRUE,
remove_page_headers = TRUE,
remove_page_footers = TRUE,
remove_repeated_furniture = TRUE,
remove_captions = TRUE,
table_include_headers = TRUE,
table_header_lookback = 3,
table_block_max_gap = 3,
table_include_notes = FALSE,
table_note_lookahead = 3,
merge_across_pages = TRUE,
output = "parsed"
)
length(tab_parsed)[1] 3
# A tibble: 6 × 1
X1
<chr>
1 Table 1. Fit Statistics for Two-Parameter IRT Multigroup Models by Subject.
2 Subject M 2 RMSEA [CI] CFI M 2 RMSEA [CI] CFI
3 English 3930 (2340), p < .01 0.019 [0.018, 0.020] 0.893 2320 (740), p < .01 0…
4 Math 1630 (1283), p < .01 0.012 [0.010, 0.014] 0.949 805 (405), p < .01 0.023…
5 Reading 1895 (1307), p < .01 0.015 [0.014, 0.017] 0.965 902 (405), p < .01 0.…
6 Science 1818 (1146), p < .01 0.018 [0.016, 0.019] 0.936 948 (350), p < .01 0.…
One primary element to test is the number of columns from the PDF. If the tables span multiple columns, but the text is in multiple columns you would want to ensure column_count = 1 is specified when extracting the tables. This will ensure the table is not truncated to only include half of the table.
The table detector is controlled by several additional options that can be tuned for better performance on specific documents. The key parameters are:
table_min_numeric_tokens: minimum number of numeric-looking tokens required for a line to be considered table-like. Larger values are stricter.table_min_digit_ratio: minimum proportion of digit characters in a line for table-like classification. Larger values reduce prose false positives.table_min_block_lines: minimum number of adjacent table-like lines needed to keep a block.table_block_max_gap: maximum number of non-table lines allowed between table-like lines when merging a block. Increase this when tables are split.table_include_headers: include nearby table headers and column-label rows.table_header_lookback: number of lines above a detected block to inspect for headers.table_include_notes: include trailing Note. or Source. rows.table_note_lookahead: number of lines after a block to inspect for note lines.merge_across_pages: if TRUE, continuation blocks across adjacent pages are merged when they appear to be one table.A practical tuning workflow:
table_block_max_gap.table_min_numeric_tokens and/or table_min_digit_ratio.table_include_headers = TRUE and increase table_header_lookback.merge_across_pages = TRUE.If desired, sentence conversion can be done after pages are concatenated. This has the benefit of allowing the sentence conversion to work across pages and ensuring proper context and allowing for better keyword proximity when sentences are split across page breaks.
# A tibble: 6 × 5
keyword page_num line_num line_text token_text
<chr> <int> <int> <list> <list>
1 IRT 1 2 <chr [1]> <list [1]>
2 IRT 1 3 <chr [1]> <list [1]>
3 IRT 1 4 <chr [1]> <list [1]>
4 IRT 1 11 <chr [1]> <list [1]>
5 IRT 3 45 <chr [1]> <list [1]>
6 IRT 3 46 <chr [1]> <list [1]>
The new functionality will undergo further testing and will not be the default behavior within pdfsearch. The goal is that it adds some additional flexibility for users to more accurately perform keyword searching within the multi-column PDF documents that are common in academic publishing. The new table-block tuning options should also help users better control how tables are handled within the keyword search and table extraction functions.
For dense multi-column journal articles, a practical default is:
split_method = "coordinates"column_count = "2"remove_section_headers = TRUEremove_page_headers = TRUEremove_page_footers = TRUEremove_repeated_furniture = TRUEremove_captions = TRUEtable_mode = "remove" for prose-focused keyword searchUse table_mode = "only" or extract_tables(..., output = "blocks") when the goal is specifically to analyze tables. If table headers are being missed, set table_include_headers = TRUE and increase table_header_lookback. If the table continues across pages, use merge_across_pages = TRUE.