View on GitHub

PDF-Tags-raku

Tagged PDF writer

[Raku PDF Project] / PDF::Tags

PDF-Tags-raku

A small DOM-like API for the creation of tagged PDF files.

This module enables PDF tagged content manipulation, with simple construction, XPath queries and basic XML serialization.

Synopsis

use PDF::Tags;
use PDF::Tags::Elem;

# PDF::API6
use PDF::API6;
use PDF::Annot;
use PDF::XObject::Image;
use PDF::XObject::Form;

my PDF::API6 $pdf .= new;
my PDF::Tags $tags .= create: :$pdf;
# create the document root
my PDF::Tags::Elem $root = $tags.Document;

my $page = $pdf.add-page;
my $header-font = $page.core-font: :family<Helvetica>, :weight<bold>;
my $body-font = $page.core-font: :family<Helvetica>;

$page.graphics: -> $gfx {

    $root.Header1: $gfx, {
        .say('Marked Level 1 Header',
             :font($header-font),
             :font-size(15),
             :position[50, 120]);
    };

    $root.Paragraph: $gfx, {
        .say('Marked paragraph text', :position[50, 100], :font($body-font), :font-size(12));
    };

    # add a marked image
    my PDF::XObject::Image $img .= open: "t/images/lightbulb.gif";
    $root.Figure: $gfx, $img, :Alt('Incandescent apparatus');

    # add a marked link annotation
    my $destination = $pdf.destination( :page(2), :fit(FitWindow) );
    my PDF::Annot $annot = $pdf.annotation: :$page, :$destination, :rect[71, 717, 190, 734];

    $root.Link: $gfx, $annot;

    # tagged XObject Form
    my PDF::XObject::Form $form = $page.xobject-form: :BBox[0, 0, 200, 50];
    my $form-elem = $root.Form;
    $form.text: {
        my $font-size = 12;
        .text-position = [10, 38];

        $form-elem.Header2: $_, {
            .say: "Tagged XObject header", :font($header-font), :$font-size;
        };

        $form-elem.Paragraph: $_, {
            .say: "Some sample tagged text", :font($body-font), :$font-size;
        };
    }

    # render the form contained in $form-elem
    $form-elem.do: $gfx, :position[150, 70];
}

$pdf.save-as: "/tmp/marked.pdf"

Description

A tagged PDF contains additional markup information describing the logical document structure of PDF documents.

PDF tagging may assist PDF readers and other automated tools in reading PDF documents and locating content such as text and images.

This module provides a DOM like interface for creating and traversing PDF structure and content via tags. It also an XPath like search capability. It is designed for use in conjunction with PDF::Class or PDF::API6.

Standard Tags

Elements may be constructed using their Tag name or Mnemonic, as listed below. For example:

$root.P: $gfx, { .say('Marked paragraph text') };

Can also be written as:

$root.Paragraph: $gfx, { .say('Marked paragraph text') };

Or as:

$root.add-kid(:name<P>).mark: $gfx, { .say('Marked paragraph text') };

Documentation in this section adapted from pdfkit.

“Grouping” elements:

Tag Mnemonic Description
Document   whole document; must be used if there are multiple parts or articles
Part   part of a document
Art Article  
Sect Section may nest
Div Division generic division
BlockQuote   block quotation
Caption   describing a figure or table
TOC TableOfContents may be nested, and may be used for lists of figures, tables, etc.
TOCI TableOfContentsItem table of contents (leaf) item
Index   index (text with accompanying Reference content)
NonStruct NonStructural non-structural grouping element (element itself not intended to be exported to other formats like HTML, but ‘transparent’ to its content which is processed normally)
Private   content only meaningful to the creator (element and its content not intended to be exported to other formats like HTML)

“Block” elements:

Mmemonic Tag Description
Tag Mnemonic Description
H Heading heading (first element in a section, etc.)
H1 - H6 Heading1 - Heading6 heading of a particular level intended for use only if nesting sections is not possible for some reason
P Paragraph  
L List should include optional Caption, and list items
LI ListItem should contain Lbl and/or LBody
Lbl Label bullet, number, or “dictionary headword”
LBody ListBody (item text, or “dictionary definition”); may have nested lists or other blocks

“Table” elements:

Tag Mnemonic Description
Table   table; should either contain TR, or THead, TBody and/or TFoot
TR TableRow  
TH TableHeader table heading cell
TD TableData table data cell
THead TableHead table header row group
TBody TableBody table body row group; may have more than one per table
TFoot TableFoot table footer row group

“Inline” elements:

Tag Mnemonic Description
Span   generic inline content
Quote   inline quotation
Note   e.g. footnote; may have a Lbl (see “block” elements)
Reference   content in a document that refers to other content (e.g. page number in an index)
BibEntry BibliographyEntry may have a Lbl (see “block” elements)
Code   code
Link   hyperlink; should contain a link annotation
Annot Annotation annotation (other than a link)
Ruby   Chinese/Japanese pronunciation/explanation
RB RubyBaseText Ruby base text
RT RubyText Ruby annotation text
RP RubyPunctuation  
Warichu   Japanese/Chinese longer description
WT WarichuText  
WP WarichuPunctuation  

“Illustration” elements (should have Alt and/or ActualText set):

Tag Mnemonic Description
Figure    
Formula    
Form   form widget

Non-structure tags:

Tag Mnemonic Description
Artifact   used to mark all content not part of the logical structure
ReversedChars   every string of text has characters in reverse order for technical reasons (due to how fonts work for right-to-left languages); strings may have spaces at the beginning or end to separate words, but may not have spaces in the middle

Classes in this Distribution

See Also

Further Work

The PDF accessibility standard ISO 14289-1 cannot be distributed and needs to be purchased from ISO.