This is an old revision of the document!


The Context

PCRE is a popular C-library that implements regular expression pattern matching using the same syntax and semantics as Perl 5. PCRE means Perl Compatible Regular Expressions. The site of this library is pcre.org

Within Gnat, there are Ada library for regular expressions : Unix-style : GNAT.Regexp, GNAT.Regpat and Spitbol-like : GNAT.Spitbol.

As an alternative, interfacing with PCRE will show some techniques for dealing with a C library.

Abstract of header file pcre.h

Using version 8.02 of the file The header file is quite long, we will just use 2 types and 4 operations, so what we need is just

/* Types */
 
struct real_pcre;                 /* declaration; the definition is private  */
typedef struct real_pcre pcre;
 
#ifndef PCRE_SPTR
#define PCRE_SPTR const char *
#endif
 
/* The structure for passing additional data to pcre_exec().  */
 
typedef struct pcre_extra {
 
/* record components we will not access */
} pcre_extra;
 
/* Indirection for store get and free functions */
PCRE_EXP_DECL void  (*pcre_free)(void *);
 
/* Exported PCRE functions */
 
PCRE_EXP_DECL pcre *pcre_compile(const char *, int, const char **, int *,
                  const unsigned char *);
PCRE_EXP_DECL int  pcre_exec(const pcre *, const pcre_extra *, PCRE_SPTR,
                   int, int, int, int *, int);
PCRE_EXP_DECL pcre_extra *pcre_study(const pcre *, int, const char **);

Interface of the thin binding

The objective of the interface is to hide the dependancy from the package Interfaces.C. and the types exposed by the interface are : Integer, String, Pcre_Type, Pcre_Extra_type, and also System.Address.

The types Pcre and Pcre_Extra are opaque pointers and should not be accessible outside the interface so they are made private. No operation on the components of pcre_extra are necessary, so pcre and pcre_extra are implemented as System.address.

In Gnat, a string is implemented with the bounds and afterwards the content of the string and we must pass to the C code a pointer to char. To avoid the function Interfaces.C.New_String, which makes a new copy of the data (when the data weight 50 Mb it's a burden), the trick is to point to the first element of the Ada String, and give the address of the first element of the string.

The Ada interface for PCRE replicates the specifications of Gnat.Regex : 2 phases (compile / match) instead of 3 phases in pcre.h (compile/study/exec)

Some Ada extra :

An exception if something goes wrong, and Null values to check the validity of any opaque pointer (it could be replaced by a function Is_Valid).

pcre.ads

with System; use System;
 
package Pcre is
 
   Pcre_Error : exception;
 
   type Pcre_Type is private;
   type Pcre_Extra_type is private;
 
   Null_Pcre       : constant Pcre_Type;
   Null_Pcre_Extra : constant Pcre_Extra_type;
 
   procedure Compile
     (Pattern       : in String;
      Options       : in Integer;
      Matcher       : out Pcre_Type;
      Matcher_Extra : out Pcre_Extra_type);
 
   procedure Match
     (Matcher             : in Pcre_Type;
      Matcher_Extra       : in Pcre_Extra_type;
      Subject             : System.Address;
      -- Address of the first element of the string to be searched;
      Length, Startoffset : in Integer;
      Options             : in Integer;
      Match_0, Match_1    : out Integer;
      Result              : out Integer);
 
   procedure Free (M : Pcre_Type);
 
   procedure Free (M : Pcre_Extra_type);
 
private
 
   type Pcre_Type is new System.Address;
   type Pcre_Extra_type is new System.Address;
 
   Null_Pcre       : constant Pcre_Type       := Pcre_Type (Null_Address);
   Null_Pcre_Extra : constant Pcre_Extra_type :=
      Pcre_Extra_type (Null_Address);
 
end Pcre;

Implementation of the thin binding

The procedure Compile combines pcre_compile and pcre_study with sanity checks. Not a big deal.

The procedure Match deals with the return of a vector from the C code. Ada allocates this vector that is used by the C code, so a pragma convention(C) is required, as well as a pragma Volatile so that the Ada compiler does not interfere/optimize it.

The 2 procedures Free are for garbage collection. The whole package has been tested for memory leaks with Valgrind and does not leak.

For the sake of simplicity, no error handling in Compile is done, it is left to the reader.

pcre.adb

with Interfaces.C.Strings;     use Interfaces.C.Strings;
with Interfaces.C;             use Interfaces.C;
with Ada.Unchecked_Conversion;
 
package body Pcre is
 
   pragma Linker_Options ("-lpcre");
   pragma Assert (int'Size = Integer'Size); -- always true with Gnat
 
   use Interfaces;
 
   function To_chars_ptr is new Ada.Unchecked_Conversion (
      Address,
      chars_ptr);
 
   function Pcre_Compile
     (pattern   : chars_ptr;
      options   : Integer;
      errptr    : access chars_ptr;
      erroffset : access Integer;
      tableptr  : chars_ptr)
      return      Pcre_Type;
   pragma Import (C, Pcre_Compile, "pcre_compile");
 
   function Pcre_Study
     (code    : Pcre_Type;
      options : Integer;
      errptr  : access chars_ptr)
      return    Pcre_Extra_type;
   pragma Import (C, Pcre_Study, "pcre_study");
 
   function Pcre_Exec
     (code        : Pcre_Type;
      extra       : Pcre_Extra_type;
      subject     : chars_ptr;
      length      : Integer;
      startoffset : Integer;
      options     : Integer;
      ovector     : System.Address;
      ovecsize    : C.int)
      return        Integer;
   pragma Import (C, Pcre_Exec, "pcre_exec");
 
   procedure Compile
     (Pattern       : in String;
      Options       : in Integer;
      Matcher       : out Pcre_Type;
      Matcher_Extra : out Pcre_Extra_type)
   is
      Regexp       : Pcre_Type;
      Regexp_Extra : Pcre_Extra_type;
      Error_Ptr    : aliased chars_ptr;
      Error_Offset : aliased Integer;
      Pat          : chars_ptr := New_String (Pattern);
   begin
      Regexp :=
         Pcre_Compile
           (Pat,
            Options,
            Error_Ptr'Access,
            Error_Offset'Access,
            Null_Ptr);
      Free (Pat);
 
      if Regexp = Null_Pcre then
         raise Pcre_Error;
      end if;
      Matcher      := Regexp;
      Regexp_Extra := Pcre_Study (Regexp, 0, Error_Ptr'Access);
      if Regexp_Extra = Null_Pcre_Extra then
         raise Pcre_Error;
      end if;
      Matcher_Extra := Regexp_Extra;
   end Compile;
 
   procedure Match
     (Matcher             : in Pcre_Type;
      Matcher_Extra       : in Pcre_Extra_type;
      Subject             : System.Address;
      -- Address of the first element of a string;
      Length, Startoffset : in Integer;
      Options             : in Integer;
      Match_0, Match_1    : out Integer;
      Result              : out Integer)
   is
      Vecsize : constant := 3; -- top-level matching
 
      m : array (0 .. Vecsize - 1) of C.int;
      pragma Convention (C, m);
      pragma Volatile (m); -- used by the C library
 
      Start  : constant chars_ptr :=
         To_chars_ptr (Subject);
   begin
 
      Result  :=
         Pcre_Exec
           (Matcher,
            Matcher_Extra,
            Start,
            Length,
            Startoffset,
            Options,
            m (0)'Address,
            C.int (Vecsize));
      Match_0 := Integer (m (0));
      Match_1 := Integer (m (1));
 
   end Match;
 
   type Access_Free is access procedure (Item : System.Address);
   Pcre_Free : Access_Free;
   pragma Import (C, Pcre_Free, "pcre_free");
 
   procedure Free (M : Pcre_Type) is
   begin
      Pcre_Free (System.Address (M));
   end Free;
 
   procedure Free (M : Pcre_Extra_type) is
   begin
      Pcre_Free (System.Address (M));
   end Free;
 
end Pcre;

Test of Pcre binding

A simple program : compiling a pattern and showing positions in the subject string.

test_pcre.adb

--
-- A simple test to show the values of m0 & m1
--
with Text_IO; use Text_IO;
with Pcre;    use Pcre;
 
procedure Test_Pcre is
 
   Regexp          : Pcre_Type;
   Regexp_Extra    : Pcre_Extra_type;
   Retcode         : Integer;
   Position, Count : Integer         := 0;
   m0, m1          : Integer;
   Subject         : constant String := "Z2345A789B123456789AA";
 
begin
   Compile
     (Pattern       => "[A-Z][0-9]",
      Options       => 0,
      Matcher       => Regexp,
      Matcher_Extra => Regexp_Extra);
 
   loop
      Match
        (Regexp,
         Regexp_Extra,
         Subject (1)'Address,
         Subject'Length,
         Position,
         0,
         m0,
         m1,
         Retcode);
      exit when Retcode < 0;
      Put_Line
        ("m0:=" &
         Integer'Image (m0) &
         " m1:=" &
         Integer'Image (m1) &
         " character => " &
         Subject (m1));
      Count    := Count + 1;
      Position := m1;
   end loop;
   Put_Line ("Count is" & Integer'Image (Count));
 
   Free (Regexp);
   Free (Regexp_Extra);
end Test_Pcre;

Output :

m0:= 0 m1:= 2 character => 2
m0:= 5 m1:= 7 character => 7
m0:= 9 m1:= 11 character => 1
Count is 3

Navigation