The Context

PCRE is a popular C-library that implements regular expression pattern matching using the same syntax and semantics as Perl 5. PCRE means Perl Compatible Regular Expressions. The site of this library is pcre.org

Within Gnat, there are Ada library for regular expressions : Unix-style : GNAT.Regexp, GNAT.Regpat and Spitbol-like : GNAT.Spitbol.

As an alternative, interfacing with PCRE will show some techniques for dealing with a C library. There are enough primitives inside the package Interfaces.C.Strings to avoid a wrapper in C.

Abstract of header file pcre.h

Using version 8.02 of the file The header file is quite long, we will just use 2 types and 4 operations, so what we need is just

/* Types */
 
struct real_pcre;                 /* declaration; the definition is private  */
typedef struct real_pcre pcre;
 
#ifndef PCRE_SPTR
#define PCRE_SPTR const char *
#endif
 
/* The structure for passing additional data to pcre_exec().  */
 
typedef struct pcre_extra {
 
/* record components we will not access */
} pcre_extra;
 
/* Indirection for store get and free functions */
PCRE_EXP_DECL void  (*pcre_free)(void *);
 
/* Exported PCRE functions */
 
PCRE_EXP_DECL pcre *pcre_compile(const char *, int, const char **, int *,
                  const unsigned char *);
PCRE_EXP_DECL int  pcre_exec(const pcre *, const pcre_extra *, PCRE_SPTR,
                   int, int, int, int *, int);
PCRE_EXP_DECL pcre_extra *pcre_study(const pcre *, int, const char **);

Interface of the thin binding

The objective of the interface is to hide the dependancy from the package Interfaces.C. and the types exposed by the interface are : Integer, String, Pcre_Type, Extra_type, (and also System.Address in the complete binding).

The types Pcre and Extra are opaque pointers and should not be accessible outside the interface so they are made private. No operation on the components of pcre_extra are necessary, so pcre and pcre_extra are just declared as System.Address.

The complete cycle in PCRE is (compile/study/exec) where Gnat.Regex has 2 phases (compile/match); the study phase is an optimization of the pattern, that output an object of type Extra. Here we by-pass the study phase.

Compile allocates and returns a pointer to the compiled pattern, that is null if some error occured. In that case, an error message is available as well as the position of the error.

Free is used to deallocate the compiled pattern.

Match takes as inputs the compiled pattern, the subject Ada string to parse. The parameter <length> of string is necessary in case of partial scan.

procedure Match ouputs a return code (Result) that is negative if there is no match or an error. For a zero or positive return code, the match_array has the same output as the C library.

pcre.ads

-----------------------------------------------------------------------
--  interface to PCRE
-----------------------------------------------------------------------
with System;
with Interfaces;
 
package Pcre is
 
   type Options is new Interfaces.Unsigned_32;
 
   PCRE_CASELESS          : constant Options := 16#00000001#;  --Compile
 
   type Pcre_Type is private;
   type Extra_type is private;
 
   Null_Pcre  : constant Pcre_Type;
   Null_Extra : constant Extra_type;
 
   type Table_Type is private;
   Null_Table : constant Table_Type;
 
 
   -- output strings for error message; normally size of 80 should be enough
   subtype Message is String (1 .. 80);
 
   procedure Compile
     (Matcher      : out Pcre_Type;
      Pattern      : in String;
      Option       : in Options;
      Error_Msg    : out Message;
      Last_Msg     : out Natural;
      Error_Offset : out Integer;
      Table        : in Table_Type := Null_Table);
 
   procedure Free (M : Pcre_Type);
 
   -----------------
   -- Match_Array --
   -----------------
   -- Result of matches : same output as PCRE
   -- size must be a multiple of 3 x (nbr of parentheses + 1)
   -- For top-level, range should be 0 .. 2
   -- For N parentheses, range should be 0 .. 3*(N+1) -1
   -- If the dimension of Match_Array is insufficient, Result of Match is 0.
   --
   type Match_Array is array (Natural range <>) of Natural;
 
   procedure Match
     (Result              : out Integer;
      Match_Vec           : out Match_Array;
      Matcher             : in Pcre_Type;
      Extra               : in Extra_type;
      Subject             : in String;
      Length, Startoffset : in Integer;
      Option              : in Options := 0);
 
private
 
   type Pcre_Type is new System.Address;
   type Extra_type is new System.Address;
 
   Null_Pcre  : constant Pcre_Type  := Pcre_Type (System.Null_Address);
   Null_Extra : constant Extra_type := Extra_type (System.Null_Address);
 
   type Table_Type is new System.Address;
   Null_Table : constant Table_Type := Table_Type (System.Null_Address);
 
end Pcre;

Implementation of the thin binding

In C, a string is implemented as a pointer to char terminated by a nul. Using Gnat, an Ada string is implemented with the 2 bounds first, and afterwards the content of the string. in package Interfaces.C.New_String

   function New_String (Str : String) return chars_ptr;

This function allocates a new copy of the data and adds a terminating null. So the data are duplicated, which can be burden when the data weight 50 Mb.

Also to avoid a memory leak, this data must be freed after use.

The procedure Match deals with :

1/passing by reference the content of an Ada string.

Due to the difference between the Ada string and the C string, the trick is to point to the first element of the Ada String. In this case, there is no terminating nul, but as we pass the length of the data, this is no trouble.

2/getting back a vector from the C code.

Ada allocates this vector that is used by the C code. Therefore a pragma convention(C) is required for the vector, as well as a pragma Volatile so that the Ada compiler does not interfere/optimize it.

The whole package has been tested for memory leaks with Valgrind and does not leak.

pcre.adb

with Interfaces.C.Strings;     use Interfaces.C.Strings;
with Interfaces.C;             use Interfaces.C;
with Ada.Unchecked_Conversion;
with System;                   use System;
 
package body Pcre is
 
   pragma Linker_Options ("-lpcre");
 
   use Interfaces;
 
   function To_chars_ptr is new Ada.Unchecked_Conversion (
      Address,
      chars_ptr);
 
   function Pcre_Compile
     (pattern   : chars_ptr;
      option    : Options;
      errptr    : access chars_ptr;
      erroffset : access Integer;
      tableptr  : Table_Type)
      return      Pcre_Type;
   pragma Import (C, Pcre_Compile, "pcre_compile");
 
   function Pcre_Exec
     (code        : Pcre_Type;
      extra       : Extra_type;
      subject     : chars_ptr;
      length      : Integer;
      startoffset : Integer;
      option      : Options;
      ovector     : System.Address;
      ovecsize    : Integer)
      return        Integer;
   pragma Import (C, Pcre_Exec, "pcre_exec");
 
   procedure Compile
     (Matcher      : out Pcre_Type;
      Pattern      : in String;
      Option       : in Options;
      Error_Msg    : out Message;
      Last_Msg     : out Natural;
      Error_Offset : out Integer;
      Table        : in Table_Type := Null_Table)
   is
      Error_Ptr : aliased chars_ptr;
      ErrOffset : aliased Integer;
      Pat       : chars_ptr := New_String (Pattern);
   begin
      Matcher :=
         Pcre_Compile
           (Pat,
            Option,
            Error_Ptr'Access,
            ErrOffset'Access,
            Table);
      Free (Pat);
 
      if Matcher = Null_Pcre then
         Last_Msg                  := Natural (Strlen (Error_Ptr));
         Error_Msg (1 .. Last_Msg) := Value (Error_Ptr);
         Error_Offset              := ErrOffset;
      else
         Last_Msg     := 0;
         Error_Offset := 0;
      end if;
   end Compile;
 
 
   procedure Match
     (Result              : out Integer;
      Match_Vec           : out Match_Array;
      Matcher             : in Pcre_Type;
      Extra               : in Extra_type;
      Subject             : in String;
      Length, Startoffset : in Integer;
      Option              : in Options := 0)
   is
      Match_Size : constant Natural                     := Match_Vec'Length;
      m          : array (0 .. Match_Size - 1) of C.int := (others => 0);
      pragma Convention (C, m);
      pragma Volatile (m); -- used by the C library
 
      Start : constant chars_ptr :=
         To_chars_ptr (Subject (Subject'First)'Address);
   begin
 
      Result :=
         Pcre_Exec
           (Matcher,
            Extra,
            Start,
            Length,
            Startoffset,
            Option,
            m (0)'Address,
            Match_Size);
      for I in 0 .. Match_Size - 1 loop
         if m (I) > 0 then
            Match_Vec (I) := Integer (m (I));
         else
            Match_Vec (I) := 0;
         end if;
      end loop;
   end Match;
 
   type Access_Free is access procedure (Item : System.Address);
   Pcre_Free : Access_Free;
   pragma Import (C, Pcre_Free, "pcre_free");
 
   procedure Free (M : Pcre_Type) is
   begin
      Pcre_Free (System.Address (M));
   end Free;
 
end Pcre;

Test of Pcre binding

Example taken from Regex at the site Rosetta.org

test_0.adb

--
-- Basic test : splitting a sentence into words
--
with Ada.Text_IO; use Ada.Text_IO;
with Pcre;        use Pcre;
 
procedure Test_0 is
 
   procedure Search_For_Pattern
     (Compiled_Expression : in Pcre.Pcre_Type;
      Search_In           : in String;
      Offset              : in Natural;
      First, Last         : out Positive;
      Found               : out Boolean)
   is
      Result  : Match_Array (0 .. 2);
      Retcode : Integer;
   begin
      Match
        (Retcode,
         Result,
         Compiled_Expression,
         Null_Extra,
         Search_In,
         Search_In'Length,
         Offset);
 
      if Retcode < 0 then
         Found := False;
      else
         Found := True;
         First := Search_In'First + Result (0);
         Last  := Search_In'First + Result (1) - 1;
      end if;
   end Search_For_Pattern;
 
   Word_Pattern : constant String := "([A-z]+)";
 
   Subject          : constant String := ";-)I love PATTERN matching!";
   Current_Offset   : Natural         := 0;
   First, Last      : Positive;
   Found            : Boolean;
   Regexp           : Pcre_Type;
   Msg              : Message;
   Last_Msg, ErrPos : Natural         := 0;
 
begin
   Compile (Regexp, Word_Pattern, 0, Msg, Last_Msg, ErrPos);
 
   -- Find all the words in Subject string
   loop
      Search_For_Pattern
        (Regexp,
         Subject,
         Current_Offset,
         First,
         Last,
         Found);
      exit when not Found;
      Put_Line ("<" & Subject (First .. Last) & ">");
      Current_Offset := Last;
   end loop;
 
   Free (Regexp);
end Test_0;

Output :

<I>
<love>
<PATTERN>
<matching>

Complete code of the binding

The complete code of the binding and some examples can be download at sourceforge.net


Navigation