Back Home Up Next
| |
Using descriptors with UDFs.
For most people that started using IB when Delphi was
released by Borland and the engine came "as a bonus toy for practicing
SQL", all the known capabilities are circumscribed to the features the DSQL
layer has to offer.
Under the hood, however, there's a procedural and neutral language called BLR
that's the only language that the core engine understands. Previously to IB4,
the main language was the proprietary GDML for preprocessed applications, where
GDML is converted into BLR by GPRE before the host language's compiler (C, Ada,
Pascal, etc.) sees the source files. In that era, DSQL was not a subsystem
integrated in the engine but an external utility, same as ISQL,
QLI and GPRE are
utilities. But in IB4, the story changed: now Borland owned IB and the
recognition that SQL was the winner against other proposed standards and
proprietary languages led the team to integrate DSQL into the engine. However,
for whatever reason (schedule, human resources, priorities, complexity, SQL
being 99% a declarative language as opposed to GDML, etc.) Borland failed to
surface several capabilities of the engine through DSQL... and unfortunately,
UDFs were one of those overlooked areas.
The following declaration exists in ibase.h and enumerates the types of
parameter passing mechanisms that are available for UDFs:
typedef ENUM { FUN_value, FUN_reference, FUN_descriptor, FUN_blob_struct,
FUN_scalar_array } FUN_T;
The following query recovers the equivalent values from the system tables.
You can observe that the symbolic names are not exactly the same:
SQL> select rdb$type || '=' || rdb$type_name
CON> from rdb$types
CON> where rdb$field_name = 'RDB$MECHANISM';
======================================
0=BY_VALUE
1=BY_REFERENCE
2=BY_VMS_DESCRIPTOR
3=BY_ISC_DESCRIPTOR
4=BY_SCALAR_ARRAY_DESCRIPTOR
Here's the explanation for each mechanism. Some mechanisms apply to almost
any data type, whereas others are very specific:
- By value: 0: GDML used to use it for parameter passing, but it was
non-portable across different architectures. Deprecated. Only used optionally
for the return argument. Do not try to use it for input arguments. As basic
measure, the DSQL layer doesn't provide a way to declare input arguments by
value but only the return argument.
- By reference: 1: by pointer. Typical of 99% of the UDFs. You get
the equivalent of a pointer in programming languages or the "var"
declaration for procedure parameters in Pascal. You don't need to declare it
since it's the default except for blobs. Besides, there's no way to declare
it explicitly.
- By VMS descriptor: 2: it will be used when you issue BY DESCRIPTOR in FB.
It's the same old DSC structure used by the engine internally. I defined it
as PARAMDSC in ibase.h because I didn't want potential issues with
redefinitions. You can use it to get a blob, but I wouldn't use it, since you
won't get the pointer to the special callback functions needed to work with
blob's segments or blob streams. It has no much sense to use this mechanism with
blobs.
- By ISC descriptor: 3: used to pass blobs.
Since the way to return a blob is to make it an argument, it's used to return
blobs, too. This is a special descriptor for blobs, that uses the BLOB structure.
It's different than the internal BLB structure used by the engine. I defined it
as BLOBCALLBACK in ibase.h a time ago. This struct provides the external entry
points to get and set blob segments and a seek function to operate on a rare
flavor of blob named stream blob. You'll
find mechanism 3 in any current UDF that handles Blobs. It's the default for
blob passing and you don't need and can't declare it explicitly.
- By scalar array descriptor: 4:
I don't know any UDF that currently uses it. It will fill a scalar array
descriptor (SAD, that I didn't define in ibase.h) and will populate it with
values from the array column. Same than with the BLOB/BLB issue, SAD is an
structure to convey the array to the UDF; the engine uses ADS (array descriptor)
internally.
By value and by reference don't entail more discussion. By isc_descriptor is
discussed (although very scarcely) in the engine's documentation, with the only
caveat that the full model that BLOBCALLBACK provides wasn't shown, since the
seek function wasn't acknowledged and the BLOB structure wasn't available in any
public header, even though it was meant to be public. For now, descriptors for
arrays are left out of the discussion. Let's concentrate on VMS descriptors.
VMS descriptors, used thoroughly inside the engine for data and metadata, are
probably one of the original constructions. Internally, the declaration is
referred as DSC and for FB, the non orthodox BY DESCRIPTOR keyword was chosen
for the DSQL layer to be able to declare this mechanism because blobs use by
default a fake descriptor that's used only to interface with UDFs and its the
default that doesn't need declaration and can't be declared, so no clash here.
Furthermore, the 4th type may be implemented in a more descriptive way, for
example, BY ARRAY DESCRIPTOR in the future. Forcing developers to change system
tables by brute force to match the desired declaration for type 3 is calling for
potential problems, hence the BY DESCRIPTOR syntax. GMDL was always able to
declare this mechanism. Again in ibase.h, the following declaration is found:
typedef struct paramdsc {
unsigned char dsc_dtype;
signed char dsc_scale;
ISC_USHORT dsc_length;
short
dsc_sub_type;
ISC_USHORT dsc_flags;
unsigned char *dsc_address;
} PARAMDSC;
In addition, the following special definition is added:
/* Overload text typing information into the dsc_sub_type field.
See intl.h for definitions of text types */
#define dsc_ttype dsc_sub_type
According to Dave Schnepper, this overloading was done to avoid changing the DSC
structure. It's so critical to the engine, that changing it means an ODS
upgrade. Furthermore, Dave envisioned that some day, that structure should get a
separate data member to measure the logical length of string for international
character sets support. That overloading of dsc_sub_type into dsc_ttype is not
valid for blobs, beware! Let's analyze each structure's member separately.
dsc_dtype
This is -with no doubt- the most important member in the structure. It's the
data type of the value being carried. The following declarations apply:
/* Note that dtype_null actually means that we do not yet know the
dtype for this descriptor. A nice cleanup item would be to globally
change it to dtype_unknown. --chrisj 1999-02-17 */
#define dtype_null 0
#define dtype_text 1
#define dtype_cstring 2
#define dtype_varying 3
#define dtype_packed 6
#define dtype_byte 7
#define dtype_short 8
#define dtype_long 9
#define dtype_quad 10
#define dtype_real 11
#define dtype_double 12
#define dtype_d_float 13
#define dtype_sql_date 14
#define dtype_sql_time 15
#define dtype_timestamp 16
#define dtype_blob 17
#define dtype_array 18
#define dtype_int64 19
#define DTYPE_TYPE_MAX 20
Notice that the engine cannot do anything useful is the type is zero, as
Chris Jewell's comment says at the top of the list. Facing code where the value
should be converted or operated with other values, type zero will cause the
engine to complain and stop the request. Same if the type is greater or equal
than DTYPE_TYPE_MAX. Bear in mind that those values are used in run-time and are
completely different than the ones you get by querying system tables:
SQL> select rdb$type || '=' || rdb$type_name
CON> from rdb$types
CON> where rdb$field_name = 'RDB$FIELD_TYPE';
======================================
14=TEXT
7=SHORT
8=LONG
9=QUAD
10=FLOAT
27=DOUBLE
35=TIMESTAMP
37=VARYING
261=BLOB
40=CSTRING
45=BLOB_ID
12=DATE
13=TIME
16=INT64
In fact, there's an internal table at the DSQL layer and a function in the core
engine that make the translation between the two equivalent type
"scales", same as you can measure temperature in Fahrenheit, Celsius
or Kelvin.
One important point here is that you can't declare a procedure parameter or a
table field to be of type CSTRING, but you can declare a UDF argument of type
CSTRING, that's nothing more than the type of string used by C, namely, an array
of char with a null ASCII terminator, unlike Pascal's length-counted string,
where the length goes hidden at the beginning. Whereas CSTRING is not used for
storage, it can be used for UDF handling. The following table shows the
equivalences:
DSC |
RDB$TYPES |
SQL |
dtype_null 0 |
|
|
dtype_text 1 |
14=TEXT |
char |
dtype_cstring 2 |
40=CSTRING |
cstring (only for UDFs) |
dtype_varying 3 |
37=VARYING |
varchar, char varying |
dtype_packed 6 |
|
|
dtype_byte 7 |
|
|
dtype_short 8 |
7=SHORT |
smallint |
dtype_long 9 |
8=LONG |
int, integer |
dtype_quad 10 |
9=QUAD |
|
dtype_real 11 |
10=FLOAT |
float |
dtype_double 12 |
27=DOUBLE |
double precision, long float |
dtype_d_float 13 |
27=DOUBLE |
double precision, d_float (VMS) |
dtype_sql_date 14 |
12=DATE |
date (dialect 3) |
dtype_sql_time 15 |
13=TIME |
time (dialect 3) |
dtype_timestamp 16 |
35=TIMESTAMP |
date (dialect 1), timestamp |
dtype_blob 17 |
261=BLOB |
blob |
dtype_array 18 |
|
|
dtype_int64 19 |
16=INT64 |
numeric(18, s)
|
| |
45=BLOB_ID |
|
There are some cases that aren't mapped to system tables or to SQL types.
Packed and real types seem obsolete, byte may be the foundation for a future
boolean, blob_id appears unused (any field declared as blob really holds only a
blob_id) and dtype_array is a special stream blob that handles arrays of other
types. To have an easy way to test if a field is of type string (char, varchar,
cstring), the following macro is defined internally in dsc.h and may be useful
to copy it to a UDF:
#define dtype_any_text dtype_varying
and the typical test is
if (descriptor->dsc_dtype <= dtype_any_text)
provided that the dsc_dtype is not zero, of course. This works because the three
string types have the lowest consecutive values after zero. Remember that the
UDF gets the values in the DSC column shown above, not the ones in the rdb$types
system table.
dsc_scale
In the era when IB was born around 1985, economy of resources was a must.
Powerful workstation didn't handle more than 2MB or 4MB RAM and hard disks were
expensive, small and slow. Packing information densely was a strong goal and
dsc_scale is an example. It's unused for string types. However, integral types
are used to hold numeric and decimal SQL types. Here, dsc_scale is the negative
number of places reserved for the scale in numeric and decimal types. A
declaration like
numeric(9, 3) means dsc_scale is -3, precision is 9 and so it's easy to get the
remnant for the integral part, precision plus scale. Notice that the precision
is carried implictly and the engine doesn't enforce it. The maximum precision
depends on the underlying storage used by the engine. The following table shows
the possibilities:
| numeric(p, s) or decimal(p, s) |
Storage in dialect 1 |
Storage in dialect 3 |
| p<5 promoted to 4 |
smallint |
smallint |
| 5<=p<10 promoted to 9 |
int |
int |
| 10<=p<18 promoted to 19 |
double precision |
int64 |
If you try MsSql for example, you will find that numeric(3,1) doesn't let you
put more than 3-1=2 digits in the integral part and one digit in the decimal
part. In IB, using the previous table, the maximum value allowed is determined
by the underlying storage and for such declaration it's smallint that goes up to
32767, so
numeric(3,1) will allow 4-1=3 digits in the integral part and for some values, 4
digits provided that you don't go above 32767. Extra decimal places are ignored.
You can also deduce from the table that dialect 1 uses inexact storage for
precisions above 9 but dialect 3 solves that problem. This happens at
declaration time: if the field was declared with precision 15 from dialect 1, it
will be always double precision, even after backup and restore. It is not
affected by changing the dialect of the database after the field creation. The
only way to get a true int64 storage is to set the database dialect to 3,
connect with a dialect 3 client and define the field.
The second usage for dsc_scale is for blobs: since the other members of the
structure are already filled with information, there's no place to put the
character set of the blob. Being the scale a signed char, it's enough, provided
that there aren't charset identifiers above 127. Currently, the maximum charset
id is 57.
dsc_length
This is the raw length of a data type. It's applied to all data types. For
string char types, it's the length of the declared field; for varchar, it's the
length of the declared field plus the size of the header to hold the effective
length (currently an unsigned short, hence two bytes on 32-bit platforms); for
cstring, it's not clear if the null ASCII terminator is counted but
DSC_make_descriptor, a widely used function inside the engine, doesn't count it.
For integral and double precision types, it's the length of the underlying
storage, typically the result of the sizeof operator in C. A quad is a
combination of a signed and an unsigned long integer, so it's equivalent in
storage to int64; same history for timestamp. Date-only needs a signed long and
time-only needs an unsigned long. For blobs, the length is only the length of
the blob-id field. This is a type of quad, therefore it uses always two long
integers. The field holding the blob only has the blob id. The zone for a blob
inside a page holds the contents.
For the developer, only the length of string fields is important, since it's
variable. For the rest of the fields, they match the sizes used by the compiler
in the client application. One special case is the blob, that provides the
length in the BLOB structure. There are official and extra information on the
commonly used isc_descriptor mechanism for blob passing to UDFs, that includes
the number of segments, the largest segment's size and the total effective
blob's length. The structure is documented as blobcallback in ibase.h due to the
callback functions it makes available to the UDF.
Unlike char and cstring, varchar fields are length-counted strings. In other
words, the engine tracks the exact length of the string, no right padding with
blanks up to the declared length like the char type. This is why the following
helper structure has been declared in ibase.h:
typedef struct paramvary {
ISC_USHORT vary_length;
unsigned char vary_string [1];
} PARAMVARY;
Again, the clumsy name is to avoid any potential clash with the original
structure (named vary) if some change is made. The dsc_length member counts the
header plus the effective length. Hence, the safest calculation for a UDF that
receives a varchar is to take the minimum between dsc_length minus
sizeof(vary_length) and the value of vary_length. In practical terms, dsc_length
and vary_length are unsigned short integers, so you the size of vary_length is
2. The maximum allowed length of char fields is 32767 and this gives 32765 for
varchar fields.
dsc_sub_type
For date/time fields, it's unused. For smallint, int and int64,
it's zero if one of those types was used in the field definition, one for
numeric and two for decimal. This is an improvement over IB5 that made
impossible to distinguish between
numeric(3,0) and smallint, since the latter is the underlying storage of the
former. In IB5, other cases could be distinguished by looking directly at the
scale (that doesn't help if the scale is zero), but now a formal behavior has
been implemented.
For string fields, the aforementioned overloading applies:
#define dsc_ttype dsc_sub_type It means that dsc_ttype is a
synonym for dsc_sub_type and is used to keep the text type. This is a
combination of the character set and the collation and is achieved with the
following macros in intl.h:
#define INTL_ASSIGN_DSC(dsc, cs, coll) \
{ (dsc)->dsc_sub_type = (SSHORT) ((coll) << 8 | (cs)); }
#define INTL_GET_CHARSET(dsc) ((SCHAR)((dsc)->dsc_sub_type & 0x00FF))
#define INTL_GET_COLLATE(dsc) ((SCHAR)((dsc)->dsc_sub_type >> 8))
Basically, those macros get and set the charset and collation for string fields.
It becomes evident that the dsc_sub_type, a signed char, uses its high order
byte to hold the collation and its low order byte to hold the charset. If a UDF
needs the specific charset (for MBCS operations, for example), it should use
functionality that resembles those macros.
For blobs, the sub type is really the blob's sub type. Again, the values are
present in the system tables:
SQL> select rdb$type || '=' || rdb$type_name
CON> from rdb$types
CON> where rdb$field_name='RDB$FIELD_SUB_TYPE';
======================================
1=TEXT
2=BLR
3=ACL
4=RANGES
5=SUMMARY
6=FORMAT
7=TRANSACTION_DESCRIPTION
8=EXTERNAL_FILE_DESCRIPTION
The default sub type for a blob is zero, that's a binary blob (no symbolic
name). When declaring a blob field, the number or the symbolic name (if
available) can be used. Text is the sub type for memos or very large varchar-like
fields. The rest of the named sub types is for internal use. Interestingly, type
format is used to keep an array of DSC structures. Positive numbers are reserved
for the system, but being dsc_sub_type a signed short integer, it allows any
negative value to be used by an application. Remember, the engine doesn't
enforce the sub type of a blob. It's a mere indication for the application. Some
extra checks may be made internally (for example, considering the charset only
for sub_type text) but the engine doesn't check that the format of the
information kept in a blob corresponds to the declared blob's sub type. Also,
the sub type is useful to allow the correct blob filter to be selected and
executed internally. Blob filters are a variety of UDFs that convert from one
blob's sub type into another.
Blobs only have character sets, not collations. Hence, the charset of a blob is
kept into dsc_scale instead, since this a scale doesn't make sense for a blob.
See the previous discussion about dsc_scale.
dsc_flags
Currently, the flags have limited use. These are the accepted values:
/* Note: DSC_null is only reliably set for local variables (blr_variable) */
#define DSC_null
1
#define DSC_no_subtype 2 /* dsc has no sub type specified */
#define DSC_nullable 4 /* not stored. instead, is derived
from metadata primarily to flag
SQLDA (in DSQL) */
The only field that may be important to the UDF is DSC_null, an indication that
there's no value, but NULL in the parameter being examined. This is one of the
advantages of using descriptors instead of pointers to values (although the
reference mechanism could have achieved the same effect by passing null pointers
to signal SQL nulls). Those flags may come bitwise or'ed from the engine, so use
descriptor->dsc_flags & DSC_null
to test for null. Anyway, I've found that the engine is inconsistent to signal
nulls, at least when UDFs are invoked. This may come from the fact that the
engine has a null flag in the internal request itself, so it doesn't rely
absolutely in the descriptor's flags. To make the UDF crash-proof, it needs to
do the following check
if (descriptor->dsc_flags & DSC_null || !descriptor->dsc_length
|| !descriptor->dsc_address)
assume the parameter is null
so flexibility comes at a cost sometimes. To return NULL, you only need to
return a null pointer. If the UDF uses the RETURNS PARAMETER <n> syntax,
you should activate the null flag with
descriptor->dsc_flags |= DSC_null
because the engine doesn't pay any attention to the return value in such case;
instead it examines the input argument that was marked as the return
argument by its position.
dsc_address
Although it's declared as a pointer to unsigned char, it can convey any data
type that the engine knows about, requiring a hard type cast, depending on
dsc_dtype. It could have been declared as a void pointer (un-typed pointer in
Pascal). I can speculate that one reason is that the void pointer wasn't an
original construction in C, so probably not all compilers supported it. In other
constructions, a pointer to integer was used (the blobcallback structure that
appear in ibase.h is an example where void pointers replaced the integer
pointers in the original declaration inside the engine). The advantage of a
pointer to char is that it allows direct pointer arithmetic to access a location
N bytes ahead without any extra cast. Looking at dsc_dtype, the pointer should
be forced to the appropriate pointer to type and then dereferenced to get the
value. The value should come aligned already.
In the explanation about dsc_length, it was said that the varchar type receives
special treatment. This is because it holds the effective length of the string,
without more trailing blanks than the ones the user inserted. Therefore,
dsc_address is not the address of the data, but the address of the header. One
solution is to cast dsc_adress to this declaration already explained:
typedef struct paramvary {
ISC_USHORT vary_length;
unsigned char vary_string [1];
} PARAMVARY;
It was also said that the effective string length is
paramvary* v = reinterpret_cast<paramvary*>(descriptor->dsc_address);
ISC_USHORT len = v->vary_length;
len = min(descriptor->dsc_length - sizeof(ISC_USHORT), len);
unsigned char *str = v->vary_string;
At least this is how the engine guarantees internally that it won't read outside
the valid bounds of the varchar field. Remember that this string isn't null
terminated, since the length is given separately. Same for char that only
requires dsc_length directly. The other solution is to create a function that
works directly on the original field:
ISC_USHORT len = *reinterpret_cast<ISC_USHORT*>(descriptor->dsc_address);
len = min(descriptor->dsc_length - sizeof(ISC_USHORT), len);
unsigned char *str = descriptor->dsc_address + sizeof(ISC_USHORT);
Both ways are equivalent. Choose the one that looks more elegant for you. I
think that there cannot be problems with the short integer members of the DSC
structure between different platforms than handle the most significative byte in
different position (the known problem of little endian v/s big endian whose name
comes from the Gulliver's travels). The reason is that the UDF should be
installed in the same machine than the engine and so they should share the HW
architecture. There's no way to have a remote UDF installed on another machine,
hence no HW mismatch can happen. If you need the UDF for another platform where
the engine runs, you should recompile it and again they use the same platform.
Caveat.
- Unlike most internal declarations where the uppercased name denotes
pointer to the structure of the same name in lowercase, DSC is the structure
itself, so your UDF needs to declare DSC pointers for its arguments. The
declaration in ibase.h follows the mimics declaration.
- There's an hard coded limit of 10 arguments by UDF. This limit includes
the return argument, so in practice you have 9 input argument. It should be
enough for most users; several UDFs use only one or two input arguments.
- There's only one return argument and it's mandatory. If you don't need it,
declare it as smallint or int with the "by value" mechanism and
return zero and one to signal success or failure, for example.
- When the engine realizes that a parameter is using descriptors, it doesn't
pay much attention to the formal parameter declaration (the data type) and
pass whatever it has to the UDF. This is why FBUDF can handle different
integral types with more than one declaration and only two functions. Had
only one declaration existed, both numbers and string would have been passed
to the same UDF and it would be a mess, since returning a string from the
UDF requires special code, same as with traditional UDFs.
- The hack that allows returning a value by descriptor didn't exist in IB
and is only a hack. It doesn't allow you to change the return type. If you
get an int64 in a function expecting int and return it, even if you detect
it properly in the function, it will overflow silently when the engine takes
the address of the returned descriptor and applies the (fixed) declared
return data type. There should be a method to allow the UDF to signal
overflow. One workaround is to set the dsc_dtype to zero so the engine
complains, but if it's standardized to signal overflow, it should complain
with the proper message.
- UDFs that have division by zero will cause Super Server to be terminated.
The reason is that Super Server decides to finish itself when a UDF executes
a failed operation due to the risk for the data. Failed operations include
division by zero, some interrupts, invalid memory access and other operating
system specific exceptions. Why division by zero was considered so dangerous
is a thing I ignore.
|