Saturday, November 23, 2013

Reporting in LedgerSMB 1.4: Part 5, Conclusions

I hope many of you have enjoyed this series.  We've tried hard to avoid inner platform syndrome here by making reporting something that a developer does.

Skills Required to Write Basic Reports


The developer skills required to write reports tend to fall on the database side.  In general one should have:

  1. A good, solid understanding of SQL and PL/PGSQL in a PostgreSQL environment.  This is the single most important skill and it is where most of the reporting effort lies.
  2. Basic understanding of Perl syntax.  Any basic tutorial will do.
  3. A basic understanding of Moose.  A general understanding of the documentation is sufficient, along with existing examples.
  4. A very basic understanding of the LedgerSMB reporting framework as discussed in this series.
These are required for general tabular reports, and they allow you to build basic tabular reports that can be output in HTML, CSV, ODS, and PDF formats.

Skills Required to Write More Advanced Reports


For more advanced reports, such as new financial statements, government forms, and the like, the following skills are required.  These are not fully discussed here.  These typically require, additionally:

  1. An in-depth understanding of our HTML elements abstraction system (this will be discussed in a future post here)
  2. A general proficiency with Template Toolkit, possibly including the LaTeX filter for running portions of the template through a LaTeX filter.

Strengths


The reporting framework here is very database-centric.  In general you cannot have a non-database-centric reporting structure because the data resides in the database, and some knowledge there is required to get it out in a working form.  We have tried hard to make a system where only minimal knowledge elsewhere is required to do this.  If you have db folks who work with your finance folks, they can write the reports.

Weaknesses


Ad hoc reporting is outside the scope of this reporting.  A one-off report is unlikely to be particularly helpful.  Additionally this generates reports as documents that can be shared.  Updating the data requires running the report again, and while this can be done as a sharable URL, it is not necessarily ideal for all circumstances.

Other Reporting Options


In cases where this reporting framework is not ideal, there are a number of other options available:

  1. Views can be made which can be pulled in via ODBC into spreadsheets like Excel.
  2. Third party report engines like JasperReports can be used instead, and
  3. One-off SQL queries in PSQL can be used to generate HTML and (in the most recent versions) LaTeX documents that can be shared.

Wednesday, November 20, 2013

Writing Reports in LedgerSMB 1.4: (Mostly-) Declarative Perl Modules

So far we have talked about getting data in, and interacting with the database.  Now we will talk about the largest of the modules and cover workflow scripts in relation with this stage,

At this point you would have a filter screen, a user defined function which would take the arguments from that screen's inputs (prefixed with 'in_' usually to avoid column name conflicts), and a tabular data structure you expect to return. 

As a note here, all the code I am trashing here is my own, in part because I have learned a lot about how to code with Moose over the course of 1.4 development.

In your workflow script you are likely to need to add the following:

use LedgerSMB::Report::MyNewReport;

and

sub run_my_new_report {
    my ($request) = @_;
    LedgerSMB::Report::MyNewReport->new(%$request)->render($request);
}

That's all you need in the workflow script.

Overall Structure and Preamble


The actual Perl module basically defines a number of parameters for the report, and the LedgerSMB::Report.pm provides a general framework to cut down on the amount of code (and knowledge of Perl) required to write a report.  Minimally we must, however, define inputs, if any, output structure, and how to create the output structure.  We can also define buttons for further actions on the workflow script.  The same workflow script would have to handle the button's actions.

Typically a report will start with something like this (of course MyNewReport is the name of the report here):

package LedgerSMB::Report::MyNewReport;
use Moose;
extends 'LedgerSMB::Report';
with 'LedgerSMB::Report::Dates'; # if date inputs used, use standard defs

This preamble sets up the basic reporting framework generally along with all the features discussed below.  If you need to handle numeric input or secondary dates you will want to change:

with 'LedgerSMB::Report::Dates';

to
 
with 'LedgerSMB::Report::Dates', 'LedgerSMB::MooseTypes';

so that you can use type coercions for numeric and/or date fields (for processing localized formattings and the like). 

Defining Inputs


Inputs are defined as report properties.  Usually you want these properties to be read-only because you want them to correspond to the report actually run.  You can use the full Moose capabilities in restricting inputs.  However typically inputs should be read-only and you are likely to want to restrict to type and possibly coerce as well (at least when using the types defined in LedgerSMB::MooseTypes).

When including the following line you do not have to define the date_from and date_to inputs:

with 'LedgerSMB::Report::Dates';

Typically our conventions are to document inputs inline with POD.  While this is (obviously) not necessary for the functioning of the report, it is helpful for future maintenance and highly recommended.  It is also worth noting in the POD how a match is made (this should be in SQL also if applicable, in a COMMENT ON statement for easy checking of common assumptions regarding API contracts).

For example, from the GL report:

=item amount_from

The lowest value that can match, amount-wise.

=item amount_to

The highest value that can match, amount-wise.

=cut

has 'amount_from' => (is => 'rw', coerce => 1,
                     isa => 'LedgerSMB::Moose::Number');
has 'amount_to' => (is => 'rw', coerce => 1,
                   isa => 'LedgerSMB::Moose::Number');


Those lines demonstrate the full power of Moose in the definition.  One obvious thing that will be fixed in beta is making these read-only (is => 'ro') while they are currently read-write.  There is no reason for these to be read-write.

From the LedgerSMB::Report::PNL you see the following optional string input defined:

=item partnumber

This is the control code of the labor/overhead, service, or part consumed.

=cut

has partnumber => (is => 'ro', isa => 'Str', required => 0);


This would probably be improved by mentioning that the partnumber is an exact match in the POD, but it shows how to designate a read-only, optional string input.

If an input is not listed, it won't be passed on to the stored procedure.  It is critical that all inputs are defined whether using standard modular definitions (LedgerSMB::Report::Dates) or explicit ones.  If an input is being ignored this is one of the first places to check.  Additionally note that because of other aspects of the reporting, it is not currently possible to use strict or slurpy constructors in any sane way.  It is likely we will build our own constructor handling in the future, but currently this is a hard limitation.

Input Definition Antipatterns


There are a few things which someone who has not worked with Moose before is likely to do in this area, and while many of these are relatively harmless in the web interface because of a number of failsafes, but if you ever want to re-use the code in a more stateful environment you will have difficulties.  The examples given are, alas, my own code but I have the benefit of being a new-comer to Moose here and so the lessons are fresh in my mind, or rather codebase.

The first is in use of read-write inputs.  A report output is closely bound to its inputs, so read-write inputs allows the application to misrepresent the report.  The example I gave above is:

has 'amount_to' => (is => 'rw', coerce => 1,
                   isa => 'LedgerSMB::Moose::Number');


Now this allows the application to do something like this:

my $report = LedgerSMB::Report::GL->new(%request);
$report->run_report();
$report->amount_to('10000000');
$report->render; 

The above will represent that the report includes a bunch of transactions that may, in fact, be excluded.   This is no good.  On the other hand, if amount_to was read-only (is => 'ro'), then the above code would throw an error instead.

The second major anti-pattern is in the use of Maybe[] as an alternative to required => 0.  For example see the following:

has 'approved' => (is => 'rw', isa => 'Maybe[Bool]');

Oh the joys of looking at code I wrote that is in need of rewrite....  Not only do we have a read-write input, but it is maybe boolean (i.e. true, false, or undef).

Now, this appears to work because undef is passed as NULL to the database, and the same is achieved by the more proper:

has approved => (is => 'ro', isa => 'Bool', required => 0);

The difference is that we will not accept as input a case where $request->{approved} = undef has been set.  Our query handlers drop empty inputs so there is no case where this should happen.  Additionally, this prevents unsetting the attribute after running the report and thus decoupling output from purported input.

Defining Report Structure


Report structure is defined using a series of functions which are overridden by actual reports.  Some of these functions are optional and some are not.  The required ones are covered first.

There are three required functions, namely columns, header_lines, and name.  These are expected to return very specific data structures, but function in a largely declarative way.  In other words, the functional interface effectively defines them as pseudo-constant (they are not fully constant because they are expected to return the localized names).

In all cases, LedgerSMB::Report::text() can be used to translate a string into its local equivalent (assuming localized strings in the .po files).

The columns function returns an arrayref of hashrefs, each of which is a column definition for our "dynatable" templates.  The following are required:

  • col_id --- the name of the row field to use
  • type --- the display type of the field (text, href, checkbox, hidden, etc)
  • name --- localized header for the column
The following are conditionally required or optional:
  •  href_base --- the base of the href. To this is appended the row_id (see below).  Only used by href columns, and then required.
  • pwidth --- Used for width factors for PDF reports.
Here's an example of a columns function for a very simple report (which just lists all SIC codes in the system):

sub columns {
    return [
      { col_id => 'code',
          type => 'href',
     href_base => 'am.pl?action=edit_sic&code=',
          name => LedgerSMB::Report::text('Code'), },

      { col_id => 'description',
          type => 'text',
          name => LedgerSMB::Report::text('Description'), }
    ];
}


In most reports, the columns function is much longer.

The header_lines function provides an arrayref of hashrefs, for displaying inputs on the report.  To this, the reporting engine adds the name of the report and the database connected to.  If you want no header lines added, you can just return an empty arrayref:

sub header_lines { return []; }

In many cases however, such inputs should be displayed.  Each hashref has two components:

  • name is the name of the input
  • text is the text label of the input.
Here's a more useful example from LedgerSMB::Report::GL:

 sub header_lines {
    return [{name => 'from_date',
             text => LedgerSMB::Report::text('Start Date')},
            {name => 'to_date',
             text => LedgerSMB::Report::text('End Date')},
            {name => 'accno',
             text => LedgerSMB::Report::text('Account Number')},
            {name => 'reference',
             text => LedgerSMB::Report::text('Reference')},
            {name => 'source',
             text => LedgerSMB::Report::text('Source')}];
}


Finally name() returns the localized name of the report.  This is usually a very simple function:

sub name {
    return LedgerSMB::Report::text('General Ledger Report');
}


Additionally there are two optional functions, buttons and template, which allow additional flexibility.  These are rarely used.

The template function overrides our dynatable-based template as the template to use.  This is used mostly in financial statements but is not used in the trial balance, or other fully tabular reports.

If you use it, just return the path to the template to use.

sub template { return 'Reports/PNL' }

Our PNL reporting module has additional features and beyond the scope of this post.

Finally buttons returns a list of buttons to be included on the report.  These follow the element_data format of UI/lib/elements.html and are used to add HTML form callbacks to the report.  Here's an example:

sub buttons {
    return  [{
         text => LedgerSMB::Report::text('Add New Tax Form'),
         name => 'action',
         type => 'submit',
         class => 'submit'
    }];
}


How Columns are Selected


The columns to display are dynamically selected according to the following rules:

  • If no column selection criteria is found, then all columns are shown
  • If the render() method is called with a hashref as arguments that includes a a member with the name of the column ID prefixed with 'col_' then the column is shown and those which are not so selected are not.
What this means is that typically you will define inputs for selection of columns in the $request object before passing it through if you want to have a baseline of  columns which always show.  Otherwise you will usually allow selection of columns in the filter screen using inputs named as above (i.e. an 'id' field would have a selector named 'col_id').

The run_report() Function


The run_report function populates the rows of the report.  It should exist and set $self->rows(@array) at the end.  This is the only portion where specific knowledge of programming in Perl is particularly helpful.  However, assuming nothing more than a little knowledge, here is a basic template from the SIC listing:

sub run_report{
    my ($self) = @_;
    my @rows = $self->exec_method(funcname => 'sic__list');
    for my $row(@rows){
        $row->{row_id} = $row->{code};
    }
    $self->rows(\@rows);
}


Going through this line by line:

my ($self) = @_;

The first line of the function body is boilerplate here.  It is possible to accept the $request object here as a second parameter but not typical unless you have very specific needs for it.  In that case, you simply:

my ($self, $request) = @_;

Remember that in Perl, @_ is the argument list.

my @rows = $self->exec_method(funcname => 'sic__list');

This line says "Take run the database function named 'sic__list' and map inputs of the report to function arguments.  You would typically just copy this line and change the function name.

for my $row(@rows){
    $row->{row_id} = $row->{code};
}

If you have any hyperlinks in the report, it is necessary to set a row_id so that this can be properly handled.  In any case the row_id is appended to the link from href_base in the column definition.  It is possible to override this on a per-column basis but that's beyond the scope of this introduction.

 $self->rows(\@rows);

This assigns the rows() of the report to the rows returned.  Currently this is handled as a read/write property of reports, but long-run this will probably be changed so that programs cannot override this after running the report.

Ending the File


Always end report classes with the following line

__PACKAGE__->meta->make_immutable;

This improves performance and ensures that  no more attributes can be dynamically added to your report.  There are cases where such may be less than desirable outside of the reports of this sort, but such would be outside the reporting use case.

Others in series:

  1. Introduction
  2. Filter Screens
  3. Best Practices regarding Stored Procedures
  4. (this piece)
  5. Conclusions

Tuesday, November 12, 2013

On CPAN, Community, and P: A Case Study in What Not to Do

I am going to try to do this piece as respectfully as I can.  I understand people put a lot of work into developing things and they submit them, and when they get panned, it can be difficult.  At the same time, community resources are community resources and so a failure to conduct such case studies in things gone amiss can lead to all kinds of bad things.  Failure to get honest feedback can lead to people not improving, but worse, it can leave beginners sometimes mistakenly believing that bad practices are best practices.  There is also a period of time after which bad practices become less easily excused. 

So somewhat reluctantly I am going to undertake such a study here.  This is solely about community interfacing.  I am not going to critique code.  Rather I would hope that this post can be a good one regarding understanding some of the problems regarding community interfaces generally, whether CPAN, PGXN, or others.  The lessons apply regardless of language or environment and the critiques I offer are at a very different level than critiques of code.

So with this, I critique the P CPAN module from a community coding perspective.  This module exports a single function called "P" which acts kind of like printf and sprintf.  It would be an interesting exercise in learning some deep aspects of Perl but from a community resource perspective, it suffers from enough issues to make it a worthy case study.

The gist of this is that community resources require contemplating how something fits into the community and working with the community in mind.  I cool idea or something one finds useful is not always something that is a candidate for publishing as a community resource, at least not without modifications aimed at carefully thinking how things fits into more general processes.

Four of my own CPAN modules are actually rewrites of code that I wrote in other contexts (particularly for LedgerSMB), and rewrote specifically for publication on CPAN.  In general there is a huge gulf between writing a module for one project or one developer and writing it for everyone.  I believe, looking at P, that it is just something the developer thought was useful personally and published it as is without thinking through any of these issues.  This is all too common and so going through these I hope will prevent too many from making the same mistakes.

Namespace Issues


The name 'P' as an extraordinarily bad choice of naming for a public module.  Perl uses nested namespaces, and nesting implies a clear relationship, such as inheritance (though other relationships are possible too).

Taking a top-level namespace is generally discouraged on CPAN where a second or third level namespace will suffice.  There are times and places for top-level namespaces, for example for large projects like Moose, Moo, and the like.  In general these are brand names for conglomerates of modules, or they are functional categories.  They are not shorthand ways of referring to functionality to save typing.

'P' as a name is not helpful generally, and moreover it denies any future large project that namespace.  The project is not ambitious enough to warrant a top-level namespace.  There is no real room for sub-modules and so there are real problems with having this as a top-level module.

Proper Namespacing


It's helpful, I think, to look at three different cases for how to address namespacing.  All three of these are ones I maintain or am planning to write.  I believe they follow generally acceptable practices generally although I have received some criticism for PGObject being a top-level namespace as well.

  • PGObject is a top-level namespace housing, currently three other modules (perhaps ten by next year).  I chose to make it a top-level namespace because it is a framework for making object frameworks, and not a simple module.  While the top-level module is a thin "glue" module, it offers services which go in a number of different directions, defying simple categorization. 

    Additionally the top-level module is largely a framework for building object frameworks, which complicates the categorization further,.  In this regard it is more like Moose than like Class::Struct.  Sub-modules include PGObject::Simple (a simple framework using PGObject, not a simple version of PGObject), PGObject::Simple::Role, and PGObject;:Type::BigFloat.
  • Mail::RoundTrip is a module which allows web-based applications to request email verification by users.  The module offers only a few pieces of functionality and is not really built for extensibility.  This should not be a top-level module.
  • Device::POS::Printer is a module I have begun to write for point of sale printers, providing a basic interface for printing, controlling cash drawers, getting error messages, etc.  The module is likely to eventually have a large  number of sub-modules, drivers for various printers etc, but putting Device:: in front does no real harm and adds clarity.  There's no reason to make it a top-level namespace.

The main point is thinking about how your module will fit into the community, how it will be found, etc.  'P' a name which suggests these have not been considered.

Exports

The P module exports a single function, P() which functions like printf and sprintf.  The major reason for this, according to the author, is both to add additional checking and to save typing.  Saving typing is not a worthy goal by itself, though neither is verbosity.  Condensing a function which takes over two different functions to a single letter, however, is not a recipe for good, readable code.  If others follow suit, you could get code like this:

P(A(R("This is a string", 3));

Now maybe this code is supposed to print the ASCII representation of "This is a string" repeated three times.  However that is not obvious from the code, leading to code that is hard to read or debug.

Proper Exports 


In Perl, exports affect the language.  Exports are thus to be used sparingly as they can lead to conflicts which can lead to hard to maintain code.  Exports should be rare, well documented, and not terribly subject to name collision.  They should also be verbose enough they can be understood without tremendous prior knowledge of the module.  P() as an exported function meets none of these criteria.

A good example of exports done right would be a function like has() used by Moose, Mouse, and Moo.  The function is exported and used to declaratively define object properties.  The convention has become widespread because it is obvious what it does.  Again this does not matter so much for personal projects, but it does for published modules on a community repository.

Test Failure Report Management


 The CPANTesters page for P shows that every version on CPAN has had test failures.  This is unusual.  Most modules have clear passes most of the time.  Even complex modules like DBD::Pg show a general attention to test failures and a relatively good record.  A lack of this attention shows a lack of interest in community use, and that fixes to test failures, needed for people to use the library, are just not important.  So if you manage a module, you really want to take every test failure seriously.


Conclusions


Resources like CPAN, CTAN, PGXN, and the like are subject to one important rule.  Just because it is good for your own personal use does not make it appropriate for community publication as a community resources.  Writing something that fits the needs of a specific project, or a specific coder's style is very different from writing something that helps a wide range of programmers solve a wide range of problems.  These community resources are not places to upload things just because one wrote them.  They are places to interface with the community through work.  Getting pre-review, post-review, and paying attention to the needs of others is critically important.

Monday, November 11, 2013

Writing Reports in LedgerSMB 1.4, part 3, Stored Procedure Conventions and Best Practices

The stored procedures/UDFs form the heart of the LedgerSMB reports. Everything else basically is there to handle input to the stored procedure and output from it.  It's for this reason that the stored procedures are covered before the Perl modules.

The stored procedures follow the conventions and design principles laid out in the very first post on this blog.  Arguments are prefixed with 'in_' and given consistent names from report criteria.  Anyone wanting to write reports should start with that post in terms of understanding where to start.

Additionally there are a number of standardized conventions for common elements.  The most important of these currently is that many dates require a date range, and the arguments here are standardized as in_date_from and in_date_to (corresponding to date_from and date_to on the application side).   The conventions are set up on the Perl side so these will be covered in the next section.

In general, there are a number of do's and dont's associated with these procedures.

1.  Do provide a default ordering that is useful.

2.  Don't provide windowing functions that depend on ordering that could be overridden.  Running totals should not be implemented here.  As a good general rule, if you use ROWS UNBOUNDED PROCEDING in one of these functions, that's a red flag.

3.  Do try to have one central query with, possibly, some small support logic.

4.   Do define your types so several different reports can share the same output type.

5.  Do write test scripts to test your reporting functions.  Use transactions that roll back.

7.  Do use plpgsql instead of sql if you have more than a few arguments.  named arguments are easier to read.  Once we can drop support for 9.2 and lower this will cease to be an issue though.

An example:

CREATE OR REPLACE FUNCTION report__gl
(in_reference text, in_accno text, in_category char(1),
in_source text, in_memo text,  in_description text, in_from_date date,
in_to_date date, in_approved bool, in_from_amount numeric, in_to_amount numeric,
in_business_units int[])
RETURNS SETOF gl_report_item AS
$$
DECLARE
         retval gl_report_item;
         t_balance numeric;
         t_chart_id int;
BEGIN

IF in_from_date IS NULL THEN
   t_balance := 0;
ELSIF in_accno IS NOT NULL THEN
   SELECT id INTO t_chart_id FROM account WHERE accno  = in_accno;
   t_balance := account__obtain_balance(in_from_date ,
                                       (select id from account
                                         where accno = in_accno));
ELSE
   t_balance := null;
END IF;


FOR retval IN
       WITH RECURSIVE bu_tree (id, path) AS (
            SELECT id, id::text AS path
              FROM business_unit
             WHERE parent_id is null
            UNION
            SELECT bu.id, bu_tree.path || ',' || bu.id
              FROM business_unit bu
              JOIN bu_tree ON bu_tree.id = bu.parent_id
            )
       SELECT g.id, g.type, g.invoice, g.reference, g.description, ac.transdate,
              ac.source, ac.amount, c.accno, c.gifi_accno,
              g.till, ac.cleared, ac.memo, c.description AS accname,
              ac.chart_id, ac.entry_id,
              sum(ac.amount) over (rows unbounded preceding) + t_balance
                as running_balance,
              compound_array(ARRAY[ARRAY[bac.class_id, bac.bu_id]])
         FROM (select id, 'gl' as type, false as invoice, reference,
                      description, approved,
                      null::text as till
                 FROM gl
               UNION
               SELECT ar.id, 'ar', invoice, invnumber, e.name, approved, till
                 FROM ar
                 JOIN entity_credit_account eca ON ar.entity_credit_account
                      = eca.id
                 JOIN entity e ON e.id = eca.entity_id
               UNION
               SELECT ap.id, 'ap', invoice, invnumber, e.name, approved,
                      null as till
                 FROM ap
                 JOIN entity_credit_account eca ON ap.entity_credit_account
                      = eca.id
                 JOIN entity e ON e.id = eca.entity_id) g
         JOIN acc_trans ac ON ac.trans_id = g.id
         JOIN account c ON ac.chart_id = c.id
    LEFT JOIN business_unit_ac bac ON ac.entry_id = bac.entry_id
    LEFT JOIN bu_tree ON bac.bu_id = bu_tree.id
        WHERE (g.reference ilike in_reference || '%' or in_reference is null)
              AND (c.accno = in_accno OR in_accno IS NULL)
              AND (ac.source ilike '%' || in_source || '%'
                   OR in_source is null)
              AND (ac.memo ilike '%' || in_memo || '%' OR in_memo is null)
             AND (in_description IS NULL OR
                  g.description
                  @@
                  plainto_tsquery(get_default_lang()::regconfig, in_description))
              AND (transdate BETWEEN in_from_date AND in_to_date
                   OR (transdate >= in_from_date AND  in_to_date IS NULL)
                   OR (transdate <= in_to_date AND in_from_date IS NULL)
                   OR (in_to_date IS NULL AND in_from_date IS NULL))
              AND (in_approved is false OR (g.approved AND ac.approved))
              AND (in_from_amount IS NULL OR ac.amount >= in_from_amount)
              AND (in_to_amount IS NULL OR ac.amount <= in_to_amount)
              AND (in_category = c.category OR in_category IS NULL)
     GROUP BY g.id, g.type, g.invoice, g.reference, g.description, ac.transdate,
              ac.source, ac.amount, c.accno, c.gifi_accno,
              g.till, ac.cleared, ac.memo, c.description,
              ac.chart_id, ac.entry_id, ac.trans_id
       HAVING in_business_units is null or in_business_units
                <@ compound_array(string_to_array(bu_tree.path, ',')::int[])
     ORDER BY ac.transdate, ac.trans_id, c.accno
LOOP
   RETURN NEXT retval;
END LOOP;
END;
$$ language plpgsql;


The above is the function that provides the GL report and search.  It is a monster query and we could use RETURN QUERY but for older versions this seems more reliable.  Ideally, the inline view would be moved to a formal one, and a few other tweaks, but it works and works well enough and despite the length and the inline view, it is not hard to debug.

Others in series:

Part 1:  Overview

Part 2:  Filter Screens

Part 4:  Perl Modules

Part 5:  Conclusions

Friday, November 8, 2013

Interlude: A cool PL/PGSQL trick

I will be getting back to the report writing in LedgerSMB next or the week after.   I have an open source, PostgreSQL-based point of sale project I am hoping to make to a minimally useful version in the mean time.

In the mean time, here's a useful trick I discovered while trying to solve a problem.  The basic problem is one of Object-Relational design, in the sense that we may want to use types as classes and therefore return a tuple which is garanteed to be of the type defined by one table or view.

However, we may want to override some of values in this tuple with values from another table.

In older versions of PostgreSQL, I seem to recall it was possible to add additional columns onto the end of a result which would be silently ignored on return.  This doesn't work in current versions and isn't very safe anyway.  A better way is to work with composite tuples, where each member is a tuple you need for processing.

So for example, suppose we have a table which contains pricing information, and a second table which contains temporary pricing information for a given customer, and these are different table structures.  Suppose, for sake of argument we have tables like this:

CREATE TABLE goods (
   id int serial not null unique,
   sku text primary key,
   sell numeric not null,  -- sell price
   buy numeric not null  -- purchase price
);

CREATE TABLE pricematrix (
   customer_id int not null references customer(id),
   goods_id int not null references goods(id),
   valid_from date,
   valid_to date,
   sell numeric not null
);

CREATE VIEW pricematrix_today AS 
select * from pricematrix
WHERE (valid_from is null or valid_from <= 'today') and
              (valid_to is null or valid_to >= 'today');

Now, what I want to do is pull records from the goods table, but override the sell column with the one from the pricematrix if it exists, choosing the lowest available price.

A simple approach (this code is not tested unlike my actual production code based on the same principle):

CREATE OR REPLACE FUNCTION get_goods_by_sku(in_sku text, in_customer_id int)
RETURNS goods LANGUAGE PLPGSQL AS
$$
DECLARE t_rec record;
       retval goods;
      pmatrix record;
BEGIN
  
           SELECT goods, pm INTO t_rec
             FROM goods
        LEFT JOIN (SELECT *, 
                          rank() OVER 
                           (ORDER BY sell
                            PARTITION BY goods_id) as ranking
                     FROM pricematrix_today
                    WHERE customer_id = in_customer_id
                  ) pm ON (goods.id = pm.goods_id and ranking = 1)
            WHERE goods.sku = in_sku
  
       pmatrix := t_rec.pm;
       retval  := t_rec.goods;
       IF pmatrix.sell < retval.sell THEN
          retval.sell = pmatrix.sell;
       END IF;
       RETURN retval;
END;
$$;

By keeping our types isolated we can safely manipulate them inside the function before returning them and still be guaranteed a named type tuple on output.

Now obviously this is a contrived example and it is less likely we will usually be selecting from tables than views, but the same goes for views since we can define "type methods" on those as well.

Sunday, October 13, 2013

Writing Reports for LedgerSMB 1.4 part 2: Input Filter Screens

The input filter screen system is designed to be fairly simple and yet easily extended.  This post will cover the overall design of the system, how to handle custom dropdowns, and general reusable components.

In general, common controls and the like have been centralized to ensure consistency of interfaces for both user and developer.  Not only does the user get to see common date selection routines, for example, but the developer can count on dates being available to the report.

Common Controls

 The report_base.html template contains a number of basic controls used for data entry. In general these are the best way to add certain kinds of common data to a report because they ensure a consistent API and user experience.  To use one of these controls, just PROCESS them  like:

<?lsmb PROCESS date_row_div ?>

We offer the following common controls for report filter screens in the report_base.html template:

  • business_classes, a table row for selecting the business reporting class (project, department, and the like)
  • business_classes_div, similar to business_classes above but in a div instead of a table row
  • entity_class, a table row for selecting entity class (customer, vendor, etc)
  • gifi_or_standard, a radio button set for selecting GIFI or standard reports (table row)
  • igore_yearend, a table row for selecting whether yearends should be ignored or not (options: all, none, or last)
  • date_row is a set of two table rows for selecting date, either by year, month, and period or by from and two dates identifying a range.
  • date_row_div uses divs instead of table rows.
  • employee_row is a row for selecting an employee for a report
Now, all of the controls come in a table row form and some come in a div form.  The reason for this is that migrated forms have been only minimally changed at present, and div-based forms have not been written yet.  In 1.5 or even possibly 1.4, it is likely that div-based versions of the table-based forms will be written.

General Functionality Available


For reports only requiring general functionality you can simply put the filter screen in UI/Reports/filters/ with a name based on the report name (myreport.html for example) and then call it with a URL like: http://myhost/ledgersmb/reports.pl&action=start_report&report_name=myreport&module=gl.

The module option determines which business unit classes are available for selection.

The following are added to the request hash before passing it on to the template:

  • heading_list is a list of account headings
  • account_list is a list of accounts
  • all_years is a list of years for which there is financial activity (does not include uninvoiced orders)
  • all_months is a list of localized months for the date selection dropdown
  • batch_classes is a list of batch classes
  • bu_classes is an array of business unit classes, for the appropriate common control
  • b_units is an array of business units by class, for the common control
  • country_list is a list of countries
  • entity_classes is an array of entity classes for the common control
  • employees is a list of employees for the common control
If you do not  require more option sets than these, you can use the basic method mentioned above.

Handling Custom Selection Boxes


There are of course times when one needs more option sets than these.  The best way to handle this is to add a new workflow script, which preprocesses the $request hash before passing this on to same function.

For example the relevant code for searching for orders is (abbreviated):

use LedgerSMB::Scripts::reports;
use LedgerSMB::Report::Orders;

sub get_criteria {
    my ($request) = @_;
    my $locale = $LedgerSMB::App_State::Locale;
    $request->{entity_class} = $request->{oe_class_id} % 2 + 1;
    $request->{report_name} = 'orders';
    $request->{open} = 1 if $request->{search_type} ne 'search';
    if ($request->{oe_class_id} == 1){
        if ($request->{search_type} eq 'search'){
            $request->{title} = $locale->text('Search Sales Orders');
        } elsif ($request->{search_type} eq 'generate'){
            $request->{title} =
                   $locale->text('Generate Purchase Orders from Sales Orders');
        } elsif ($request->{search_type} eq 'combine'){
            $request->{title} = $locale->text('Combine Sales Orders');
        } elsif ($request->{search_type} eq 'ship'){
            $request->{title} = $locale->text('Ship');
        }
    } elsif ($request->{oe_class_id} == 2){
        if ($request->{search_type} eq 'search'){
            $request->{title} = $locale->text('Search Purchase Orders');
        } elsif ($request->{search_type} eq 'combine'){
            $request->{title} = $locale->text('Combine Purchase Orders');
        } elsif ($request->{search_type} eq 'generate'){
            $request->{title} =
                   $locale->text('Generate Sales Orders from Purchase Orders');
        } elsif ($request->{search_type} eq 'ship'){
            $request->{title} = $locale->text('Receive');
        }
    } elsif ($request->{oe_class_id} == 3){
        if ($request->{search_type} eq 'search'){
            $request->{title} = $locale->text('Search Quotations');
        }
    } elsif ($request->{oe_class_id} == 4){
        if ($request->{search_type} eq 'search'){
            $request->{title} = $locale->text('Search Requests for Quotation');
        }
    }
    LedgerSMB::Scripts::reports::start_report($request);
}


The point is you can effectively put whatever logic you want before passing on to our general functionality.

You can use the templates in UI/Reports/filters as examples for coding the template itself.

Conclusions


The filter screen system is intended to make it relatively easy to create filters for report inputs.  In the future we will probably add more common controls and the like.  However this gives you the basic information needed to start writing reports.

Others in Series


1. Overview

3. Stored Procedure Best Practices

4.  (Mostly) Declarative Reporting Perl Modules

5.  Conclusions

Wednesday, October 9, 2013

Writing Reports in LedgerSMB 1.4 Part 1: Overview

One of the major features of the upcoming LedgerSMB 1.4 branch is the new reporting system.  This post gives a very basic breakdown of this new reporting engine.

The goal of the system was to make a framework for quickly and easily converting stored procedures into reports.  The basic components are:

  1. The input criteria filter (usually required)
  2. The main Perl module that defines the report
  3. The stored procedure which does the main work
  4. An optional output template.
  5. A workflow script to coordinate.

Some of these may be very lightweight, if not almost optional.  The two however which are critical are the defining perl module, which defines a report in nearly declarative terms, and the stored procedure which does the
work.

For simple reports, it is enough to put a form template for input values in UI/Reports/filters/ and then access it as through a URL referencing the name of the form as the filter template, like this:

http://lsmb/reports.pl?action=begin_report&report_name=mycustomreport

This part is quite simple.

The next thing you need is the PostgreSQL user-defined-function.  This is usually a wrapper around an SQL query which provides a discoverable interface for input values to output values and returns tabular data.  This function must return all data related to the report, and needed for support functions like click-through links.

You also need a Perl module which defines the basic nature of the report, its name, layout and the like.  This is done mostly declaratively and will be covered in depth in a future article in this series.  The approach is sufficiently simple that basic reports can be written without any real knowledge of Perl (working from sample code only).

Finally there is a little bit of glue that is required for displaying the report in the workflow scripts.  This is simple enough that we can cover it here and be done with it.

Normally one will either have a new workflow script that managed this report or lump it together with a series of similar reports.  If the Perl module is in LedgerSMB/Reports/MyCustomReport.pm, then your workflow script would need to have, minimally the following few lines:

use LedgerSMB::Reports::MyCustomReport;

sub my_custom_report {
    my ($request)  = @_;
    LedgerSMB::Reports::MyCustomReport
        ->new(%$request)->render($request);
 }

That's all that is required to render the report from this framework, based on inputs from the filter screen.

This system shows the general power that putting such reports in the database and detecting their arguments can have.  The key skill in writing such reports is thus SQL, not Perl.

Forthcoming parts:

2.  Understanding Input Filter Screens for Reports

3.  Reporting Stored Procedures Best Practices

4.  (Mostly) Declarative Reporting Perl Modules

5.  Conclusions

Wednesday, September 25, 2013

What is coming in LedgerSMB 1.4?

I figure it is a good time to review the changes that are going into 1.4 and discuss where things are going, where expected features have changed, and the like.

Framework Changes


1.4 moves all reports onto the new framework and depends on Moose for core functionality.  Independent extras may use Moo instead (both are declarative object systems).  This poses some performance costs and so we will be recommending our FCGI wrappers for standard operation instead of CGI.

Of particular interest is the new reporting framework.  This framework allows one to write new reports quickly in a nearly declarative way, with a minimum of Perl knowledge.  Reports are defined first and foremost through stored procedures/UDF's and secondly through perl modules which define the parameters of the report.  An open reporting system is something LedgerSMB has needed for some time and hopefully this will help us move forward rather quickly.

New Interface


We have based our user interface on Dojo widgets, which are, in most screens, dynamically allocated via html class and javascript.  Support for declarative pages is likely to be available too.  We have a lot of work going into good Dojo support and this is something which is really positively impacting our usability.

New Functionality


We have a bunch of new features coming including:
  1. LedgerSMB 1.4 will include a new payroll framework which local organizations can use to implement payroll systems based on local rules.
  2. Invoices with inventory can be subject to batch workflows and separation of duties in the same way as AR transactions.  COGS is posted FIFO, using time of approval as the transaction ordering time.
  3. Template transactions will be supported natively
  4. CSV uploads for a number of types of data will be supported automatically.
  5. Most reports will be available in ODS, CSV, and PDF formats automatically, even if the author did not explicitly make such available.
  6. Better handling of localized number and date formats.
  7. Much, much more.
This is a big release and provides a lot of things that larger organizations will need.

Friday, September 13, 2013

PostgreSQL, Community Development, and Support

With the impressive release of PostgreSQL 9.3 I have noticed that a number of journalists seem to only mention a single provider of support.    I decided to write a different sort of article here discussing the PostgreSQL commercial support offerings and how these fit into development.  Please note that the same basic approach applies to LedgerSMB as well, as we have done our best to emulate them (though quite obviously with different support providers).

This is exactly how not to run an open source project if you want it to be a cash cow for a single business but it is how to run an open source project if you want it to reach as large of a user base as possible and provide as many economic opportunities as possible.

PostgreSQL is a community developed, multi-vendor project.  Vendors come and go, but the community endures.  Many vendors who used to contribute to the project no longer do so but there are a number of mainstays.  This article was written in September of 2013, and if you are reading it years later, please be aware there may be additional sources of support available.

Because PostgreSQL is developed by multiple vendors working together, in theory any vendor which employs competent programmers can fix bugs, offer hot fixes, and more for clients, and can, reasonably, if the patches are of good quality, get them accepted upstream.  This is an extremely important selling point for the database management system.

There are several long-standing companies in the community which offer support accounts on the database itself.  This is on top of vendors like Red Hat who offer high quality support with their OS service level agreements.

This list provided here is largely for journalists and others who wish to discuss PostgreSQL support.  It is by no means exhaustive nor is it intended to be.  Support is available in various markets through other companies as well and one of our tasks as a community is to create a larger amount of support and consulting services, serving a larger variety of markets.  This is a strength of the community development model (as opposed to the vendor development model).

In the interest of full disclosure, I am a principal consultant for 2ndQuadrant, and I have worked with folks from Command Prompt, PGExperts, and some other companies on various projects.  Some aspects of what I say here come from something of an insider's perspective.

  1. 2ndQuadrant offers high quality 24x7 support delivered by support engineers which include actual contributors to the software.  Some of their support offerings offer guarantees not found by the vendors of proprietary databases.  I say this as a former employee of Microsoft's Product Support Services division.
  2. Command Prompt. Inc offers service level agreements which ensure quite a bit of proactive assistance.  The firm is one of the long-standing mainstays of the PostgreSQL scene.
  3. PGExperts offers a number of services aimed at ensuring support for critical production environments.
  4. EnterpriseDB offers support for the official version of PostgreSQL, as well as their own proprietary spinoff, "Postgres Plus Advanced Server."  Their proprietary version has a number of features aimed at smoother migration from Oracle, although it is sometimes mistaken for an "enterprise edition" of PostgreSQL.

The companies which I have listed above are largely those where I have met people from them and know something about their business models.  The project also maintains list of professional service providers who have opted in to be listed.   Note that some service providers may provide technical support or not.

In the end, this model of support is a selling point of the software.  Unlike with Oracle, the companies which provide support have to serve the customer's needs because otherwise the customer can go elsewhere.

PostgreSQL is used in a large number of critical production capabilities where the ability to call someone for support, and get a very competent second set of eyes when things go wrong is absolutely necessary, and the companies above provide that.  But the companies listed go further, and are able to support the software as if they were the vendor (or likely even better).

Friday, August 30, 2013

Encryption: MySQL vs PostgreSQL

First a note, all my tests involved a relatively simple table with a schema like this (column names did vary):

CREATE TABLE enctest (
   id int,
   id_text text,
   id_enc bytea
);

In MySQL varbinary(64) was used instead of bytea.

The id was formed from a sequence from 1 to 100000.  I had more trouble loading this in MySQL than in PostgreSQL. id_text was a text cast of id, and id_enc was the value of id_text encrypted using 128-bit AES encryption.  This was intended to mimic sales data consisting of short strings that would be decrypted and converted to numeric data before aggregation.

The goal was to see how fast the different implementations would decrypt all records and aggregate as numeric data types. For PostgreSQL, pgcrypto was used.  The tests were conducted under ANSI mode on MySQL, and the tables were innodb.

What I found was remarkably disturbing.  While MySQL was blazingly fast, this speed came at the cost of basic error checking and rather than an error, decrypting with the wrong key would give the wrong data back sometimes, even on traditional modes.  This is because the errors instead of warnings, per the documentation, are only transformed on insert, not on select.  In other words, MySQL is just as permissive in read operations with STRICT mode turned on as turned off.

mysql> select sum(cast(aes_decrypt(id_enc, sha2('secret', 512)) as decimal)) FROM enctest;
+----------------------------------------------------------------+
| sum(cast(aes_decrypt(id_enc, sha2('secret', 512)) as decimal)) |
+----------------------------------------------------------------+
|                                                     5000050000 |
+----------------------------------------------------------------+
1 row in set (0.33 sec)


That is fast.  Very fast,  My similar query in PostgreSQL took about 200 seconds, so approx 600x as long, and was entirely CPU-bound the whole time.

efftest=# explain (analyse, verbose, costs, buffers) select sum(pgp_sym_decrypt(testvalsym, 'mysecretpasswd')::numeric) from sumtest;
                                                          QUERY PLAN           
                                              
--------------------------------------------------------------------------------
-----------------------------------------------
 Aggregate  (cost=7556.16..7556.17 rows=1 width=62) (actual time=217381.965..217
381.966 rows=1 loops=1)
   Output: sum((pgp_sym_decrypt(testvalsym, 'mysecretpasswd'::text))::numeric)
   Buffers: shared read=5556 written=4948
   ->  Seq Scan on public.sumtest  (cost=0.00..6556.08 rows=100008 width=62) (ac
tual time=0.015..1504.897 rows=100000 loops=1)
         Output: testval, testvaltext, testvalenc, testvalsym
         Buffers: shared read=5556 written=4948
 Total runtime: 217382.010 ms
(7 rows)


My first thought was that for there to be a 3-orders-of-magnitude difference between the two implementations, something must be seriously wrong on the PostgreSQL side.   This is a huge difference.  But then something occurred to me.  What if I use the wrong password?

On PostgreSQL:

efftest=# explain (analyse, verbose, costs, buffers)
select sum(pgp_sym_decrypt(testvalsym, 'mysecretpasswd2')::numeric) from sumtest;
ERROR:  Wrong key or corrupt data



On MySQL, it is a very different story:

mysql> select sum(cast(aes_decrypt(id_enc, sha2('secret2', 512)) as decimal)) FROM enctest;
+-----------------------------------------------------------------+
| sum(cast(aes_decrypt(id_enc, sha2('secret2', 512)) as decimal)) |
+-----------------------------------------------------------------+
|                                                            1456 |
+-----------------------------------------------------------------+
1 row in set, 6335 warnings (0.34 sec)


Hmmm, out of 100000 rows, only 6000 (6%) gave a warning, and we got a meaningless answer back.  Thanks, MySQL.  So I tried some others:

mysql> select sum(cast(aes_decrypt(id_enc, sha2('s', 512)) as decimal)) FROM enctest;
+-----------------------------------------------------------+
| sum(cast(aes_decrypt(id_enc, sha2('s', 512)) as decimal)) |
+-----------------------------------------------------------+
|                                                      1284 |
+-----------------------------------------------------------+
1 row in set, 6230 warnings (0.35 sec
)

Again 6% warnings, meaningless answer returned.  Wow this is fun.....

Try as I might I couldn't get MySQL to throw any errors, and I always got meaningless results back with the wrong key.   A closer look would reveal that MySQL was throwing warnings only when certain rare criteria were met and was performing no validation on the data to ensure it matched the data in.  Further review showed that the cryptograms were much shorter on MySQL than PostgreSQL suggesting that PostgreSQL was padding short strings in order to ensure that cryptography would better protect the data.  More on this later.

This suggested that the difference in the performance might well be related to extra sanity checks in PostgreSQL that MySQL omitted for speed-related purposes.  Armed with this knowledge, I tried the following:

efftest=# update sumtest set testvalsym = pgp_sym_encrypt(testvaltext, 'mysecretpasswd', 's2k-mode=0, s2k-digest-algo=md5');
UPDATE 100000


The query returned pretty fast.  However these settings are not really recommended for production environments.

I went ahead and tried again my data test queries and my performance queries and the results were two orders of magnitude faster:

efftest=# explain (analyse, verbose, costs, buffers)
select sum(pgp_sym_decrypt(testvalsym, 'mysecretpasswd2')::numeric) from sumtest;
ERROR:  Wrong key or corrupt data
efftest=# update sumtest set testvalsym = pgp_sym_encrypt(testvaltext, 'mysecretpasswd', 's2k-mode=0, s2k-digest-algo=md5');
UPDATE 100000
efftest=# explain (analyse, verbose, costs, buffers) select sum(pgp_sym_decrypt(testvalsym, 'mysecretpasswd2')::numeric) from sumtest;
ERROR:  Wrong key or corrupt data
efftest=# explain (analyse, verbose, costs, buffers)
select sum(pgp_sym_decrypt(testvalsym, 'mysecretpasswd')::numeric) from sumtest;
                                                          QUERY PLAN           
                                              
--------------------------------------------------------------------------------
-----------------------------------------------
 Aggregate  (cost=13111.00..13111.01 rows=1 width=71) (actual time=1996.574..199
6.575 rows=1 loops=1)
   Output: sum((pgp_sym_decrypt(testvalsym, 'mysecretpasswd'::text))::numeric)
   Buffers: shared hit=778 read=10333
   ->  Seq Scan on public.sumtest  (cost=0.00..12111.00 rows=100000 width=71) (a
ctual time=0.020..128.722 rows=100000 loops=1)
         Output: testval, testvaltext, testvalenc, testvalsym
         Buffers: shared hit=778 read=10333
 Total runtime: 1996.617 ms
(7 rows)


Much, much faster.  Of course that comes at the cost of security features.

The primary security features changed here are what are called string to key functions.  PostgreSQL also offers some relatively complex containers for short data which include things like padding and session keys.  MySQL does not provide string to key management, and requires that you generate the hexadecimal key yourself.  PostgreSQL provides a number of options for string to key generation which allow for salted hashes to be used for the actual encryption.

One of the most obvious implications here is that with MySQL, you have to generate your salted hash yourself, while with PostgreSQL, it may generate a different salted hash for each line.   This is very important for encryption particularly with smaller strings because this helps thwart rainbow tables.  In essence with salted keys, there is no 1:1 relationship between the passphrase/data combination and the cryptogram, because there is no 1:1 relationship between the passphrase and the key.   Further testing suggests that this is not responsible for the performance difference but it does suggest there are more checks lurking beneath the surface which are omitted from MySQL.

So given that the issue is not string to key management, the issue must be padding.  For very short strings, PostgreSQL is managing padding and containers, while MySQL is purely encrypting short strings without more than minimal padding.  Since there is insufficient padding, the decryption routines are much faster, but this comes at a cost of any reasonable security.  Additionally PostgreSQL provides data checks that are not done on MySQL.

So what does this tell us?  I think the primary lesson which I have had driven home a few times is that database-level encryption is tricky.  This is particularly true when other considerations are involved, like performance aggregating data over significant sets.     Add to this the woes of in-db key management and the like and in-db encryption is definitely expert territory.  In this regard, MySQL's approach seems to require a lot more complexity to maintain security than PostgreSQL's.

It is important to remember that short encrypted strings are relatively common in databases which use encryption.  One of the most common uses is for things like credit card numbers.    For the reasons mentioned here I would suggest that PostgreSQL is much more trustworthy in these cases.

Monday, August 26, 2013

When to use SELECT * in PostgreSQL

In LedgerSMB we use a lot of queries which involve SELECT *.  Many people consider SELECT * harmful but there are several cases where it is useful.  Keep in mind we encapsulate the database behind an API, so SELECT * has different implications than it does from applications selecting directly from tables.

The Fundamental Design Questions


It all comes down to software interface contracts and types.  Poorly thought-out contracts, loosely applied, lead to unmaintainable code.  Clear contracts, carefully applied, lead to maintainable code because the expectations are easily enforced.

PostgreSQL comes with a complex type system where every table, view, or composite type is an object class.  In the right contexts, SELECT * provides you a result of a guaranteed type.  This is important when doing object relational work because it means you get a series of objects back in a defined class.  This allows you to then pass those on to other functions to get derived data.

Select * therefore helps you when working with objects, because you can ensure that the result types are in fact valid objects of a specified class defined in the relation clause of the query. 

Where SELECT * can't be helpful


SELECT * is never helpful (and can have significant problems) in specific areas, such as anywhere you have a join.  There are specific reasons for these problems.

Consider the following:

chris=# create table typetest (test text);
CREATE TABLE
                                              ^
chris=# insert into typetest values ('test1'), ('test2');
INSERT 0 2
chris=# CREATE VIEW typetestview AS select * from typetest;
CREATE VIEW
chris=# select * from typetestview;
 test 
-------
 test1
 test2
(2 rows)


chris=# alter table typetest add newfield bool default false;
ALTER TABLE
chris=# select * from typetestview;
 test 
-------
 test1
 test2
(2 rows)


(Edited, see note [1]).  Interestingly this is fixed in the view definition so dump and reload won't change it.  However without setting up casts, you can't take advantage of the fact you are passing up the data structure.

The key problem here is that views are their own types, and consequently you cannot guarantee that the view type will be the same as the underlying table type.  This makes castes more complex to maintain and probably not worth the trouble.

Once joins are used in a query, however, SELECT * loses any potential benefit.  Joins do not return a defined type, and so SELECT * should never be used in queries utilizing joins (aside possibly from ad hoc queries run by the dba to explore the data).

SELECT * and Stored Procedures


Consider for example the following CRUD stored procedure:

CREATE OR REPLACE FUNCTION accounts__list_all()
RETURNS SETOF accounts
LANGUAGE SQL AS
$$
    SELECT * FROM accounts ORDER BY account_no;
$$;

This query is relatively simple, but the stored procedure returns a type that is defined by the underlying table.  We all run into cases where application data can't be much further normalized and we may want to have stored procedures delivering that data to the application.  In this case, we are likely to use a function like this, and that enables us to do other object-relational things outside it.

Now, if we need to change the underlying accounts table, we can always make a decision as to whether to make accounts a view with a stable representation, a complex type with a hand-coded query returning it, or just propagate the changes upwards.  Because the application is not directly selecting from the underlying storage, we have options to ensure that the contract can be maintained.  In essence this injects a dependency that allows us to maintain contracts more easily through schema changes.

Consequently although it leads to the same execution plan in this example, there is a tremendous difference, software engineering-wise, between an application calling:

SELECT * FROM accounts ORDER BY account_no;

and

SELECT * FROM accounts__list_all();

In the first case, you have only one contract, between the high level application code and the low-level storage.  In the second case, you have two contracts, one between the storage and the procedure (which can be centrally adjusted), and a more important one between the application code and the stored procedure.

Conclusions

In PostgreSQL, the choice of whether to use SELECT * in a query is a relatively clear one.  If you want to return objects of a type of an underlying construct, and the return type is closely tied over time to the output type, then SELECT * is fine.  On the other hand, if these things aren't true then either you should find ways to make them true, or avoid using SELECT * altogether.

This makes a lot more sense when you realize that things like table methods can be passed up when select * is used (or methods applied to views, or the like).

In general you are going to find two areas where select * is most helpful.  The first is in object-relational design.   The second case is where you want PostgreSQL to define an object model for you.  In reality the first case is a special case of the second.

This way of doing things is very different than the way most applications work today.  The database is encapsulated behind an object model and the application consumes that object model.  In those cases, select * is very helpful.

[1]  Corrected as per comment. I was under the mistaken impression that select * would be a part of the view definition.  This still leads to annoying and unexpected changes in view definition, such as when you drop and recreate the view and so I would still discourage it here however.

Sunday, August 18, 2013

In defence of hand coded SQL

One common comment I get when I point out I hand-write all my SQL queries rather than relying on an ORM or the like is that this is drudge work,  obsoleted by modern tools, and when I mention these are usually wrapped in stored procedures, the reactions go from disdainful to horrified.  This piece is the other side, why I do this and why I find it works.  I am not saying these approaches are free of costs, but software engineering is about tradeoffs.   These tradeoffs are real.  My approach is not a magic bullet, but it forms a vital piece of how I build software on the database. 

The first thing to note is that I use a lot of SELECT * FROM table queries when querying tables that match output structure.  We all know we run into tables that cannot be reasonably further normalized and where the application structure can feed directly into the application.  In a stored procedure wrapper, SELECT * reduces maintenance points of such tables if new fields need to be added (in which case the query still matches the specified return type with no modifications).  This has costs in that it discourages refactoring of tables down the road but this just needs to be checked.  One can still have central management by using views if needed.  Central management of type definitions is generally a good thing.  Views can take the place of an ORM....

The second point is that CRUD queries of this sort don't really take significant time to write, even on a well-normalized database and having these encapsulated behind a reasonably well-designed procedural interface is not a bad thing provided that some of the classical difficulties of stored procedures are addressed.

I find that my overall development time is not slowed down by hand-writing SQL.  This remains true even as the software matures.   The time-savings of automatic query tools is traded for the fact that one doesn't get to spend time thinking about how best to utilize queries in the application.  The fact is that as application developers, we tend to do a lot in application code that could be better done as part of a query.  Sitting down and thinking about how the queries fit into the application is one of the single most productive exercises one can do.

The reason is that a lot of data can be processed and filtered in the queries themselves.  This allows one to request that the database send back data in the way the application can make best use of it.  This can eliminate a lot of application-level code and lead to a shrinking codebase.  This in turn allows application-level code to make better use of data returned from queries, which leads to better productivity all around.

Monday, June 10, 2013

Design of Efficito's PostgreSQL-centric automation environment

One of the challenges of setting up a cloud hosting environment is ensuring that systems are as configured and that the desired state is one which can be rebuilt if there is a problem.  We at Efficito focus on data backups and rebuilding software images to consistent states rather than backing up full images and restoring them.  This helps ensure that in the event of a disaster we can ensure that VMs are restored to a consistent software state with data restored from backup.

Our choice on architecture was guided by the following requirements:

  1. Configuration and building of virtual machines should be subject to automation without human intervention, with full integration of payment frameworks and the like.
  2. Configuration and building of virtual machines should be such that virtual machines can be fully rebuilt in the event of disaster recovery requirements.
  3. Configuration changes should be able to be automated and retriable.
  4. This can be specific to hosted clouds for specific applications (we only host LedgerSMB as an ERP solution).
Further posts will probably cover very small pieces of our system.  The entire system cannot be published here in part because we want to preserve our trade secrets.  This post however just covers some ways we use PostgreSQL as the centerpiece of this environment.

The Basic Structure and Role of PostgreSQL


Our approach is relatively simple.  Data comes in through either an administrative or customer portal, transmitted to a limited API which then goes into our configuration database.  Information can be requests for new virtual machines, configuration changes and the like.  Additionally payment notifications can come in through these interfaces as well.

PostgreSQL is then attached to the configuration system which picks up notifications of needed configuration changes and orchestrates these on the system.  This also allows us to pull information on our service deployments into our financial system for billing purposes (we use LedgerSMB beta versions of 1.4 internally, eating our own dogfood so to speak).

In this regard PostgreSQL acts as an information backplane.  It allows our software to talk to eachother and allows messages to be sent between components with both transitory and permanent information being recorded in the database for later record-keeping (transitory information can be periodically truncated).

Information Flow


The system is still developing with various components coming together.  Nonetheless, the idea is that the customer or administrator enters information into a front-end tool, which, through a limited API, inserts the data into the database.

On database commit, triggers are fired which queue a message for reading by the configuration system.  We use pg_message_queue for this, in part because it supports both NOTIFY and periodic polling and we intend to add much better multiple listener support as we need it.

From there the listener gets information about what portions of the system need to be changed, makes the changes, and on success commits the transaction that dequeued the notification.  On a failure, a system alert is raised  and the system goes on to the next request (the item returned back to the queue for later processing on the next polling cycle).

What this means is that to a large extent this is a hands-off system.  We provide configuration options, customers select them, and once every aspect is running, the customer can control the software configuration of their vm's within limits, but without root access.  We do offer root access but only if the customer is willing to set up the ssl key and certificate (we can't give root access if our wildcard cert is on the vm!).