Re: Dataframes as arguments to functions?

Jens Oehlschlaegel (oehl@Psyres-Stuttgart.DE)
Wed, 28 Jan 1998 21:39:49 +0100 (MET)


On Wed, 28 Jan 1998, F.Tusell wrote:

> I have read the S-Plus manual, and I understand that this is not a
> bug, but a design decision --as it should be. My question is: how can
> I do what I want (basically, pass dataframes as in a FORTRAN
> call-by-reference subroutine)? Is there any other way to go around?

Well, this question has a long history, and in the light of memory
requirements of S+4.0 it is still a 'hot' topic. A quick answer is:

1) S+ favors copying objects instead of referencing them
(perhaps this changes partially with S+5.0)

2) S+ allows for indirect access into other "frames" so you can do it,
have a look at assign() get() eval(parse(text=...)) sys.status()
but today it's usually cheaper to invest in memory than in programming

3) There were some severe garbage collection bugs in WinS+3.3 (may be
still in 4.0) which can cause a lot of user-suffering in doing it
(bug description available on request)

4) You can facilitate 2) by using classes 'reference' and
'static' from my library REF.
(available on request for S+ or R)

More details are in the following discussion, which had previously been
posted as attachment to my summary <summary: pointer/reference/memory>
of a discussion in this list.

Best regards

Jens Oehlschlaegel-Akiyoshi

==================================================
SUMMARY: class 'pointer', class 'static', Recall()
==================================================

FUTURE
======
Perhaps most important: Charles Roosen announced that MATHSOFT
plans to focus on the memory/big-data-set issue with S+5.0,
eventually by implementing "Version 4 of S". Thank you, this is
very good news. (To avoid confusion: S+4.0 will be based on
"Version 3 of S")

'references' in language Version 4 of S
=======================================
I recommend reading "Evolution of the S Language" by John M.
Chambers (http://cm.bell-labs.com/stat/fmc).
Version 4 of S will make use of references in a special way:
COPY ON MODIFY. Thus there are

DIFFERENT TYPES of REFERENCES
=============================
Types of references differ basically in:

1) whether they are used IMPLICITELY (the end user does not
know wether his parameter is a value copy or a reference)
or whether they are used EXPLICITELY
2) What happens, when attempts are made to write/assign to the
reference

NO REFERENCE
passing by value (copying objects) is basis for secure
"functional programming" but leads to big memory
requirements. This is status quo in S+. After evaluating
a functions parameter, a local copy is generated, even if
only read access is done. So in a way S+ is doing
COPY ON EVALUATE or COPY ON READ.

READ ONLY REFERENCE
is the simplest secure way of using references
but may be used only explicitely in special cases

COPY ON WRITE
is a very clever way of making use of references while
preserving the security of functional programming: The
reference is read-only, on attempts to write to the
reference a local copy is generated. This is status quo
in R. The hidden change of the parameter being a reference
to being a copy makes this rather an implicit version.

COPY ON MODIFY
is a more elaborated version of 'copy on write' where
eventually only those parts of the referenced object are
generated as local copies, which actually have been
changed, e.g. write attempts to a referenced dataframe
result in locally copying only the changed column, the
other columns are still read through the reference.
This is intended for Version 4 of S.

The (hidden) use of 'copy on write' and 'copy on modify' will
partially avoid/solve memory problems. Unfortunately they do
not solve the memory intensive RECURSION PROBLEM, since
recursion migth require the recursive function to change the
parameter data object BEFORE recalling.

-> a quick and dirty solution would be: allow the recursive
function to remove the object in it's parent's frame.
This leads to having only one object and a temporary copy,
but in terms of security it's a somewhat strange solution
and it lacks generality
(cf. the recent 'nested functions' business of Mark Bravington).

Back to different types of references: How to overcome the
restrictions of the above mentioned reference types? What about
defining references more generally:

GENERALIZED REFERENCE
could be a reference
- for read access
- which defaults on write attempts to 'copy on ...'
- but which gives write access to the referenced object IF
this has an attribute Write.Permission=T
(eventually with the additional condition that explicitely
a special assignment function must be used for this write
access, to keep the code readable )

The crucial point is keeping the Write.Permission with the
referenced object, not with the reference, because this is
- more secure
- saves memory since the referenced object exists only once
while there may be multiple references, e.g. in recursion

Ordinary (read-only) objects would not need an attribute
Write.Permission=F (no additional memory requirements).

If more security is needed, the idea could easily be extended
to setting the objects attribute

attr(referenced.object, "Write.Permission") <- any.password

and extending the reference with an additional list element

reference$write=any.password

Write access thus would require identity between those two,
or attr(referenced.object, "Write.Permission")==T in case of no
list element reference$write.

This logic (with a special assignment function) does not
require an explicit class 'reference', just a few helpful
functions for
- checking whether a parameter is (still) a reference or
(already) a local copy
- checking whether the referenced object has Write.Permission
- forcing a parameter to become a local copy

IF NOT implementing such generalized references as part of the
language I find it useful to implement an explicit class
'reference' as I previously suggested under the name 'pointer',
and now provide with REF.S or REF.R .

Most discussants felt that some kind of (write permitting)
references would enrich the S+ language, but some had security
objections.

GENERAL SECURITY OBJECTIONS
===========================
I myself have heavily suffered from not knowing that database
compiler CLIPPER - though following structured programming -
SOMETIMES makes an exception from handling parameters as
copies. Big data objects (so called 'arrrays', comparable to S+
lists) are transferred as references. Thus changing the seemingly
local parameter in fact changes the original!
Implicit references with general write permission are a serious
issue.
My suggestion for an explicit class 'reference' was realized
on the end user level of S+: S+ allows functions to evaluate/change
expressions/objects in frames other than the local frame, thus
insecurity is already built in. The only protection provided is
that user may not use it. Providing a class 'reference' makes
'it' more comfortable, thus highlighting the security issue. As
outlined above, a good solution for Write.Permission references
may be much more secure than forcing the end user to do 'it' on
his own (not talking about the time everybody is waisting by
reinventing the weel).

SECURITY of CLASS 'static'
==========================
There was no discussion at all about class 'static'. This
use of references INCREASES SECURITY because naming conflicts
in the frame 0 are avoided (cf. YELLOW BOOK, 'communication
through frame 0' in the special programming part about
recursion). Of course this increase of security works only, if
everyone who feels tempted to write to frame 0 knows about
class static. In other words: This should become part of the
language. It's NOT sufficient that it sleeps somewhere at
STATLIB. This is one reason that MATHSOFT should take over
this business, puts it into the handbooks etc.

(note: here statics were assumed to reside in
frame 0, my implementation puts them in frame 1)

ACCESSING PARTS of OBJECTS
==========================
is another reason for MATHSOFT to take over. Efficient
references require the possibility of accessing parts
of objects.

Let o be an object in frame fo and r be a reference to o in the
local frame fl and further i,n be objects in fl

One syntax suggestion for writing to parts of o was

r[i] <- n

by defining special assignment operators for class 'reference'

I eventually could imagine this syntax for an implicit reference concept
defined in the language (so everybody has to know it, and has to
give Write.Permission), but if 'it' is done via an user defined
class 'reference', I fell that this is too dangerous. I would prefer
a special assignment or evaluation function like

eval.ref( r[i] <- n )

which has to scan it's expression for references and
to interpret it's tokens in two different frames as

eval(expression(
o[eval(expression(i),local=fl)] <-
eval(expression(n),local=fl)
),local=fo)

An alternative would be a changed function eval() which
interprets efficiently

deref(r)[i] <- n

Programming eval.ref() resp. changing eval() means properly
parsing complex expression, which means reprogramming the
complete S+ evaluator. Me, as a end user, I am not willing to
reinvent this wheel (but perhaps I am just too blind to see the
simple solution, any GURU feels challenged?).

!Attention!
currently eval() seems to be less efficient than get() assign()

For MATHSOFT having the source of the evaluator this should be
easier doing. So I will not provide this usefull functionality
with my solution. I wish MATHSOFT would take responsibility for
the reference business and would maintain the stuff. An answer
to s-news would be nice!

REFERENCES to PERMANENT OBJECTS
===============================
One suggestion was to allow references to permanent objects,
i.e. extending the 'frame' part to include a 'where'
statement. Not moving big objects through memory could imply
even to avoid one copy in memory, i.e. reading some very large
objects only partial into memory.
I think in general this is a very good suggestion, but it
makes full sense only, if accessing parts of referenced
objects is really available. I did not extend to a where
statement yet.
The possibility of references to permanent objects raises the
next issue: Someone could store references permanently, which raises
security issues again:

VALIDITY of REFERENCES
======================
Many discussants were concerned with the validity of references.
On suggestion was to have a function is.local() which checks
"whether a pointer can be returned safely" or whether the
reference points to something in the local frame (which will
not exist after the reference has been returned). This
statement hits the heart of the problem: If one allows a
reference to be returned from a function one can get into
severe trouble: Version 4 of S with the 'modify on write'
concept probably just avoids this problem, because (no end
user) function will be able to return implicit references.
Returning something will allways give a value, not a reference,
I assume.

I don't see necessity for references pointing to objects in
the local frame of child functions, nor do I see necessity for
a child function to create some object in a parent frame, and
to inform the parent by returnig a reference, where it is. If a
child function is to write to a parent frame, the object may be
created in the parent frame, be given Write.Permission and the
reference be passed to the child, that's the logic of
functional programming as I understand it.

I would suggest two basic rules:

- DO ONLY CREATE REFERENCES TO EXISTING OBJECTS
- NEVER RETURN/ASSIGN A REFERENCE to a parent frame, frame 0
or a permanent database

With an implicit reference concept users are forced to stick to
these rules. As long as there is need to use user-defined
explicit references, following these rules should grant
validity of references. John Chambers suggested eventually
creating unique tags (numbers) for each frame or object, which
are stored with the object and with the reference. With the
two rules this should not be necessary. However, the use of the
Write.Permission=password allows to implement such checks.

SOME DETAILS of SYNTAX
======================
I followed a suggestion and renamed previous function object()
into deref() for getting a referenced object. Previous function
as.pointer() will be renamed ref() to support the first rule.
ref() and deref() will be my main tools for working with
references.

let
r <- ref(o)
and
o <- deref(r)
then
ref(r)

will not give a reference to r but will give a reference to o
as ref(o) does. Similarly

deref(o)

will give o itself. Further

deref(r) <- o2 equals o <- o2

if write permission has been given to o by

r <- ref(o, write.permission=T)

For details see REF.TXT, REF.S, REF.R.

ODD BEHAVIOUR of Recall()
=========================
Nobody commented on the odd behaviour of Recall(), which
creates two frames and two copies of the parameters each
recall. I assume this to be a bug, which should be removed.
Recall() should behave as generic functions, which do not
create two frames.

COMPILER
========
Matt Calder reminds us that he offers a (syntax limited) S+
compiler for free.
Matt Calder <http://www.stat.colostate.edu/~calder>

R people are working on a optimizing compliler for a S-like
language (speedup factor 100-200 for scalar operations), but
they say, they need 'a couple of years' before they have a
mature compiler.
Ross <ihaka@stat.auckland.ac.nz>

==============
END of Summary
==============

--
Jens Oehlschlaegel-Akiyoshi
Psychologist/Statistician
Project TR-EAT + COST Action B6
                                                 F.rankfurt
oehl@psyres-stuttgart.de                         A.ttention
+49 711 6781-408 (phone)                         I.nventory
+49 711 6876902  (fax)                           R .-----.
                                                  / ----- \
Center for Psychotherapy Research                | | 0 0 | |
Christian-Belser-Strasse 79a                     | |  ?  | |
D-70597 Stuttgart Germany                         \ ----- /
-------------------------------------------------- '-----' -
(general disclaimer)                             it's better