The environment is the data structure that powers scoping. This chapter dives deep into environments, describing their structure in depth, and using them to improve your understanding of the four scoping rules described in lexical scoping.
Environments can also be useful data structures in their own right because they have reference semantics. When you modify a binding in an environment, the environment is not copied; it’s modified in place. Reference semantics are not often needed, but can be extremely useful.
If you can answer the following questions correctly, you already know the most important topics in this chapter. You can find the answers at the end of the chapter in answers.
List at least three ways that an environment is different to a list.
What is the parent of the global environment? What is the only environment that doesn’t have a parent?
What is the enclosing environment of a function? Why is it important?
How do you determine the environment from which a function was called?
How are and different?
Environment basics introduces you to the basic properties of an environment and shows you how to create your own.
Recursing over environments provides a function template for computing with environments, illustrating the idea with a useful function.
Function environments revises R’s scoping rules in more depth, showing how they correspond to four types of environment associated with each function.
Binding names to values describes the rules that names must follow (and how to bend them), and shows some variations on binding a name to a value.
Explicit environments discusses three problems where environments are useful data structures in their own right, independent of the role they play in scoping.
This chapter uses many functions from the package to pry open R and look inside at the messy details. You can install by running
The job of an environment is to associate, or bind, a set of names to a set of values. You can think of an environment as a bag of names:
Each name points to an object stored elsewhere in memory:
The objects don’t live in the environment so multiple names can point to the same object:
Confusingly they can also point to different objects that have the same value:
If an object has no names pointing to it, it gets automatically deleted by the garbage collector. This process is described in more detail in gc.
Every environment has a parent, another environment. In diagrams, I’ll represent the pointer to parent with a small black circle. The parent is used to implement lexical scoping: if a name is not found in an environment, then R will look in its parent (and so on). Only one environment doesn’t have a parent: the empty environment.
We use the metaphor of a family to refer to environments. The grandparent of an environment is the parent’s parent, and the ancestors include all parent environments up to the empty environment. It’s rare to talk about the children of an environment because there are no back links: given an environment we have no way to find its children.
Generally, an environment is similar to a list, with four important exceptions:
Every name in an environment is unique.
The names in an environment are not ordered (i.e., it doesn’t make sense to ask what the first element of an environment is).
An environment has a parent.
Environments have reference semantics.
More technically, an environment is made up of two components, the frame, which contains the name-object bindings (and behaves much like a named list), and the parent environment. Unfortunately “frame” is used inconsistently in R. For example, doesn’t give you the parent frame of an environment. Instead, it gives you the calling environment. This is discussed in more detail in calling environments.
There are four special environments:
The , or global environment, is the interactive workspace. This is the environment in which you normally work. The parent of the global environment is the last package that you attached with or .
The , or base environment, is the environment of the base package. Its parent is the empty environment.
The , or empty environment, is the ultimate ancestor of all environments, and the only environment without a parent.
The is the current environment.
lists all parents of the global environment. This is called the search path because objects in these environments can be found from the top-level interactive workspace. It contains one environment for each attached package and any other objects that you’ve ed. It also contains a special environment called which is used to save memory by only loading package objects (like big datasets) when needed.
You can access any environment on the search list using .
, , the environments on the search path, and are connected as shown below. Each time you load a new package with it is inserted between the global environment and the package that was previously at the top of the search path.
To create an environment manually, use . You can list the bindings in the environment’s frame with and see its parent with .
The easiest way to modify the bindings in an environment is to treat it like a list:
By default, only shows names that don’t begin with . Use to show all bindings in an environment:
Another useful way to view an environment is . It is more useful than because it shows each object in the environment. Like , it also has an argument.
Given a name, you can extract the value to which it is bound with , , or :
and look only in one environment and return if there is no binding associated with the name.
uses the regular scoping rules and throws an error if the binding is not found.
Deleting objects from environments works a little differently from lists. With a list you can remove an entry by setting it to . In environments, that will create a new binding to . Instead, use to remove the binding.
You can determine if a binding exists in an environment with . Like , its default behaviour is to follow the regular scoping rules and look in parent environments. If you don’t want this behavior, use :
To compare environments, you must use not :
List three ways in which an environment differs from a list.
If you don’t supply an explicit environment, where do and look? Where does make bindings?
Using and a loop (or a recursive function), verify that the ancestors of include and . Use the same basic idea to implement your own version of .
Recursing over environments
Environments form a tree, so it’s often convenient to write a recursive function. This section shows you how by applying your new knowledge of environments to understand the helpful . Given a name, finds the environment where that name is defined, using R’s regular scoping rules:
The definition of is straightforward. It has two arguments: the name to look for (as a string), and the environment in which to start the search. (We’ll learn later why is a good default in calling environments.)
There are three cases:
The base case: we’ve reached the empty environment and haven’t found the binding. We can’t go any further, so we throw an error.
The successful case: the name exists in this environment, so we return the environment.
The recursive case: the name was not found in this environment, so try the parent.
It’s easier to see what’s going on with an example. Imagine you have two environments as in the following diagram:
If you’re looking for , will find it in the first environment.
If you’re looking for , it’s not in the first environment, so will look in its parent and find it there.
If you’re looking for , it’s not in the first environment, or the second environment, so reaches the empty environment and throws an error.
It’s natural to work with environments recursively, so provides a useful template. Removing the specifics of shows the structure more clearly:
Modify to find all environments that contain a binding for .
Write your own version of using a function written in the style of .
Write a function called that finds only function objects. It should have two arguments, and , and should obey the regular scoping rules for functions: if there’s an object with a matching name that’s not a function, look in the parent. For an added challenge, also add an argument which controls whether the function recurses up the parents or only looks in one environment.
Write your own version of (Hint: use .) Write a recursive version that behaves like .
Most environments are not created by you with but are created as a consequence of using functions. This section discusses the four types of environments associated with a function: enclosing, binding, execution, and calling.
The enclosing environment is the environment where the function was created. Every function has one and only one enclosing environment. For the three other types of environment, there may be 0, 1, or many environments associated with each function:
Binding a function to a name with defines a binding environment.
Calling a function creates an ephemeral execution environment that stores variables created during execution.
Every execution environment is associated with a calling environment, which tells you where the function was called.
The following sections will explain why each of these environments is important, how to access them, and how you might use them.
The enclosing environment
When a function is created, it gains a reference to the environment where it was made. This is the enclosing environment and is used for lexical scoping. You can determine the enclosing environment of a function by calling with a function as its first argument:
In diagrams, I’ll depict functions as rounded rectangles. The enclosing environment of a function is given by a small black circle:
The previous diagram is too simple because functions don’t have names. Instead, the name of a function is defined by a binding. The binding environments of a function are all the environments which have a binding to it. The following diagram better reflects this relationship because the enclosing environment contains a binding from to the function:
In this case the enclosing and binding environments are the same. They will be different if you assign a function into a different environment:
The enclosing environment belongs to the function, and never changes, even if the function is moved to a different environment. The enclosing environment determines how the function finds values; the binding environments determine how we find the function.
The distinction between the binding environment and the enclosing environment is important for package namespaces. Package namespaces keep packages independent. For example, if package A uses the base function, what happens if package B creates its own function? Namespaces ensure that package A continues to use the base function, and that package A is not affected by package B (unless explicitly asked for).
Namespaces are implemented using environments, taking advantage of the fact that functions don’t have to live in their enclosing environments. For example, take the base function . Its binding and enclosing environments are different:
The definition of uses , but if we make our own version of it doesn’t affect :
This works because every package has two environments associated with it: the package environment and the namespace environment. The package environment contains every publicly accessible function, and is placed on the search path. The namespace environment contains all functions (including internal functions), and its parent environment is a special imports environment that contains bindings to all the functions that the package needs. Every exported function in a package is bound into the package environment, but enclosed by the namespace environment. This complicated relationship is illustrated by the following diagram:
When we type into the console, it’s found first in the global environment. When looks for it finds it first in its namespace environment so never looks in the .
What will the following function return the first time it’s run? What about the second?
This function returns the same value every time it is called because of the fresh start principle, described in a fresh start. Each time a function is called, a new environment is created to host execution. The parent of the execution environment is the enclosing environment of the function. Once the function has completed, this environment is thrown away.
Let’s depict that graphically with a simpler function. I draw execution environments around the function they belong to with a dotted border.
When you create a function inside another function, the enclosing environment of the child function is the execution environment of the parent, and the execution environment is no longer ephemeral. The following example illustrates that idea with a function factory, . We use that factory to create a function called . The enclosing environment of is the execution environment of where is bound to the value 1.
You’ll learn more about function factories in functional programming.
Look at the following code. What do you expect to return when the code is run?
The top-level (bound to 20) is a red herring: using the regular scoping rules, looks first where it is defined and finds that the value associated with is 10. However, it’s still meaningful to ask what value is associated within the environment where is called: is 10 in the environment where is defined, but it is 20 in the environment where is called.
We can access this environment using the unfortunately named . This function returns the environment where the function was called. We can also use this function to look up the value of names in that environment:
In more complicated scenarios, there’s not just one parent call, but a sequence of calls which lead all the way back to the initiating function, called from the top-level. The following code generates a call stack three levels deep. The open-ended arrows represent the calling environment of each execution environment.
Note that each execution environment has two parents: a calling environment and an enclosing environment. R’s regular scoping rules only use the enclosing parent; allows you to access the calling parent.
Looking up variables in the calling environment rather than in the enclosing environment is called dynamic scoping. Few languages implement dynamic scoping (Emacs Lisp is a notable exception.) This is because dynamic scoping makes it much harder to reason about how a function operates: not only do you need to know how it was defined, you also need to know in what context it was called. Dynamic scoping is primarily useful for developing functions that aid interactive data analysis. It is one of the topics discussed in non-standard evaluation.
List the four environments associated with a function. What does each one do? Why is the distinction between enclosing and binding environments particularly important?
Draw a diagram that shows the enclosing environments of this function:
Expand your previous diagram to show function bindings.
Expand it again to show the execution and calling environments.
Write an enhanced version of that provides more information about functions. Show where the function was found and what environment it was defined in.
Binding names to values
Assignment is the act of binding (or rebinding) a name to a value in an environment. It is the counterpart to scoping, the set of rules that determines how to find the value associated with a name. Compared to most languages, R has extremely flexible tools for binding names to values. In fact, you can not only bind values to names, but you can also bind expressions (promises) or even functions, so that every time you access the value associated with a name, you get something different!
You’ve probably used regular assignment in R thousands of times. Regular assignment creates a binding between a name and an object in the current environment. Names usually consist of letters, digits, and , and can’t begin with . If you try to use a name that doesn’t follow these rules, you get an error:
Reserved words (like , , , and ) follow the rules but are reserved by R for other purposes:
A complete list of reserved words can be found in .
It’s possible to override the usual rules and use a name with any sequence of characters by surrounding the name with backticks:
The regular assignment arrow, , always creates a variable in the current environment. The deep assignment arrow, , never creates a variable in the current environment, but instead modifies an existing variable found by walking up the parent environments.
If doesn’t find an existing variable, it will create one in the global environment. This is usually undesirable, because global variables introduce non-obvious dependencies between functions. is most often used in conjunction with a closure, as described in Closures.
There are two other special types of binding, delayed and active:
Rather than assigning the result of an expression immediately, a delayed binding creates and stores a promise to evaluate the expression when needed. We can create delayed bindings with the special assignment operator , provided by the pryr package.
is a wrapper around the base function, which you may need to use directly if you need more control. Delayed bindings are used to implement , which makes R behave as if the package data is in memory, even though it’s only loaded from disk when you ask for it.
Active are not bound to a constant object. Instead, they’re re-computed every time they’re accessed:
is a wrapper for the base function . You may want to use this function directly if you want more control. Active bindings are used to implement reference class fields.
What does this function do? How does it differ from and why might you prefer it?
Create a version of that will only bind new names, never re-bind old names. Some programming languages only do this, and are known as single assignment languages.
Write an assignment function that can do active, delayed, and locked bindings. What might you call it? What arguments should it take? Can you guess which sort of assignment it should do based on the input?
As well as powering scoping, environments are also useful data structures in their own right because they have reference semantics. Unlike most objects in R, when you modify an environment, it does not make a copy. For example, look at this function.
If you apply it to a list, the original list is not changed because modifying a list actually creates and modifies a copy.
However, if you apply it to an environment, the original environment is modified:
Just as you can use a list to pass data between functions, you can also use an environment. When creating your own environment, note that you should set its parent environment to be the empty environment. This ensures you don’t accidentally inherit objects from somewhere else:
Environments are data structures useful for solving three common problems:
- Avoiding copies of large data.
- Managing state within a package.
- Efficiently looking up values from names.
These are described in turn below.
Since environments have reference semantics, you’ll never accidentally create a copy. This makes it a useful vessel for large objects. It’s a common technique for bioconductor packages which often have to manage large genomic objects. Changes to R 3.1.0 have made this use substantially less important because modifying a list no longer makes a deep copy. Previously, modifying a single element of a list would cause every element to be copied, an expensive operation if some elements are large. Now, modifying a list efficiently reuses existing vectors, saving much time.
Explicit environments are useful in packages because they allow you to maintain state across function calls. Normally, objects in a package are locked, so you can’t modify them directly. Instead, you can do something like this:
Returning the old value from setter functions is a good pattern because it makes it easier to reset the previous value in conjunction with (see more in on exit).
As a hashmap
A hashmap is a data structure that takes constant, O(1), time to find an object based on its name. Environments provide this behaviour by default, so can be used to simulate a hashmap. See the CRAN package for a complete development of this idea.
There are four ways: every object in an environment must have a name; order doesn’t matter; environments have parents; environments have reference semantics.
The parent of the global environment is the last package that you loaded. The only environment that doesn’t have a parent is the empty environment.
The enclosing environment of a function is the environment where it was created. It determines where a function looks for variables.
always creates a binding in the current environment; rebinds an existing name in a parent of the current environment.
Um,why exactly does all this matter? So.
It's not immediately clear.
So typically the function is defined the global environment so that
values of the free variables are just found in the user's workspace.
So this is kind of the.
The right thing to do is kind of what most people are expecting.
If there's no, if, if there's, you can't find a value
inside the function itself, you just look in the global environment.
So this is the, the idea here is that you can define things like
global variables, that will be common to a lot of different functions.
That you might be defining in your workspace.
so, but the key difference in R is
that you can define functions inside of other functions.
'n so for example a function can return a function as the return value.
So, in most functions they'll return a list, or a vector, or a matrix, or
a data frame or something like that, but it is possible for a, for a function
to return another function and then that, if that's
the case then the, then the function that gets returned.
It was defined inside of another function.
So, it's an, the environment in which it was defined Is not the global environment.
It's really the, the, the insides of this other function.
So this is when things get interesting and this is when
the scoping rules really have an impact on what you can do.
So, I am going to define a very simple function here and often
these kinds of functions come [UNKNOWN]
where you might think of constructive functions.
So, the idea that the function is constructing another function.
So, here's what I want to, I want to
create a function that that defines another, called make.power.
And what make.power takes as input is a number n.
So, and inside the make.power function I define
another function called pow.
And pow is going to take an argument called x.
And and so what's going to happen is that
the power function is going to take the, then the,
take the argument X and raise to their
power N, okay, and so make that power returns,
with a function power as its return value and so you see inside the power function X
is a, X is a function argument but that's not a problem, but n is a free variable
because its not defined inside the power. Function.
However, N is defined inside the make.power function and so
since that's the environment in which the pow is defined.
It will find the value of N.
The pow, the power function will find the
value of n inside this, it's other environment.
So what happens is that I can call make.power and pass it a number like 3.
And then, it will return a function, which I'll sign to
be called cube.
And, similarly, I can pass 2 to make that
power and create a function that I'll call square.
So, now, when I, when I pass cube, the number 3 What is
it does is it raises 3 to the 3rd power, so I get 27.
If I call square on the number 3, it, it
raises three to the 2nd power, so it gives me 9.
And so, so, so now, I've cons, I've
got one function that can, that's capable of constructing
many different types of functions, and by raising to pow, to various powers.
So, how do you know what's in a function's environment?
So you can, you can at the function, so, excuse me.
You can look in the environment in which
the function was defined, by calling the LS function.
So if I call, if I call LS on On the environment for cube.
You can see that inside the cube function, there's,
there's something, there's an, there's an object called N.
And if I use get on N you'll see that the value of N is equal to 3.
So that's how the power function knows to raise
it to the 3rd, to the 3rd power.
Excuse me, that's how the cube function knows how to, knows
to raise the argument to the 3rd power because it's already defined.
In it's, in it's, in it's, closure environment.
Similarly the environment for square, you can see
it has the exact same objects in it.
But now the value of n is equal to 2, in the square function.
So, so, I want to make one brief
comparison between lexical scoping, which is what R
does, and dynamic scoping, which is what maybe
some other function, some other programing languages implement.
So here I've got, I'm assigning the value of Y equal to 10.
Then create a function F, which takes, as an argument, X.
And then, it assigns, there it assigns Y equal to 2, it squares Y and then adds G
of X. So, what's G?
G is another function, which takes as an
argument called X, and it multiplies X times Y.
So, in the F function, Y is a free variable, and G is also a free variable.
So, the G function is not defined.
Inside of F of or, it's, it, of, argument to F.
Then in the G function, then the var-, the symbol Y is a free variable.
And so the question
is if I call f of 3 what gets returned?
So with lexical scoping, the value of Y and the function G
is looked up in the environment in which the function was defined.
Which in this case was the global environment.
So that the value of Y and the G function is 10.
So with dynamic scoping the value of Y is looked up in
the environment from which the function
was called; sometimes called the calling environment.
So in the R the calling environment is
known as is what's called the parent frame.
In this case the calling environment Y was defined to
be 2 and so the value of Y would be 2.
Calling the function F would produce different answers depending
on whether you use lexical scoping or dynamic scoping.
So, the one thing that, that, that will
make lexical scoping and dynamic scoping look the
same is that when a function is defined
in the global environment and is subsequently called
from the global environment, then the defining environment
and the calling environment are exactly the same
and so this can sometimes give the appearance
of dynamic scoping even when It doesn't exist.
So here I've got a function called G.
It takes an argument X. It assigns A to be equal to 3.
And then it adds X plus A plus Y.
So, in this case, X is a function is a formal argument.
A is a local variable so it's not a
formal argument, but I defined it inside the function.
Then so, that's okay.
And then Y is a free variable, okay?
So if I call G of 2, the function G is
going to look for the value of Y in the global environment.
If I haven't yet defined Y then there has
to be an error because it doesn't know what
value to assign to the symbol of Y. So that's what I get in this line here.
Now if I define what Y is, say I assign it to be 3, if I call it
G of 2, then it returns 8 because now it's able to find Y in the global environment.
So even though it looks like the value of Y was looked up in the calling
environment, it's actually the defining environment because G
happened to be defined in the global environment
so, there are a number of other languages that support lexical scoping.
Some examples are things like Scheme, Perl, Python, and Common Lisp.
And of course there's a, a well known computer science
theorem which is that all languages eventually converge to Lisp.
And so it's, it's not a, it's not an obscure type of feature.
It's actually very common in a number of other programming languages.
So, one of the main consequences of lexical scoping in R
is that all the objects have to be stored in memory.
So, if you're working with a programming language that has
very small objects this generally speaking not a big problem.
Because of nature of the scoping rules and
because of the complexity of the environment and the,
the way they are all linked together, it's difficult
to implement this type of model outside of physical
memory, and so.
So the consequence was that, when R was originally designed.
Everything was stored in memory.
Things are getting complicated now, because of very large types of data sets.
And, being able to read them into R.
It is a challenge. Everything has to be stored in memory.
Second now, so every function has a carrier
pointer to its respect, to its defining environment.
and, and that defining environment could literally be anywhere
because there could be functions within functions and then the,
and if you do, if one function returns another function,
then there has, there has to be a pointer to
that piece of memory where the defining environment is stored.
And so this makes the model a little bit
more complex but but, but all the more useful.
So, the, in S plus, which was kind of the original implementation of the S language,
the free variable were always looked up in the workspace.
Everything could be stored on the disk, because the
defining environment of all the functions was the same.