This post is about how you set up an adequate project environment. By this I mean the folders you should create, and how you should save your files. The structure introduced here will help you to keep your project structured and to keep an overview about your work, but also to make it easier to share your project with others.
In all, whenever you start a new programming project you should set up the infrastructure described below. Such project could be a term paper, a research endeavor, or just the code to create some visualizations. Later you might find that some aspects of the infrastructure below feel like a bit of an overkill, especially for very small undertakings. But especially in the beginning its better to be save than sorry and to set up the whole project as described below.
In all, setting up a good working environment includes the following steps:
Then you should always familiarize yourself with how to use the
here
-package with your project.
There are some additional steps one might to take, such as initiating
a Git repository or
setting up a renv
environment . Moreover, for larger projects you might also want to
add a README.md
. But for now the steps mentioned above are
sufficient. But before going through them one by one, we need to clarify
two important technical concepts:
The working
directory is a folder on the computer which R uses as an default
anchor for all file paths used to accesses input, such as data sets, or
to store output. The default working directory the user directory, but
it can be changed. We can display the current working directory using
the getwd()
function. In my case the working directory
looks like this:
/Users/graebnerc/Teaching/DataScience22/
Now assume we produced a plot in our current session and want to save
it using the function ggplot2::ggsave()
. As we already
learned, this function takes, among others, an argument
filename
that specifies the name of the file that is meant
to contain the plot. Now if I were to tell R to save the plot under the
name test.pdf
like this:
::ggsave(filename = "test.pdf") ggplot2
R would save it in the following location:
/Users/graebnerc/Teaching/DataScience22/test.pdf
As you can see R uses the current working directory as an ‘anchor’,
and all paths provided are relative to this anchor. This means
that, assuming that in our working directory exists a folder called
output
, we could save our file test.pdf
in
this folder by making the following function call:
::ggsave(filename = "output/test.pdf") ggplot2
Viewed upon from a global perspective the file is saved here:
/Users/graebnerc/Teaching/DataScience22/output/test.pdf
Since the path provided is relative to the working
directory, we call paths such as those we would have passed to
ggsave()
above as relative paths.
Alternatively, we could also provide R directly with the absolute path. In this case, we would need to type the complete path, starting from the root directory of the computer. Rather then assuming the absolute path implicitly as above, we would need to do the following call:
::ggsave(
ggplot2filename = "/Users/graebnerc/Teaching/DataScience22/output/test.pdf"
)
When we use absolute paths, we can save a file at any position on the computer we want. For instance, we do the following
::ggsave(
ggplot2filename = "/Users/graebnerc/GreatPlots/test.pdf"
)
to save the file here:
/Users/graebnerc/GreatPlots/test.pdf
While it seems to be attractive to use absolute paths because of their expressive power, i.e. the possibility to save files anywhere we want, I can only advice against using them. In fact, absolute paths are something that you might use in your console when you want to save a file quickly during a private programming session. But you should never use absolute paths in scripts.
A central argument in favor of relative paths is that code using relative paths can function when executed on different computers. Absolute paths look different on every computer, so they will always produce errors when being transferred across computers. Have a look at the following path from above:
/Users/graebnerc/Teaching/DataScience22/output/test.pdf
I hope you agree that it is highly unlikely that a path involving my account name exists on your computer. Thus, if I sent you a script that contains a reference to this path, it will produce an error once you execute it. Thus, we will always use relative paths below.
Of course, one problem is that the ‘anchor’ from which the relative path will be evaulated on my and your computer must somehow be harmonized. As we will learn below, this can be achieved through the use of R project files and the package here.
First of all you have to decide on a place on your computer in which all data related to your project, i.e. data, scripts, images, etc., should be saved. It is usually a good idea to avoid places such as the Desktop or your Download folder.
After having identified the right place for our project on our
computer, we will now create an R-project at exactly this place. To this
place, open R-Studio, and either click on File/New Project
,
or on the blue botton in the upper left part of the pane, directly to
the right of the New File
button. You should now see the
‘New Project Wizard’:
We click on New Directory
1 and then on
New Project
. Then we should see the following: