This lesson challenges students to critically think about good file and process management and organization in support of reproducible open science.
Students should review the following presentation PRIOR to participating in the activity.
|10||Intro to Reproducibility|
|25||Group Work - Identify issues|
|05||Wrap up / Survey|
First 10 Minutes
- Introduction to Reproducibility
- Story about some element where it would have been helpful
Why It Makes Science Better
- Help out your future self
- Contribute to building upon research efforts
- Error checking
List any other reasons / motivation for it.
You are in a lab and a colleague has moved on to a new job and left you their research which you are tasked by your supervisor with picking up and moving forward. Have a look at the files that were left for you to work with and answer the following questions:
- Are the contents of the directory easy to understand?
- Do you feel confident that you can easily recreate the workflow associated with the data / code?
- Do you have access to the data? What data are available and where / how were they collected?
Have the students work in small groups to:
- Create a list of things that would make the working directory easier to work with.
- Break that list into general “areas” / categories of reproducibility.
Files for an exercise on file, data, and code documentation and organization
Files in the subdirectory
messy-dir-example can be used to help students identify problems that make it difficult to share or reuse analyses. There are many problems with the folder structure, file nameing, data organization, and code organization in this example directory.
Some of the problems within this directory include:
- No metadata or readme
- No directory structure
- Background info is a picture of text instead of searchable text
- Multiple files with similar content and different names; ambiguous naming
- Some vector GIS files are missing and it is unclear why
- Tabular data is in proprietary format
- Not clear which sites different files are from
- Not clear the order in which the script were run or should be run
- In the code:
- Multiple copies of similar code pasted near each other but with slight changes
- Very few comments
- Unclear about the order in which lines should be run
- In the tabular file foliar chem:
- Notes at bottom of files
- Notes off to the right in unlabeled column
- Gap between columns
- Column name starting with a number
- Duplicate column names
- Spaces in column names
- Misspellings in columns that might be used as categorical variables
- Different values for missing data
- Dealing with dates in Excel (DANGER)
- Units for values?
- Where is metadata?
- Using colors rather than machine readible column flags
- Multiple tabs
There are more issues with the repo that participants will find.
About This Lesson
This lesson was originally taught as part of the NEON Data Institute 2016 by Naupaka Zimmerman. The data and files are for the most part derived from various NEON remote sensing data products from the D17 California field sites.