Yiheng An

Leveraging Supervised and Unsupervised Machine Learning to Study Shapes

2023-02-21T00:00:00-07:00

As sensor technology improves, data volumes grow. We now live in a sea of data collected by our phones, smartwatches, and home assistants like Alexa. Science is not any different, new sensors are enabling the collection of large datasets that can be mined for new scientific discoveries. In plant science, sensor technology is being applied to study how plants grow under drought conditions.

NOTE: Access the workshop notebook here.

Research

We will be using data collected by the Field Scanalyzer at the University of Arizona Maricopa Agricultural Center. The Field Scanalyzer covers over an hectare of land - capturing data from over 20,000 plants over a growing season. The Field scanalyzer is equipped with stereo RGB and thermal cameras, a PSII chlorophyll fluorescence imager, and a pair of 3D laser scanners (pictured below).

Collectively, these sensors capture 20 terabytes (TBs) in a three-month period, which makes converting these raw data into information a difficult task. Accomplishing extraction of information requires leveraging machine learning, high performance computers, and distributed computing.

These data enable me and other scientists to study how plants respond to drought stress under real-world, field conditions. These data will contribute to efforts aimed at improving the resiliency of plants to drought stress.

Data

Today we will be working with 3D point cloud data collected by the Field Scanalyzer. These data provide fine-scale resolution on plant shapes. We will: (i) extract TDA shape descriptors, (ii) run PCA on these data, and (iii) classify plants into their respective variety name.

Workshop materials

Google Colab notebook

Acknowledgements

With special thanks to:

Dr. Duke Pauli & lab members
Dr. Eric Lyons & lab members

Using interactive data visualization to make sense of large datasets

2022-11-16T00:00:00-07:00

NOTE: Access the workshop notebook here.

Phenomics: A case study in big data

These multiple sources of data provide a fine-scale information of plant growth under drought (decreased water) conditions. Today, we will use some of these data to learn interactive visualization using Python!

Workshop materials

Google Colab notebook

Survey

Please provide your feedback to improve future workshops here: https://bit.ly/2022-ds2f.

Additional materials

Seminar invitation
- School of Plant Sciences Seminar - Transforming a quarter petabyte of field phenomics data into functional traits and beyond
  - Date: Tuesday, 22-Nov
  - Time: 4pm
  - Zoom link: https://arizona.zoom.us/j/83941552191
  - Password: spls2022
Reading
- Living in Data: A Citizen’s Guide to a Better Information Future by Jer Thorp
- Data Science by John D. Kelleher and Brendan Tierney
Software
- PhytoOracle
  - Data processing pipelines that convert raw data from the Field Scanalzyer into phenotypic trait information
  - To check out our open source code, click here.

Acknowledgements

This program is funded by the University of Arizona Libraries: https://data.library.arizona.edu/ds2f.

With special thanks to:

Jeffrey Oliver
Megan Senseney
Jim Martin
Yvonne Mery
Leslie Sult
Cheryl Casey

Pip install without sudo on HPC clusters

2022-03-22T00:00:00-07:00

Learn how to pip install Python packages without root access.

Introduction

High performance computer (HPC) clusters are shared resources. As such, sudo/root access is denied to prevent one user from potentially harming the system or deleting data. This does make installing Linux dependencies and/or Python libraries. So how do we get around this?

Finding the default Python

When installing Python packages, it is important to know the default Python version. To find your default version, run the following command:

which python3

Which should produce an output like this:

You can check for other Python version by running:

ls -ls /usr/bin/python*

Installing Python libraries

To install packages without sudo/root access, run the following command, making sure to insert your package name:

/usr/bin/python3 -m pip install  --user

For example, if we wanted to install the awesome giotto-tda package, we would run:

/usr/bin/python3 -m pip install giotto-tda --user

So why does this work? Well notice the --user flag, this ensures that the package is only installed within your own user environment. This allows you to download packages without sudo/root access on HPC systems and servers. Give it a try!

Phenomic Data Exploration

2022-03-21T00:00:00-07:00

Explore field scanalyzer multimodal phenomic data!

Introduction

The field scanalyzer at the University of Arizona Maricopa Agricultural Center is a multimodal phenotyping platform that travels along rails and captures images and point clouds of thousands of plants. These data are processed using PhytoOracle distributed processing pipelines. Given the size of raw data, all field scanalyzer data types are processed on the University of Arizona high performance computer cluster.

Figure 1. The field scanalyzer is an outdoor plant phenotyping platform at the University of Arizona Maricopa Agricultural Center.

Sensors enclosed within the sensor box include stereo RGB and thermal cameras, a PSII chlorophyll fluorescence imager, and a pair of 3D laser line scanners. All sensors collect data at the full field scale, except PSII chlorophyll fluorescence which collects data at the center of each agricultural plot.

Figure 2. (A) The field scanalyzer covers a 1 hectare field. (B) The platform collects RGB, thermal, PSII chlorophyll fluorescence, and 3D laser scanner data. (C) The raw data is sensor dependent, ranging from 5-350 GBs. All sensor data is captured at the full field scale, except for PSII chlorophyll fluorescence which captures data from the center of each agricultural plot. (D) Raw sensor data is temporarily stored on a cache server, where it is programmatically compressed and uploaded onto CyVerse. Compressed data is downloaded, processed, and outputs transferred on the UA high performance clusters.

Irrigation treatment & weather data

Figure 3. Volumetric water content (%) over the course of the growing period. For each collection, measurements were taken at depths 10, 30, 50, 70, and 90 cm. Each point represents the mean value of two measurements.

Figure 4. Weather data throughout the growing period collected by the Arizona Meteorological Network (AZMET).

Test dataset

To download our numerical, tabular test dataset, click here. This dataset contains RGB, thermal, PSII chlorophyll fluorescence, and 3D line scanner phenotypic trait data. For a full description of the dataset, click here. The figures below show only those lettuce types included in the test dataset, although you can click on other lettuce types to see their trends by clicking on each figure’s legend.

To download our point cloud test dataset in an archived, compressed “tar.gz” format , click here. To access the same data in an uncompressed Google Drive folder, click here.

Mophological phenotypes

RGB

Figure 5. Bounding area time series showing plant development over the growing period. Errors bars represent the 95% CI around the mean. Means represent the phenotypic average of a lettuce type, including all genotypes and their respective replicates within a treatment.

3D laser scanner

Figure 6. Height time series showing plant development over the growing period. Errors bars represent the 95% CI around the mean. Means represent the phenotypic average of a lettuce type, including all genotypes and their respective replicates within a treatment.

Physiological phenotypes

Thermal

Figure 7. Canopy temperature over the growing period. Errors bars represent the 95% CI around the mean. Means represent the phenotypic average of a lettuce type, including all genotypes and their respective replicates within a treatment.

PSII chlorophyll fluorescence

Figure 8. Maximum quantum effiiency of PSII (FV/FM) over the growing period. Errors bars represent the 95% CI around the mean. Means represent the phenotypic average of a lettuce type, including all genotypes and their respective replicates within a treatment.

Setting up iRODS

2022-01-31T00:00:00-07:00

Learn how to install and use the Integrated Rule-Oriented Data System (iRODS). iRODS is open source data management software used by research groups, such as CyVerse. This software provides access to data on the terminal, whether that be your local computer or a high performance computer (HPC). Below are the steps to getting iRODS installed on your machine and an example of a data download.

CyVerse Account Registration

Create an account here
Access the CyVerse DataStore here
Login to your account by clicking on the Login icon:
You can now navigate the CyVerse DataStore. Check our phenomics research data collected by the Field Scanner here
Follow the steps below to get iRODS command access on your terminal so that you can download large datasets.

iRODS Installation

macOS users

Download the macOS installer here.
Follow the installation steps.
On your terminal, run:
```
 iinit
```

Fill in the prompts with:

Host name	Port #	Username	Zone	Password
data.cyverse.org	1247	CyVerse User ID	iplant	CyVerse password

You’re now ready to start downloading data!

Linux & Windows Subsystem for Linux 2 (WSL2) users

Download the iRODS installation shell script and give it executable permissions:

 wget https://raw.githubusercontent.com/emmanuelgonz/emmanuelgonz.github.io/master/files/install_irods_copy.sh && chmod 755 install_irods_copy.sh

Run the installation script:
```
 sudo ./install_irods_copy.sh
```
Log in to iRODS:
```
 iinit
```

Fill in the prompts with:

Host name	Port #	Username	zone	Password
data.cyverse.org	1247	CyVerse User ID	iplant	CyVerse password

You’re ready to start downloading some data!

iRODS Data Download

Let’s say we want to download some hyperspectral data on the phytooracle CyVerse DataStore. Follow the steps below to do just that:

Open the CyVerse DataStore website
Find the file you’d like to download
To download the highlighted file above, copy the “Path” and run the iget command. Below is an example:
```
 iget -KPVT /iplant/home/shared/phytooracle/season_12_sorghum_soybean_sunflower_tepary_yr_2021/level_0/VNIR/VNIR-2021-05-29__12-17-47-496_sunflower.tar.gz
```
NOTE

Below is an explanation of each flag used above:
- -K Verify the checksum
- -P Output the progress of the download
- -V Verbose
- -T Renew socket connection after 10 minutes
It’s recommended to use the -KT flags, as it prevents errors due to internet connectivity. To see a full list of other flags/options, click here.

Creating an academic website on GitHub

2021-10-04T00:00:00-07:00

Learn how to create a website to showcase your academic achievements! This tutorial will walk you through setting up an academic website on GitHub. You can add publications, blog posts, and a CV to your website to share with people as you network! We will get some more practice with the terminal by interacting with the Git command line interface (CLI).

Preparation (reviewed in previous workshop)

Fork the Academic Pages repo.
Rename the repo to your GitHub username:
Click on the green “Code” button and copy the link to clone your own repo.
On your terminal, run:
```
 git clone link here>
```

About page

Open your integrated development environment (IDE) and open up the directory containing your cloned repo.
Open the _pages directory and click on the about.md file.
Remove all the text under line 9 (highligthed in blue).
Edit the header information (title, excerpt)
You can add an image by placing it in the images directory. Use the following code to include it in your page:
```
 title="" alt="Alt text" src="images/">
```

Add, commit, and push your changes:

 git add *

 git commit -m 'changes to about me'

 git push origin

Now, navigate to your website, which is accessible at .github.io

Publications page

Open the _publications directory and click on the 2009-10-01-paper-title-number-1.md file.
Edit the title, permalink, etc. Example below:
Create a new file for each publication.

Add, commit, and push your changes:

 git add *

 git commit -m 'changes to publications'

 git push origin

Now, navigate to your website, which is accessible at .github.io

CV page

Open the Open the _pages directory and click on the cv.md file.
Add your education, work experience, and skills. Example below:

Add, commit, and push your changes:

 git add *

 git commit -m 'changes to cv'

 git push origin

Now, navigate to your website, which is accessible at .github.io

iRODS Crash Course

2021-09-23T00:00:00-07:00

Learn how to use iRODS for your research data management needs. This tutorial will walk you through downloading and uploading data using iRODS.

Let’s download some files, run:

 iget -N 0 -PVT /iplant/home/emmanuelgonzalez/acic_2021_tutorials/mavic_mini_2_sorghum.mp4

Did you run into any problems?
- I did not share the file with you, that’s why you got that error!

Now that I have shared the file with you, run the command again:

 iget -N 0 -PVT /iplant/home/emmanuelgonzalez/acic_2021_tutorials/mavic_mini_2_sorghum.mp4

To open the folder in which you downloaded the file run the following command depending on your OS:
- macOS
```
  open .
```
- WSL 2
```
  explorer.exe .
```
- Linux
```
  xdg-open .
```
Now upload the file to your CyVerse Data Store, run:
```
 iput -N 0 -PVT mavic_mini_2_sorghum.mp4
```
Go to the CyVerse Data Store and navigate to your home directory.
You can share a file by logging into the CyVerse Data Store, clicking on the 3 dots on the far right and clicking “Share.”
Share the file with someone present on the Zoom call.
Congratulations, you are now an iRODS expert!

Terminal, GitHub, and iRODS Essentials

2021-09-20T00:00:00-07:00

Learn how to leverage the terminal for GitHub version control and Integrated Rule-Oriented Data System (iRODS) data management! This tutorial is split into three parts:

Part A: Terminal
- Set up a Linux workspace for scientific computing.
Part B: GitHub
- Build a website to share this with employers, network connections, etc.
Part C: iRODS
- Set up iRODS within your terminal and upload/download data.

Tutorial requirements:

Computer, either Windows, Linux, or Mac OS

CyVerse account, get one here

GitHub account, get one here

Note: We may run into errors during this workshop. Do not be discouraged, this is part of the workspace set up. It is painful at first, but once it’s over, it’s worth it!

Part A: Terminal

Your terminal will look and act differently depending on your operating system (OS). There are a variety of OSs out there including Ubuntu, Windows, Mac OS, etc. Since the majority of scientific computing is done on Linux, that will be the focus of this tutorial.

macOS & Linux users

You are ready to proceed. Just open your terminal! I strongly suggest you pay attention to the Windows Subsystem for Linux 2 (WSL 2) set up, as you may find this useful when you develop for other OSs.

Windows users

We need to download and install WSL 2. I use this as my go-to workspace, as it allows me to run my code on Linux but have my computer run Windows 10. You will have a Linux terminal running on the subsystem, but your main OS will be Windows! Isn’t that cool?

Let’s get this set up on your computer by following the steps below:

Open Powershell as Admin and run:

 dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart

Right-click on the Windows Start icon, click on Run, type winver. Confirm that you meet the requirements below.

WSL 2 Requirements

x64 systems: Version 1903 or higher, with Build 18362 or higher.

ARM64 systems: Version 2004 or higher, with Build 19041 or higher.

Enable the Virtual Machine feature by running:

 dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart

Download and install the Linux kernel update by clicking here.

Note: If you get an error during the Linux kernel installation, restart your computer and retry Step #5.
Go back to your admin Powershell window and set WSL 2 as your default WSL version by running:
```
 wsl --set-default-version 2
```
Open the Windows Store and download Ubuntu.
Download the Windows Terminal app.
Open the Windows Terminal app. You are now ready to go! You will be asked to create a username and password.

Part B: GitHub

Setting up SSH keys

We need to set up an SSH key to easily push changes to your repos.

On your terminal, run and click enter for all prompts:
```
 ssh-keygen
```
Print and copy contents of the file:
```
 cat ~/.ssh/id_rsa.pub
```
Open GitHub, click on your Profile Picture > Settings > SSH and GPG keys > New SSH Key.
Paste the contents of your file which you previously copied into the Key field, add a descriptive title, and click “Add SSH Key”.

Fork & clone a repo

Fork the Academic Pages repo.
Rename the repo to your GitHub username:
Click on the green “Code” button and copy the link to clone your own repo.
On your terminal, run:
```
 git clone link here>
```

Part C: iRODS Data Management

macOS users

Download the macOS installer here.
Follow the installation steps.
On your terminal, run:
```
 iinit
```

Fill in the prompts with:

Host name	Port #	Username	Zone	Password
data.cyverse.org	1247	CyVerse User ID	iplant	CyVerse password

You’re now ready to start downloading some data!

WSL 2 & Linux users

Download the iRODS installation shell script and give it executable permissions:

 wget https://raw.githubusercontent.com/emmanuelgonz/emmanuelgonz.github.io/master/files/install_irods_copy.sh && chmod 755 install_irods_copy.sh

Run the installation script:
```
 sudo ./install_irods_copy.sh
```
Log in to iRODS:
```
 iinit
```

Fill in the prompts with:

Host name	Port #	Username	zone	Password
data.cyverse.org	1247	CyVerse User ID	iplant	CyVerse password

You’re ready to start downloading some data! Let’s continue our tutorial here.

Resources

For details on the vim editor, run the following:
```
  vimtutor
```