ARCHIVED: A walkthrough of APackOfTheClones v0.1.x • APackOfTheClones

Introduction

Single-cell RNA sequencing (scRNA-seq) and T cell receptor (TCR) sequencing are popular techniques for studying immune cell function and disease. The combined use of such data can provide insights into clonal expansion. APackOfTheClones provides a simple, publication-ready method to intuitively visualize clonal expansion between different cell clusters with ggplot2, and can be easily slotted into any analysis pipeline.

In this user vignette, we will assume basic familiarity with Seurat and R, and a Seurat object has already been processed through the basic pipeline until at least the UMAP reduction has been obtained. Most importantly, clonal expansion can only be analyzed if the biosample has been run through the 10X genomics Chromium Single Cell Immune Profiling service, and the resulting data processed by Cell Ranger.

We will be showing basic usage of the main function of the package, Namely, in the integration of the 10X genomics T-cell receptor library generated by Cell Ranger into a Seurat object, followed by the usage of a fully customizable clonal expansion plotting method.

Setup the Seurat object and T cell receptor library

Before anything else can be done, we need a Seurat object of the single cell experiment with a dimensional reduction. The most modern and common reduction used in publications is UMAP (Uniform Manifold Approximation and Projection) which is what this package defaults to, and what this vignette will use. There is also a way to base it off of the t-SNE (t-distributed stochastic neighbor embedding) or PCA (Principle Component Analysis) reduction though choosing the latter is probably not the best idea.

library(Seurat)
suppressPackageStartupMessages(library(APackOfTheClones))

# Here, a Seurat object has already been loaded named `pbmc`
print(pbmc)

#> An object of class Seurat 
#> 31053 features across 8000 samples within 1 assay 
#> Active assay: RNA (31053 features, 2000 variable features)
#>  3 layers present: counts, data, scale.data
#>  2 dimensional reductions calculated: pca, umap

Here, we can also see its UMAP reduction plot. It is very important that it is present within the Seurat object so that the later clonal expansion visualization can be based on the coordinates for intuitiveness.

umap_plot <- UMAPPlot(pbmc)
umap_plot

To analyze clonal expansion, the T-cell library (in the form of the all_contig_annotations.csv file) generated by Cell Ranger has to be “integrated” into the seurat object. This simply means that the cell-level receptor information is incorporated into the @meta.data attribute. However, before doing so, let’s quickly view the file structure. An r dataframe of the read-in file has been loaded named tcr_dataframe:

head(tcr_dataframe)

barcode	is_cell	contig_id	high_confidence	length	chain	v_gene	d_gene	j_gene	c_gene	full_length	productive	cdr3	cdr3_nt	reads	umis	raw_clonotype_id	raw_consensus_id
AAACCTGAGTTAACGA-1	True	AAACCTGAGTTAACGA-1_contig_1	True	362	TRB	None	None	TRBJ2-5	TRBC1	False	None	None	None	23679	4	clonotype4	None
AAACCTGAGTTAACGA-1	True	AAACCTGAGTTAACGA-1_contig_2	True	872	TRB	TRBV19	TRBD1	TRBJ1-1	TRBC1	True	True	CASSIRVNTEVFF	TGTGCCAGCAGTATCAGGGTAAACACAGAAGTCTTCTTT	3993	3	clonotype4	clonotype4_consensus_2
AAACCTGAGTTAACGA-1	True	AAACCTGAGTTAACGA-1_contig_3	True	324	Multi	None	None	TRAJ50	TRBC1	False	None	None	None	4870	2	clonotype4	None
AAACCTGAGTTAACGA-1	True	AAACCTGAGTTAACGA-1_contig_6	True	749	TRA	TRAV10D	None	TRAJ49	TRAC	True	True	CAARDTGYQNFYF	TGTGCAGCAAGGGACACGGGTTACCAGAACTTCTATTTT	14844	5	clonotype4	clonotype4_consensus_1
AAACCTGCATCCGGGT-1	True	AAACCTGCATCCGGGT-1_contig_1	True	557	TRA	TRAV19	None	TRAJ56	TRAC	True	True	CAAGGGNNKLTF	TGCGCAGCAGGAGGAGGCAATAATAAGCTGACTTTT	14438	4	clonotype5	clonotype5_consensus_2
AAACCTGCATCCGGGT-1	True	AAACCTGCATCCGGGT-1_contig_2	True	619	TRB	TRBV5	TRBD1	TRBJ1-4	TRBC1	True	True	CASSLTGISNERLFF	TGTGCCAGCAGCCTCACAGGGATTTCCAACGAAAGATTATTTTTC	3721	2	clonotype5	clonotype5_consensus_1

This is the first 5 rows of what the file should look like directly from cell ranger. If you for some reason have some custom version of this dataframe that is different, it is highly important that there AT LEAST exists the column with the column name barcode (column 1) and raw_clonotype_id (column 17), otherwise everything else will not work as intended.

how do I load in the raw file?

library(readr)
tcr_dataframe <- readr::read_csv("your_file_location/all_contig_annotations.csv")

Integration of clonotype information

To finally incorporate the TCR(T-cell receptor library) data into the seurat object, we’ll use the simple integrate_tcr function.

pbmc <- integrate_tcr(pbmc, tcr_dataframe, verbose = FALSE)

Note that the last argument verbose was intentionally set to FALSE due to the presence of a progress bar which corrupts the formatting of the vignette (it defaults to the recommended TRUE). Otherwise, the output to the R Console would have been the following:

#> integrating TCR library into seurat object
#>   |====================================| 100%
#> Percent of unique barcodes: 31 %

By calling integrate_tcr, one can now find all the information from all_contig_annotations.csv within the @meta.data attribute. Let’s take a quick look on what it looks like:

head(pbmc@meta.data)

	orig.ident	nCount_RNA	nFeature_RNA	RNA_snn_res.0.6	seurat_clusters	barcode	is_cell	contig_id	high_confidence	length	chain	v_gene	d_gene	j_gene	c_gene	full_length	productive	cdr3	cdr3_nt	reads	umis	raw_clonotype_id	raw_consensus_id
AAACCTGAGCAACGGT-1	SeuratProject	3456	1203	1	1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
AAACCTGAGCAGCCTC-1	SeuratProject	7335	2334	8	8	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
AAACCTGAGTTAACGA-1	SeuratProject	2949	1144	0	0	AAACCTGAGTTAACGA-1	True	AAACCTGAGTTAACGA-1_contig_1__AAACCTGAGTTAACGA-1_contig_2__AAACCTGAGTTAACGA-1_contig_3__AAACCTGAGTTAACGA-1_contig_6	True	324__362__749__872	Multi__TRA__TRB	None__TRAV10D__TRBV19	None__TRBD1	TRAJ49__TRAJ50__TRBJ1-1__TRBJ2-5	TRAC__TRBC1	False__True	None__True	CAARDTGYQNFYF__CASSIRVNTEVFF__None	None__TGTGCAGCAAGGGACACGGGTTACCAGAACTTCTATTTT__TGTGCCAGCAGTATCAGGGTAAACACAGAAGTCTTCTTT	3993__4870__14844__23679	2__3__4__5	clonotype4	clonotype4_consensus_1__clonotype4_consensus_2__None
AAACCTGCATCCGGGT-1	SeuratProject	3586	1017	0	0	AAACCTGCATCCGGGT-1	True	AAACCTGCATCCGGGT-1_contig_1__AAACCTGCATCCGGGT-1_contig_2	True	557__619	TRA__TRB	TRAV19__TRBV5	None__TRBD1	TRAJ56__TRBJ1-4	TRAC__TRBC1	True	True	CAAGGGNNKLTF__CASSLTGISNERLFF	TGCGCAGCAGGAGGAGGCAATAATAAGCTGACTTTT__TGTGCCAGCAGCCTCACAGGGATTTCCAACGAAAGATTATTTTTC	3721__14438	2__4	clonotype5	clonotype5_consensus_1__clonotype5_consensus_2
AAACCTGCATCTACGA-1	SeuratProject	2659	1141	3	3	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
AAACCTGGTCTAACGT-1	SeuratProject	5266	1189	0	0	AAACCTGGTCTAACGT-1	True	AAACCTGGTCTAACGT-1_contig_1__AAACCTGGTCTAACGT-1_contig_2	True	565__651	TRA__TRB	TRAV13D-2__TRBV13-3	None__TRBD1	TRAJ38__TRBJ2-7	TRAC__TRBC1	True	True	CAIDPPNVGDNSKLIW__CASSDDRVGEQYF	TGTGCCAGCAGTGACGACAGGGTGGGGGAACAGTACTTC__TGTGCTATAGACCCCCCTAATGTTGGTGACAACAGTAAGCTGATTTGG	7172__13349	2__8	clonotype6	clonotype6_consensus_1__clonotype6_consensus_2

From this snippet of the meta data, we can see that for this particular seurat object, the first 5 columns orig.ident, nCount_RNA, nFeature_RNA, RNA_snn_res.0.6, seurat_clusters were originally present. What we care about is all the new columns after them. A few key things to note: * There are a lot of NA values, and that is due to 1) some cells aren’t T cells, 2) some T cell receptors do not end up being sequenced for many possible reasons, 3) Of the sequenced T cells in the TCR library, there are usually a lot of duplicate barcodes (repeated TCR sequencing of the same cell) as each individual cell may have several TCR types and/or need several contigs to assemble the final sequence. * unique data about TCRs of the same cell (same barcode) are collapsed together with __ in the order of the contigs

Now, the seurat object can be analyzed downstream with the new integrated information however we wish. It also is now ready to have its clonotypes counted and the clonal expansion visualized.

do I HAVE to integrate the the TCR library into my seurat object?

well… no. But there is no reason not to do so and it is heavily recommended to do so. Otherwise, we are about to see in the upcoming clonal expansion plotting function how you can avoid the integration.

How do I get the clonal expansion per cluster information myself?

There is the count_clone_sizes(seurat_obj) function for doing so. Check the documentation by running ?count_clone_sizes for more details.

Ball packing visualization

All one has to do to produce the visualization is by simply running the following code:

pbmc_expansion_plot <- clonal_expansion_plot(pbmc, verbose = FALSE)
# once again, `verbose` is set to false but it defaults to TRUE
pbmc_expansion_plot

How do I base the clonal expansion plot off of the t-SNE or PCA reduction?

clonal_expansion_plot has an optional argument reduction which defaults to 'umap' that handles which reduction the plot is based off of. The argument can be set to 'tsne' or 'pca' like so:

# t-SNE based plot
pbmc_expansion_plot <- clonal_expansion_plot(pbmc, reduction = 'tsne')
# PCA based plot
pbmc_expansion_plot <- clonal_expansion_plot(pbmc, reduction = 'pca')

A few things to note about the plot

The clonotype counts for each seurat cluster corresponds to their position and color in the original UMAP centroids
The most expanded clonotypes are in the center of each circle cluster with larger sizes, symbolizing increased expansion
There are less circles than the scRNAseq cell count due to aforementioned reasons.
On the top left, there is a somewhat covered visual legend of the relative clone sizes
The red and green clusters have considerable visual overlap
The returned plot is a fully customizable ggplot object

As APackOfTheClones is in a rapid-development phase, the resulting clonal expansion plot will usually not be visually satisfactory on the first run without customizations. However, there are many optional arguments and ggplot tricks that we are about to explore that will help us make it publication quality.

How do I create the plot without integrating the Seurat object and TCR library?

clonal_expansion_plot will also work if the user provides the original seurat object and the raw TCR library dataframe, like so:

# let pbmc be the raw seurat object before the integration
pbmc_expansion_plot <- clonal_expansion_plot(pbmc, tcr_dataframe, verbose = FALSE)

Adjusting intra-cluster spacing

The considerable overlap between clusters (due to the algorithm’s attempt to fit clusters to the original UMAP coordinates) may sometimes obstruct eachother excessively.

To account for this, there are four optional arguments in clonal_expansion plot:

repulse = FALSE,
repulsion_threshold = 1,
repulsion_strength = 1,
max_repulsion_iter = 10

For more details on them, read the “Arguments” section in the function documentation (?clonal_expansion_plot). But to summarize, first, to make the circle clusters move away from eachother, repulse should be set to TRUE, and the function should be ran AGAIN. (If you feel this excessive re-running takes too long or is inefficient for your workflow in its current form, please make an issue on the github page.)

Setting repulse to TRUE with those default parameters should probably yield something like the following:

pbmc_expansion_plot <- clonal_expansion_plot(pbmc, repulse = TRUE, verbose = FALSE)
pbmc_expansion_plot

As we can see, the red and green clusters have “moved away” from each other, and there is very little overlap remaining. All other clusters which were not touching remain in their original positions.

Do I HAVE to re-run clonal_expansion_plot each time I want to modify a parameter?

Unfortunately, yes. With the current version (v1) of APackOfTheClones, to change the appearance of the plot, clonal_expansion_plot has to be re-ran with modified parameters each time, which unfortunately may be slow/tedious if a lot of modifications are required. Please create a github issue or contact the author if a new version should be made to prevent this.

If you are still unhappy with the spacing of the plot, see the following ways to adjust the spacing:

repulsion_threshold indicates the amount of ggplot2 units of overlap between clusters that are acceptable. It defaults to 1, meaning that two clusters that overlap by about 1 unit are considered by the repulsion algorithm to not be overlapping. Increasing this number will increase the amount of overlap between clusters, and decreasing this number will do the opposite, while decreasing it to negative values will incur additional spacing between clusters. However, using negative spacing may cause the plot to look very spaced out. Keep reading to see alternatives to doing so.
repulsion_strength relates to how much the clusters should repel each other. The repulsion algorithm works in iterations, where for each iteration, each cluster “pushes” each other away from eachother by some amount. Increasing this value will cause extra “pushing” during each iteration. However, increasing this factor too much may result once again in a very visually unpleasant plot. Note that closer clusters will repell eachother more, and larger clusters will repel smaller clusters more.
max_repulsion_iter indicates the number of iterations where clusters should repel eachother. Increasing this number would ensure that clusters will (almost always) for sure not be overlapping. A trick with this parameter to make more pleasant plots is to decrease repulsion_strength and increase max_repulsion_iter to possibly make a more pleasant arrangement of clusters.

Adjusting and customizing size legend

The plot we’ve generated here has a size legend on the top left, but we see that it is partially outside of the plot. Unfortunately the algorithm for placing the legend is far from perfect in version 0 so the user will have to do a little extra work.

There are the following six parameters to adjust the legend:

add_size_legend = TRUE,
legend_sizes = c(1, 5, 50),
legend_position = "top_left",
legend_buffer = 1.5,
legend_color = "#808080",
legend_spacing = 0.4

Here, we can adjust the argument legend_buffer, which indicates the amount of ggplot2 units to move the legend “inwards” by. We can also slightly increase legend_spacing, an argument specifying the vertical distance between circles in the legend.

# the previous arguments have to be maintained!
pbmc_expansion_plot <- clonal_expansion_plot(
  pbmc,
  legend_buffer = 3,
  legend_spacing = 0.6,
  repulse = TRUE,
  verbose = FALSE
)
pbmc_expansion_plot

We can even change the circle (clone) sizes displayed in the legend by changing the vector legend_sizes:

pbmc_expansion_plot <- clonal_expansion_plot(
  pbmc,
  legend_sizes = c(1, 5, 25, 50),
  legend_buffer = 3,
  legend_spacing = 0.6,
  repulse = TRUE,
  verbose = FALSE
)
pbmc_expansion_plot

More details about these arguments can be read in the function documentation.

Adjusting and customizing circle size and scaling

So far, the sizes of the individual circles (clones) have seemed to look quite pleasant, fitting roughly into their UMAP cluster centroids snugly. However, this may not always be the case. A likely very common issue upon the initialization of the plot with default arguments is that the circles may be too small or too large.

In these cases, the argument clone_scale_factor should be changed. The argument defaults to 0.1, which is the multiplicative scale factor for the square root radius of a clone. So if a clone size is four, and the factor is 0.1, then the actual radius of the circle on the plot would be sqrt(4) * 0.1 which equates to 0.2.

Additionally, rad_scale_factor can optionally be changed to adjust the intra-circular spacing. The factor defaults to 0.95 and indicates the multiplicative amount to adjust the radius of the circles after they have been placed. decreasing this value (slightly) will basically make each circle smaller in place, increasing the spacing between then,

For demonstration sake, we’ll change the clone_scale_factor to 0.07 and rad_scale_factor to 0.9:

pbmc_expansion_plot <- clonal_expansion_plot(
  pbmc,
  clone_scale_factor = 0.07,
  rad_scale_factor = 0.9,
  legend_sizes = c(1, 5, 25, 50),
  legend_buffer = 3,
  legend_spacing = 0.6,
  repulse = TRUE,
  verbose = FALSE
)
pbmc_expansion_plot

Other possible modifications

There are a few other arguments this vignette did not cover which can be read in the function documentation. However, to end off, one more important argument to consider is use_default_theme = TRUE. It makes the plot have the same axis labels and general theme as the original UMAP plot. If this isn’t to your liking or you want to customize the plot your own way (since clonal_expansion_plot generates a ggplot object!), simply set it to FALSE and you’ll have a very minimally themed plot:

clonal_expansion_plot(
  pbmc,
  use_default_theme = FALSE,
  clone_scale_factor = 0.07,
  rad_scale_factor = 0.9,
  legend_sizes = c(1, 5, 25, 50),
  legend_buffer = 3,
  legend_spacing = 0.6,
  repulse = TRUE,
  verbose = FALSE
)

Additionally, if you really want the plot to reflect the original UMAP plot, you could try using ggplot2’s xlim and ylim functions.

Final product

That’s about it for the most basic functionalities of the clonal expansion visualization function. Remember to save the plot first as an .svg file for maximal resolution, and it should be publication ready.

For fun, here’s a side-by-side comparison of the UMAP plot and the clonal expansion plot:

library(ggplot2)
library(cowplot)

cowplot::plot_grid(
  umap_plot + ggtitle("scRNA-seq UMAP"),
  pbmc_expansion_plot + ggtitle("APackOfTheClones clonal expansion plot"),
  labels = "AUTO"
)

How do I reproduce the plots in this vignette?

If you really want to reproduce the plots and code in this vignette, you will need to install the data used in this vignette which has been compiled in an R package SCIPData (Single Cell Immune Profiling Data). You can do so with the following code

# install the data package
library(devtools)
devtools::install_github("Qile0317/SCIPData")

# load the package data directly into R objects in memory
data("pbmc", "tcr_dataframe")

And the rest of the code can be copied and used accordingly.

Alternatively, there are two smaller, minimal, artificial versions of the seurat object and clonotype data that can be loaded in APackOfTheClones, run ?mini_seurat_obj and ?mini_clonotype_data for more details.