X
Categories
original research

Platypus Paper, Rewritten

This is a completely rewritten version of a biology paper titled, “A Model for the Evolution of the Mammalian T-cell Receptor α/δ and μ Loci Based on Evidence from the Duckbill Platypus.” It is meant to be a demonstration and a proof of concept for JAWWS, my idea of a science journal that focusses on readability.

You can read the original version on the Molecular Biology and Evolution journal’s website, or here on this blog, with annotations by me.

Why did I choose to rewrite this paper? I wish I had a more principled answer, but the truth is that I simply went to ResearchHub, a website where scientists share papers between themselves and upvote the most interesting one, went to the Evolutionary Biology section (because that used to be my field), and picked the first paper that was open-access and seemed fit for my purposes. In other words, a paper that seemed like it could be improved a lot because it seemed more difficult to understand than was warranted.

Only later did I realize it was a paper from 2012, so not that recent. I don’t think that matters too much for now since it’s just a proof of concept. Nor does the topic, i.e. molecular evolution in the vertebrate immune system. The actual journal will need to pick papers in a more principled way, of course.

What did my rewrite entail? The best way to know is to read (not necessarily closely) the original and rewritten versions. But here’s a sample of my “interventions”:

  • I put almost all citations in collapsible footnotes.
  • I cut up most long paragraphs, including the abstract.
  • I added many context sentences, including at the beginning of sections, to give a better sense of why we’re reading this. One example is the first sentence of the introduction: “How did the immune system of jawed vertebrates evolve?”
  • I reworked some of the paper’s structure. One major change: I put the major contribution of the study, that is, the new evolutionary model, in its own section after the introduction. This way, it is not buried deep in the discussion; readers can start with it and dig into the rest only if they want more details. I also reordered the methods so that they would match the ordering in the Results.
  • I added several subheadings to the sections that didn’t already have subsections (Introduction, and Results and Discussion)
  • I tried my best to avoid abbreviations. One difficulty is that some of them are probably very recognizable by people who  know immunology, and not by me. So I left some in, while trying to make sure they don’t hamper readability. The major example is “TCR”, which means T-cell receptor and was used a lot. It’s still used in my version, but far less often.
  • I removed some jargon words. E.g. “proximal” became “closest”.
  • I formatted some information in point form, such as the T-cell lineages or the protocols in the Methods.
  • I added some text formatting to guide the reader. For instance, bold font for groups of animals in the introduction, and color in the text to match colored elements in figures.
  • I changed one figure by reorganizing its parts to make it clearer (fig. 2, which used to be fig. 5). A lot of additional clarity could potentially be gained from editing the figures, but that’s a lot of work, so I didn’t press it further.
  • I fixed a number of typos and grammatical errors. Mistakes like this are not a big problem, but there were enough that I assume little editing work was done on this paper.

On footnotes: I’m using two different kinds of collapsible footnotes. Those in the usual style of the blog, like this one,1Here you would usually read a citation in short form such as “Rast et al. 1997.” Head to the original paper webpage to see the reference list in full. contain the citations included in the original paper.  Footnotes with brackets like this[1]This is an example comment. are for comments on the rewriting process and are also shown at the bottom of the paper. I suggest you don’t click on the former, unless you want to see a reference, and hover onto the latter to read my comments.

Overall, my rewrite increased the length of the abstract from 267 to 286 words, and of the rest of the paper from about 6000 to about 6400 words. I consider this acceptable.

I will publish some more thoughts on the rewriting process later.

Abstract

Goal: This study presents a new model for the evolution of part of the vertebrate immune system: the genes encoding the T-cell receptor (TCR) δ chains.

Background: T lymphocytes have to recognize specific antigen for the adaptive immune response to work in vertebrates. They perform this using a somatically diversified T-cell receptor. All jawed vertebrates use four T-cell receptor chains called α, β, γ, and δ, but some lineages have nonconventional receptor chains: monotremes and marsupials encode a fifth one, called TCRµ. Its function is unknown, but it is somatically diversified like the conventional chains. Its origins are also unclear. It appears to be distantly related to the TCRδ chain, for which recent evidence from birds and frogs has provided new information that was not available from humans or other placental (eutherian) mammals.

Experiment: We analyzed the genes encoding the δ chains in the platypus. This revealed the presence of a highly divergent variable (V) gene, indistinguishable from immunoglobulin heavy chain V genes (VH), and related to V genes used in the µ chain. This gene is expressed as part of TCRδ repertoire, so it is designated VHδ.

Conclusions: The VHδ gene is similar to what has been found in frogs and birds, but it is the first time such a gene has been found in a mammal. This provides a critical link in reconstructing the evolutionary history of TCRµ. The current structure of the δ and µ genes in tetrapods suggests ancient and possibly recurring translocations of gene segments between the δ and immunoglobulin heavy genes, as well as translocations of δ genes out of the TCRα/δ locus early in mammals, creating the TCRµ locus. We present a detailed model of this evolutionary history.[2]Major changes to the abstract: I split it in four paragraphs with section titles (this is common in some journals; it should be common in most journals). I also added a section at the beginning to … Continue reading

Introduction

How did the immune system of jawed vertebrates evolve? In this study, we use genomic evidence from the platypus to propose a model for the evolution of a specific component of the vertebrate immune system: the receptors on the surface of T lymphocytes.

As a reminder, T lymphocytes (or T cells) are white blood cells that play a critical role in the adaptive immune system. They can be classified into two main lineages based on the receptor they use:2Rast et al. 1997; reviewed in Davis and Chein 2008

  1. αβT cell lineage: The receptor is composed of a heterodimer of α and β chains. Most circulating human T cells are αβT cells, including familiar subsets such as CD4+ helper T cells and regulatory T cells, CD8+ cytotoxic T cells, and natural killer T (NKT) cells.
  2. γδT cell lineage: The receptor is composed of γ and δ chains. The function of these cells is less well defined. They have been associated with a broad range of immune responses including tumor surveillance, innate responses to pathogens and stress, and wound healing.3Hayday 2009 γδT cells are found primarily in epithelial tissues and form a lower percentage of circulating lymphocytes in some species.

αβ and γδ T cells also differ in the way they interact with antigen. The receptors of αβT cells are “restricted” relative to the major histocompatibility complex (MHC), meaning that that they bind antigenic epitopes, such as peptide fragments, bound to, or “presented” by, molecules encoded in the MHC. In contrast, γδT receptors have been found to bind antigens directly in the absence of MHC, as well as self-ligands that are often MHC-related molecules.4Sciammas et al. 1994; Hayday 2009

All gnathostomes (jawed vertebrates) have αβ and γδ T cells. As we will see below, marsupial and monotreme mammals have an additional type of T-cell receptor, denoted with the letter µ. The platypus, a monotreme, further has a non-conventional receptor with δ chains, which is also present in birds and amphibians.

Before presenting our evolutionary model, let’s review these types of T-cell receptors and their structure.

Structure and Genes of Conventional T-Cell Receptors

The chains of conventional T-cell receptors are composed of two extracellular domains, both members of the immunoglobulin domain superfamily of cell surface proteins (fig. 1):5reviewed in Davis and Chein 2008

  • The closest domain to the cellular membrane is called C for constant.[3]Since this abbreviation comes up a lot, I put it first, with its meaning in parentheses. The C domain is largely invariant among T-cell clones expressing the same class of the receptor chain.
  • The domain farthest from the cellular membrane is called V for variable. It is the region that contacts antigen and MHC. Similar to antibodies, the individual clonal diversity in the V domain is generated by somatic DNA recombination.6Tonegawa 1983 [4]I didn’t change much in the figure’s caption, but it seemed pretty trivial to add color to the text to facilitate looking up what the colors mean.
Fig. 1. Cartoon diagram of the T-cell receptor (TCR) forms found in different species. Oblong circles indicate immunoglobulin superfamily domains and are color coded as C domains (blue), conventional V domains (red), and VHδ or Vµ (yellow). The gray shaded chains represent the hypothetical partner chain for µ and δ using VHδ.

While C domains are usually encoded by a single, intact exon, V domains are assembled somatically from germ-line segments in developing T cells. These segments are genes called V (again for variable), D (for diversity), and J (for joining). The assembly process depends on the enzymes encoded by two genes, the recombination activating genes (RAG)-1 and RAG-2.7Yancopoulos et al. 1986; Schatz et al. 1989

The various T-cell receptor chains differ in how their V domains are assembled. β and δ chains are assembled from all three types of gene segments, whereas α and γ chains use only V and J. The different combinations of two or three segments, selected from a large repertoire of germ-line gene segments, along with variation at the junctions due to the addition and deletion of nucleotides during recombination, contribute to a vast diversity of T-cell receptors. It is this diversity that creates the individual antigen specificity of T-cell clones.[5]This is an example of two sentences taken verbatim from the original paper. Not all of it was poorly written!

These genes are highly conserved among species in both their genomic sequence and their organization.8Rast et al. 1997; Parra et al. 2008, 2012; Chen et al. 2009 In all tetrapods examined, the β and γ chains are each encoded at multiple separate loci, whereas the genes encoding the α and δ chains are nested at a single locus, called the TCRα/δ locus.9Chien et al. 1987; Satyanarayana et al. 1988; reviewed in Davis and Chein 2008 The V domains of α and δ chains can use a common pool of V gene segments, but distinct D, J, and C genes.

The recombination of V, J and optionally D genes, referred to as V(D)J recombination, and mediated by RAG, is also known to generate the diversity of antibodies produced by another type of lymphocyte, the B cells.[6]It took me forever to rewrite this part. The original sentence was, “Diversity in antibodies produced by B cells is also generated by RAG-mediated V(D)J recombination and the TCR and Ig genes … Continue reading10Flajnik and Kasahara 2010; Litman et al. 2010

Non-Conventional Receptors Across Vertebrates

T-cell receptor and immunoglobulin genes clearly share a common origin in the jawed vertebrates.11Flajnik and Kasahara 2010; Litman et al. 2010 Usually, the V, D, J, and C coding regions are readily distinguishable from immunoglobulin, at least for conventional T-cell receptors, owing to divergence over the past 400 million years.

Recently, however, the discovery of non-conventional isoforms of the δ chain has blurred the boundary between them. These non-conventional forms use V genes that appear indistinguishable from the immunoglobulin heavy chain V.12Parra et al. 2010, 2012 Such V genes have been named VHδ.[7]The original combined this sentence and the next, even though they’re about quite distinct ideas: the name of the genes, and the species where they’re found.

VHδ genes have been found in both amphibians and birds (see the rightmost part of fig. 1).[8]Why not indicate the part of the figure that is relevant? Whenever you can, provide reader guidance! In the frog Xenopus tropicalis, as well as in a passerine bird, the zebra finch Taeniopygia guttata, the VHδ genes coexist with the conventional Vα and Vδ genes at the TCRα/δ locus.13Parra et al. 2010, 2012

In galliform birds, such as the chicken Gallus gallus, they are instead located at a second TCRδ locus that is unlinked to the conventional TCRα/δ.14Parra et al. 2012 VHδ are the only type of V gene segment present at the second locus and, although closely related to antibody VH genes, the VHδ appear to be used exclusively in δ chains. This is true as well for frogs where the TCRα/δ and IgH (immunoglobulin heavy chain) loci are tightly linked.15Parra et al. 2010

In mammals, a TCRα/δ locus has been characterized in several eutherian species and at least one marsupial, the opossum Monodelphis domestica. VHδ genes have not been found in mammals to date.16Satyanarayana et al.1988; Wang et al. 1994; Parra et al. 2008 However, marsupials do have an additional locus, unlinked to TCRα/δ, that uses antibody-related V genes. This fifth chain is called µ, and the receptor that uses it is referred to as TCRµ. The µ chain is related to the δ chain, but it diverges from it in both sequence and structure.17Parra et al. 2007, 2008 It has also been found in a monotreme, the platypus.[9]The authors like to use “duckbill platypus,” but there’s only one species of platypus, so I took that word out. The platypus and marsupial TCRµ genes are clearly orthologous, which is consistent with the idea that the µ chain is ancient in mammals, but has been lost in the eutherians.18Parra et al. 2008; Wang et al. 2011

TCRµ chains use their own unique set of V genes, called Vµ.19Parra et al. 2007; Wang et al. 2011 So far, no evidence has been found of V(D)J recombination between Vµ genes and genes from other immunoglobulin or T-cell receptor loci.[10]Another horrible sentence from the original, recorded for posterity: “Trans-locus V(D)J recombination of V genes from other Ig and TCR loci with TCRµ genes has not been found.” That … Continue reading  Neither have TCRµ homologues been found in non-mammals.20Parra et al. 2008

The structure of TCRµ chains is atypical. They contain three, rather than two, extra-cellular domains from the immunoglobulin superfamily;[11]The abbreviation IgSF was used in the paper, with no explanation. I assume the people who would read this paper tend to know what that means, but still. this is due to an extra N-terminal V domain (see fig. 1).21Parra et al. 2007; Wang et al. 2011 Both V domains are encoded by a unique set of Vµ genes and are more related to immunoglobulin heavy chain V than to conventional T-cell receptor V domains. The N-terminal one is diverse and encoded by genes that undergo somatic V(D)J recombination, while the second V domain (referred to as “supporting”) has little or no diversity.

The supporting V domain differs between marsupials and monotremes. In marsupials, it is encoded by a germ-line joined, or pre-assembled, V exon that is invariant.22Parra et al. 2007 In the platypus, it is encoded by gene segments requiring somatic DNA recombination, but with limited diversity due in part to the lack of D segments.23Wang et al. 2011

Sharks and other cartilaginous fish also have a T-cell receptor chain that is structurally similar to TCRµ (see middle part of fig. 1).24Criscitiello et al. 2006; Flajnik et al. 2011 The resulting receptor is called NAR-TCR. Like the receptor of marsupials and monotremes, it contains three extracellular domains, but its N-terminal V domain is related to chains used by IgNAR (immunoglobulin new antigen receptor) antibodies, a type of antibody found only in sharks.25Greenberg et al. 1995 In both the TCRµ of marsupials and monotremes and the NAR-TCR of cartilaginous fishes, the current working model is that the N-terminal V domain is unpaired and acts as a single, antigen binding domain. This would be analogous to the V domains of light-chainless antibodies found in sharks and camelids.26Flajnik et al. 2011; Wang et al. 2011

How did the µ chain arise? Phylogenetic analyses support an origin after the avian–mammalian split.27Parra et al. 2007; Wang et al. 2011 Previously, we hypothesized that it originated as a recombination between ancestral immunoglobulin heavy and TCRδ-like loci,28Parra et al. 2008 but this hypothesis is problematic for several reasons. One challenge is the apparent genomic stability and ancient conserved synteny (order of genes on the chromosome) in the region surrounding the TCRα/δ locus; this region has appeared to remain stable over at least the past 350 million years of tetrapod evolution.29Parra et al. 2008, 2010

As a result, we need a new model for the evolution of TCRµ and the TCRα/δ locus. Here we present the best current model, supported by an analysis of the platypus genome—the first to examine a monotreme TCRα/δ locus in detail—as described in the methods and results sections below.

The Model

Our model can be summarized in six stages (fig. 2).[12]Major change from me here. This section was moved here from the discussion, because it is the core and most interesting part of the paper. It is now its own first-level section alongside … Continue reading

Fig. 2. A model of the stages of evolution of the TCRα/δ loci in tetrapods and the origins of TCRµ in mammals. Refer to the text for detailed explanation of stages A-F.
  1. Duplication of the cluster. This occurred early in the evolution of tetrapods, or earlier. The duplication resulted in two copies of the C gene of the δ chain, each with its own set of D and J segments.

  2. Insertion of VH. Recall that VH refers to the variable chain of immunoglobulin heavy (IgH). One or more genes were translocated from the IgH locus and inserted into the TCRα/δ locus, most likely to a location between the existing Vα/δ genes and the 5′-proximal cluster. This is the configuration found today in the zebra finch genome.30Parra et al. 2012

  3. Inversion of the VHδcluster in amphibians. This cluster of genes was translocated and inverted, and the number of VHδ genes increased. The frog X. tropicalis currently has the greatest number of VHδ genes, where they make up the majority of V genes available in the germ-line for T-cell receptor δ chains.31Parra et al. 2010

  4. Translocation of the VHδcluster to another site in galliforms. In chickens and turkeys, the same cluster that was inverted in amphibians instead moved out of the TCRα/δ locus and is now found on another chromosome. There are no or genes at this second TCRδ locus in chickens, and only a single gene remains at the conventional TCRα/δ locus.32Parra et al. 2012

  5. Translocation of the VHδcluster to another site (TCRµ) in mammals. A similar process to step D in galliforms happened in a common ancestor of mammals, giving rise to TCRµ. Internal duplications of the VH, D, and J genes gave rise to the current [(VDJ) − (VDJ) − C] organization that can encode chains with double V domains.33Parra et al. 2007, Wang et al. 2011

  6. Further changes in the three mammalian lineages. 

    • In the platypus, the second VDJ cluster, which encodes the supporting (non-terminal) V chain, lost its D segments and generates V domains with short complementarity-determining region-3 (CDR3) encoded by direct V to J recombination.34Wang et al. 2011

    • Meanwhile, in therians (marsupials and placentals), the VHδ gene disappeared from the TCRα/δ locus (not shown in fig. 2).35Parra et al. 2008

    • Then, in placentals, the TCRµ locus was also lost.36Parra et al. 2008

    • The marsupials kept TCRµ, but the second set of V and J segments (which encode the supporting V domain) was replaced with a germ-line joined V gene (fused yellowgreen segment in fig. 2), probably due to germ-line V(D)J recombination and retro-transposition.37Parra et al. 2007, 2008

    • In both monotremes and marsupials, the whole cluster from VH to C appears to have undergone additional tandem duplication as it exists in multiple copies in the opossum and probably in the platypus.38Parra et al. 2007, 2008; Wang et al. 2011

The rest of the paper explains the analyses that gave us with the evidence to build this model. Additional discussion of the model is provided in the last section.

Materials and Methods

There are three parts to the analyses and experiments that allowed us to gather evidence and build our evolutionary model. First, find the TCRα/δ locus in platypus genome data. Second, perform phylogenetic analyses with the relevant genes. Third, confirm from a live specimen that the platypus expresses VHδ.[13]This new paragraph is important! It gives context to the experiments below and it guides the reader for the entire section. Also notice this is a case of an enumeration without point form. I like … Continue reading

1) Identification and Annotation of the Platypus TCRα/δ Locus

We analyzed the genome of the platypus, Ornithorhynchus anatinus, using the assembly version 5.0.1 (http://www.ncbi.nlm.nih.gov/genome/guide/platypus/). We used two genome alignment tools: whole-genome BLAST from NCBI (www.ncbi.nlm.nih.gov/) and BLAST/BLAT from Ensembl (www.ensembl.org).

We located the V and J gene segments by looking for similarity with the corresponding segments of other species, and by identifying flanking conserved recombination signal sequences. (RSS). We annotated V segments in the 5′ to 3′ direction as either Vα or Vδ, followed by the family number and the gene segment number if there were more than one in the family. For example, Vα15.7 is the seventh Vα gene in family 15.

As for the D segments, we identified them from cDNA clones using VHδ, using complementarity-determining region-3 (CDR3) sequences that represent the V-D-J junctions.

We labeled the platypus T-cell receptor gene segments according to the IMGT nomenclature (http://www.imgt.org/). We provide the location for the TCRα/δ genes of the platypus genome version 5.0.1 in supplementary table S1, available online.

2) Phylogenetic Analyses

We used BioEdit39Hall 1999 as well as the accessory application ClustalX40Thompson et al. 1997 to align the nucleotide sequences of the V genes regions, from the framework region FR1 to FR3, including the complementarity-determining regions CDR1 and CDR2. We established the codon position of the alignments using amino acid sequences.41Hall 1999 When necessary, we corrected the alignments through visual inspection. We then analyzed them with MEGA Software.42Kumar et al. 2004

We generated phylogenetic trees using two methods: Neighbor Joining (NJ) with uncorrected nucleotide differences (p-distance), and Minimum Evolution distances.

We evaluated support for the generated trees using bootstrap values from 1000 replicates. Supplementary table S2 contains the GenBank accession numbers for the sequences used in tree construction.[14]In the original paper, this section comes after the Confirmation of Expression section below, but in the results section, the phylogenetic results are discussed first. I don’t know if there was … Continue reading

3) Confirmation of Expression of Platypus VHδ

As described with more detail in the Results and Discussion section below, the annotation step allowed us to find an atypical VHδ gene in the platypus genome. To confirm that it was not an artifact of the genome assembly process, we looked at the expression of this gene in a live specimen, a male platypus from the Upper Barnard River in New South Wales, Australia. The platypus was collected under the same permits as in Warren et al. 2008.

We performed reverse transcription PCR (RT-PCR) on the RNA from the spleen of this New South Wales specimen. As a second point of comparison, we also used a previously described platypus spleen cDNA library that was constructed from RNA extracted from a Tasmanian animal.43Vernersson et al. 2002 The protocols and products used at every step are as follows:

  • cDNA synthesis: Invitrogen Superscript III-first strand synthesis kit, using the manufacturer’s recommended protocol44Invitrogen, Carlsbad, CA, USA
  • PCR amplification: we used the QIAGEN HotStar HiFidelity Polymerase Kit45BD Biosciences, CLONTECH Laboratories, Palo Alto, CA, USA in total volume of 20 µl containing:
    • 1× Hotstar Hifi PCR Buffer (containing 0.3 mM dNTPs)
    • 1µM of primers: we identified these from the platypus genome assembly step.46Warren et al. 2008 We targeted T-cell receptor δ transcripts with two primers, one for VHδ and one for Cδ:
        • 5′-GTACCGCCAACCACCAGGGAAAG-3′ for VHδ
        • 5′-CAGTTCACTGCTCCATCGCTTTCA-3′ for Cδ
    • 1.25U Hotstar Hifidelity DNA polymerase
  • PCR product cloning: TopoTA cloning® kit 47Invitrogen
  • Sequencing: BigDye terminator cycle sequencing kit version 348Applied Biosystems, Foster City, CA, USA according to the manufacturer recommendations.
  • Analysis of sequencing reactions: ABI Prism 3100 DNA automated sequences.49PerkinElmer Life and Analytical Sciences, Wellesley, MA, USA
  • Chromatogram analysis: Sequencher 4.9 software50Gene Codes Corporation, Ann Arbor, MI, USA

We archived the sequence on GenBank under the accession numbers JQ664690–JQ664710.

Results and Discussion

Results of the TCRα/δ Locus Identification in the Platypus

Here are the results of our analysis of the platypus genome from part 1 of the Materials and Methods section, which allowed us to identify the TCRα/δ locus and annotate its V, D, J and C gene segments, as well as the exons. Refer to fig. 3 below for the annotation map.

Fig. 3. Annotated map of the platypus TCRα/δ locus, showing the locations of the Vα and Vδ (red), VHδ (yellow), Dδ (orange), Jα and Jδ (green), Cδ (dark blue), and Cα (light blue). Conserved syntenic genes are in gray. The scaffold and contig numbers are indicated.

Most of the locus is present on a single scaffold. The remainder is on a shorter contig. On either sides of the locus, we find the genes SALL2, DAD1, and several olfactory receptor genes (OR). All of these genes share conserved synteny with the TCRα/δ locus in amphibians, birds, and mammals.51Parra et al. 2008, 2010, 2012

The platypus locus has many typical features common to TCRα/δ loci in other tetrapods.52Satyanarayana et al. 1988; Wang et al. 1994; Parra et al. 2008, 2010, 2012 Two C region genes are present: a Cα (light blue in fig. 3) at the 3′ end of the locus, and a Cδ (dark blue) oriented 5′ of the Jα genes. These Jα genes occur in a large number (32) of fragments (in green) located between Cδ and Cα. A large array of Jα genes like this is believed to facilitate secondary Vα to Jα rearrangements in developing αβT cells if the primary rearrangements are nonproductive or need replacement.53Hawwari and Krangel 2007 Primary TCRα V–J rearrangements generally use Jα segments towards the 5′-end of the array and can progressively use downstream Jα in subsequent rearrangements. There is also a single Vδ gene (the last red segment in fig. 3) in reverse transcriptional orientation between the platypus Cδ gene and the Jα array that is conserved in mammalian TCRα/δ both in location and orientation.54Parra et al. 2008

There are 99 conventional T-cell receptor V gene segments in the platypus TCRα/δ locus (red in fig. 3). The vast majority, 89, share nucleotide identity with Vα in other species; the other 10 share identity with Vδ genes. The Vδ genes are clustered towards the 3′-end of the locus. Based on nucleotide identity shared among the platypus V genes, they can be classified into 17 different Vα families and two different Vδ families, based on the criteria of a V family sharing >80% nucleotide identity (the family and segment numbers are annotated in fig. 3). This is a typical level of complexity for mammalian Vα and Vδ genes.55Giudicelli et al. 2005; Parra et al. 2008

Also present were two Dδ (orange) and seven Jδ (green) gene segments oriented upstream of the Cδ. All gene segments were flanked by canonical recombination signal sequences (RSS), which are the recognition substrate of the RAG recombinase. The D segments were asymmetrically flanked by an RSS containing at 12 bp spacer on the 5′-side and 23 bp spacer on the 3′-side, as has been shown previously for T-cell receptor D gene segments in other species.56Carroll et al. 1993; Parra et al. 2007, 2010 In summary, the overall content and organization of the platypus TCRα/δ locus appeared fairly generic, with one exception.

This atypical feature of the platypus locus is an additional V gene that shares greater identity to antibody VH genes than to T-cell receptor V genes. Among V genes, this segment is the closest to the D and J genes (see the yellow segment in fig. 3). We tentatively designated it as VHδ. 

VHδ Phylogenetics

VHδ genes are, by definition, V genes that are indistinguishable from immunoglobulin heavy V (Ig VH) genes, but used in encoding T-cell receptor δ chains. Recall from the introduction[15]Yes, you are allowed to make links between the sections of your paper like this! that they have previously been found only in the genomes of birds and frogs.57Parra et al. 2008, 2010, 2012

To put the platypus VHδ gene in context, let us examine the phylogeny of VH genes. In mammals and other tetrapods, VH genes have been shown to cluster into three ancient clans (shown in fig. 4). Individual species differ in the presence of one or more of these clans in their germ-line immunoglobulin heavy locus.58Tutter and Riblet 1989; Ota and Nei 1994 For example, humans, mice, echidnas, and frogs have VH genes from all three clans,59Schwager et al. 1989; Ota and Nei 1994; Belov and Hellman 2003 whereas rabbits, opossums, and chickens have only a single clan.60McCormack et al. 1991; Butler 1997; Johansson et al. 2002; Baker et al. 2005

Fig. 4. Phylogenetic tree of mammalian VH genes, including the platypus VHδ and monotreme Vµ. The three major VH clans are bracketed. A box indicates the platypus VHδ, and bolding indicates the clade containing platypus VHδ along with platypus and echidna Vµ within clan III. The three-digit numbers following the VH gene labels are the last three digits of the GenBank accession number referenced in supplementary table S2. The numbers following the platypus and echidna Vµ labels are clone numbers. The tree shown here was generated using the Minimum Evolution method; the Neighbor Joining method yielded a similar topology.

Our phylogenetic analyses showed that the platypus VHδ was most related to the platypus Vµ genes found at the TCRµ locus (see the boxed and bolded parts of fig. 4). Platypus VHδ, however, shares only 51–61% nucleotide identity (average 56.6%) with the platypus Vµ genes. Both the platypus Vµ and VHδ clustered within clan III.61Wang et al. 2011 This is noteworthy since VH genes in the platypus IgH locus are also from clan III and, in general, clan III is the most ubiquitous and conserved lineage of VH.62Johansson et al. 2002; Tutter and Riblet 1989 Although clearly related to platypus VH, the VHδ gene shares only 34–65% nucleotide identity (average 56.9%) with the “authentic”[16]I’m not sure about this but the original phrase was bona fide, which I had to look up. Maybe “authentic” between quotes isn’t the best translation, but a translation is better … Continue reading VH used in antibody heavy chains in this species.

Results of the Confirmation of VHδ Expression

It was necessary to rule out that the VHδ gene present in the platypus TCRα/δ locus was not an artifact of the genome assembly process. This is why we performed a “wet lab” verification step on cDNA synthesized from the splenic RNA of two platypuses, one from New South Wales and one from Tasmania (see Materials and Methods). We performed RT-PCR with primers that were specific for VHδ and Cδ. We were successful in amplifying the PCR products of the NSW specimen, but not for the Tasmanian one.

One piece of supporting evidence for the expression of VHδ would be the demonstration that it is recombined to downstream Dδ and Jδ segments, and expressed with Cδ in complete T-cell receptor δ transcripts. This is what we found from the twenty sequenced clones we obtained from PCR in the New South Wales platypus. Each clone contained a unique nucleotide sequence that comprised the VHδ gene recombined to the Dδ and Jδ gene segments (see fig. 4A). Of these 20, 11 had unique V, D, and J combinations that would therefore encode 11 different complementarity-determining regions-3 (CDR3; see fig. 4B). More then half of these (8 out of 11) contained evidence of using both D genes, giving a VDDJ pattern. This is a common feature of δ V domains where multiple D genes can be incorporated into the recombination due to the presence of asymmetrical RSS.63Carroll et al. 1993

Fig. 4. (A) Alignment of predicted protein sequence of transcripts containing a recombined VHδ gene isolated from platypus spleen RNA. The individual clones are identified by the last three digits of their GenBank accession numbers (JQ664690–JQ664710). Shown is the region from FR3 of the VHδ through the beginning of the Cδ domain. The sequence in bold at the top of the alignment is the germ-line VHδ and Cδ gene sequence. The double cysteines at the end of FR3 and unpaired cysteines in CDR3 are shaded, as is the canonical FGXG in FR4. (B) Nucleotide sequence of the CDR3 region of the eleven unique V(D)J recombinants using VHδ described in the text. The germ-line sequence of the 3′-end of VHδ, the two Dδ, are shown at the top. The germ-line Jδ sequences are shown on the right-hand side of the alignment interspersed amongst the cDNA sequences using each. Nucleotides in the junctions between the V, D, and J segments, shown italicized, are most likely N-nucleotides added by TdT.

The region corresponding to the junctions between the V, D, and J segments contained an additional sequence that could not be accounted for by the germ-line gene segments (fig. 4B). There are two possible sources of such a sequence. One is palindromic nucleotides that are created during V(D)J recombination when the RAG generates hairpin structures that are resolved asymmetrically during the re-ligation process.64Lewis 1994 The second is non-templated nucleotides that can be added by the enzyme terminal deoxynucleotidyl transferase (TdT) during the V(D)J recombination process.

An unusual feature of the platypus VHδ is the presence of a second cysteine encoded near the 3′-end of the gene, directly next to the cysteine predicted to form the intra-domain disulfide bond in immunoglobulin domains (fig. 4A). Additional cysteines in the complementarity-determining region 3 of VH domains have been thought to provide stability to unusually long CDR3 loops, as has been described for cattle and the platypus previously.65Johansson et al. 2002 The CDR3 of T-cell receptor δ using VHδ are only slightly longer than conventional δ chains (ranging 10–20 residues).66Rock et al. 1994; Wang et al. 2011 Furthermore, the stabilization of CDR3 generally involves multiple pairs of cysteines, which were not present in the platypus VHδ clones (fig. 4A). 

The Tasmanian specimen

The above concerns the animal collected from New South Wales. With the Tasmanian specimen, we were unable to amplify T-cell receptor δ transcripts containing VHδ from its splenic cDNA. We did, however, successfully isolate transcripts containing conventional Vα/δ segments, which provides a positive control.

It is possible that Tasmanian platypuses, which have been separated from the mainland population at least 14,000 years ago, either have a divergent VHδ or have deleted this single V gene altogether.67Lambeck and Chappell 2001

Sequence variation in VHδ

Although there is only a single VHδ in the current platypus genome assembly, there was sequence variation in the region corresponding to FR1 through FR3 of the V domains (see fig. 4A; the sequence data are not shown here, but are available in GenBank). We have three potential explanations for this variation:

  1. Two alleles of a single VHδ gene
  2. Somatic mutation of expressed VHδ genes
  3. Allelic variation in gene copy number

The two-allele explanation makes sense given that the RNA used in this experiment is from a wild-caught individual from the same population that was used to generate the whole-genome sequence, which was found to contain substantial heterozygosity.68Warren et al. 2008 However, the variation was too large to be fully explained by this. 

The second possibility, somatic mutation (i.e. mutation not occurring in germ cells), is considered controversial for T-cell receptor chains. Nonetheless, it has been invoked in sharks and postulated in salmonids to explain the variation that exceeds the apparent gene copy number in these vertebrates.69Yazawa et al. 2008; Chen et al. 2009 Therefore, it seems possible[17]I kind of like the original phrasing “it does not seem to be out of the realm of possibility” but that could be easily simplified, so I did. that somatic mutation is occurring in platypus VHδ. One piece of evidence in favor of this is that the mutations appear to be localized to the V region with no variation in the C region (fig. 4A). This may be due to the relatedness between VHδ and immunoglobulin VH genes where somatic hyper-mutation is well documented. Somatic mutation in immunoglobulin VH contributes to overall affinity maturation in secondary antibody responses.70Wysocki et al. 1986 However, this means that the evidence is mixed: the pattern of mutation seen in the platypus is found in the complementarity-determining region 3, which would be indicative of selection for affinity maturation, but was also found in the framework regions, which does not indicate this. As further evidence against the somatic mutation explanation, there is no evidence of somatic mutation in the V regions of birds, which also have only a single VHδ.71Parra et al. 2012 The contribution of mutation to the platypus TCRδ repertoire, if it is occurring, remains to be determined.

Alternatively, the sequence polymorphism may be due to VHδ gene copy number variation between individual TCRα/δ alleles.

Irrespective of the number of VHδ genes in the platypus TCRα/δ locus, the results clearly support T-cell receptor δ transcripts containing VHδ recombined to Dδ and Jδ gene segments in the TCRα/δ locus (fig. 4). A VHδ gene or genes in the platypus TCRα/δ locus in the genome assembly, therefore, does not appear to be an assembly artifact. Rather, it is present and functional, and contributes to the expressed T-cell receptor δ chain repertoire. The possibility that some platypus TCRα/δ loci contain more than a single VHδ does not alter the principal conclusions of this study.

Discussion of our Model of the Evolution of TCRα/δ and TCRµ

The results above make up the evidence that allowed us to construct the model shown after the introduction section (see fig. 2). Here we discuss various considerations about the model.

Previous hypothesis of the origin of TCRµ in mammals

Our previous hypothesis72Parra et al. 2008 about the origin of T-cell receptor µ (TCRµ) in mammals involved the recombination between an ancestral TCRα/δ locus and an immunoglobulin heavy (IgH) locus. The IgH locus would have contributed the V gene segments at the 5′-end, while the T-cell receptor δ would have contributed the D, J, and C genes at the 3′-end of the locus.

The difficulty with this hypothesis was the clear stability of the genome region surrounding the TCRα/δ locus. In other words, the chromosomal region containing the TCRα/δ locus appears to have remained relatively undisrupted for at least the past 360 million years.73Parra et al. 2008, 2010, 2012

VHδ in different vertebrate lineages

An alternative model for the origins of TCRµ emerged from the discovery, in amphibians and birds, of VHδ genes inserted into the TCRα/δ locus. This model involves the insertion of VH (fig. 2B) followed by the duplication and translocation of T-cell receptor genes (fig. 2C-E).

The insertion in the TCRα/δ locus seems to occur without disrupting the local syntenic region, as we know from zebra finches and frogs. In frogs, the IgH and TCRα/δ loci are tightly linked, which may have facilitated the translocation of VH genes into the TCRα/δ locus.74Parra et al. 2010

But close linkage is not a requirement. The genomes of birds and platypuses do not show such linkage, and the translocation of VH genes to the TCRα/δ locus appears to have occurred independently from frogs in these two lineages. We know this from the lack of similarity and relatedness between the VHδ genes of frogs, birds, and monotremes.75Parra et al. 2012 As can be seen in the phylogenetic tree of fig. 4, they appear derived each from different, ancient VH clans:

  • Clan I for birds
  • Clan II for frogs
  • Clan III for platypuses

Therefore, we suggest that the transfer of VHδ occurred independently in the different lineages. Another possibility is that transfers of VHδ may have occurred frequently and repeatedly in the past. Gene replacement may be the best explanation for the current content of these genes in the different tetrapod lineages.

The new evidence of platypus VHδ from this study allows us to update the model.

Updating the model for mammalian TCRµ

Let us contrast the evidence from marsupials with the evidence we have gathered from the platypus. In marsupials, there is no VHδ; the Vµ genes are highly divergent; and at least in the opossum, there is no conserved synteny with genes linked to TCRµ. These facts provide little insight into the origins of T-cell receptor µ and its relationship to other T-cell receptor chains like δ or the conventional ones.76Parra et al. 2008

In the platypus genome, however, we notice a striking similarity between VH, VHδ, and Vµ. These genes are all in clan III. In particular, the close relationship between the platypus VHδ and Vµ genes lends greater support for the model presented in fig. 2E, with TCRµ having been derived from TCRδ genes.

The similarity that we found here between the platypus VHδ and V genes in the TCRµ locus is, so far, the clearest evolutionary association between the µ and δ loci in one species.

Evolution of chains with three extracellular domains

TCRµ is an example of a T-cell receptor form with three extracellular domains (refer back to fig. 1). These forms have evolved at least twice in vertebrates. The first was in the ancestors of the cartilaginous fish in the form of NAR-TCR.77Criscitiello et al. 2006 The second was in the mammals as TCRµ.78Parra et al. 2007

As we discussed in the introduction, NAR-TCR uses an N-terminal V domain that is related to the V domains found in IgNAR antibodies, which are unique to cartilaginous fish,79Greenberg et al. 1995; Criscitiello et al. 2006 and not closely related to antibody VH domains. Therefore, it appears that NAR-TCR and TCRµ are more likely the result of convergent evolution rather than being related by direct descent.80Parra et al. 2007; Wang et al. 2011

Evolution of chains with antibody-like V domains

T-cell receptor chains that use antibody-like V domains, such as TCRδ using VHδ, NAR-TCR, or TCRµ (i.e. the receptors with yellow ovals in fig. 1) are widely distributed in vertebrates. Only the bony fish and placental mammals lack them.

In addition to NAR-TCR, some shark species appear to generate T-cell receptor chains using antibody V genes. This occurs via trans-locus V(D)J recombination between immunoglobulin IgM and IgW heavy chain V genes and TCRδ and TCRα D and J genes.81Criscitiello et al. 2010 This may be possible, in part, due to the multiple clusters of immunoglobulin genes found in the cartilaginous fish. It also illustrates that there have been independent solutions to generating T-cell receptor chains with antibody V domains in different vertebrate lineages.

In the tetrapods, the VH genes were trans-located into the T-cell receptor loci where they became part of the germ-line repertoire. By comparison, in cartilaginous fish, something equivalent may occur somatically during V(D)J recombination in developing T cells. Either mechanism suggests there has been selection for having T-cell receptors using antibody V genes over much of vertebrate evolutionary history.

What function do the antibody V chains serve? The current working hypothesis is that they are able to bind native antigen directly. This is consistent with a selective pressure for T-cell receptor chains that may bind or recognize antigen in ways similar to antibodies in many different lineages of vertebrates.

In the case of NAR-TCR and TCRµ, the N-terminal V domain (the “third” one) is likely to be unpaired and bind antigen as a single domain (see fig. 1), as has been described for IgNAR and some IgG antibodies in camels (recently reviewed in Flajnik et al. 2011). This model of antigen binding is consistent with the evidence that the N-terminal V domains in TCRµ are somatically diverse, while the second, supporting V domains have limited diversity and presumably perform a structural role rather than one of antigen recognition.82Parra et al. 2007; Wang et al. 2011

There is no evidence of double V domains in TCRδ chains using VHδ in frogs, birds, or platypus (rightmost part of fig. 1).83Parra et al. 2010, 2012 Rather, the complex containing VHδ would likely be structured similar to a conventional γδ receptors with a single V domain on each chain. It is possible that such receptors also bind antigen directly, but this remains to be determined.

A compelling model for the evolution of the immunoglobulin and T-cell receptor loci has been one of internal duplication, divergence and deletion. This is the so-called birth-and-death model of evolution of immune genes and was promoted by Nei and colleagues.84Ota and Nei 1994; Nei et al. 1997 Our results do not contradict that the birth-and-death mode of gene evolution has played a significant role in shaping these complex loci. However, our results do support the role of horizontal transfer of gene segments between the loci that had not been previously appreciated. With this mechanism, T cells may have been able to acquire the ability to recognize native, rather than processed antigen, much like B cells.

Notes

Notes
1 This is an example comment.
2 Major changes to the abstract: I split it in four paragraphs with section titles (this is common in some journals; it should be common in most journals). I also added a section at the beginning to state the main point of the paper. Scientists have the bad habit of starting with background information before we even know why we’re supposed to care. This fixes that.

The abstract is longer now, but not terribly so (286 vs. 267 words), so I think it’s fine; it could be shortened some more, but that would be a question of picking what to remove, which I’m less confident to do as I’m not the author.

Note that I rewrote the abstract after rewriting the rest.

3 Since this abbreviation comes up a lot, I put it first, with its meaning in parentheses.
4 I didn’t change much in the figure’s caption, but it seemed pretty trivial to add color to the text to facilitate looking up what the colors mean.
5 This is an example of two sentences taken verbatim from the original paper. Not all of it was poorly written!
6 It took me forever to rewrite this part. The original sentence was, “Diversity in antibodies produced by B cells is also generated by RAG-mediated V(D)J recombination and the TCR and Ig genes clearly share a common origin in the jawed-vertebrates.” Soooo many things wrong here.

First, the weirdly formatted term “V(D)J” was not defined anywhere. I assume it means “V, J, and optionally D,” but it’s not as obvious as the authors seem to think.

Second, why are we talking about B cells? They don’t come up anywhere else except in the very last sentence of the paper. We’ve been talking about T cells; if you’re going to switch to a different but similarly named type of cell, then you should tell the reader explicitly.

Third, this is two different ideas linked together with an “and”. I have no clue why it was written as a single sentence, except maybe for the bad reason of having the citations refer to both ideas. They’re so different that it made sense to split them into not only distinct sentences or paragraphs, but actual sections!

7 The original combined this sentence and the next, even though they’re about quite distinct ideas: the name of the genes, and the species where they’re found.
8 Why not indicate the part of the figure that is relevant? Whenever you can, provide reader guidance!
9 The authors like to use “duckbill platypus,” but there’s only one species of platypus, so I took that word out.
10 Another horrible sentence from the original, recorded for posterity: “Trans-locus V(D)J recombination of V genes from other Ig and TCR loci with TCRµ genes has not been found.” That distance between the subject (recombination) and the verb (has). Ugh.
11 The abbreviation IgSF was used in the paper, with no explanation. I assume the people who would read this paper tend to know what that means, but still.
12 Major change from me here. This section was moved here from the discussion, because it is the core and most interesting part of the paper. It is now its own first-level section alongside Introduction, Materials and Methods, etc.

I also simplified the contents. The six stages used to be identified with letters A-F in the figure, and 1-6 in the text. I changed that to use letters everywhere. I removed most of the figure’s caption since it repeats the text.

There was an inconsistency in calling the same thing the Dδ–Jδ–Cδ cluster in the figure and D–J–Cδ cluster in the text. I fixed that. I also color-coded the elements in the text according to the figure.

One thing I didn’t like about the original figure is that the six stages aren’t sequential. The figure presented steps A to F as if they followed one another, but steps C, D and E-F describe the evolution in different animal lineages. So I reorganized the contents and added some arrows for clarification. It also seems that the steps 5-6 in the text and E-F in the figure didn’t quite match, with some parts illustrated in step F being explained in step 5; I edited the text so that they do match. I think the figure could be improved much more, notably by splitting the complex F stage in multiple steps, but I don’t want to change it too much.

13 This new paragraph is important! It gives context to the experiments below and it guides the reader for the entire section. Also notice this is a case of an enumeration without point form. I like point form, but it must not be overused.
14 In the original paper, this section comes after the Confirmation of Expression section below, but in the results section, the phylogenetic results are discussed first. I don’t know if there was a reason for this (maybe they performed the phylogenetic analysis later) but it seems better to keep the same order in both sections, which is why I’m placing this part here.
15 Yes, you are allowed to make links between the sections of your paper like this!
16 I’m not sure about this but the original phrase was bona fide, which I had to look up. Maybe “authentic” between quotes isn’t the best translation, but a translation is better than a Latin phrase that many people will not get.
17 I kind of like the original phrasing “it does not seem to be out of the realm of possibility” but that could be easily simplified, so I did.
Categories
guidelines

Science Style Guide: Links

This post is part of my ongoing scientific style guideline series.

Go to Wikipedia and start reading an article on some topic you don’t know much about. For example, the umami taste.

Chances are that by the end of the first few paragraph, you will have clicked on several links, either because they referred to terms you didn’t know (glutamateinosine monophosphate), or because you were curious (what does Wikipedia have to say about the five basic tastes?). Now these links might be open in new tabs for you to check later. Or maybe you’ve already given up reading the original umami article, and are now exploring some new rabbit hole (e.g. the Scoville scale of spiciness).

Wikipedia is a great resource for many reasons, but one of them is this constant hyperlinking to other relevant Wikipedia articles. This has a major advantage: it allows the reader to create their own reading experience. Advanced readers on some topic can keep reading without having to go through the basics they already know. Beginner readers can look up technical terms easily. Wikipedia is a choose-your-own-adventure book, where you can wander according to your own character level.

Links are even an answer to some degree to the four-way tradeoff I wrote about here. It’s difficult to write something that is clear, brief, complex, and information-rich. But Wikipedia articles come closer to the golden middle, and that’s thanks to links. With no need to explain every difficult term directly in the text, articles can be more brief without sacrificing clarity. By packaging complex information in other articles, and showing only the link, Wikipedia articles can contain more complexity and richness of information.

(The reason it doesn’t falsify my four-way tradeoff theory is that Wikipedia as a whole cannot be called succinct. To understand a topic well, you still have to read a lot of articles. But Wikipedia packages this information in relatively brief articles. In other words, information architecture is a pretty good solution to the tradeoff.)

Links make matters easier to readers, but they also help writers. You don’t have to do as much guessing about what your audience knows; your audience will decide for themselves. And you can just reuse existing information written by others.

This suggests that our two principles are satisfied:

  • Minimum reading friction: links give agency to readers. They make it easier to look up complex terms (which readers will tend to google anyway).
  • Low-hanging fruit: adding links to existing public resources like Wikipedia, other encyclopedias, or open-access papers, is an easy thing to do for a writer.

Links in scientific papers

Given all of the above, we’d expect scientific papers — which are almost always at the frontier of the tradeoff, trying to cram a lot of complex information within a word limit without being too difficult to read — to use hyperlinks heavily. Right?

Nope. They rarely do. At most they include citations that link to the reference section, which may include a link to the original paper, which may or may not be openly accessible to you, and which may or may not be a 10-page difficult read in which the explanation you seek is buried in page three of the discussion with no hint to tell you where to look. And they definitely never include links to Wikipedia or anything like it.

Why is that? I’m guessing part of the explanation is the high importance of those citations. It is considered vital to put your work in relation to existing literature, so scientists have an incentive to reference as many relevant papers as they can, and no incentive to link to anything else. The respectability of the sources comes into play; Wikipedia isn’t a reputable source (it can be edited by anyone! It’s not peer-reviewed!!), nor are a lot of the other websites you could link to. So they tend to be avoided.

Then there’s the requirements of proper information management. References must be written in standard format. So if for some reason you do need to link to a website, then you’ll have to use a format like:

“Questions and Answers on Monosodium glutamate (MSG)”. Silver Spring, Maryland: United States Food and Drug Administration. 19 November 2012. Retrieved 19 February 2017.

There are good reasons to this formalism. But it also means that adding any link to a paper requires some work. As a result, I suspect it leads to less hyperlinking in science papers than would be useful to readers.

If you’ve ever read a scientific paper, it’s likely that you have googled complicated terms and looked up Wikipedia articles to help you. Scientists shouldn’t pretend that this isn’t happening. They should not hesitate to add link to resources like Wikipedia, blogs, Twitter threads, and other papers, in order to guide readers and reduce friction.

Drawbacks

There are a few drawbacks to hyperlinked text. None of them invalidate the idea that links should be used more, but we should keep them in mind.

One drawback is that links can be distracting. A barrage of links in a paragraph might be somewhat annoying to read. (Although links also have the benefit of providing novelty to text — even a simple thing like the color of a hyperlink can be useful to make a piece of writing less boring.) And having to open links may drive readers away from the original paper, and require some more effort on their part as opposed to a piece written in a way that beginners don’t need to look up extra information.

Another major drawback is link rot. A webpage may stop existing at any moment, and then your link becomes useless. Also, Wikipedia articles can change and stop fulfilling their original purpose. (Although in practice Wikipedia contributors are mindful of that and use redirections a lot.) One way to circumvent this is to link to archived pages, such as the Wayback Machine.

And of course, links don’t work offline. But my contention is that fewer and fewer people are reading papers in print or without internet access. We shouldn’t make it impossible for these modes of reading to happen, but it’s time we make full use of the web’s possibilities to improve science publishing.

Recommendations

  • Do not hesitate to add links to various resources, including encyclopedias, your own content (whether formally published or not), etc.
  • Try to find the correct balance between too few and too many links.
    • Your paper should be readable without clicking any links, so do explain the crucial parts directly in the text.
    • Too many links can be distracting, so choose carefully when to add one.
  • Link to archived webpages when possible.
  • Links shouldn’t replace formal citations, but it’s good practice to pair citations with direct links to make it easier to look up the reference.
Categories
guidelines

Science Style Guide: Bullet Points

This post is part of my ongoing scientific style guideline series.

Writing with bullet points (or bullet numbers, letters etc.) has several advantages:

  • It provides clear guidance to readers.
  • It forces the writer to think about the structure of what they’re trying to say.
  • It comes with built-in line breaks, which tends to create shorter, more readable paragraphs.
  • It breaks the flow of normal prose, which makes reading less monotonous.
  • It is another channel to communicate emphasis (in addition to italics, bold, caps, subheadings, etc.).

Not everything in a piece of writing should follow point form format. Regular prose, organized in paragraphs, is better for most things. But when you are trying to express something that’s highly structured, like a list of steps (in a recipe, or in an experimental protocol), then not using bulleted lists can work against you.

Science papers being a weirdly conservative genre, bulleted lists are somewhat uncommon in them. Papers will quite often use quasi-point form formats, like (1) having numbers or letters in the middle of a paragraph, like this; (2) using “first,” “second,” “third”; or (3) separating ideas with quasi-titles written in italics or bold, but without a line break.

You see this a lot in figures. Many scientific figures are complex and contain multiple parts. Each part is identified with a letter, as in the following example from the platypus paper:

And the caption will have long explanations separated by very easy to miss letters, like this. (A) The first part of the figure, including some colors and shapes. (B) The second part. Note that this part comes after the first part, and before part C. (C) The third part. I’m running out pointless things to write, but I want this to be a wall of text, so I’m gonna keep writing. (D) Have you lost attention yet? (E) That’s a lot of parts, isn’t it? Funny thing, there’s a limitation of how I do image captions that wouldn’t even let me use bullet points even if I wanted to. (F) At last, the final part. So much information needed to make sense of this picture, right? It’s good to be exhaustive, but there’s no point in making it difficult for readers.

A lot of these habits come, I assume, from the fact that journals used to be available only in print. Space was very limited, and there’s often a lot of scientific information to display, so you’re not going to waste any with bullet points.

Today we don’t have these limitations. Using bullet points where appropriate is nice to your readers, so use them. They’re an easy way to reduce reading friction.

They’re also clearly a Low-Hanging Fruit. Similar to breaking paragraphs, it takes very little work to turn a piece of text into a list, if it’s already presenting the information in something close to a list. (If it isn’t, then bullet points probably won’t work well anyway.) It’s also one of those interventions that can be done almost mechanistically.

Recommendations

  • Use bullet points liberally when it is appropriate, e.g. for:
    • Steps in a process, experiment, protocol, etc.
    • Lists of materials used, substances, etc.
    • Enumerations (e.g. “the five characteristics of X are : …”)
  • Nested bullet points (just like the above) can be useful, but don’t overuse them.
    • At more than two levels, the information structure is probably too complex for the bullet points to improve readability.
  • Pick bullet points instead of numbers/letters when the order does not matter. Pick numbers (for simplicity, prefer Arabic numerals, but Roman numerals can work) or letters when the order does matter (e.g. for steps in a protocol).
  • Bullet points are useful to break the monotony of reading paragraphs, but when there’s too much point form, the reverse becomes true. Use bulleted lists less than normal prose.
Categories
guidelines

Science Style Guide: Giving Examples

This post is part of my ongoing scientific style guideline series.

Imagine you’re writing a science paper. The journal you’re going to submit it to specifies a word limit: 5,000 max. You open the stats in your finished draft — 5,523 words. You’re going to have to cut.

Problem is: everything you wrote is important! You can’t take out anything from the Methods or Results sections: that would make the study weaker and less likely to be accepted for publication. You can’t take out any of the background information in the introduction: you already included the bare minimum for readers to understand the rest.

Although, on closer inspection, perhaps not quite the bare minimum. You reread this sentence:

Most models of trait evolution are based on Brownian motion, which assumes that a trait (say, beak size in some group of bird species) changes randomly, with some species evolving a larger beak, some a smaller one, etc.

What if you removed the parts that talk about beak size? That’s not strictly necessary.

Most models of trait evolution are based on Brownian motion, which assumes that a trait changes randomly.

There we go. More concise, more to the point, and most importantly, you shaved off 23 words from that word count. Of course, the sentence is less illustrative, but whatever: your readers are smart, they’ll be able to figure out an example on their own. Right?

Wrong.

Well, okay, not quite wrong, your readers probably are smart. But this goes against the Minimum Reading Friction principle. The point of most writing, including science papers, is to do the work so that readers don’t need to. If readers need to think of an illustrative example themselves to fully understand your abstract idea, then you’re asking a big effort of them.

Picking good, concrete, relevant examples is a lot of work, whether as a reader or writer.1Here’s an aside that’s not directly related to science but, instead, to computer programming.

When coding, you’ll usually refer a lot to developer documentation about whatever preexisting code you’re using. E.g. you want to convert a date to a different format, so you look up the docs for the function
convertFormat(someDate) -> convertedDate. The docs will describe how the function works, what its input (someDate) and outputs (convertedDate) exactly are, and so on — but very often they will not include an example of using convertFormat() in code. If there is an example, it’s often trivial and not very helpful. When I worked as a programmer, I was commonly frustrated by the lack of examples, both because I wanted to figure out quickly how to use a complicated function, and because I wanted to know about any usage conventions.

I suspect that writing documentation would be a lot more work if it included clear and relevant examples everywhere, which is probably why it’s rarely done.
I realize this constantly when I write. It’s very tempting to just state an abstract idea and not bother finding a good example to illustrate it. After all, the abstract idea is more general and therefore more valuable — provided that your readers understand it.

I struggled with example-finding in this very essay. It took me a while to think of the opening about cutting examples to respect a word count limit. And I’m not even that happy with this example. For one thing, it’s not very concrete. For another, it’s not even the most common reason for lack of examples: usually, we don’t cut them out, we simply fail to come up with them in the first place.

And so, unfortunately, this piece of guidance is less of a Low-Hanging Fruit than others: adding good examples is a skill that takes some practice. At the very least, it’s not difficult from the point of view of structure, since it doesn’t require you to rethink your argument — you usually just need to add a sentence or two.

Here are a few other minor points:

Where should examples be placed relative to the main idea?

It’s most intuitive to place an example right after the idea it supports, and that’s probably fine most of the time. But there are benefits to placing an example first.

Consider:

Left-handedness seems to be somewhat correlated with extraordinary success, including political success. For example, despite a base rate of about 10% left-handedness in the general population, four of the seven last United States presidents — Barack Obama, Bill Clinton, George H.W. Bush, and Ronald Reagan — were left-handed.

vs.

What do US presidents Barack Obama, Bill Clinton, George H.W. Bush, and Ronald Reagan have in common? They were all left-handed. In other words, four of the last seven presidents were left-handed, compared to a base rate in the general population of about 10%. This suggests that left-handedness is correlated with extraordinary success, at least in politics.

I find the second version more engaging. You see an interesting fact, you’re drawn in, and then the writer tells you the more general point when you’re most receptive.

Journalists do this a lot. They opens with a story, and then proceed to make their point.

What types of scientific writing does this apply to?

Anything that deals mostly with abstract ideas. Highly concrete writing, such as the sections describing the methods or results of a study, aren’t concerned. Thus, in a typical experiment paper, this advice is mostly relevant to the discussion section and some of the introductory background.

Authors of literature reviews may need to be more careful. These papers integrate a lot of ideas from reviewed studies; it can be tempting to skip examples in order to include more content in less space. The paragraph I worked on here was from a literature review.

What about word limits, though?

Sometimes you really are constrained by externally imposed word limits, and sometimes the examples really are the the least problematic thing to take out. In those cases, well, do what you have to do.

In JAWWS, I don’t want to be strict about word limits. They often force writers to sacrifice clarity to satisfy other components of the four-way tradeoff. They’re also not as relevant in an age where papers are rarely printed on, well, paper. On the other hand, I imagine that many other publications first think that and then have to implement limits to avoid very long submissions. I wonder if the solution could be to make concrete examples not count, provided it’s not too difficult to identify them.

Recommendations

  • Support each abstract idea with at least one example
    • Complicated abstract idea may benefit from multiple examples
  • Choose concrete, specific examples that can be grasped immediately
  • When possible, put the example before stating the underlying idea

See also

Categories
guidelines

Science Style Guide: Paragraph Length

This post is part of my ongoing scientific style guideline series.

There are famous words from Gary Provost that go like this. Pay attention to the rhythm:

This sentence has five words. Here are five more words. Five-word sentences are fine. But several together become monotonous. Listen to what is happening. The writing is getting boring. The sound of it drones. It’s like a stuck record. The ear demands some variety.

Now listen. I vary the sentence length, and I create music. Music. The writing sings. It has a pleasant rhythm, a lilt, a harmony. I use short sentences. And I use sentences of medium length. And sometimes when I am certain the reader is rested, I will engage him with a sentence of considerable length, a sentence that burns with energy and builds with all the impetus of a crescendo, the roll of the drums, the crash of the cymbals—sounds that say listen to this, it is important.

So write with a combination of short, medium, and long sentences. Create a sound that pleases the reader’s ear. Don’t just write words. Write music.

This is legendary advice for writing sentences. It is delightfully illustrative; we grasp it immediately. And it is correct: diversity in sentence length is a necessity of good writing, just like it is for musical notes.

I claim that the same is true of paragraph length.

Science papers usually feature many long paragraphs. Often, all or almost all paragraphs in a paper are long.

Put negatively, we might call them Walls of Text. This is a good metaphor because Walls of Text, just like regular walls, serve as obstacles. They make information less accessible. How often have you looked at a Wall of Text and simply decided it wasn’t worth the effort?

Walls of Text are bad because:

  1. They make it more difficult for readers to take breaks.
  2. They provide no hints about the structure of the underlying ideas.

We’ll examine both in more detail below. But first I want to tie my ideas about paragraphs with my two major writing style principles.

Minimum Reading Friction: The point of having paragraphs at all, as opposed to perfectly continuous text with no line breaks, is to provide some help for readers. If you don’t do that, you’re essentially telling your readers that they’re on their own. This is the opposite of what we want — the effort should be made once, by the writer, so that the many readers don’t have to.

Low-Hanging Fruit: Cutting up paragraphs is a relatively easy task. If the sentences are structured well already, it’s just a matter of finding the “joints” in the written text where it makes sense to add a line break. If the ideas are structured in a confusing manner, then it’s more work, but there’s also greater room for improvement.

In the interest of not making this post too long, I won’t include a full-fledged example, but this past post in which I rewrote a paragraph (into several ones) is a good illustration.

1. Rewarding the reader with breaks

Humans aren’t computers. We can’t work continuously without resting. Reading science papers text is work, so we’re always on the lookout for opportunities to take breaks — sometimes microbreaks on the order of a few seconds, sometimes longer breaks like a full day.

Paragraphs, like chapters, sections, and sentences, serve the purpose of telling readers, “hey, good job, you read a thing, now you can take a break if you want.” It’s rewarding. It indicates that it’s safe(r) to take a break after a paragraph because it’ll be less work to find a reentry point later, and because you expect the next paragraph to be about a different idea.

I don’t know if it’s a coincidence that the word break is used for both concepts, but if so, it’s a fortuitous one.

Walls of Text are often bad because when they loom ahead, you brace yourself. You wonder if you’ll have the energy and time to read it all. If not, maybe you quit reading (and it’s anybody’s guess whether you’ll come back to it later). If yes, then you come out at the other end with less energy and time, and good luck if the next paragraph is also a Wall of Text. And that’s assuming you do reach the end. It’s quite likely that you quit halfway — because you had to stop to think about something you read, or you needed to look up a word, or you clicked on a link, or some random distraction outside the text grabbed your attention.

At the most extreme, you could imagine an entire book that consists of a single paragraph, with no chapters or line breaks at all.1In fact very old books, from centuries or millennia ago, are often like that, probably because back then paper or parchment were expensive. You wouldn’t want to waste precious space with line breaks. This is really lazy on the part of the writer — the reader has to do all the work!

Now, that’s not to say long paragraphs are always wrong. Sometimes it really does make sense to package a lot of ideas together in a single Wall of Text. Also, long paragraphs can be easy to read if the sentences are good and logically connected. But this also means that if you do choose to write a Wall of Text, then you should be extra careful with how you structure the writing inside it.

2. Providing structure

Speaking of structure: line breaks are one of the most useful tools to communicate structure to readers.

We expect paragraphs to contain a single idea. You may have learned in school that a paragraph should have a “topic sentence” with additional sentences to provide “supporting detail.” This is somewhat too rigid, but the principle is sound.

The worst kinds of Walls of Texts are those that have multiple competing ideas inside them. Find where the boundaries are, and cut them up! The ideas don’t even have to be very different. Suppose you have a transition word like “Similarly” or “Alternatively” in the middle of a paragraph. The next sentence if probably closely related to the previous one, but the transition word does indicate a shift, so it’s a nice spot for adding a line break.

Of course, sometimes you really have a single idea with lots of supporting detail that it makes to sense to break up. This is why Walls of Text are sometimes useful.

In fact, as the Gary Provost quote at the top illustrates for sentences, diversity in paragraph length is a good thing.

Having only very short paragraphs is bad.

Think of low-quality newspaper pieces where there’s a line break after each sentence.

It’s jarring.

This is almost as bad as Walls of Text, from the point of view of structure.

Okay, that was annoying, right? The reason is that sentences already provide structure. So using only single-sentence paragraphs amounts to not using line breaks as an extra channel for reader guidance.

Strive to have a mix of short, medium, and long paragraphs. Heterogeneity is good. It carries more information.

Recommendations

  • If you’re ever debating whether or not to end the paragraph and add a line break, err on the side of “yes”.
    • Verbatim from Slate Star Codex’s Nonfiction Writing Advice, an excellent essay whose section 1 heavily inspired this post.
  • Balance your piece between short, medium, and long paragraphs.
  • Cut up existing Walls of Text by finding the boundaries between different ideas.
  • This advice generalizes to section breaks:
    • Err on the side of more shorter sections rather than few long ones.
    • Split sections that are long and contain many distinct ideas.

 

Categories
original research

An Annotated Reading of a Paper about Platypuses

Here I present a paper I chose to rewrite as a demonstration for the JAWWS project. The original text and figures are reproduced below,1the paper has a Creative Commons non-commercial license interspersed with my comments in the following format:

hello I am a blue comment in a quote-block

Feel free to just read the comments. Annotating the paper was a first step in the process. Next I will focus on the rewriting per se. Should be fun!

I didn’t have a particularly strict selection procedure — I went on ResearchHub, in the evolutionary biology section (since that used to be my field), and picked one that seemed appropriate. A cursory skimming showed it had plenty of abbreviations and long paragraphs, which suggested there was a lot of room for improvement.

Also, it’s about platypuses. Or platypi. Platypodes. Whatever.

Here are the metadata:

  • Title: “A Model for the Evolution of the Mammalian T-cell Receptor α/δ and μ Loci Based on Evidence from the Duckbill Platypus”
  • Authors: Zuly E. Parra, Mette Lillie, Robert D. Miller
  • Journal: Molecular Biology and Evolution
  • Link to original version
  • Word count: 5,800 words.

A disclaimer: some of the comments below will be harsh. Again, I don’t mean to attack the authors, who did their job as well as they could, and in fact succeeded at it — after all, they managed to publish their work!

With that, let’s pretend we’re semi-aquatic platypuses and dive in.

A Model for the Evolution of the Mammalian T-cell Receptor α/δ and μ Loci Based on Evidence from the Duckbill Platypus

Comments: Okay, this paper is going to be about T cells (I vaguely remember this being about immunity?), platypuses, and evolution. Sounds good.

Abstract

The specific recognition of antigen by T cells is critical to the generation of adaptive immune responses in vertebrates. T cells recognize antigen using a somatically diversified T-cell receptor (TCR). All jawed vertebrates use four TCR chains called α, β, γ, and δ, which are expressed as either a αβ or γδ heterodimer. Nonplacental mammals (monotremes and marsupials) are unusual in that their genomes encode a fifth TCR chain, called TCRµ, whose function is not known but is also somatically diversified like the conventional chains. The origins of TCRµ are also unclear, although it appears distantly related to TCRδ. Recent analysis of avian and amphibian genomes has provided insight into a model for understanding the evolution of the TCRδ genes in tetrapods that was not evident from humans, mice, or other commonly studied placental (eutherian) mammals. An analysis of the genes encoding the TCRδ chains in the duckbill platypus revealed the presence of a highly divergent variable (V) gene, indistinguishable from immunoglobulin heavy (IgH) chain V genes (VH) and related to V genes used in TCRµ. They are expressed as part of TCRδ repertoire (VHδ) and similar to what has been found in frogs and birds. This, however, is the first time a VHδ has been found in a mammal and provides a critical link in reconstructing the evolutionary history of TCRµ. The current structure of TCRδ and TCRµ genes in tetrapods suggests ancient and possibly recurring translocations of gene segments between the IgH and TCRδ genes, as well as translocations of TCRδ genes out of the TCRα/δ locus early in mammals, creating the TCRµ locus.

Comments: That’s a pretty dense abstract. There’s a lot of acronyms in there, which I find distracting. Also, it’s not immediately obvious why we should be interested in this paper. It seems to be this: studying platypuses uncovered new information about how T cells evolved. But that info is buried in the fourth sentence and beyond.

Introduction

T lymphocytes are critical to the adaptive immune system of all jawed vertebrates and can be classified into two main lineages based on the T-cell receptor (TCR) they use (Rast et al. 1997; reviewed in Davis and Chein 2008). The majority of circulating human T cells are the αβT cell lineage which use a TCR composed of a heterodimer of α and β TCR chains. αβT cells include the familiar T cell subsets such as CD4+ helper T cells and regulatory T cells, CD8+ cytotoxic T cells, and natural killer T (NKT) cells. T cells that are found primarily in epithelial tissues and a lower percentage of circulating lymphocytes in some species express a TCR composed of γ and δ TCR chains. The function of these γδ T cells is less well defined and they have been associated with a broad range of immune responses including tumor surveillance, innate responses to pathogens and stress, and wound healing (Hayday 2009). αβ and γδ T cells also differ in the way they interact with antigen. αβTCR are major histocompatibility complex (MHC) “restricted” in that they bind antigenic epitopes, such as peptide fragments, bound to, or “presented” by, molecules encoded in the MHC. In contrast, γδTCR have been found to bind antigens directly in the absence of MHC, as well as self-ligands that are often MHC-related molecules (Sciammas et al. 1994Hayday 2009).

I can hardly think of a less exciting introduction. I’m expecting talk of platypuses, of puzzling questions about evolution or the immune system — and all I get is a boring lecture on T cells. Make no mistake: all of this information is important. We need to know a T cell is, what’s a T-cell receptor, and that there exist at least two kinds (αβ and γδ).

But this information shouldn’t be put first. And it could definitely be split up into more paragraphs.

The conventional TCR chains are composed of two extracellular domains that are both members of the immunoglobulin (Ig) domain super-family (reviewed in Davis and Chein 2008) (fig. 1). The membrane proximal domain is the constant (C) domain, which is largely invariant amongst T-cell clones expressing the same class of TCR chain, and is usually encoded by a single, intact exon. The membrane distal domain is called the variable (V) domain and is the region of the TCR that contacts antigen and MHC. Similar to antibodies, the individual clonal diversity in the TCR V domains is generated by somatic DNA recombination (Tonegawa 1983). The exons encoding TCR V domains are assembled somatically from germ-line gene segments, called the V, diversity (D), and joining (J) genes, in developing T cells, a process dependent upon the enzymes encoded by the recombination activating genes (RAG)-1 and RAG-2 (Yancopoulos et al. 1986Schatz et al. 1989). The exons encoding the V domains of TCR β and δ chains are assembled from all three types of gene segments, whereas the α and γ chains use only V and J. The different combinations of V, D, and J or V and J, selected from a large repertoire of germ-line gene segments, along with variation at the junctions due to addition and deletion of nucleotides during recombination, contribute to a vast TCR diversity. It is this diversity that creates the individual antigen specificity of T-cell clones.

Fig. 1.
Cartoon diagram of the TCR forms found in different species. Oblong circles indicate Ig super-family domains and are color coded as C domains (blue), conventional TCR V domains (red), and VHδ or Vµ (yellow). The gray shaded chains represent the hypothetical partner chain for TCRµ and TCRδ using VHδ.
The figure helps, but again, why are we reading this? This paper seems to follow the common pattern in which the introduction gradually “zooms into” the main point. This is not a good pattern, because it doesn’t tell us the reason for this information. Sure, we suspect it’s relevant to understand what comes next, but without any mystery to anchor this to, it’s hard to be really engaged.

The TCR genes are highly conserved among species in both genomic sequence and organization (Rast et al. 1997Parra et al. 20082012Chen et al. 2009). In all tetrapods examined, the TCRβ and γ chains are each encoded at separate loci, whereas the genes encoding the α and δ chains are nested at a single locus (TCRα/δ) (Chien et al. 1987Satyanarayana et al. 1988; reviewed in Davis and Chein 2008). The V domains of TCRα and TCRδ chains can use a common pool of V gene segments, but distinct D, J, and C genes.

Diversity in antibodies produced by B cells is also generated by RAG-mediated V(D)J recombination and the TCR and Ig genes clearly share a common origin in the jawed-vertebrates (Flajnik and Kasahara 2010Litman et al. 2010). However, the V, D, J, and C coding regions in TCR have diverged sufficiently over the past >400 million years (MY) from Ig genes that they are readily distinguishable, at least for the conventional TCR. Recently, the boundary between TCR and Ig genes has been blurred with the discovery of non-conventional TCRδ isoforms that have been found that use V genes that appear indistinguishable from Ig heavy chain V (VH) (Parra et al. 20102012). Such V genes have been designated as VHδ and have been found in both amphibians and birds (fig. 1). In the frog Xenopus tropicalis, and a passerine bird, the zebra finch Taeniopygia guttata the VHδ are located within the TCRα/δ loci where they co-exist with conventional Vα and Vδ genes (Parra et al. 20102012). In galliform birds, such as the chicken Gallus gallus, VHδ are present but located at a second TCRδ locus that is unlinked to the conventional TCRα/δ (Parra et al. 2012). VHδ are the only type of V gene segment present at the second locus and, although closely related to antibody VH genes, the VHδ appear to be used exclusively in TCRδ chains. This is true as well for frogs where the TCRα/δ and IgH loci are tightly linked (Parra et al. 2010).

Okay… different species have slightly different genes… Cool.

Also, “MY” for million years, really? Do we really need that, especially when there are already about five abbreviations per sentence?

The TCRα/δ loci have been characterized in several eutherian mammal species and at least one marsupial, the opossum Monodelphis domestica, and VHδ genes have not been found to date (Satyanarayana et al.1988Wang et al. 1994Parra et al. 2008). However, marsupials do have an additional TCR locus, unlinked to TCRα/δ, that uses antibody-related V genes. This fifth TCR chain is called TCRµ and is related to TCRδ, although it is highly divergent in sequence and structure (Parra et al. 20072008). A TCRµ has also been found in the duckbill platypus and is clearly orthologous to the marsupial genes, consistent with this TCR chain being ancient in mammals, although it has been lost in the eutherians (Parra et al. 2008Wang et al. 2011). TCRµ chains use their own unique set of V genes (Vµ) (Parra et al. 2007Wang et al. 2011). Trans-locus V(D)J recombination of V genes from other Ig and TCR loci with TCRµ genes has not been found. So far, TCRµ homologues have not been found in non-mammals (Parra et al. 2008).

After an overview of non-mammal tetrapods (frogs, birds), we’re now talking about mammals: platypuses, marsupials, eutherians. It seems like the zooming in is coming to an end…

TCRµ chains are atypical in that they contain three extra-cellular IgSF domains rather than the conventional two, due to an extra N-terminal V domain (fig. 1) (Parra et al. 2007Wang et al. 2011). Both V domains are encoded by a unique set of Vµ genes and are more related to Ig VH than to conventional TCR V domains. The N-terminal V domain is diverse and encoded by genes that undergo somatic V(D)J recombination. The second or supporting V domain has little or no diversity. In marsupials this V domain is encoded by a germ-line joined, or pre-assembled, V exon that is invariant (Parra et al. 2007). The second V domain in platypus is encoded by gene segments requiring somatic DNA recombination; however, only limited diversity is generated partly due to the lack of D segments (Wang et al. 2011). A TCR chain structurally similar to TCRµ has also been described in sharks and other cartilaginous fish (fig. 1) (Criscitiello et al. 2006Flajnik et al. 2011). This TCR, called NAR-TCR, also contains three extracellular domains, with the N-terminal V domain being related to those used by IgNAR antibodies, a type of antibody found only in sharks (Greenberg et al. 1995). The current working model for both TCRµ and NAR-TCR is that the N-terminal V domain is unpaired and acts as a single, antigen binding domain, analogous to the V domains of light-chainless antibodies found in sharks and camelids (Flajnik et al. 2011Wang et al. 2011).

I’ve tried reading this paragraph like five times and I’m still not sure what it’s trying to say. It feels like it’s mostly disjointed sentences that had to be included so the authors can assume you know this, but since we still don’t have a vision of the larger picture, it’s really hard to pay attention.

Phylogenetic analyses support the origins of TCRµ occurring after the avian–mammalian split (Parra et al. 2007Wang et al. 2011). Previously, we hypothesized the origin of TCRµ being the result of a recombination between ancestral IgH and TCRδ-like loci (Parra et al. 2008). This hypothesis, however, is problematic for a number of reasons. One challenge is the apparent genomic stability and ancient conserved synteny in the region surrounding the TCRα/δ locus; this region has appeared to remain stable over at least the past 350 MY of tetrapod evolution (Parra et al. 20082010). The discovery of VHδ genes inserted into the TCRα/δ locus of amphibians and birds has provided an alternative model for the origins of TCRµ; this model involves both the insertion of VH followed by the duplication and translocation of TCR genes. Here we present the model along with supporting evidence drawn from the structure of the platypus TCRα/δ locus, which is also the first analysis of this complex locus in a monotreme.

The last sentence is the first interesting one of the entire paper. It could have come earlier. Technically we should know this from the abstract, but the abstract was pretty difficult to read too.

Also, this is definitely at least two paragraphs merged into one: the first about the previous hypothesis, and the second about the alternative model that is going to be presented.

Materials and Methods

The intro was painful, and usually materials and methods are even worse. We’ll see! 🙂

Identification and Annotation of the Platypus TCRα/δ Locus

The analyses were performed using the platypus (Ornithorhynchus anatinus) genome assembly version 5.0.1 (http://www.ncbi.nlm.nih.gov/genome/guide/platypus/). The platypus genome was analyzed using the whole-genome BLAST available at NCBI (www.ncbi.nlm.nih.gov/) and the BLAST/BLAT tool from Ensembl (www.ensembl.org). The V and J segments were located by similarity to corresponding segments from other species and by identifying the flanking conserved recombination signal sequences (RSS). V gene segments were annotated 5′ to 3′ as Vα or Vδ followed by the family number and the gene segment number if there were greater than one in the family. For example, Vα15.7 is the seventh Vα gene in family 15. The D segments were identified using complementarity-determining region-3 (CDR3) sequences that represent the V–D–J junctions, from cDNA clones using VHδ. Platypus TCR gene segments were labeled according to the IMGT nomenclature (http://www.imgt.org/). The location for the TCRα/δ genes in the platypus genome version 5.0.1 is provided in supplementary table S1Supplementary Material online.

Actually, this isn’t that bad: it’s easier to follow than the introduction because it tells us sequential actions. They make sense together.

But there are a few things wrong here. First, the use of the dreaded passive voice. “The analyses were performed …” No! Tell us who performed it! Second, it’s a pretty dense paragraph and the only one in its section (Identification and Annotation …), which means there’s no benefit to bundling all these sentences together: the title already serves this purpose. Third, it lacks some sentence to tell us what the goal is. The intro was not clear enough to assume readers know what the end point of these analyses is.

Confirmation of Expression of Platypus VHδ

Reverse transcription PCR (RT–PCR) was performed on total splenic RNA extracted from a male platypus from the Upper Barnard River, New South Wales, Australia. This platypus was collected under the same permits as in Warren et al. (2008). The cDNA synthesis step was carried out using the Invitrogen Superscript III-first strand synthesis kit according to the manufacturer’s recommended protocol (Invitrogen, Carlsbad, CA, USA). TCRδ transcripts containing VHδ were targeted using primers specific for the Cδ and VHδ genes identified in the platypus genome assembly (Warren et al. 2008). PCR amplification was performed using the QIAGEN HotStar HiFidelity Polymerase Kit (BD Biosciences, CLONTECH Laboratories, Palo Alto, CA, USA) in total volume of 20 µl containing 1× Hotstar Hifi PCR Buffer (containing 0.3 mM dNTPs), 1µM of primers, and 1.25U Hotstar Hifidelity DNA polymerase. The PCR primers used were 5′-GTACCGCCAACCACCAGGGAAAG-3′ and 5′-CAGTTCACTGCTCCATCGCTTTCA-3′ for the VHδ and Cδ, respectively. A previously described platypus spleen cDNA library constructed from RNA extracted from tissue from a Tasmanian animal was also used (Vernersson et al. 2002).

PCR products were cloned using TopoTA cloning® kit (Invitrogen). Sequencing was performed using the BigDye terminator cycle sequencing kit version 3 (Applied Biosystems, Foster City, CA, USA) and according to the manufacturer recommendations. Sequencing reactions were analyzed using the ABI Prism 3100 DNA automated sequences (PerkinElmer Life and Analytical Sciences, Wellesley, MA, USA). Chromatograms were analyzed using the Sequencher 4.9 software (Gene Codes Corporation, Ann Arbor, MI, USA). Sequences have been archived on GenBank under accession numbers JQ664690–JQ664710.

This seems to be mostly a list of the machines, substances, protocols etc. that were used. Accordingly, it should be formatted as a list. It doesn’t read well as a paragraph (nor should it be expected to).

Phylogenetic Analyses

Nucleotide sequences from FR1 to FR3 of the V genes regions, including CDR1 and CDR2, were aligned using BioEdit (Hall 1999) and the accessory application ClustalX (Thompson et al. 1997). Nucleotide alignments analyzed were based on amino acid sequence to establish codon position (Hall 1999). Alignments were corrected by visual inspection when necessary and were then analyzed using the MEGA Software (Kumar et al. 2004). Neighbor joining (NJ) with uncorrected nucleotide differences (p-distance) and minimum evolution distances methods were used. Support for the generated trees was evaluated based on bootstrap values generated by 1000 replicates. GenBank accession numbers for sequences used in the tree construction are in supplementary table S2Supplementary Material online.

I have a graduate degree in evolutionary biology, I’ve done plenty of phylogenetic analyses (building trees of life), and somehow I hadn’t understood yet that this is what this paper was about. Maybe that’s really obvious to practicing evolutionary biologists, but it seems to me that the kind of analysis could have been made more obvious earlier.

Results and Discussion

Not a bad idea to merge results and discussion together IMO, as long as it doesn’t hinder comprehension.

The TCRα/δ locus was identified in the current platypus genome assembly and the V, D, J, and C gene segments and exons were annotated and characterized (fig. 2). The majority of the locus was present on a single scaffold, with the remainder on a shorter contig (fig. 2). Flanking the locus were SALL2DAD1 and several olfactory receptor (OR) genes, all of which share conserved synteny with the TCRα/δ locus in amphibians, birds, and mammals (Parra et al. 200820102012). The platypus locus has many typical features common to TCRα/δ loci in other tetrapods (Satyanarayana et al. 1988Wang et al. 1994Parra et al. 200820102012). Two C region genes were present: a Cα that is the most 3′ coding segment in the locus, and a Cδ oriented 5′ of the Jα genes. There is a large number of Jα gene segments (n = 32) located between the Cδ and Cα genes. Such a large array of Jα genes are believed to facilitate secondary Vα to Jα rearrangements in developing αβT cells if the primary rearrangements are nonproductive or need replacement (Hawwari and Krangel 2007). Primary TCRα V–J rearrangments generally use Jα segments towards the 5′-end of the array and can progressively use downstream Jα in subsequent rearrangements. There is also a single Vδ gene in reverse transcriptional orientation between the platypus Cδ gene and the Jα array that is conserved in mammalian TCRα/δ both in location and orientation (Parra et al. 2008).

Fig. 2.
Annotated map of the platypus TCRα/δ locus showing the locations of the Vα and Vδ (red), VHδ (yellow), Dδ (orange), Jα and Jδ (green), Cδ (dark blue), and Cα (light blue). Conserved syntenic genes are in gray. The scaffold and contig numbers are indicated.
Oof. I had to actually add line breaks to this paragraph to parse it. It mostly says the same things as the figure, which isn’t too bad. Repeating important info in multiple formats is a good idea. The figure itself could have been clearer, though — it took me a few minutes to understand that the multiple lines in it represent contiguous segments of the chromosome (at least that’s what I think it means). I also had to look up what “synteny” means: it’s having the same order for genetic elements across species. 
There are 99 conventional TCR V gene segments in the platypus TCRα/δ locus, 89 of which share nucleotide identity with Vα in other species and 10 that share identity with Vδ genes. The Vδ genes are clustered towards the 3′-end of the locus. Based on nucleotide identity shared among the platypus V genes they can be classified into 17 different Vα families and two different Vδ families, based on the criteria of a V family sharing >80% nucleotide identity (not shown, but annotated in fig. 2). This is also a typical level of complexity for mammalian Vα and Vδ genes (Giudicelli et al. 2005Parra et al. 2008). Also present were two Dδ and seven Jδ gene segments oriented upstream of the Cδ. All gene segments were flanked by canonical RSS, which are the recognition substrate of the RAG recombinase. The D segments were asymmetrically flanked by an RSS containing at 12 bp spacer on the 5′-side and 23 bp spacer on the 3′-side, as has been shown previously for TCR D gene segments in other species (Carroll et al. 1993Parra et al. 20072010). In summary, the overall content and organization of the platypus TCRα/δ locus appeared fairly generic.

The last sentence seems to be the main takeaway. I would have put it first.

What is atypical in the platypus TCRα/δ locus was the presence of an additional V gene that shared greater identity to antibody VH genes than to TCR V genes (figs. 2 and 3). This V gene segment was the most proximal of the V genes to the D and J genes and was tentatively designated as VHδ. VHδ are, by definition, V genes indistinguishable from Ig VH genes but used in encoding TCRδ chains and have previously been found only in the genomes of birds and frogs (Parra et al. 200820102012).

Shortish paragraph, intriguing first sentence — good job!

Fig. 3.
Phylogenetic tree of mammalian VH genes including the platypus VHδ and monotreme Vµ. The three major VH clans are bracketed. The platypus VHδ is boxed and the clade containing platypus VHδ along with platypus and echidna Vµ is in bold and indicated by a smaller bracket in VH clan III. The three-digit numbers following the VH gene labels are the last three digits of the GenBank accession number referenced in supplementary table S2, Supplementary Material online. The numbers following the platypus and echidna Vµ labels are clone numbers. The tree presented was generated using the Minimum Evolution method. Similar topology was generation using the Neighbor Joining method.

Maybe that’s the ex-biologist speaking, but I personally really like phylogenetic trees. I find them quite illustrative. On the other hand, I, uh, didn’t remember at all what a VH gene is, so I had to go back to the introduction. There should have been a way to make it clearer, since VH genes play a big role in the results.

Also, not important, but there’s a big typo in the last sentence (generation should have been generated).

VH genes from mammals and other tetrapods have been shown to cluster into three ancient clans and individual species differ in the presence of one or more of these clans in their germ-line IgH locus (Tutter and Riblet 1989Ota and Nei 1994). For example, humans, mice, echidnas, and frogs have VH genes from all three clans (Schwager et al. 1989Ota and Nei 1994Belov and Hellman 2003), whereas rabbits, opossums, and chickens have only a single clan (McCormack et al. 1991Butler 1997Johansson et al. 2002Baker et al. 2005). In phylogenetic analyses, the platypus VHδ was most related to the platypus Vµ genes found in the TCRµ locus in this species (fig. 3). Platypus VHδ, however, share only 51–61% nucleotide identity (average 56.6%) with the platypus Vµ genes. Both the platypus Vµ and VHδ clustered within clan III (fig. 3) (Wang et al. 2011). This is noteworthy given that VH genes in the platypus IgH locus are also clan III and, in general, clan III VH are the most ubiquitous and conserved lineage of VH (Johansson et al. 2002Tutter and Riblet 1989). Although clearly related to platypus VH, the VHδ gene share only 34–65% nucleotide identity (average 56.9%) with the bona fide VH used in antibody heavy chains in this species.

Okay, this explains the three VH parts in the tree. It’s pretty clear.

It was necessary to rule out that the VHδ gene present in the platypus TCRα/δ locus was not an artifact of the genome assembly process. One piece of supporting evidence would be the demonstration that the VHδ is recombined to downstream Dδ and Jδ segments and expressed with Cδ in complete TCRδ transcripts. PCR using primers specific for VHδ and Cδ was performed on cDNA synthesized from splenic RNA from two different platypuses, one from New South Wales and the other from Tasmania. PCR products were successfully amplified from the NSW animal and these were cloned and sequenced. Twenty clones, each containing unique nucleotide sequence, were characterized and found to contain the VHδ recombined to the Dδ and Jδ gene segments (fig. 4A). Of these 20, 11 had unique V, D, and J combinations that would encode 11 different complementarity-determining regions-3 (CDR3) (fig. 4B). More than half of the CDR3 (8 out of 11) contained evidence of using both D genes (VDDJ) (fig. 4B). This is a common feature of TCRδ V domains where multiple D genes can be incorporated into the recombination due to the presence of asymmetrical RSS (Carroll et al. 1993). The region corresponding to the junctions between the V, D, and J segments, contained additional sequence that could not be accounted for by the germ-line gene segments (fig. 4B). There are two possible sources of such sequence. One are palindromic (P) nucleotides that are created during V(D)J recombination when the RAG generates hairpin structures that are resolved asymmetrically during the re-ligation process (Lewis 1994). The second are non-templated (N) nucleotides that can be added by the enzyme terminal deoxynucleotidyl transferase (TdT) during the V(D)J recombination process. An unusual feature of the platypus VHδ is the presence of a second cysteine encoded near the 3′-end of the gene, directly next to the cysteine predicted to form the intra-domain disulfide bond in Ig domains (fig. 4A). Additional cysteines in the CDR3 region of VH domains have been thought to provide stability to unusually long CDR3 loops, as has been described for cattle and the platypus previously (Johansson et al. 2002). The CDR3 of TCRδ using VHδ are only slightly longer than conventional TCRδ chains (ranging 10–20 residues) (Rock et al. 1994Wang et al. 2011). Furthermore, the stabilization of CDR3 generally involves multiple pairs of cysteines, which were not present in the platypus VHδ clones (fig. 4A). Attempts to amplify TCRδ transcripts containing VHδ from splenic RNA obtained from the Tasmanian animal were unsuccessful. As a positive control, TCRδ transcripts containing conventional Vα/δ were successfully isolated, however. It is possible that Tasmanian platypuses, which have been separated from the mainland population at least 14,000 years either have a divergent VHδ or have deleted this single V gene altogether (Lambeck and Chappell 2001).

I like the thought process: “hey, our results may have been an artifact, here’s what we did to prove it wasn’t.” But why is this paragraph so long? Seems like it could have been multiple smaller ones, perhaps with a section subheading.

Fig. 4.
(A) Alignment of predicted protein sequence of transcripts containing a recombined VHδ gene isolated from platypus spleen RNA. The individual clones are identified by the last three digits of their GenBank accession numbers (JQ664690–JQ664710). Shown is the region from FR3 of the VHδ through the beginning of the Cδ domain. The sequence in bold at the top of the alignment is the germ-line VHδ and Cδ gene sequence. The double cysteines at the end of FR3 and unpaired cysteines in CDR3 are shaded, as is the canonical FGXG in FR4. (B) Nucleotide sequence of the CDR3 region of the eleven unique V(D)J recombinants using VHδ described in the text. The germ-line sequence of the 3′-end of VHδ, the two Dδ, are shown at the top. The germ-line Jδ sequences are shown on the right-hand side of the alignment interspersed amongst the cDNA sequences using each. Nucleotides in the junctions between the V, D, and J segments, shown italicized, are most likely N-nucleotides added by TdT.

This figure is probably good to visualize what their results actually looked like, but it also seems like a way to cram as much information in a visual and its caption as humanly possible… I’ll let it pass. It’s fine that some parts of the paper go more in depth, if they can be easily ignored, as I think is the case here.

Small nitpick: This is two figures, and I would preferred that this fact would have been clearer. A small “(A)” and “(B)” in the paragraph doesn’t really help the reader.

Although there is only a single VHδ in the current platypus genome assembly, there was sequence variation in the region corresponding to FR1 through FR3 of the V domains (fig. 4A and sequence data not shown but available in GenBank). Some of this variation could represent two alleles of a single VHδ gene. Indeed, the RNA used in this experiment is from a wild-caught individual from the same population that was used to generate the whole-genome sequence and was found to contain substantial heterozygosity (Warren et al. 2008). There was greater variation in the transcribed sequences, however, than could be explained simply by two alleles of a single gene (fig. 4A). Two alternative explanations are the occurrence of somatic mutation of expressed VHδ genes or allelic variation in gene copy number. Somatic mutation in TCR chains is controversial. Nonetheless, it has been invoked to explain the variation in expressed TCR chains that exceeds the apparent gene copy number in sharks, and has also been postulated to occur in salmonids (Yazawa et al. 2008Chen et al. 2009). Therefore, it does not seem to be out of the realm of possibility that somatic mutation is occurring in platypus VHδ. Indeed, the mutations appear to be localized to the V region with no variation in the C region (fig. 4A). This may be due to its relatedness of VHδ to Ig VH genes where somatic hyper-mutation is well documented. Such somatic mutation contributes to overall affinity maturation in secondary antibody responses (Wysocki et al. 1986). The pattern of mutation seen in platypus VHδ however, is not localized to the CDR3, which would be indicative of selection for affinity maturation, but was also found in the framework regions. Furthermore, in the avian genomes where there is also only a single VHδ, there was no evidence of somatic mutation in the V regions (Parra et al. 2012). The contribution of mutation to the platypus TCRδ repertoire, if it is occurring, remains to be determined. Alternatively, the sequence polymorphism may be due to VHδ gene copy number variation between individual TCRα/δ alleles.

Not the worst paragraph, but again, doesn’t need to be a Wall of Text.

Irrespective of the number of VHδ genes in the platypus TCRα/δ locus, the results clearly support TCRδ transcripts containing VHδ recombined to Dδ and Jδ gene segments in the TCRα/δ locus (fig. 4). A VHδ gene or genes in the platypus TCRα/δ locus in the genome assembly, therefore, does not appear to be an assembly artifact. Rather it is present, functional and contributes to the expressed TCRδ chain repertoire. The possibility that some platypus TCRα/δ loci contain more than a single VHδ does not alter the principal conclusions of this study.

Previously, we hypothesized the origin of TCRµ in mammals involving the recombination between and ancestral TCRα/δ locus and an IgH locus (Parra et al. 2008). The IgH locus would have contributed the V gene segments at the 5′-end of the locus, with the TCRδ contributing the D, J, and C genes at the 3′-end of the locus. The difficulty with this hypothesis was the clear stability of the genome region surrounding the TCRα/δ locus. In other words, the chromosomal region containing the TCRα/δ locus appears to have remained relatively undisrupted for at least the past 360 million years (Parra et al. 200820102012). The discovery of VHδ genes within the TCRα/δ loci of frog and zebra finch is consistent with insertions occurring without apparently disrupting the local syntenic region. In frogs, the IgH and TCRα/δ loci are tightly linked, which may have facilitated the translocation of VH genes into the TCRα/δ locus (Parra et al. 2010). However, close linkage is not a requirement since the translocation of VH genes appears to have occurred independently in birds and monotremes, due to the lack of similarity between the VHδ in frogs, birds, and monotremes (Parra et al. 2012). Indeed, it would appear is if the acquisition of VH genes into the TCRα/δ locus occurred independently in each lineage.

The similarity between the platypus VHδ and V genes in the TCRµ locus is, so far, the clearest evolutionary association between the TCRµ and TCRδ loci in one species. From the comparison of the TCRα/δ loci in frogs, birds, and monotremes, a model for the evolution of TCRµ and other TCRδ forms emerges (fig. 5), which can be summarized as follows:

Oooh, exciting! The title promised a model, and at last we get it. Also it seems that below we get point-form stuff! I like point-form stuff. It’s often really helpful to guide the reader.

  1. Early in the evolution of tetrapods, or earlier, a duplication of the D–J–Cδ cluster occurred resulting in the presence of two Cδ each with its own set of Dδ and Jδ segments (fig. 5A).

  2. Subsequently, a VH gene or genes was translocated from the IgH locus and inserted into the TCRα/δ locus, most likely to a location between the existing Vα/Vδ genes and the 5′-proximal D–J–Cδ cluster (fig. 5B). This resulted in the configuration like that which currently exists in the zebra finch genome (Parra et al. 2012).

  3. In the amphibian lineage there was an inversion of the region containing VHδ–Dδ–Jδ–Cδ cluster and an expansion in the number of VHδ genes (fig. 5C). Currently, X. tropicalis has the greatest number of VHδ genes, where they make up the majority of V genes available in the germ-line for use in TCRδ chains (Parra et al. 2010).

  4. In the galliform lineage (chicken and turkey), the VHδ–Dδ–Jδ–Cδ cluster was trans-located out of the TCRα/δ locus where it currently resides on another chromosome (fig. 5D). There are no Vα or Vδ genes at the site of the second chicken TCRδ locus and only a single Cδ gene remains in the conventional TCRα/δ locus (Parra et al. 2012).

  5. Similar to galliform birds, the VHδ–Dδ–Jδ–Cδ cluster was trans-located out of the TCRα/δ locus in presumably the last common ancestor of mammals, giving rise to TCRµ (fig. 5E). Internal duplications of the VHδ–Dδ–Jδ genes gave rise to the current [(V–D–J) − (V–D–J) − C] organization necessary to encode TCR chains with double V domains (Parra et al. 2007Wang et al. 2011). In the platypus, the second V–D–J cluster, encoding the supporting V, has lost its D segments and generates V domains with short CDR3 encoded by direct V to J recombination (Wang et al. 2011). The whole cluster appears to have undergone additional tandem duplication as it exists in multiple tandem copies in the opossum and also likely in the platypus (Parra et al. 20072008Wang et al. 2011).

  6. In the therian lineage (marsupials and placentals), the VHδ was lost from the TCRα/δ locus (Parra et al. 2008). In placental mammals, the TCRµ locus was also lost (Parra et al. 2008). The marsupials retained TCRµ, however the second set of V and J segments, encoding the supporting V domain in the protein chain, were replaced with a germ-line joined V gene, in a process most likely involving germ-line V(D)J recombination and retro-transposition (fig. 5F) (Parra et al. 20072008).

Yeah, this was good. These point-form paragraphs, combined with Fig. 5 (below) did more to help me understand the paper than anything else so far. I kind of wish the paper had just opened with this, and then proceeded to explain the reasoning behind.

TCR forms such as TCRµ, which contain three extracellular domains, have evolved at least twice in vertebrates. The first was in the ancestors of the cartilaginous fish in the form of NAR-TCR (Criscitiello et al. 2006) and the second in the mammals as TCRµ (Parra et al. 2007). NAR-TCR uses an N-terminal V domain related to the V domains found in IgNAR antibodies, which are unique to cartilaginous fish (Greenberg et al. 1995Criscitiello et al. 2006), and not closely related to antibody VH domains. Therefore, it appears that NAR-TCR and TCRµ are more likely the result of convergent evolution rather than being related by direct descent (Parra et al. 2007Wang et al. 2011). Similarly, the model proposed in fig. 5 posits the direct transfer of VH genes from an IgH locus to the TCRα/δ locus. But it should be pointed out the VHδ found in frogs, birds, and monotremes are not closely related (fig. 3); indeed, they appear derived each from different, ancient VH clans (birds, VH clan I; frogs VH clan II; platypus VH clan III). This observation would suggest that the transfer of VHδ into the TCRα/δ loci occurred independently in the different lineages. Alternatively, the transfer of VH genes into the TCRα/δ locus may have occurred frequently and repeatedly in the past and gene replacement is the best explanation for the current content of these genes in the different tetrapod lineages. The absence of VHδ in marsupials, the highly divergent nature of Vµ genes in this lineage, and the absence of conserved synteny with genes linked to TCRµ in the opossum, provide little insight into the origins of TCRµ and its relationship to TCRδ or the other conventional TCR (Parra et al. 2008). The similarity between VH, VHδ, and Vµ genes in the platypus genome, which are all clan III, however is striking. In particular, the close relationship between the platypus VHδ and Vµ genes lends greater support for the model presented in fig. 5E, with TCRµ having been derived from TCRδ genes.

My comments are getting repetitive. This could have been multiple paragraphs etc. etc. It’s easy enough to find the joints where it should be carved, by the way: right before the sentences that start with “Similarly” and “Alternatively” would be a good start, since these words indicate that we’re switching to a new idea.

Fig. 5.
A model of the stages of evolution of the TCRα/δ loci in tetrapods and the origins of TCRµ in mammals. A color key of the gene segments is presented at the bottom. (A) Depiction of the Dδ-Jδ-Cδ duplication in an ancestral TCRα/δ locus that provides a second Cδ gene found in frogs and zebra finch. (B) Depiction of the insertion of a VH gene into the TCRα/δ locus producing a current organization as it is found in zebra finch. (C) Depiction of the inversion/translocation and VHδ gene duplication that yielded the current organization found in frogs. (D) Depiction of the translocation of a VHδ–Dδ–Jδ–Cδ cluster to a location outside the TCRα/δ locus generating a second TCRδ locus as it is currently found in chicken and turkey. (E) Depiction the translocation that took place in mammals giving rise to the TCRµ locus. (F) Loss of TCRµ in placental mammals, loss of D gene segments in cluster encoding the support V domain, retro-transpostion to form a germ-line joined V in marsupials, and duplication of TCRµ clusters in both monotremes and marsupials.

Super helpful figure. Although I’m generally in favor of repeating important info, I do feel that the caption could have simply referred to the 6-point model in the text. The caption as it stands doesn’t add much and looks like a Wall of Text. But that’s not a big deal.

The presence of TCR chains that use antibody like V domains, such as TCRδ using VHδ, NAR-TCR or TCRµ are widely distributed in vertebrates with only the bony fish and placental mammals missing. In addition to NAR-TCR, some shark species also appear to generate TCR chains using antibody V genes. This occurs via trans-locus V(D)J recombination between IgM and IgW heavy chain V genes and TCRδ and TCRα D and J genes (Criscitiello et al. 2010). This may be possible, in part, due to the multiple clusters of Ig genes found in the cartilaginous fish. It also illustrates that there has been independent solutions to generating TCR chains with antibody V domains in different vertebrate lineages. In the tetrapods, the VH genes were trans-located into the TCR loci where they became part of the germ-line repertoire. Whereas in cartilaginous fish something equivalent may occur somatically during V(D)J recombination in developing T cells. Either mechanism suggests there has been selection for having TCR using antibody V genes over much of vertebrate evolutionary history.

The current working hypothesis for such chains is that they are able to bind native antigen directly. This is consistent with a selective pressure for TCR chains that may bind or recognize antigen in ways similar to antibodies in many different lineages of vertebrates. In the case of NAR-TCR and TCRµ, the N-terminal V domain is likely to be unpaired and bind antigen as a single domain (fig. 1), as has been described for IgNAR and some IgG antibodies in camels (recently reviewed in Flajnik et al. 2011). This model of antigen binding is consistent with the evidence that the N-terminal V domains in TCRµ are somatically diverse, while the second, supporting V domains have limited diversity with the latter presumably performing a structural role rather than one of antigen recognition (Parra et al. 2007Wang et al. 2011). There is no evidence of double V domains in TCRδ chains using VHδ in frogs, birds, or platypus (fig. 1) (Parra et al. 20102012). Rather, the TCR complex containing VHδ would likely be structured similar to a conventional γδTCR with a single V domain on each chain. It is possible that such receptors also bind antigen directly, however this remains to be determined.

Not much to add except that I just had a thought that subheadings would have greatly eased this section (like they did the Methods section).

A compelling model for the evolution of the Ig and TCR loci has been one of internal duplication, divergence and deletion; the so-called birth-and-death model of evolution of immune genes promoted by Nei and colleagues (Ota and Nei 1994Nei et al. 1997). Our results in no way contradict that the birth-and-death mode of gene evolution has played a significant role in shaping these complex loci. However, our results do support the role of horizontal transfer of gene segments between the loci that has not been previously appreciated. With this mechanism T cells may have been able to acquire the ability to recognize native, rather than processed antigen, much like B cells.

Pretty good conclusion, opening on new ideas and showing the significance of this work in the field.


Phew. I’m done.

Reading this paper took me several days, although I could have been more focussed in general. But this shows how much work is required to read papers! I had to push myself to read. Many times I caught myself skimming paragraphs without understanding anything, and I had to read again. Right now I think I would benefit from reading it all a second time, but I resist the thought, because it’s work.

But I think it’s a good candidate for my rewriting project. It should be relatively easy to cut down the number of abbreviations, split long paragraphs, and add subheadings. More thorough rewriting will probably involve clarifying the main points and claims right at the start. At the most extreme (I’m not sure I’ll go there), it could be beneficial to change the entire structure: give the detailed model first, and only then explain the background and methods.

Stay tuned!

Categories
guidelines

Science Style Guide: Abbreviations

This post is part of my ongoing scientific style guideline series.

Textual compression techniques (TCDs) are used more and more in science writing. TCDs come in various forms, including truncation (e.g. mi for mile), acronyms (lol for laughing out loud), syllabic acronyms (covid for coronavirus disease), contraction (int’l for international), and others. The primary benefit of using TCDs in writing is to reduce text length. This is especially useful in contexts where space is limited, such as tables and charts, as well as when a long word or phrase is repeated multiple times. Another reason to use a TCD is to create a new semantic unit that is more practical to use than the long version. For instance, the TCD laser is both more convenient and more recognizable than the original light amplification by the stimulated emission of radiation.

Okay. Look at that paragraph without reading it. Does anything stand out?

I made up the phrase textual compression device and its acronym TCD. They simply mean “abbreviation,” which is what I would have used if I weren’t trying to illustrate the following points:

  • Abbreviations can be distracting. Readers expect words, and things that look less like words — capitalized acronyms, random apostrophes in the middle of a word — will stand out, as does TCD above. Used sparingly, that can be good, to draw attention to something. But in large quantities, it’s jarring.
    • It’s even worse when multiple different abbreviations are in close proximity, or when similar abbreviations are used (e.g. TCRµ and TCRδ, which come up all the time in a paper I’m reading).
  • More importantly, abbreviations demand cognitive effort. If the reader doesn’t already know an abbreviation (for instance because you made it up), they have to spend some energy learning it. You’d probably prefer them to expend that energy understanding your paper instead.
    • Worse, they might have to interrupt their reading to go back to where you defined the abbreviation, or to look it up online. (A nice opportunity to quit reading your paper altogether.)

Humans don’t read like computers. You can’t just “declare” an abbreviation as you would a variable in code, and assume that from now on your reader knows what it stands for. It’s quite likely that readers will skim your piece, or jump directly to a specific section (e.g. results), in which case they can miss the definition. Even if they do read it, they might forget — in a typical paper, there’s a lot of information to remember.

Of course, abbreviations can be useful, as the TCD paragraph laboriously explains. But the benefits are rather minor. On computer screens, which is where your scientific writing will almost always be read, space is virtually unlimited. (Figures and tables remain a good use case, as long as the abbreviations are easily readable in the caption.) Creating a new, more practical way to call a thing (e.g. laser) can be quite useful, but again, only if used sparingly, for important concepts.

Overall, the benefits of abbreviations are much greater for the writer than for the reader — which is exactly the opposite of what we want as per the Minimum Reading Friction principle.

The other principle, Low-Hanging Fruit, says that the best improvements are those that require little writing skill to implement. Abbreviation minimization fits the bill. In most cases, you can improve the text just by replacing the abbreviation with:

  • The spelled-out version (textual compression device instead of TCD)
  • A synonym (abbreviation instead of TCD / textual compression device)
  • The core noun of the abbreviated phrase (e.g. device; not the best example but you get the idea. It will usually be clear in context what you refer to, unless you’re talking about many different types of devices).

Sometimes you’ll need to perform a bit more rephrasing, but rarely will you have to perform major rephrasing due to abbreviations. If you do, that’s probably a sign that the original text was awfully painful to read.

Recommendations

  • Coin new abbreviations as rarely as possible.
    • If you must coin new abbreviations, make sure they’re short, pronounceable, and memorable. Don’t hesitate to repeat their meaning multiple times — you’re teaching your readers a new word.
  • Generally prefer the use of spelled-out versions, core nouns, or synonyms.
  • Avoid using multiple different abbreviations in close proximity.
  • Abbreviations that are generally well-known, such as DNA, can be used as much as you want. A good way to tell is if they’re included in dictionaries.
  • If you can’t avoid using several uncommon or new abbreviations, it can be helpful to draw attention to them, so that readers are warned that they will have a better time if they make sure they learn the new terms.
    • This could take the form of a short glossary at the beginning, making it easy to look up definitions during reading.
Categories
guidelines

Proposal for a New Scientific Writing Guide

Scientific writing is in bad shape. Realizing that, and wanting to do something about it, was the starting point for my essay on the creation of a new journal, one that would rewrite some science papers in a better style and kickstart a movement to ultimately change the writing norms.

Since I published the essay last July, the Journal of Actually Well-Written Science (yeah, it needs a better name) has gone from “cool idea” to “project I’m actually trying to bring to life.” Many questions remain unanswered as to how best to proceed. But one important thing I must figure out is: What should the writing norms be changed to?

Today I’m committing to publish several short posts over the course of the next month to answer exactly that.

Below is some brief discussion of the two principles that will guide my thinking. They both center on the idea of minimizing effort, for the reader as well as for the writer. Writers should make some effort to ensure readers don’t have to (that’s the basic job of a writer), but I’ll focus on improvements that don’t require a lot of time and effort from writers, since those tend to be busy scientists.

I’ll also include a table of contents to easily access the posts as they are published.

Two effort-minimization principles

1) Minimum Reading Friction: Demand less cognitive resources from the reader

Science papers are usually technical. They deal with complex questions. They assume specialized background knowledge. They may involve math.

It is expected that papers be difficult to read. But we can at least make sure the writing doesn’t get in the way.

The first principle of this style guide says that you should do everything feasible to reduce the amount of effort readers will need to make when reading your paper. In other words, your job is to make their job easier.

If something — e.g. finding a good example to illustrate a point1like I just did! — asks some effort from you but reduces the effort readers will need to make when reading, then do it. Conversely, don’t make your own life easier if it’s going to make the reader’s life harder. An example would be using an abbreviation to spend less time typing at the cost of increasing the cognitive demands on the reader.

The larger your readership, the more important this principle is. If you write for one person (e.g. an email), then it doesn’t matter that much if it takes some work to read (although it might hurt your chances of getting a reply). But if you expect to be read by 1,000 people, then every abbreviation that saved you some inconvenience is now multiplied into an inconvenience for 1,000 people.

2) Low-Hanging Fruit: Focus on improvements that are easy to apply

“Writing well” is a complicated art. Developing it can be the project of a lifetime. Scientists are typically too busy for that.

Fortunately (for me), science writing is so bad that there’s a lot of low-hanging fruit to pick. Many improvements need little effort. For example, using fewer abbreviations results in less demanding reading without requiring advanced writing skills — you can often just replace the abbreviations with the unabridged terms. Splitting long paragraphs into smaller chunks is often as easy as adding a line break when you notice a shift to a different idea.

Such improvements can also be applied almost mechanistically, which is ideal for someone who rewrites a paper without being as intimate with the topic as the author is.

The second principle therefore says to focus primarily (but not exclusively) on the elements of style that require the least effort and skill relative to how much they improve the writing.

Other things to keep in mind

  • Keep the good current norms. The goal of this project is not to burn scientific writing down and rebuild it from scratch. For example, it is good that scientists, by default, try to avoid ornamented writing. This helps with precision and objectivity.
  • Formatting is an area that can be improved, but it’s a less tractable problem because it differs a lot between publications. For instance, citation style (e.g. footnotes vs. inline) can help or hinder reading. Still, I’ll eventually need to develop guidelines for formatting in JAWWS, so I will probably discuss it a few times.
  • Focus on the classic paper format. There are a lot of new, exotic ways that science could be communicated, but at first we’ll assume that papers — usually with traditional structures like intro-methods-results-discussion — will remain the main format in the foreseeable future.
  • Personal preferences can be hard to distinguish from objective quality measures. Of course, everything I propose will reflect what I personally look for in science writing. I think and hope most guidelines will be broadly popular, but I’m always open for feedback and I’ll tweak them if others make convincing arguments.

Table of contents

I will update this list as I publish the posts.

In the meantime, here’s a very informal list of topics I might cover:

  • Abbreviations
  • Paragraph length
  • Giving examples
  • Bullet points
  • Links
  • Citations and references
  • Point of view (1st vs 3rd person) and voice (active vs passive)
  • Humor
  • Flourishes, ornamentation
  • Paper structure
  • Length vs. clarity vs. density tradeoff
  • Figures
  • Reading guidance (e.g. “read the methods section to understand what we did, but feel free to skip the more technical section 2.3”)
  • Jargon, vocabulary, word choice
  • Writing in narrative form (a difficult skill!)

 

Last updated: November 12, 2021

Categories
original research

Appendix to JAWWS: An Incrementally Rewritten Paragraph

Yesterday, I published a post describing an idea to improve scientific style by rewriting papers as part of a new science journal. I originally wanted to conclude the post with a demonstration of how the rewriting could be done, but I didn’t want to add too much length. Here it is as an appendix.

We start with a paragraph taken more or less at random from a biology paper titled “Shedding light on the ‘dark side’ of phylogenetic comparative methods“, published by Cooper et al. in 2016. Then, in five steps, we’ll incrementally improve it — at least according to my preferences! Let me know if it fits your own idea of good scientific writing as well.

1. Original

Most models of trait evolution are based on the Brownian motion model (Cavalli-Sforza & Edwards 1967; Felsenstein 1973). The Ornstein–Uhlenbeck (OU) model can be thought of as a modification of the Brownian model with an additional parameter that measures the strength of return towards a theoretical optimum shared across a clade or subset of species (Hansen 1997; Butler & King 2004). OU models have become increasingly popular as they tend to fit the data better than Brownian motion models, and have attractive biological interpretations (Cooper et al. 2016b). For example, fit to an OU model has been seen as evidence of evolutionary constraints, stabilising selection, niche conservatism and selective regimes (Wiens et al. 2010; Beaulieu et al. 2012; Christin et al. 2013; Mahler et al. 2013). However, the OU model has several well-known caveats (see Ives & Garland 2010; Boettiger, Coop & Ralph 2012; Hansen & Bartoszek 2012; Ho & Ané 2013, 2014). For example, it is frequently incorrectly favoured over simpler models when using likelihood ratio tests, particularly for small data sets that are commonly used in these analyses (the median number of taxa used for OU studies is 58; Cooper et al. 2016b). Additionally, very small amounts of error in data sets can result in an OU model being favoured over Brownian motion simply because OU can accommodate more variance towards the tips of the phylogeny, rather than due to any interesting biological process (Boettiger, Coop & Ralph 2012; Pennell et al. 2015). Finally, the literature describing the OU model is clear that a simple explanation of clade-wide stabilising selection is unlikely to account for data fitting an OU model (e.g. Hansen 1997; Hansen & Orzack 2005), but users of the model often state that this is the case. Unfortunately, these limitations are rarely taken into account in empirical studies.

Okay, first things first: let’s banish all those horrendous inline citations to footnotes.

2. With footnotes

Most models of trait evolution are based on the Brownian motion model.1Cavalli-Sforza & Edwards 1967; Felsenstein 1973 The Ornstein–Uhlenbeck (OU) model can be thought of as a modification of the Brownian model with an additional parameter that measures the strength of return towards a theoretical optimum shared across a clade or subset of species.2Hansen 1997; Butler & King 2004 OU models have become increasingly popular as they tend to fit the data better than Brownian motion models, and have attractive biological interpretations.3Cooper et al. 2016b For example, fit to an OU model has been seen as evidence of evolutionary constraints, stabilising selection, niche conservatism and selective regimes.4Wiens et al. 2010; Beaulieu et al. 2012; Christin et al. 2013; Mahler et al. 2013 However, the OU model has several well-known caveats.5see Ives & Garland 2010; Boettiger, Coop & Ralph 2012; Hansen & Bartoszek 2012; Ho & Ané 2013, 2014 For example, it is frequently incorrectly favoured over simpler models when using likelihood ratio tests, particularly for small data sets that are commonly used in these analyses.6the median number of taxa used for OU studies is 58; Cooper et al. 2016b Additionally, very small amounts of error in data sets can result in an OU model being favoured over Brownian motion simply because OU can accommodate more variance towards the tips of the phylogeny, rather than due to any interesting biological process.7Boettiger, Coop & Ralph 2012; Pennell et al. 2015 Finally, the literature describing the OU model is clear that a simple explanation of clade-wide stabilising selection is unlikely to account for data fitting an OU model,8e.g. Hansen 1997; Hansen & Orzack 2005 but users of the model often state that this is the case. Unfortunately, these limitations are rarely taken into account in empirical studies.

Much better.

Does this need to be a single paragraph? No, it doesn’t. Let’s not go overboard with cutting it up, but I think a three-fold division makes sense.

3. Multiple paragraphs

Most models of trait evolution are based on the Brownian motion model.9Cavalli-Sforza & Edwards 1967; Felsenstein 1973

The Ornstein–Uhlenbeck (OU) model can be thought of as a modification of the Brownian model with an additional parameter that measures the strength of return towards a theoretical optimum shared across a clade or subset of species.10Hansen 1997; Butler & King 2004 OU models have become increasingly popular as they tend to fit the data better than Brownian motion models, and have attractive biological interpretations.11Cooper et al. 2016b For example, fit to an OU model has been seen as evidence of evolutionary constraints, stabilising selection, niche conservatism and selective regimes.12Wiens et al. 2010; Beaulieu et al. 2012; Christin et al. 2013; Mahler et al. 2013

However, the OU model has several well-known caveats.13see Ives & Garland 2010; Boettiger, Coop & Ralph 2012; Hansen & Bartoszek 2012; Ho & Ané 2013, 2014 For example, it is frequently incorrectly favoured over simpler models when using likelihood ratio tests, particularly for small data sets that are commonly used in these analyses.14the median number of taxa used for OU studies is 58; Cooper et al. 2016b Additionally, very small amounts of error in data sets can result in an OU model being favoured over Brownian motion simply because OU can accommodate more variance towards the tips of the phylogeny, rather than due to any interesting biological process.15Boettiger, Coop & Ralph 2012; Pennell et al. 2015 Finally, the literature describing the OU model is clear that a simple explanation of clade-wide stabilising selection is unlikely to account for data fitting an OU model,16e.g. Hansen 1997; Hansen & Orzack 2005 but users of the model often state that this is the case. Unfortunately, these limitations are rarely taken into account in empirical studies.

We haven’t rewritten anything yet — the changes so far are really low-hanging fruit! Let’s see if we can improve the text more with some rephrasing. This is trickier, because there’s a risk I change the original meaning, but it’s not impossible.

4. Some rephrasing

Most models of trait evolution are based on the Brownian motion model, in which traits evolve randomly and accrue variance over time.17Cavalli-Sforza & Edwards 1967; Felsenstein 1973

What if we add a parameter to measure how much the trait motion returns to a theoretical optimum for a given clade or set of species? Then we get a family of models called Ornstein-Uhlenbeck,18Hansen 1997; Butler & King 2004 first developed as a way to describe friction in the Brownian motion of a particle. These models have become increasingly popular, both because they tend to fit the data better than simple Brownian motion, and because they have attractive biological interpretations.19Cooper et al. 2016b For example, fit to an Ornstein-Uhlenbeck model has been seen as evidence of evolutionary constraints, stabilising selection, niche conservatism and selective regimes.20Wiens et al. 2010; Beaulieu et al. 2012; Christin et al. 2013; Mahler et al. 2013

However, Ornstein-Uhlenbeck models have several well-known caveats.21see Ives & Garland 2010; Boettiger, Coop & Ralph 2012; Hansen & Bartoszek 2012; Ho & Ané 2013, 2014 For example, they are frequently — and incorrectly — favoured over simpler Brownian models. This occurs with likelihood ratio tests, particularly for the small data sets that are commonly used in these analyses.22the median number of taxa used for Ornstein-Uhlenbeck studies is 58; Cooper et al. 2016b It also happens when there is error in the data set, even very small amounts of error, simply because Ornstein-Uhlenbeck models accommodate more variance towards the tips of the phylogeny — therefore suggesting an interesting biological process where there is none.23Boettiger, Coop & Ralph 2012; Pennell et al. 2015 Additionally, users of Ornstein-Uhlenbeck models often state that clade-wide stabilising selection accounts for data fitting the model, even though the literature describing the model warns that such a simple explanation is unlikely.24e.g. Hansen 1997; Hansen & Orzack 2005 Unfortunately, these limitations are rarely taken into account in empirical studies.

What did I do here? First, I completely got rid of the “OU” acronym. Acronyms may look like they simplify the writing, but in fact they often ask more cognitive resources from the reader, who has to constantly remember that OU means Ornstein-Uhlenbeck.

Then I rephrased several sentences to make them flow better, at least according to my taste.

I also added a short explanation of what Brownian and Ornstein-Uhlenbeck models are. That might not be necessary, but it’s always good to make life easier for the reader. Even if you defined the terms earlier in the paper, repetition is useful to avoid asking the reader an effort to remember. And even if everyone reading your paper is expected to know what Brownian motion is, there’ll be some student somewhere thanking you for reminding them.25I considered doing this with the “evolutionary constraints, stabilising selection, niche conservatism and selective regimes” enumeration too, but these are mere examples, less critical to the main idea of the section. Adding definitions would make the sentence quite long and detract from the main flow. Also I don’t know what the definitions are and don’t feel like researching lol.

This is already pretty good, and still close enough to the original. What if I try to go further?

5. More rephrasing

Most models of trait evolution are based on the Brownian motion model.26Cavalli-Sforza & Edwards 1967; Felsenstein 1973 Brownian motion was originally used to describe the random movement of a particle through space. In the context of trait evolution, it assumes that a trait (say, beak size in some group of bird species) changes randomly, with some species evolving a larger beak, some a smaller one, and so on. Brownian motion implies that variance in beak size, across the group of species, increases over time.

This is a very simple model. What if we refined it by adding a parameter? Suppose there is a theoretical optimal beak size for this group of species. The new parameter measures how much the trait tends to return to this optimum. This gives us a type of model called Ornstein-Uhlenbeck,27Hansen 1997; Butler & King 2004 first developed as a way to add friction to the Brownian motion of a particle.

Ornstein-Uhlenbeck models have become increasingly popular in trait evolution, for two reasons.28Cooper et al. 2016b First, they tend to fit the data better than simple Brownian motion. Second, they have attractive biological interpretations. For example, fit to an Ornstein-Uhlenbeck model has been seen as evidence of a number of processes, including evolutionary constraints, stabilising selection, niche conservatism and selective regimes.29Wiens et al. 2010; Beaulieu et al. 2012; Christin et al. 2013; Mahler et al. 2013

Despite this, Ornstein-Uhlenbeck models are not perfect, and have several well-known caveats.30see Ives & Garland 2010; Boettiger, Coop & Ralph 2012; Hansen & Bartoszek 2012; Ho & Ané 2013, 2014 Sometimes you really should use a simpler model! It is common, but incorrect, to favour an Ornstein-Uhlenbeck model over a Brownian model after performing likelihood ratio tests, particularly for the small data sets that are often used in these analyses.31the median number of taxa used for Ornstein-Uhlenbeck studies is 58; Cooper et al. 2016b Then there is the issue of error in data sets. Even a very small amount of error can lead researchers to pick an Ornstein-Uhlenbeck model, simply because they accommodate more variance towards the tips of the phylogeny — therefore suggesting interesting biological processes where there is none.32Boettiger, Coop & Ralph 2012; Pennell et al. 2015

Additionally, users of Ornstein-Uhlenbeck models often state that the reason their data fits the model is clade-wide stabilising selection (for instance, selection for intermediate beak sizes, rather than extreme ones, across the group of birds). Yet the literature describing the model warns that such simple explanations are unlikely.33e.g. Hansen 1997; Hansen & Orzack 2005

Unfortunately, these limitations are rarely taken into account in empirical studies.

Okay, many things to notice here. First, I added an example, bird beak size. I’m not 100% sure I understand the topic well enough for my example to be particularly good, but I think it’s decent. I also added more explanation of what Brownian models are in trait evolution. Then I rephrased other sentences to make the tone less formal.

As a result, this version is longer than the previous ones. It seemed justified to cut it up into more paragraphs to accommodate the extra length. It’s plausible that the authors originally tried to include too much content in too few words, perhaps to satisfy a length constraint posed by the journal.

Let’s do one more round…

6. Rephrasing, extreme edition

Suppose you want to model the evolution of beak size in some fictional family of birds. There are 20 bird species in the family, all with different average beak sizes. You want to create a model of how their beaks changed over time, so you can reimagine the beak of the family’s ancestor and understand what happened exactly.

Most people who try to model the evolution of a biological trait use some sort of Brownian motion model.34Cavalli-Sforza & Edwards 1967; Felsenstein 1973 Brownian motion, originally, refers to the random movement of a particle in a liquid or gas. The mathematical analogy here is that beak size evolves randomly: it becomes very large in some species, very small in others, with various degrees of intermediate forms between the extremes. Therefore, across the 20 species, the variance in beak size increases over time.

Brownian motion is a very simple model. What if we add a parameter to get a slightly more complicated one? Let’s assume there’s a theoretical optimal beak size for our family of birds — maybe because the seeds they eat have a constant average diameter. The new parameter measures how much beak size tends to return to the optimum during its evolution. This gives us a type of model called Ornstein-Uhlenbeck,35Hansen 1997; Butler & King 2004 first developed as a way to add friction to the Brownian motion of a particle. We can imagine the “friction” to be the resistance against deviating from the optimum.

Ornstein-Uhlenbeck models have become increasingly popular, for two reasons.36Cooper et al. 2016b First, they often fit real-life data better than simple Brownian motion. Second, they are easy to interpret biologically. For example, maybe our birds don’t have as extreme beak sizes as we’d expect from a Brownian model, so it makes sense to assume there’s some force pulling the trait towards an intermediate optimum. That force might be an evolutionary constraint, stabilising selection (i.e. selection against extremes), niche conservatism (the tendency to keep ancestral traits), or selective regimes. Studies using Ornstein-Uhlenbeck models have been seen as evidence for each of these patterns.37Wiens et al. 2010; Beaulieu et al. 2012; Christin et al. 2013; Mahler et al. 2013

Of course, Ornstein-Uhlenbeck aren’t perfect, and in fact have several well-known caveats.38see Ives & Garland 2010; Boettiger, Coop & Ralph 2012; Hansen & Bartoszek 2012; Ho & Ané 2013, 2014 For example, simpler models are sometimes better. It’s common for researchers to incorrectly choose Ornstein-Uhlenbeck instead of Brownian motion when using likelihood ratio tests to compare models, a problem especially present due to the small data sets that are often used in these analyses.39the median number of taxa used for Ornstein-Uhlenbeck studies is 58; Cooper et al. 2016b Then there is the issue of error in data sets (e.g. when your beak size data isn’t fully accurate). Even a very small amount of error can lead researchers to pick an Ornstein-Uhlenbeck model, simply because it’s better at accommodating variance among closely related species at the tips of a phylogenetic tree. This can suggest interesting biological processes where there are none.40Boettiger, Coop & Ralph 2012; Pennell et al. 2015

One particular mistake that users of Ornstein-Uhlenbeck models often make is to assume that their data fits the model due to clade-wise stabilising selection (e.g. selection for intermediate beak sizes, rather than extreme ones, across the family of birds). Yet the literature warns against exactly that — according to the papers describing the models, such simple explanations are unlikely.41e.g. Hansen 1997; Hansen & Orzack 2005

Unfortunately, these limitations are rarely taken into account in empirical studies.

This is longer still than the previous version! At this point I’m convinced the original paragraph was artificially short. That is, it packed far more information than a text of its size normally should.

This is a common problem in science writing. Whenever you write something, there’s a tradeoff between brevity, clarity, amount of information, and complexity: you can only maximize three of them. Since science papers often deal with a lot of complex information, and have word limits, clarity often gets the short end of the stick.

Version 6 is a good example of sacrificing brevity to get more clarity. In this case it’s important to keep the amount of information constant, because I don’t want to change what the original authors were saying. It is possible that they were saying too many things. On the other hand, this is only one paragraph in a longer paper, so maybe it made sense to simply mention some ideas without developing them.

I tried a Version 7 in which I aimed for a shorter paragraph, on the scale of the original one, but I failed. To be able to keep all the information, I would have to sacrifice the extra explanations and the bird beak example, and we’d be back to square one. This suggests that both the original paragraph and my rewritten version are on different points on the tradeoff curve. The original is brief, information-rich, and complex dense; my version is information-rich, complex, and clear.. To get brief and clear would require taking some information out, which I can’t do as a rewriter.

It is my opinion that sacrificing clarity is the worst possible world, at least in most contexts. We could then rephrase my project as attempting to emphasize clarity above all else — after all, brevity, information richness and complexity serve no purpose if they fail to communicate what they want to.

Categories
essay

The Journal of Actually Well-Written Science

Update: The project described below is actually happening! Head to jawws.org for more content and posts.


Once upon a time, I was a master’s student in evolutionary biology, on track towards a PhD and an academic research career.

Some gloomy day (it was autumn and it was Sweden), a professor suggested that we organize a journal club — a weekly gathering to discuss a scientific paper — as an optional addition to regular coursework. I immediately thought, “Reading science papers sucks, so obviously I’m not going to do more of that just for fun.” But all my classmates enthusiastically signed up for it, so I caved in and joined too. And so, every week, I went to the journal club and tried to hide the fact that I had barely skimmed the assigned paper.

I am no longer on track towards a PhD and an academic research career.

There were, of course, many reasons to leave the field after my master’s degree, some better than others. “I hate reading science papers” doesn’t sound like a very serious reason — but if I’m honest with myself, it was a true motivation to quit.

And I think that generalizes far beyond my personal experience.

Science papers are boring. They’re boring even when they should be interesting. They’re awful at communicating their contents. They’re a chore to read. They’re work.

In a way, that’s expected — papers aren’t meant to be entertainment — but over time, I’ve grown convinced that the pervasiveness of bad writing is a major problem in science. It requires a lot of researchers’ precious time and energy. It keeps the public out, including people who disseminate knowledge, such as teachers and journalists, and those who take decisions about scientific matters, such as politicians and business leaders. It discourages wannabe scientists. In short, it makes science harder than it needs to be.

The quality of the writing is, of course, only one of countless problems with current academic publishing. Others include access,1most papers are gated by journals and very expensive to get access to peer review,2a very bad system in which anonymous scientists must review your paper before it gets published, and may arbitrarily reject your work, especially if they are in competition with you, or ask you to perform more experiments labor exploitation,3scientists don’t get paid for writing papers, or for reviewing them, and journals take all the financial upside the failure to report negative results,4which are less exciting than positive results fraud, and so on. These issues are important, but they are not the focus of this essay. The focus here is to examine and suggest a solution to a question that sounds petty and unserious, but is actually a genuine problem: the fact that science papers are incredibly tiresome.


This post contains three main sections:

If you’re short on time, please read the third one, which includes the sketch of a plan to improve scientific style. The other two sections provide background and justification for the plan.

Additionally, I published an appendix in which I rewrite a paragraph multiple times as a demonstration.


What makes scientific papers difficult to read?

Three reasons: topic, content, and style.

Boring topics

Science today is hyperspecialized. To make a new contribution, you need to be hyperspecialized in some topic, and read hyperspecialized papers, and write hyperspecialized ones. It’s unavoidable — science is too big and complex to allow people to make sweeping general discoveries all the time.

As a result, any hyperspecialized paper in a field that isn’t your own isn’t going to be super interesting to you. Consider these headlines:5These are a few titles taken at random from the journal Nature, all published on 30 June 2021.

I could see myself maybe skimming the third one because I’ve been interested in covid vaccines to some superficial extent, but none of them strike me as fun reading. But if you work in superconductors, maybe the Wigner crystal one (whatever that is) sounds appealing to you.

One of the reasons I quit biology is that I eventually figured out that I wasn’t sufficiently interested in the field. Surely that also contributed to my lack of eagerness to read papers. But that isn’t the whole story. There were scientific questions I was genuinely curious about, and for which I should have been enthusiastic about reading the latest research. Yet that almost never happened.

Just like you’re sometimes attracted to a novel or movie because of its premise, only to be disappointed in the actual execution — there are papers that should be interesting due to their topic, but still fail due to their contents or style.

Tedious content

The primary goal of a scientific paper is to communicate science. Surprisingly, we tend to forget this, because, as I said, papers are also a measure of work output. But still, they’re supposed to contain useful information. A good science paper should answer a question and allow another scientist to understand and perhaps replicate the methods.

That means that, sometimes, there is stuff that must be there even though it’s not interesting. A paper might contain a lengthy description of an experimental setup or statistical methods which, no matter what you do, will probably never be particularly compelling.

Besides, it might be very technical and complicated. It’s possible to write complex material that is engaging, but that’s a harder bar to clear.

And then sometimes your results just aren’t that interesting. Maybe they disprove the cool hypothesis you wanted to prove. Maybe you merely found a weak statistical correlation. Maybe “more research is needed.” It’s important to publish results even if they’re negative or unimpressive, but of course that means your paper will have a hard time generating excitement.

So there’s not much we can do in general about content. All scientists try to do the most engaging and life-changing research they can, but only a few will succeed, and that’s okay. (And some scientists adopt a strategy of publishing wrong or misleading content in order to generate excitement, which, well, is a rather obvious bad idea.)

Awful style

Style is somehow both the least important and the most important part of writing.

It’s the least important because it rarely is the reason we read anything. Except for some entertainment,6And even then! There’s some intellectual pleasure to be gleaned from looking at the form of a poem, but it rarely is the top reason we like poetry and songs. we pick what to read based on the contents, whether we expect to learn new things or be emotionally moved. Good style makes it easier to get the stuff, but it’s just a vehicle for the content.

And yet style is incredibly important because without good style (or, as per the transportation analogy, without a functioning vehicle), a piece of writing will never get anywhere. You could have the most amazing topic with excellent content — if it’s badly written, if it’s a chore to read, then very few people will read it.

Scientific papers suck at style.

(Quick disclaimer: As we’re going to discuss below, this isn’t the fault of any individual scientist. It’s a question of culture and social norms.)

Anyone who’s ever read anything knows that long, dense paragraphs aren’t enjoyed by anyone. Yet scientific papers somehow consist of nothing but long and dense paragraphs.7That’s not to say giant paragraphs are always bad; they serve a purpose, which is to make a coherent whole out of several ideas, and they can be written well. But often they aren’t written well, and sometimes they’re messy at the level of ideas. As a result, they often make reading harder, for no gain. Within the paragraphs, too many sentences are long and winding. The first person point of view is often eschewed in favor of some neutral-sounding (but not actually neutral, and very stiff) third person passive voice. The vocabulary tends to be full of jargon. The text is commonly sprinkled with an overabundance of AAAs,8Acronyms And Abbreviations, an acronym I just made up for illustrative purposes. even though they are rarely justified as a way to save space in this age where most papers are published digitally. Citations, which are of course a necessity, are inserted everywhere, impeding the flow of sentences.

Here’s an example, selected at random from an old folder of PDFs from one of my master’s projects back in the day. Ironically, it discusses the fact that some methods in evolutionary biology are applied incorrectly because… it’s hard to extract the info from long, technical papers.9Here’s the original paper, which by a stroke of luck for me, is open-source and shared with a Creative Commons license.

Don’t actually read it closely! This is just for illustration. Skim it and scroll down to the end to keep reading my essay.

Most models of trait evolution are based on the Brownian motion model (Cavalli-Sforza & Edwards 1967; Felsenstein 1973). The Ornstein–Uhlenbeck (OU) model can be thought of as a modification of the Brownian model with an additional parameter that measures the strength of return towards a theoretical optimum shared across a clade or subset of species (Hansen 1997; Butler & King 2004). OU models have become increasingly popular as they tend to fit the data better than Brownian motion models, and have attractive biological interpretations (Cooper et al. 2016b). For example, fit to an OU model has been seen as evidence of evolutionary constraints, stabilising selection, niche conservatism and selective regimes (Wiens et al. 2010; Beaulieu et al. 2012; Christin et al. 2013; Mahler et al. 2013). However, the OU model has several well-known caveats (see Ives & Garland 2010; Boettiger, Coop & Ralph 2012; Hansen & Bartoszek 2012; Ho & Ané 2013, 2014). For example, it is frequently incorrectly favoured over simpler models when using likelihood ratio tests, particularly for small data sets that are commonly used in these analyses (the median number of taxa used for OU studies is 58; Cooper et al. 2016b). Additionally, very small amounts of error in data sets can result in an OU model being favoured over Brownian motion simply because OU can accommodate more variance towards the tips of the phylogeny, rather than due to any interesting biological process (Boettiger, Coop & Ralph 2012; Pennell et al. 2015). Finally, the literature describing the OU model is clear that a simple explanation of clade-wide stabilising selection is unlikely to account for data fitting an OU model (e.g. Hansen 1997; Hansen & Orzack 2005), but users of the model often state that this is the case. Unfortunately, these limitations are rarely taken into account in empirical studies.

This paragraph is not good writing by any stretch of the imagination.

First, it’s a giant paragraph.10Remarkably, it is the sole paragraph in a subsection titled “Ornstein-Uhlenbeck (Single Stationary Peak) Models of Traits Evolution,” which means that the paragraph’s property of saying “hey, these ideas go together” isn’t even used; the title would suffice. It contains two related but distinct ideas, which are that (1) the Ornstein–Uhlenbeck model can be useful, and that (2) it has caveats. Why not split it? Speaking of which, the repetition of the “OU” acronym is jarring. It doesn’t even seem to serve a purpose other than shorten the text a little bit. It’d be better to spell “Ornstein-Uhlenbeck” out each time, and try to avoid repeating it so much.

The paragraph also contains inline citations to an absurd degree. Yes, I’m sure they’re all relevant, and you do need to show your sources, but this is incredibly distracting. Did you notice the following sentence when reading or skimming?

However, the OU model has several well-known caveats.

It’s a key sentence to understand the structure of the paragraph, indicating a transition from idea (1) to idea (2), but it is inelegantly sandwiched between two long enumerations of references:

(Wiens et al. 2010; Beaulieu et al. 2012; Christin et al. 2013; Mahler et al. 2013). However, the OU model has several well-known caveats (see Ives & Garland 2010; Boettiger, Coop & Ralph 2012; Hansen & Bartoszek 2012; Ho & Ané 2013, 2014).

Any normal human will just gloss over these lines and fail to grasp the structure of the paragraph. Not ideal.11The ideal format for citations in scientific writing is actually a matter of some debate, and depends to some extent on personal preference. As a friend said: “The numbered citation style (like in Science or Nature) is really nice because it doesn’t interrupt paragraphs, especially when there are a lot of citations. But many people also like to see which paper/work you are referencing without flipping to the end of the article to the references section.”

I admit I am biased towards prioritizing reading flow, but it’s true that having to match numbers to references at the end of a paper can be tedious. In print and PDFs, I’d be in favor of true footnotes (as opposed to endnotes), so that you don’t have to turn a page to read it. In digital formats, I’d go with collapsible footnotes (like the one you’re reading right now if you’re on my blog). Notes in the margin can also work, either in print or online. Alexey Guzey’s blog is a good example.

And if mentioning a reference is useful to understand the text, the writer should simply spell it out directly in the sentence.

Finally, there is quite a bit of specialized vocabulary that will make no sense to most readers, such as “niche conservatism” or “clade-wide stabilising selection.” That may be fine, depending on the intended audience; knowing what is or isn’t obvious to your audience is a difficult problem. I tend to err on the side of not including a term if a general lay audience wouldn’t understand it, but that’s debatable and dependent on the circumstances.

Now, I don’t mean to pick on this example or its authors in particular. In fact, it isn’t even a particularly egregious example.12Interestingly, the more I examined the paragraph in depth, the less I thought it was bad writing. This is because, I think, becoming familiar with something makes us see it in a more favorable light. In fact this is why authors are often blind to the flaws in their own writing. But by definition a paper is written for people who aren’t familiar with it. Many papers are worse! But as we saw, it’s far from being a breeze to read. Bad, boring style is so widespread that even “good” papers aren’t much fun.

Yet science can definitely be fun. Some Scott Alexander blog posts manage to make me read thousands of (rigorous!) words about psychiatric drugs, thanks to his use of microhumor. And then, of course, there’s an entire genre devoted to “translating” scientific papers into pleasant prose: popular science. Science popularizers follow different incentives than scientists: their goal is to attract clicks, so they have to write in a compelling way. They take tedious papers as input, and then produce fun stories as output.

There is no fundamental reason why scientists couldn’t write directly in the style of science popularizers. I’m not saying they should copy that exactly — there are problems with popular science too, like sensationalism and inaccuracies — but scientists could at least aim at making their scientific results accessible and enjoyable to interested and educated laypeople, or to undergraduate students in their discipline. I don’t think we absolutely need a layer of people who interpret the work of scientists for the rest of us, in a way akin to the Ted Chiang story about the future of human science.

Topic and content are hard to solve as a general problem. But I think we can improve style. We can create better norms. I have a crazy idea to do that, which we’ll get into at the end of the post, but first, we need to discuss the reasons behind the dismal state of current scientific style.

Why is scientific style so bad?

There are many reasons why science papers suck at style. One is that people writing them, scientists, aren’t selected for their writing ability. They have a lot on their plate already, from designing experiments to performing them to applying for funding to teaching classes. Writing plays an integral part of the process of science, but it’s only a part — compared to, say, fields like journalism or literature.

Another problem is language proficiency. Almost all science (at least in the more technical fields) today is published in English, and since native English speakers are a small minority of the world’s population, it follows that most papers are written by people who have only partial mastery over the language. You can’t exactly expect stellar style from a French or Russian or Chinese scientist who is forced to publish their work in a language that isn’t their own.

Both these reasons are totally valid! There’s no point blaming scientists for not being good writers. It’d be great if all scientists suddenly became masters of English prose, but we all know that’s not going to happen.

The third and most important reason for bad style is social norms.

Imagine being a science grad student, and having to write your first Real Science Paper that will be submitted to a Legit Journal. You’ve written science stuff before, for classes, for your undergrad thesis maybe, but this is the real deal. You really want it to be published. So you try to understand what exactly makes a science paper publishable. Fortunately, you’ve read tons of papers, so you have absorbed a lot of the style. You set out to write it… and reproduce the same crappy style as all the science papers before you.

Or maybe you don’t, and you try to write in an original, lively manner… until your thesis supervisor reads your draft and tells you you must rewrite it all in the passive voice and adopt a more formal style and avoid the verb “to sparkle” because it is “non-scientific.”13The “sparkle” example happened to a friend of mine recently.

Or maybe you have permissive supervisors, so you submit your paper written in an unconventional style… and the journal’s editors reject it. Or they shrug and send it to peer review, from whence it comes back with lots of comments by Reviewer 2 telling you your work is interesting but the paper must be completely rewritten in the proper style.

Who decides what style is proper? No one, and everyone. Social norms self-perpetuate as people copy other people. For this reason, they are extremely difficult to change.

As a scientist friend, Erik Hoel, told me on Twitter:

There is definitely a training period where grad students are learning to write papers (basically a “literary” art like learning how to write short stories) wherein you are constantly being told that things need to be rephrased to be more scientific

And of course there is. Newbie scientists have to learn the norms and conventions of their field. Not doing so would be costly for their careers.

The problem isn’t that norms exist. The problem is that the current norms are bad. In developing its own culture, with its traditions and rituals and “ways we do things,” science managed to get stuck with this horrible style that everyone is somehow convinced is the only way you can write and publish science papers, forever.

It wasn’t always like this. If you go back and look at science papers from the 19th century, for instance, you’ll find a rather different style, and, dare I say, a more pleasant one.

I know this thanks to a workshop I went to in undergrad biology, almost a decade ago. Prof. Linda Cooper of McGill University (now retired, as I have found out when trying to contact her during the writing of this post) showed us a recent physics paper, and a paper written in 1859 by Carlo Matteucci about neurophysiology experiments in frogs, titled Note on some new experiments in electro-physiology.14At least I think this is it; my memory of the workshop is very dim. Dr. David Green, local frog expert, helped me find this paper, and it fits all the details I can remember. You might expect very old papers to be difficult to parse — but no! It’s crystal clear and in fact rather delightful. Here’s a screenshot of the introduction:

It isn’t quite clickbait, but there’s an elegant quality to it. First, it’s told in first person. Second, there’s very little jargon. Third, we quickly get to the point; there’s no lengthy introduction that only serves as proof that you know your stuff. Fourth, there are no citations. Okay, again, we do want citations, but at least we see here that avoiding them can help the writing flow better. (No citations also means that you can’t leave something unexplained by directing the reader to some reference they would prefer not to read. Cite to give credit, but not as a way to avoid writing a clear explanation.)

By contrast, the contemporary physics paper shown at the workshop was basically non-human-readable. I can’t remember what it was, which is probably a good thing for all parties involved.

In the past 150 years, science has undoubtedly progressed in a thousand ways; yet in the quality of the writing, we are hardly better than the scientists of old.

I want to be somewhat charitable, though, so let’s point out that some things are currently done well. For example, I think the basic IMRaD structure — introduction, methods, results, and discussion — is sound.15Although one could argue that IMRaD is perhaps too often followed without thought, like a recipe. The systematic use of abstracts, and the growing tendency to split them into multiple paragraphs, is an excellent development.

There’s been a little bit of progress — but we should be embarrassed that we haven’t improved more.

What happened? It’s hard to say. Some plausible hypotheses, all of which might be true:

  • In the absence of a clear incentive to maximize the number of readers, good style doesn’t develop. The dry and boring style that currently dominates is simply the default.
  • Everyone has their own idea of what good scientific writing should be, and we’ve naturally converged onto a safe middle ground that no one particularly loves, but that people don’t hate enough to change.
  • The current style is favored because it is seen as a mark of positive qualities in science such as objectivity, rigor, or detachment.
  • The style serves as an in-group signal for serious scientists to recognize other serious scientists. Put differently, it is a form of elitism. This might mean that for the people in the in-group, poor style is a feature, not a bug.16Just like unpleasant bureaucracy acts as a filter so that only the most motivated people manage to pass through the system.
  • Science is too globalized and anglicized. There is only one scientific culture, so if it gets stuck on poor norms, there isn’t an alternative culture that can come to the rescue by doing its own thing and stumbling upon better norms.

It’s possible that these forces are too powerful for anyone to successfully change the current norms. Maybe most scientists would think I’m a fool for wanting to improve them. But it does seem to me that we should at least try.

How can we forge better norms?

First, I want to emphasize that the primary goal of scientific writing is communication among researchers, not between researchers and the public. Facilitating this communication, and lowering the barriers to entry into hyperspecialized fields,17For students, and for scientists in adjacent fields are the things I want to optimize for.

However, I do think there are benefits to making science more accessible to non-specialists — scientists in very different fields, academics outside science, journalists, teachers, politicians, etc. — without having to rely on the layer of popular science. So while we won’t optimize for this directly, it’s worth improving it along the way if we can.

With that in mind, how can we improve the social norms for style across all of scientific writing?

Here’s one recipe for failure. Come up with a new style guide, and share it with grad students and professors. Publish op-eds and give conferences on your new approach. Teach writing classes. In short, try to convince individual scientists. Then watch as they just write in the old style because it’s all they know and there’s no point in making it harder for themselves to publish their papers and get recognition.

Science is an insanely competitive field. Most scientists, especially grad students, postdocs and junior professors, are caught in a rat race. They will not want to reduce their chances of publication, even if they privately agree that scientific style should be improved.

(Not to mention, many have been reading and writing in that style for so long that they don’t even see it as problematic anymore.)

By definition, social norms are borderline impossible to change if you’re subject to them. That means that the impulse to change must come from someone who’s not subject to them. Either an extremely well established person, i.e. somebody famous enough to get away with norm-defying behavior, or an outsider — i.e. somebody who just doesn’t care.

Well, I don’t have a Nobel Prize, but I gave up on science years ago and I have zero attachment to current scientific norms, so I think I qualify as an outsider.

But what can an outsider do, if you can’t convince scientists to change? The answer is: do the work for them. Create something new, better, that scientists have an incentive to copy.

Here’s a sketch of how that could be done. Mind you, it’s very much at the stage of “crazy idea”; I don’t know if it would work. But I think there’s at least a plausible path.

The Plan

1. Found a new journal

Let’s call it the Journal of Actually Well-Written Science. I’ll make an exception to my anti-abbreviation stance and call it JAWWS because I just realized it’s a pretty cool and memorable one.

The journal would have precise writing guidelines. Those guidelines are the new norms we’ll try to get established. They would be dependent on personal taste to some extent, but I think it’s possible to come up with a set of guidelines that make sense.

Here’s some of what I have in mind:

  • If it’s a choice between clarity and brevity, prioritize clarity.
  • Split long paragraphs into shorter ones.
  • Use examples. Avoid expressing abstract ideas without supporting them with concrete examples.
  • Whenever possible, place the example before the abstract idea to draw the reader in.
  • Avoid abbreviations and acronyms unless they’re already well-known (e.g. DNA). If you must use or create one, make sure it’s effortless for the reader to remember what it means.
  • Allow as little space as possible for references while still citing appropriately. Of course, it’s fine to write a reference in full if you want to draw attention to it. Also, don’t use a citation as a way to avoid explaining something.
  • Write in the first person, even in the introduction and discussion. Your paper is being written by you, a human being, not by the incorporeal spirit of science.
  • Don’t hesitate to use microhumor; it is often the difference between competent and great writing. My mention of the incorporeal spirit of science is an example of that.
  • Avoid systematic use of the passive voice.
  • Avoid ornamental writing for its own sake. Occasionally, a good metaphor can clarify a thought, but be mindful that it’s easy to overuse them.
  • Remember that the primary goal of your paper is to communicate methods or results. Always keep the reader in mind. And make that imaginary reader an educated nonspecialist, i.e. you whenever you read papers not directly relevant to your field.

In the appendix, I show a multistep application of this to the paragraph I quoted above as an example.

Again, we’re not trying to reinvent popular science writing. We will borrow techniques and ideas from it, and try to emulate it insofar as it’s good at communicating its content. But the end goal is very different — JAWWS is intended not to entertain, but to publish full, rigorous methods and results that can be cited by researchers. I want it to be a new kind of scientific journal, but a scientific journal nonetheless.

2. Hire great writers

JAWWS will eventually accept direct submissions by researchers. But as a new journal, it will have approximately zero credibility at first. So we will start by republishing existing papers that have gone through a process of rewriting by highly competent science communicators.

Finding those communicators might be the hardest part. We need people who can understand scientific papers in their current dreadful state, but who haven’t already accepted the current style as inevitable. And we need them to be excellent at their job. If we rewrite a paper into something that’s no better than the original — or, worse, if we introduce mistakes — then the whole project falls apart.

On the other hand, tons of people want to be writers in general and science writers in particular, so there is some hope.

3. Pick papers to rewrite

It’s unclear how many science papers are published each year, but a reasonable estimation is quite a lot. I saw the 2,000,000 per year figure somewhere; I have no idea if it’s accurate, but even if it’s off by an order of magnitude or two, that’s still a lot.

How should JAWWS select the papers it rewrites?

I’m guessing that one criterion will be copyright status. I’m no intellectual property specialist, so I have no idea if it’s legal to rewrite an entire article that’s protected by copyright. Fortunately, there are many papers that are released with licenses allowing people to adapt them, so I suggest we start with those. Another avenue is to rewrite papers by scientists who like this project and grant us permission to use their work.

Then there are open questions. Should JAWWS focus on a particular field at first? Should it rewrite top papers? Neglected papers? Particularly difficult papers? Randomly selected papers? Should it focus more on literature reviews, experimental studies, meta-analyses, or methods papers? Should it accept applications by scientists who’d like our help? We can settle these questions in due time.

Crucially, the authors of a JAWWS rewritten paper will be the same as the paper it is based on. When people cite it, they’ll give credit to the original authors, not the rewriter, whose name should be mentioned separately. This also means that the original authors should approve the rewritten paper, since it’ll be published under their names.18My friend Caroline Nguyen makes an important point: the process must involve very little extra work for scientists who are already burdened with many tasks. Their approval could therefore be optional — i.e. they can veto, but by default we assume that they approve. It might also be possible to involve a writer earlier in the research process, so that they are in close contact with a team of scientists and are able to publish a JAWWS paper at the same time as the scientists publish a traditional one. In all cases, we can expect the first participating researchers to be the ones who agree with the aims of our project and trust that JAWWS is a good initiative.

4. Build prestige over time

If the rewritten papers are done well, then they’ll be pleasant to read. If they’re pleasant to read, more people will read them. If more people read them, then they’re likely to get cited more. If they get cited more, then they will have more impact. If JAWWS publishes a lot of high-impact papers, then JAWWS will become prestigious.

There’s no point in aiming low — we should try making JAWWS as prestigious, if not more, than top journals like NatureScience, or Cell.19Is this a good goal? Wouldn’t it be better to just try to build something different? Well, I see this project kind of like Tesla for cars: Tesla isn’t trying to replace cars with something else, it’s just trying to make cars much better. So I would like JAWWS to be taken as seriously as the prestigious journals — while being an improvement over them. The danger in building a new thing is that you just create your little island of people who care about style while the rest of science is still busy competing for a paper in prestigious journals. That wouldn’t be a good outcome.

Of course, that won’t happen overnight. But I don’t see why it wouldn’t be an achievable goal. And even if we don’t quite get there, the “aim for the moon, if you fail you’ll fall among the stars” principle comes into play. JAWWS can have a positive influence even if it doesn’t become a top journal.

Along the way, JAWWS will become able to accept direct submissions and publish original science papers. It might also split into several specialized journals. At this point we’ll be a major publishing business!

5. Profit!

I don’t know a lot about the business side of academic publishing, but my understanding is that there are two main models:

  • Paywall: researchers/institutions pay to access the contents of the journal.
  • Open-access: researchers/institutions pay to publish content that is then made accessible to everyone.

For JAWWS, a paywall model might make sense, since the potential audience would be larger than just scientists. But it would run contrary to the ideal of making science accessible to as many people as possible. Open-access seems more promising, and it feels appropriate to ask for a publication fee as compensation for the work needed to rewrite a paper. But that might be hard to set up in the beginning when we haven’t proven ourselves yet.

Maybe some sort of freemium model is conceivable, e.g. make papers accessible on a website but provide PDFs and other options to subscribers only.

Another route would be to set up JAWWS as a non-profit organization. An example of a journal that is also a non-profit is eLife. This might help with gaining respectability within some circles, but my general feeling is that profitability is better for the long-term survival of the project.

6. Improve science permanently

No, “profit” is not the last step in the plan. Making money is great, but we can and should think bigger. The end goal of this project is to improve science writing norms forever.

If JAWWS becomes a reasonably established journal, then other publications might copy its style. That would be very good and highly encouraged. But more importantly, it would show that it’s possible to change the norms for the better. Other journals will feel more free to experiment with different formats. Scientists will gain freedom in the way they share their work. Maybe we can even get rid of other problems like the ones associated with peer review while we’re at it.

One dark-side outcome I can imagine is that the norms are simply destroyed, we lose the coherence that science currently has, and then it becomes harder to find reliable information. To which I respond… that I’m not sure that it would be worse than the present situation. But anyway, it seems unlikely to happen. There will always be norms. There will always be prestigious people and publications that you can copy to make sure you write in the most prestigious style. We are a very mimetic bunch, after all.

And if we succeed… then science becomes fun again.

Less young researchers will drop out (like I did). Random curious people will read science directly instead of sensationalist popularizers. It’ll be easier for the public (who pays for most of science, after all) to keep informed about the latest research. Maybe it’ll even encourage more kids to get into the field. If everything goes well, we’ll get one step closer to a new golden age of humanity.

Okay, maybe I’m getting ahead of myself. But then again, like I said, there’s no point in aiming low.


To repeat, this is still a crazy idea. It did get less crazy after I finished writing the above plan, though. I have a feeling it might really work.

But it’s very possible I’m wrong. Maybe there are some major problems I haven’t foreseen. Maybe the entire scientific establishment will hate me for trying to change their norms. Maybe it’s just too ambitious a project, and it will fail if somebody doesn’t devote themselves to it. I don’t know if I should devote myself to it.

So, I’d really love for this post to be shared widely and for readers — whether professional scientists, writers, students, science communicators, and really anyone who’s interested in science somehow — to let me know what they think. Like science as a whole, this should be a collaborative effort.

 

Further reading

 

Thanks to Khalis Afnan, Dan Stern, Caroline Nguyen, Mahwash Jamy, Daniel Golliher, and Ulkar Aghayeva for feedback on this piece.