pub/release-111/transcript_families • TranscriptDB Data Repository

CONTENT

The content is grouped by gene families.

Example:


├── ENSGT00950000186351
│   ├── ENSGT00950000186351.tar.gz
│   └── TDBTT00950000186351|.1
│       ├── dendogram.txt
│       ├── homologies.json
│       ├── json_format.json
│       ├── newick.txt
│       ├── reconciliation.txt
│       └── subtrees_description.json
├── ENSGT00950000186393
│   ├── ENSGT00950000186393.tar.gz
│   └── TDBTT00950000186393|.1
│       ├── dendogram.txt
│       ├── homologies.json
│       ├── json_format.json
│       ├── newick.txt
│       ├── reconciliation.txt
│       └── subtrees_description.json
├── etc.

Each gene family or gene tree represents a folder that contains the computed transcript families. Each transcript family is also represented as a folder and contains exactly six files. Below, we describe each file. The compressed files in the gene family folder represent the content of the gene families. The user can download them when all the data about the transcript family is needed, for instance.

dendogram.txt

The dendogram used to merge the ortholog trees in Newick format.

homologies.json

A JSON file where the keys are:

-id_transcripts: A unique ID affiliated with a transcript and the label of the transcript.

-recent_paralogs: For each unique ID transcript, the recent paralogs are described as a string separated by '&'.

-ortho_orthologs: For each unique ID transcript, the ortho-orthologs are described as a string separated by '&'.

-para_orthologs: For each unique ID transcript, the para-orthologs are described as a string separated by '&'.

-ancient_paralogs: For each unique ID transcript, the ancient paralogs are described as a string separated by '&'.

Example:


{
    "id_transcripts": {
        "0": "t1",
        "1": "t2"
    },
    "recent_paralogs": {
        "0": "NULL",
        "1": "NULL"
    },
    "ortho_orthologs": {
        "0": "t1",
        "1": "t2"
    },
    "para_orthologs": {
        "0": "NULL",
        "1": "NULL"
    },
    "ancient_paralogs": {
        "0": "NULL",
        "1": "NULL"
    }
}

json_format.json

The transcript phylogenies are described in a JSON file corresponding to the transcript family. The keys for each node are:

label: Preorder traversal number.
evoltype: Speciation, duplication, or creation.
node: Internal or leaf.

Example:


[
    {
        "label": 0,
        "evoltype": "duplication",
        "node": "internal"
    },
    {
        "label": 1,
        "evoltype": "speciation",
        "node": "leaf"
    },
    {
        "label": 2,
        "evoltype": "speciation",
        "node": "leaf"
    }
]

newick.txt

The transcript tree represented in Newick.

reconciliation.txt

A string that represents the reconciliation with the gene tree (similar to the json_format.json file). The label here represents the postorder traversal order. Example:


4:duplication&2:duplication&0:leaf&1:leaf&3:leaf

subtrees_description.json

A JSON file that describes each cluster of orthologous transcripts. The keys are:

newick: The ortholog tree in Newick format.
id_transcripts: The transcripts present in the cluster, described as a string separated by '&'.

Example:


{
    "C1": {
        "newick": "(C47D2.2.1,F49E8.4.1);",
        "id_transcripts": "C47D2.2.1&F49E8.4.1"
    }
}