Near telomere to telomere genome assembly of Chinese yellow rabbit (Oryctolagus cuniculus) - Scientific Data

In this study, we integrated PacBio HiFi, ONT ultra-long reads, and Hi-C sequencing data to generate a near-T2T assembly of the O. cuniculus genome, which includes the first complete assembly of its Y chromosome. This assembly achieved nearly full T2T genome with only 30 unplaced scaffolds remaining, surpassing all previously published O. cuniculus genomes. This study not only enhances our comprehension of rabbit biology but also serves as a valuable resource for future studies aimed at conservation and genetic improvement, ultimately benefiting both the species and the agricultural practices associated with it.

All animal experiments were performed in strict compliance with the guidelines established by the Committee on Animal Research and Ethics of Fujian University of Traditional Chinese Medicine. (approval number: FJTCM IACUC 2024330).

Sample collection and sequencing

A four-month-old male O. cuniculus-FJY was obtained from the National Fujian Yellow Rabbit Conservation Farm (Fuzhou, Fujian, China). Genomic DNA and transcriptome sequencing libraries were prepared from blood samples. Sequencing data were generated using multiple platforms, including PacBio high-fidelity (HiFi) reads, Oxford Nanopore (ONT) ultra-long reads, paired-end reads, and Hi-C reads (Table 1). For PacBio HiFi sequencing, a standard SMRTbell library was constructed using the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, USA) following the manufacturer's protocol. Sequencing was performed on the PacBio Revio platform in circular consensus sequencing (CCS) mode. Two SMRT cells were sequenced, yielding approximately 11 million high-quality CCS reads (184.14 Gb, ~64 × coverage). The average CCS read length was 16.96 Kb, with an N50 of 16.82 Kb. For ONT sequencing, a PromethION library was prepared following the Oxford Nanopore SQK-LSK109 kit protocol (Oxford Nanopore Technologies). Four flow cells were sequenced, and raw data were processed using the Dorado server (v7.2.13) (https://github.com/nanoporetech/dorado/) with adapter trimming and quality filtering (parameter: '-min-qscore 7'). This generated 568,258 high-quality ONT reads (56.90 Gb), with the longest and average read lengths being 813.30 Kb and 100.13 Kb, respectively (Fig. 1). The Hi-C library was constructed from cross-linked genomic DNA and sequenced on the DNBSEQ-T7 platform (MGI) using 2 × 150 bp paired-end reads, producing 166.79 Gb of clean data (~58 × coverage) with fastp (v0.26.0; parameter: 'fastp -Q -L-adapter_sequence AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA-adapter_sequence_r2 AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG-out1 [output read1]-out2 [output read2]-in1 [input read1]-in2 [input read2]'). Additionally, next-generation DNA sequencing and RNA-seq were performed on the DNBSEQ-T7 platform, yielding 216.05 Gb and 10.99 Gb of clean data, respectively (Table 1).

Genome assembly and evaluation

Initial chromosome-level genome assembly

Initial assembly of contig sequences was performed with Hifiasm (v0.19.6; parameters: '-h1 hic.R1.fq.gz-h2 hic.R2.fq.gz-ul ont.fa.gz HiFi-reads.fa.gz'), incorporating HiFi reads, ONT reads, and clean Hi-C data. To eliminate contaminated sequences and mitochondrial sequences, the initially assembled genome was aligned against the NT database (https://ftp.ncbi.nlm.nih.gov/blast/db/) using BLASTN (v2.11.0 + ; parameters: '-evalue 0.00001, -num_alignments 5, and -max_hsps 1'). Contigs with more than 50% of their sequences identified as contamination or originating from organelles were completely removed from the assembly. For chromosomal scaffolding, low-quality Hi-C reads were removed using HiCUP (v0.7.2; parameter: '-NM 3'), followed by alignment of the filtered Hi-C data to the contig assembly via BWA (v0.7.12; default parameters). Subsequently, valid Hi-C reads were analyzed using Juicer (v1.6; default parameters) and 3D-DNA (parameter: '-r 0') to scaffold contigs into chromosomes. Contig ordering and assembly accuracy were verified and manually corrected in JuiceBox (v1.11.08; default parameters). Following refinement and genome adjustment, the final assembly was anchored to 23 chromosomes (comprising 21 autosomes, X, and Y) (Table 2).

T2T genome assembly

To achieve a T2T assembly, telomeric regions were extended by aligning ONT ultra-long reads to the scaffolded genome using Minimap2 (v2.24; default parameters), followed by refinement with medaka consensus (v1.7.2; https://github.com/nanoporetech/medaka) and BLASTN (v2.11.0+; default parameters). Contig gaps were resolved using TGS-GapCloser (v1.2.0; parameters: '-min_nread 10'), leveraging ONT ultra-long reads and coverage-based relationships among contigs. The extended and gap-filled assembly was subsequently polished with 216.05 Gb of clean short-read data (~75 × coverage) using Pilon (v1.23; default parameters). The final near T2T assembly spanned 23 chromosomes, totaling 2.88 Gb (30 unmapped contigs remained, with a mounting rate of 99.80%), with a scaffold N50 of 148.90 Mb and a GC content of 44.0%. Notably, this represents the first assembled rabbit Y chromosome, measuring 34.43 Mb in length (Fig. 2a-c, Tables 2, 3).

Telomere identification

Telomeric regions were identified by leveraging the conserved sequence signature of telomeres (CCCTAA/TTAGGG). A genome-wide scanning was performed to detect characteristic sequences that contain at least four repeat units and are situated within 50 Kb of either end of each chromosome using the quarTeT (v1.1.36; parameter: '-m 1') software. Telomeres were annotated across 23 chromosomes, with each chromosome showing detectable telomeric sequences (Fig. 2c, Table 3).

Repeat sequence annotation

The repetitive sequences in the O. cuniculus-FJY genome were annotated using the following three methods: (1) Based on the RepBase database (https://www.girinst.org/repbase/), homology prediction was conducted using RepeatMasker (v4.0.9; default parameters) (http://www.repeatmasker.org) to identify DNA-level repetitive sequence features, resulting in 1.23 Gb of repetitive sequences, accounting for approximately 42.90% of the genome (Fig. 2d); (2) Using the RepeatProteinMask tool integrated in RepeatMasker (v4.0.9; default parameters), homology prediction was performed to identify protein-level repetitive sequence features, yielding 210.71 Mb of repetitive sequences, which represent about 7.32% of the genome; (3) A self-built repetitive sequence feature library was constructed using RepeatModeler (v1.0.11; default parameters) and LTR-FINDER_parallel (v1.0.7; default parameters), and de novo prediction was carried out with RepeatMasker, resulting in 924.13 Mb of repetitive sequences, accounting for approximately 32.11% of the genome. Finally, a redundancy-removed statistical calculation was performed and a total of approximately 1.36 Gb of repetitive sequences were obtained, representing 47.09% of the genome. The repetitive sequences were categorized by type, with the highest proportions being short interspersed nuclear elements (SINEs), long interspersed nuclear elements (LINEs), long terminal repeats (LTRs), and DNA transposons (Table 4).

Protein coding gene annotation

Gene structure annotation

Gene structure annotation was performed using a combination of methods, including: (1) De novo prediction: We conducted de novo gene prediction using the Augustus (v3.3.2) and Genscan software with default parameters, resulting in the identification of 31,072 and 52,508 genes, respectively. (2) Homolog prediction: We downloaded homologous protein sequences from related rabbit species, including Ochotona princeps-mOchPri1 (GenBank: GCF_030435755.1), O. princeps-OchPri4 (GenBank: GCF_014633375.1), O. cuniculus-OryCun2 (GenBank: GCF_000003625.3), and O. cuniculus-NZW (GenBank: GCF_009806435). Gene structure annotation was performed based on homologous species proteins using the software miniprot (v0.11; parameters: '-gff-only -O 11 -E 1 -F 23 -C 1 -B 5 -G 200000 -j 1') and Liftoff (v1.6.3; parameters: '-a 0.5 -s 0.5'). (3) RNA-seq data-based prediction: Using Stringtie (v1.3.5; default parameters) and TransDecoder (v5.5.0; default parameters), we predicted a total of 11,285 coding genes based on RNA-seq data. (4) Protein library comparison prediction: We obtained 14,108 genes through protein library comparison prediction using the Benchmarking Universal Single-Copy Orthologs (BUSCO; v5.5.0) software. Finally, a non-redundant gene set was generated by integrating prior evidence with MAKER (v2.31.10; default parameters), followed by refinement using HiFAP (v2.4.1; Wuhan OneMore Tech Co., Ltd., https://www.onemore-tech.com/) to obtain a final high-quality set of 22,674 genes. The average gene length was 42,977 bp, with an average of approximately nine exons per gene (Table 5).

Gene function annotation

In this study, gene function annotation was primarily conducted using two major approaches: sequence similarity-based and domain/motif similarity-based methods. (1) Sequence similarity-based annotation: Diamond (v2.0.14; parameters: '-evalue 1e-05') was used to align annotated protein sequences against several databases, including NR (version 2023-04-01), SwissProt (version 2023-03-01), TrEMBL (version 2023-03-01), AnimalTFDB (version 4.0), KOG (version 2023-03-01), and KEGG (version 2023-01-01). Additionally, KOBAS (v3.0; parameters: '-t blastout:tab -s ko') was used to link annotated KEGG information with KEGG ortholog and pathway information. (2) Domain/Motif similarity-based annotation: i) We employed InterProScan (v5.61-93.0; parameters: '-seqtype p-formats TSV-goterms-pathways -dp') to compare against a series of sub-databases, obtaining information on conserved sequences, motifs, and domains of proteins. ii) We used HMMER3 (v3.3.1; parameters: 'hmmsearch -E 1e-05-domE 1e-05') to annotate conserved sequences, including transcription factors and Pfam domains/motifs, based on multiple sequence alignment and hidden Markov models. Out of 22,674 genes, 22,616 genes (99.74%) were successfully annotated with functional information (Table 6, Fig. 2e). For Gene Ontology (GO) and KEGG functional annotation analyses, the GO results showed that the top three pathways in biological process were cellular process, metabolic process, and biological process, while the top three pathways in molecular function were binding, catalytic activity, and molecular transducer activity (Fig. 3a). The KEGG analysis revealed that the most represented pathways in cellular process, metabolism, and organismal systems were cell growth and death, lipid metabolism, and immune system, respectively (Fig. 3b).

Non-coding RNA annotation

Non-coding RNA (ncRNA) annotations were performed using different methods based on the characteristics of each RNA type: (1) tRNA: Based on the structural features of tRNA, we used the tRNAscan-SE (v1.3.1) software for annotation, identifying a total of 1,341 tRNAs with an average length of 78 bp. (2) rRNA: Given the high conservation of rRNA sequences, we used rRNA sequences from closely related species as reference sequences. Annotation was performed via BLASTN (v2.11.0 + ; parameters: '-evalue 0.01'), resulting in the identification of 922 rRNAs with an average length of 186 bp. (3) miRNA and snRNA: Using covariance models from the Rfam family, we annotated miRNAs and snRNAs with the INFERNAL software included in Rfam (v14.8). This process identified 635 miRNAs and 2,421 snRNAs, with average lengths of 87 bp and 117 bp, respectively (Table 7).

Whole-genome collinearity analysis

For comparative genomic analysis, two additional O. cuniculus genome assemblies with annotations -- mOryCun1.1 (GenBank: GCA_030258805.1) and NZW (GenBank: GCA_030258775.1) were retrieved from NCBI. Collinearity relationships among the three genomes were constructed and visualized using the JCVI software. The analysis was performed with the following parameters: for ortholog identification, '-m jcvi.compara.catalog ortholog -no_strip_names-cscore = .99'; for synteny screening, '-m jcvi.compara.synteny screen-minspan = 30'. The results showed that the O. cuniculus-FJY genome was significantly better than the O. cuniculus-NZW genome, and was similar to the O. cuniculus-mOryCun1.1 genome (Fig. 4a).

Gene family clustering and analysis

To ascertain gene families, we conducted comparative analyses of protein sequences from O. cuniculus-FJY against those from six related species: O. cuniculus-mOryCun1.1, O. cuniculus-NZW, Lepus europaeus (GenBank: GCF_033115175.1), L. oiostolus (GenBank: GCA_036325905.1), Ochotona curzoniae (GenBank: GCF_017591425.1), and O. princeps (GenBank: GCF_030435755.1). Prior to analysis, gene sets of all involved species were filtered to ensure quality: for genes with multiple transcripts (resulting from alternative splicing), only the longest coding transcript was retained; additionally, genes encoding proteins shorter than 30 amino acids or containing internal stop codons were excluded. Subsequently, an all-vs-all BLASTp search (v2.11.0+; parameter: '-evalue 1e-5') was performed to determine similarity relationships among protein sequences across all species. Finally, clustering of the BLASTp results was carried out using OrthoFinder (v2.5.5; parameter: '-S blast -I 1.5'). Among the 22,674 protein-coding genes in O. cuniculus-FJY, 22,480 (approximately 99.14%) were grouped into 18,478 orthologous clusters (Fig. 4b).

Near telomere to telomere genome assembly of Chinese yellow rabbit (Oryctolagus cuniculus) - Scientific Data

POPULAR CATEGORY

misc

entertainment

corporate

research

wellness

athletics