Chromosome-scale genome assembly of Helcystogramma triannulella (Lepidoptera: Gelechiidae) - Scientific Data

In this study, we integrated PacBio HiFi reads and Hi-C data to assemble the first chromosome-level genome of H. triannulella. This high-quality genome assembly provides a valuable resource for elucidating the molecular mechanisms underlying H. triannulella's biology and offers a foundation for developing effective prevention and control strategies to mitigate its agricultural impact.

Specimens of H. triannulella were collected in September 2024 from Xinxiang City, China (35°16'48''N, 113°57'12''E) for whole genome sequencing. Field-collected Ipomoea batatas leaves exhibiting rolled leaves were transported to the laboratory. A portion of the leaves was used for larval collection, while the remainder was placed in wide-mouth bottles and maintained indoors until the emergence of H. triannulella adults. All collected samples were immediately flash-frozen in liquid nitrogen for storage at -80 °C. The larval samples were subsequently used for DNA sequencing to assemble the genome, while a pooled RNA sample including the different developmental stages (include larvae, pupae and adult) was used for transcriptome sequencing. In addition, the genomic DNA extracted from larval samples were also used for both Hi-C and PacBio HiFi sequencing.

The high-quality genomic DNA and total RNA were extracted using the CTAB method and Trizol reagent (Magen, Guangdong, China), respectively. The extracted DNA was evaluated for integrity and concentration using a NanoDrop 2000 spectrophotometer, Qubit 3.0 fluorometer, and 0.8% agarose gel electrophoresis to ensure suitability for downstream applications. Libraries were prepared for short-read sequencing using the VAHTS Universal Plus DNA Library Prep Kit for MGI (Vazyme, Nanjing, China), following the manufacturer's instructions. Hi-C libraries were prepared by digesting DNA with the Mbol restriction enzyme and processed according to established protocols. Both short-read and Hi-C libraries were sequenced on the BGISEQ DNBSEQ-T7 platform at Frasergen Bioinformatics (Wuhan, China). For long-read sequencing, PacBio HiFi libraries were first prepared using the SMRTbell Express Template Prep Kit 2.0. Subsequently, sequencing was performed on the PacBio Sequel II instrument with 8 M SMRT cells, acquiring one 30-hour movie per cell. This yielded a total of 30.44 Gb (78.50X coverage) of HiFi reads, 88.12 Gb (227.30X) of short reads, and 111.17 Gb (286.70X) of Hi-C reads (Table 1), providing a robust dataset for genome assembly. For transcriptome sequencing, RNA-seq libraries were prepared using the VAHTS Universal V10 RNA-seq Library Prep Kit (Vazyme, Nanjing, China) and sequenced on the BGISEQ DNBSEQ-T7 platform, generating 28.96 Gb of short-read RNA-seq data (Table 1). Subsequently, these transcriptomic data were utilized for genome annotation and the prediction of protein-coding genes, enabling a comprehensive understanding of the H. triannulella genome.

To characterize the genomic features of H. triannulella, we first conducted a genome survey using short-read sequencing data to estimate genome size, repeat content, and heterozygosity. The frequency distribution of 17-mer sequences was analyzed using Jellyfish v. 2.3.1, and the results were further processed with GenomeScope v. 2.0 to derive genomic parameters. This analysis revealed an estimated genome size of 355.30 Mb, with a substantial proportion of repetitive sequences (39.4%) and a heterozygosity rate of 0.78% (Fig. 2).

For the de novo genome assembly, we utilized HiFi long-read sequencing data, which were processed using HiFiasm v. 0.16.1 with default parameters to generate a high-quality draft assembly. The sequence graphs in GFA format were converted to FASTA format using gfatools, ensuring compatibility with downstream analyses. The final H. triannulella genome assembly comprised 387.28 Mb of sequence, distributed across 3,068 contigs and 2,152 scaffolds, with the contig N50 and scaffold N50 reaching 2.28 Mb and 12.75 Mb, respectively (Table 2).

To achieve a chromosomal-level genome assembly, we employed a comprehensive Hi-C-based scaffolding pipeline. Initially, Trimmomatic v. 0.40 was utilized to pre-process the Hi-C paired-end reads by removing low-quality bases and adapter sequences, ensuring high-quality input data for downstream analysis. The filtered reads were then aligned to the draft contigs using Juicer (v3) to calculate contact frequencies, which serve as the foundation for chromatin interaction mapping. Subsequently, 3d-DNA v. 180922 was applied with two iterative rounds of misjoin correction (-r2) using default parameters, refining the assembly by resolving potential structural errors. The oriented scaffolds generated from this step were further processed to construct interaction matrices using Juicer, which were visually inspected and manually corrected using Juicebox Assembly Tools v. 1.11.08. In addition, the Hi-C-scaffolded genome sequence was ordered against the reference genome sequence (Pectinophora gossypiella, NCBI accession no. GCA_024362695.1) using Mummer v. 4.0.0rc1 alignment results. This iterative refinement process ensured the accuracy and continuity of the chromosomal scaffolds. The heatmap shows that on all 30 chromosomes, the interaction strength at diagonal positions is significantly higher than at non-diagonal regions, indicating good quality of chromosome assembly (Fig. 3A). The collinearity analysis indicated that the chromosomes of H. triannulella showed substantial collinearity with those of the pink bollworm P. gossypiella (Fig. 3C). Ultimately, we successfully generated the first chromosomal-scale high-quality assembly for H. triannulella, with the genome anchored onto 30 chromosomes (Fig. 3). The chromosomal lengths varied from 4.45 Mb to 25.60 Mb, collectively representing 91.77% of the total genome sequence (Table 3).

The H. triannulella genome assembly was rigorously validated through multiple complementary assessment approaches. First, BUSCO (Benchmarking Universal Single-Copy Orthologue, v3.0.2) analysis showed that 96.20% of the core conserved genes (5,085 out of 5,286 Lepidoptera_odb10) were complete in the assembly, demonstrating exceptional genome completeness (Table 4).

To further evaluate assembly accuracy, we assessed the reads re-mapping ratio and coverage using both BGI short reads and HiFi long reads. BGI short reads were aligned to the genome using BWA MEM v. 0.7.17 with default parameters, while HiFi long reads were mapped using Minimap2 v. 2.24 with the "-ax map-hifi" parameter. The resulting mapping rates of 96.09% and 99.25%, along with genome coverages of 98.27% and 99.28%, respectively, further confirmed the high quality and continuity of the assembly (Table 2).

Given the inherent challenges of assembling highly repetitive genomic regions, we specifically evaluated the completeness of repetitive sequences using the Long Terminal Repeat (LTR) Assembly Index (LAI). The LAI value of 1.36 for the H. triannulella pseudomolecules indicates a reference-level assembly, underscoring the high completeness and accuracy of complex repetitive regions, including LTR retrotransposons. Additionally, we employed k-mer-based quality assessment (k = 19 bp) using the Merqury pipeline v. 1.3 with HiFi reads, which yielded a quality value (QV) score of 27.14 (Table 2). Collectively, these comprehensive evaluations demonstrate that the H. triannulella genome assembly achieves a high standard of quality, making it a reliable resource for downstream genomic and evolutionary analyses.

A comprehensive computational pipeline was used to identify and characterize repetitive sequences across the whole genome, including tandem repeats and transposable elements (TEs). Initially, Tandem Repeats Finder (TRF, v4.09.1) with stringent parameters (2 7 7 80 10 50 2000) were used to annotate high-confidence tandem repeats. Subsequently, TE annotation was conducted through an integrated approach combining de novo prediction and homology-based identification at both nucleotide and protein levels. For DNA-level analysis, LTR retrotransposons (LTR-RTs) were initially identified using LTR_FINDER v. 1.0.7, followed by de novo repeat library construction through RepeatModeler v. 2.0.1, which generated a comprehensive repeat consensus database with detailed classification information. The resulting library, in conjunction with the curated Repbase TE database, was then subjected to RepeatMasker v. 4.1.2 for genome-wide TE identification. Concurrently, protein-level analysis was performed using RepeatProteinMask implemented in the RepeatMasker package, which employed WU-BLASTX against the TE protein database to enhance detection sensitivity. Our analysis revealed that repetitive elements constitute a substantial portion of the genome, with 190.06 Mb (49.02% of the genome assembly) identified as repetitive sequences. The TE landscape was dominated by DNA transposons, accounting for 21.51% of the total repetitive content. Other major TE classes included long interspersed nuclear elements (LINEs, 19.07%), short interspersed nuclear elements (SINEs, 11.35%), and long terminal repeats (LTRs, 7.94%), with a minor fraction (3.57%) remaining unclassified (Table 5).

We employed a comprehensive approach combining homologous, ab initio, and transcriptome-assisted annotation to predict the coding gene structures. For homologous annotation, we first performed Tblastn v. 2.11.0 comparisons between related species and the reference genome. The resulting aligned sequences and their corresponding proteins were subsequently filtered and processed using Exonerate v. 2.4.0 for precise alignment. For de novo annotation, we utilized both Augustus v. 3.4.0 and GlimmerHMM v. 3.0.4. Regarding transcriptome analysis, we employed a dual strategy involving both genome-based and de novo assemblies. RNA-seq alignments were initially generated using HiSat2 v. .2.2.1, followed by transcript assembly through the genome-guided assembler Stringtie v. .2.1.7. In parallel, we conducted de novo transcriptome assembly using Trinity v. 2.8.5. All transcripts derived from RNA-seq were then integrated into a comprehensive transcriptome database following the PASA pipeline v. 2.4.1. To consolidate our findings, we used Maker v. 3.01.03 to integrate the predicted gene sets into a non-redundant, comprehensive, and reliable gene collection. Finally, we refined these predictions through the PASA pipeline v. 2.4.1, which enabled the addition of untranslated region (UTR) annotations and alternative splicing isoform models. Finally, we predicted a total of 14,211 protein-coding genes in the genome (Table 6). To functionally annotate the predicted protein-coding genes in the H. triannulella genome, we employed a comprehensive approach utilizing multiple well-established databases. Gene functions were inferred according to the best match of the alignments to the NCBI Non-Redundant (NR), KEGG database, Gene Ontology (GO), TrEMBL and Swiss-Prot protein databases using Diamond BLASTP v2.0.7 with an E-value threshold of 1E-5. Additionally, protein domains were annotated using InterProScan v. 5.50-84.0 based on the InterPro protein database, which integrates predictive models from multiple resources to provide a robust annotation framework. This multi-database approach enabled the functional annotation of 13,989 genes, representing 98.44% of the predicted protein-coding genes in the H. triannulella genome. Specifically, 74.38% of the genes were annotated in InterPro, 50.95% in GO, 96.88% in KEGG, 66.01% in SwissProt, 96.45% in TrEMBL, and 98.43% in NR (Table 6).

The tRNAscan-SE (v2.0.9) algorithms and RNAmmer v. 1.2 was used to predict tRNA genes and rRNA sequences, respectively. MiRNAs and snRNAs were identified by Infernal v. 1.1.2 software against the Rfam v. 14.6 database with default parameters. This integrated approach resulted in the identification of 74 miRNAs, 17,993 tRNAs, 52 rRNAs, and 65 snRNAs in the H. triannulella genome. Notably, tRNAs were the most abundant ncRNA class, representing 0.3377% of the total genome length (Table 7).

Chromosome-scale genome assembly of Helcystogramma triannulella (Lepidoptera: Gelechiidae) - Scientific Data

POPULAR CATEGORY

corporate

entertainment

research

misc

wellness

athletics