WORLD UNIVERSITY DIRECTORY
Medical Education Bioinformatics - current issue

  Back to "News Updates - Homepage"


| More


Bioinformatics - current issue - Recent Educational Updates

-->
NanoASV: a snakemake workflow for reproducible field-based Nanopore full-length 16S metabarcoding amplicon data analysis
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>NanoASV is a conda environment and snakemake-based workflow using state-of-the-art bioinformatics software to process full-length SSU rRNA (16S/18S) amplicons acquired with Oxford Nanopore Sequencing technology. Its strength lies in reproducibility, portability, and the possibility to run offline, allowing in-field analysis. It can be installed on the Nanopore MK1C sequencing device and process data locally.<div class="boxTitle">Availability and implementation</div>Source code and documentation are freely available at <a href="https://github.com/ImagoXV/NanoASV">https://github.com/ImagoXV/NanoASV</a> and Zenodo archive at <a href="https://doi.org/10.5281/zenodo.14730742">https://doi.org/10.5281/zenodo.14730742</a>.</span>


mastR: an R package for automated identification of tissue-specific gene signatures in multi-group differential expression analysis
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Biomarker discovery is important and offers insight into potential underlying mechanisms of disease. While existing biomarker identification methods primarily focus on single cell RNA sequencing (scRNA-seq) data, there remains a need for automated methods designed for labeled bulk RNA-seq data from sorted cell populations or experiments. Current methods require curation of results or statistical thresholds and may not account for tissue background expression. Here we bridge these limitations with an automated marker identification method for labeled bulk RNA-seq data that explicitly considers background expressions.<div class="boxTitle">Results</div>We developed <span style="font-style:italic;">mastR</span>, a novel tool for accurate marker identification using transcriptomic data. It leverages robust statistical pipelines like <span style="font-style:italic;">edgeR</span> and <span style="font-style:italic;">limma</span> to perform pairwise comparisons between groups, and aggregates results using rank-product-based permutation test. A signal-to-noise ratio approach is implemented to minimize background signals. We assessed the performance of <span style="font-style:italic;">mastR</span>-derived NK cell signatures against published curated signatures and found that the <span style="font-style:italic;">mastR</span>-derived signature performs as well, if not better than the published signatures. We further demonstrated the utility of <span style="font-style:italic;">mastR</span> on simulated scRNA-seq data and in comparison with <span style="font-style:italic;">Seurat</span> in terms of marker selection performance.<div class="boxTitle">Availability and implementation</div><span style="font-style:italic;">mastR</span> is freely available from <a href="https://bioconductor.org/packages/release/bioc/html/mastR.html">https://bioconductor.org/packages/release/bioc/html/mastR.html</a>. A vignette and guide are available at <a href="https://davislaboratory.github.io/mastR">https://davislaboratory.github.io/mastR</a>. All statistical analyses were carried out using R (version ≥4.3.0) and Bioconductor (version ≥3.17).</span>


Generating multiple alignments on a pangenomic scale
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Since novel long read sequencing technologies allow for <span style="font-style:italic;">de novo</span> assembly of many individuals of a species, high-quality assemblies are becoming widely available. For example, the recently published draft human pangenome reference was based on assemblies composed of contigs. There is an urgent need for a software-tool that is able to generate a multiple alignment of genomes of the same species because current multiple sequence alignment programs cannot deal with such a volume of data.<div class="boxTitle">Results</div>We show that the combination of a well-known anchor-based method with the technique of prefix-free parsing yields an approach that is able to generate multiple alignments on a pangenomic scale, provided that large-scale structural variants are rare. Furthermore, experiments with real world data show that our software tool PANgenomic Anchor-based Multiple Alignment significantly outperforms current state-of-the art programs.<div class="boxTitle">Availability and implementation</div>Source code is available at: <a href="https://gitlab.com/qwerzuiop/panama">https://gitlab.com/qwerzuiop/panama</a>, archived at swh:1:dir:e90c9f664995acca9063245cabdd97549cf39694.</span>


AlphaPulldown2—a general pipeline for high-throughput structural modeling
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>AlphaPulldown2 streamlines protein structural modeling by automating workflows, improving code adaptability, and optimizing data management for large-scale applications. It introduces an automated Snakemake pipeline, compressed data storage, support for additional modeling backends like UniFold and AlphaLink2, and a range of other improvements. These upgrades make AlphaPulldown2 a versatile platform for predicting both binary interactions and complex multi-unit assemblies.<div class="boxTitle">Availability and implementation</div><span style="font-style:italic;">AlphaPulldown2</span> is freely available at <a href="https://github.com/KosinskiLab/AlphaPulldown">https://github.com/KosinskiLab/AlphaPulldown</a>.</span>


actifpTM: a refined confidence metric of AlphaFold2 predictions involving flexible regions
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>One of the main advantages of deep learning models of protein structure, such as Alphafold2, is their ability to accurately estimate the confidence of a generated structural model, which allows us to focus on highly confident predictions. The ipTM score provides a confidence estimate of interchain contacts in protein–protein interactions. However, interactions, in particular motif-mediated interactions, often also contain regions that remain flexible upon binding. These noninteracting flanking regions are assigned low confidence values and will affect ipTM, as it considers all interchain residue–residue pairs, and two models of the same motif-domain interaction, but differing in the length of their flanking regions, would be assigned very different values. Here, we propose actual interface pTM (actifpTM), a modified ipTM measure, that focuses on the residues participating in the interaction, resulting in a more robust measure of interaction confidence. Besides, actifpTM is calculated both for the full complex as well as for each pair of chains, making it well-suited for evaluating multi-chain complexes with a particularly critical binding interface, such as antibody-antigen interactions.<div class="boxTitle">Availability and implementation</div>The method is available as part of the ColabFold (<a href="https://github.com/sokrypton/ColabFold">https://github.com/sokrypton/ColabFold</a>) repository, installable both locally or usable with Colab notebook.</span>


QuickEd: high-performance exact sequence alignment based on bound-and-align
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Pairwise sequence alignment is a core component of multiple sequencing-data analysis tools. Recent advancements in sequencing technologies have enabled the generation of longer sequences at a much lower price. Thus, long-read sequencing technologies have become increasingly popular in sequencing-based studies. However, classical sequence analysis algorithms face significant scalability challenges when aligning long sequences. As a result, several heuristic methods have been developed to improve performance at the expense of accuracy, as they often fail to produce the optimal alignment.<div class="boxTitle">Results</div>This paper introduces QuickEd, a sequence alignment algorithm based on a bound-and-align strategy. First, QuickEd effectively bounds the maximum alignment-score using efficient heuristic strategies. Then, QuickEd utilizes this bound to reduce the computations required to produce the optimal alignment. Compared to O(n2) complexity of traditional dynamic programming algorithms, QuickEd’s bound-and-align strategy achieves O(ns^) complexity, where <span style="font-style:italic;">n</span> is the sequence length and s^ is an estimated upper bound of the alignment-score between the sequences. As a result, QuickEd is consistently faster than other state-of-the-art implementations, such as Edlib and BiWFA, achieving performance speedups of 4.2−5.9× and 3.8−4.4×, respectively, aligning long and noisy datasets. In addition, QuickEd maintains a stable memory footprint below 35 MB while aligning sequences up to 1 Mbp.<div class="boxTitle">Availability and implementation</div>QuickEd code and documentation are publicly available at <a href="https://github.com/maxdoblas/QuickEd">https://github.com/maxdoblas/QuickEd</a>.</span>


UnifiedGreatMod: a new holistic modelling paradigm for studying biological systems on a complete and harmonious scale
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Computational models are crucial for addressing critical questions about systems evolution and deciphering system connections. The pivotal feature of making this concept recognizable from the biological and clinical community is the possibility of quickly inspecting the whole system, bearing in mind the different granularity levels of its components. This holistic view of system behaviour expands the evolution study by identifying the heterogeneous behaviours applicable, e.g. to the cancer evolution study.<div class="boxTitle">Results</div>To address this aspect, we propose a new modelling paradigm, UnifiedGreatMod, which allows modellers to integrate fine-grained and coarse-grained biological information into a unique model. It enables functional studies by combining the analysis of the system’s multi-level stable states with its fluctuating conditions. This approach helps to investigate the functional relationships and dependencies among biological entities. This is achieved, thanks to the hybridization of two analysis approaches that capture a system’s different granularity levels. The proposed paradigm was then implemented into the open-source, general modelling framework GreatMod, in which a graphical meta-formalism is exploited to simplify the model creation phase and R languages to define user-defined analysis workflows. The proposal’s effectiveness was demonstrated by mechanistically simulating the metabolic output of <span style="font-style:italic;">Escherichia coli</span> under environmental nutrient perturbations and integrating a gene expression dataset. Additionally, the UnifiedGreatMod was used to examine the responses of luminal epithelial cells to <span style="font-style:italic;">Clostridium difficile</span> infection.<div class="boxTitle">Availability and implementation</div>GreatMod <a href="https://qbioturin.github.io/epimod/">https://qbioturin.github.io/epimod/</a>, epimod_FBAfunctions <a href="https://github.com/qBioTurin/epimod_FBAfunctions">https://github.com/qBioTurin/epimod_FBAfunctions</a>, first case study <span style="font-style:italic;">E. coli</span>  <a href="https://github.com/qBioTurin/Ec_coli_modelling">https://github.com/qBioTurin/Ec_coli_modelling</a>, second case study <span style="font-style:italic;">C. difficile</span>  <a href="https://github.com/qBioTurin/EpiCell_CDifficile">https://github.com/qBioTurin/EpiCell_CDifficile</a>.</span>


AJGM: joint learning of heterogeneous gene networks with adaptive graphical model
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Inferring gene networks provides insights into biological pathways and functional relationships among genes. When gene expression samples exhibit heterogeneity, they may originate from unknown subtypes, prompting the utilization of mixture Gaussian graphical model (GGM) for simultaneous subclassification and gene network inference. However, this method overlooks the heterogeneity of network relationships across subtypes and does not sufficiently emphasize shared relationships. Additionally, GGM assumes data follows a multivariate Gaussian distribution, which is often not the case with zero-inflated scRNA-seq data.<div class="boxTitle">Results</div>We propose an Adaptive Joint Graphical Model (AJGM) for estimating multiple gene networks from single-cell or bulk data with unknown heterogeneity. In AJGM, an overall network is introduced to capture relationships shared by all samples. The model establishes connections between the subtype networks and the overall network through adaptive weights, enabling it to focus more effectively on gene relationships shared across all networks, thereby enhancing the accuracy of network estimation. On synthetic data, the proposed approach outperforms existing methods in terms of sample classification and network inference, particularly excelling in the identification of shared relationships. Applying this method to gene expression data from triple-negative breast cancer confirms known gene pathways and hub genes, while also revealing novel biological insights.<div class="boxTitle">Availability and implementation</div>The Python code and demonstrations of the proposed approaches are available at <a href="https://github.com/yyytim/AJGM">https://github.com/yyytim/AJGM</a>, and the software is archived in Zenodo with DOI: 10.5281/zenodo.14740972.</span>


PopGLen—a Snakemake pipeline for performing population genomic analyses using genotype likelihood-based methods
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>PopGLen is a Snakemake workflow for performing population genomic analyses within a genotype-likelihood framework, integrating steps for raw sequence processing of both historical and modern DNA, quality control, multiple filtering schemes, and population genomic analysis. Currently, the population genomic analyses included allow for estimating linkage disequilibrium, kinship, genetic diversity, genetic differentiation, population structure, inbreeding, and allele frequencies. Through Snakemake, it is highly scalable, and all steps of the workflow are automated, with results compiled into an HTML report. PopGLen provides an efficient, customizable, and reproducible option for analyzing population genomic datasets across a wide variety of organisms.<div class="boxTitle">Availability and implementation</div>PopGLen is available under GPLv3 with code, documentation, and a tutorial at <a href="https://github.com/zjnolen/PopGLen">https://github.com/zjnolen/PopGLen</a>. An example HTML report using the tutorial dataset is included in the Supplementary MaterialSupplementary Material.</span>


BioArchLinux: community-driven fresh reproducible software repository for life sciences
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>The BioArchLinux project was initiated to address challenges in bioinformatics software reproducibility and freshness. Relying on Arch Linux's user-driven ecosystem, we aim to create a comprehensive and continuously updated repository for life sciences research.<div class="boxTitle">Results</div>BioArchLinux provides a PKGBUILD-based system for seamless software packaging and maintenance, enabling users to access the latest bioinformatics tools across multiple programming languages. The repository includes Docker images, Windows Subsystem for Linux (WSL) support, and Junest for nonroot environments, enhancing accessibility across platforms. Although being developed and maintained by a small core team, BioArchLinux is a fast-growing bioinformatics repository that offers a participatory and community-driven environment.<div class="boxTitle">Availability and implementation</div>The repository, documentation, and tools are freely available at <a href="https://bioarchlinux.org">https://bioarchlinux.org</a> and <a href="https://github.com/BioArchLinux">https://github.com/BioArchLinux</a>. Users and developers are encouraged to contribute and expand this open-source initiative.</span>


TrAGEDy—trajectory alignment of gene expression dynamics
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Single-cell transcriptomics sequencing is used to compare different biological processes. However, often, those processes are asymmetric which are difficult to integrate. Current approaches often rely on integrating samples from each condition before either cluster-based comparisons or analysis of an inferred shared trajectory.<div class="boxTitle">Results</div>We present Trajectory Alignment of Gene Expression Dynamics (TrAGEDy), which allows the alignment of independent trajectories to avoid the need for error–prone integration steps. Across simulated datasets, TrAGEDy returns the correct underlying alignment of the datasets, outperforming current tools which fail to capture the complexity of asymmetric alignments. When applied to real datasets, TrAGEDy captures more biologically relevant genes and processes, which other differential expression methods fail to detect when looking at the developments of T cells and the bloodstream forms of <span style="font-style:italic;">Trypanosoma brucei</span> when affected by genetic knockouts.<div class="boxTitle">Availability and implementation</div>TrAGEDy is freely available at <a href="https://github.com/No2Ross/TrAGEDy">https://github.com/No2Ross/TrAGEDy</a>, and implemented in R.</span>


ENACT: End-to-End Analysis of Visium High Definition (HD) Data
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Spatial transcriptomics (ST) enables the study of gene expression within its spatial context in histopathology samples. To date, a limiting factor has been the resolution of sequencing based ST products. The introduction of the Visium High Definition (HD) technology opens the door to cell resolution ST studies. However, challenges remain in the ability to accurately map transcripts to cells and in assigning cell types based on the transcript data.<div class="boxTitle">Results</div>We developed ENACT, a self-contained pipeline that integrates advanced cell segmentation with Visium HD transcriptomics data to infer cell types across whole tissue sections. Our pipeline incorporates novel bin-to-cell assignment methods, enhancing the accuracy of single-cell transcript estimates. Validated on diverse synthetic and real datasets, our approach is both scalable to samples with hundreds of thousands of cells and effective, offering a robust solution for spatially resolved transcriptomics analysis.<div class="boxTitle">Availability and implementation</div>ENACT source code is available at <a href="https://github.com/Sanofi-Public/enact-pipeline">https://github.com/Sanofi-Public/enact-pipeline</a>. Experimental data are available at <a href="https://zenodo.org/records/14748859">https://zenodo.org/records/14748859</a>.</span>


Predicting circRNA–disease associations with shared units and multi-channel attention mechanisms
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Circular RNAs (circRNAs) have been identified as key players in the progression of several diseases; however, their roles have not yet been determined because of the high financial burden of biological studies. This highlights the urgent need to develop efficient computational models that can predict circRNA–disease associations, offering an alternative approach to overcome the limitations of expensive experimental studies. Although multi-view learning methods have been widely adopted, most approaches fail to fully exploit the latent information across views, while simultaneously overlooking the fact that different views contribute to varying degrees of significance.<div class="boxTitle">Results</div>This study presents a method that combines multi-view shared units and multichannel attention mechanisms to predict circRNA–disease associations (MSMCDA). MSMCDA first constructs similarity and meta-path networks for circRNAs and diseases by introducing shared units to facilitate interactive learning across distinct network features. Subsequently, multichannel attention mechanisms were used to optimize the weights within similarity networks. Finally, contrastive learning strengthened the similarity features. Experiments on five public datasets demonstrated that MSMCDA significantly outperformed other baseline methods. Additionally, case studies on colorectal cancer, gastric cancer, and nonsmall cell lung cancer confirmed the effectiveness of MSMCDA in uncovering new associations.<div class="boxTitle">Availability and implementation</div>The source code and data are available at <a href="https://github.com/zhangxue2115/MSMCDA.git">https://github.com/zhangxue2115/MSMCDA.git</a>.</span>


PgRC2: engineering the compression of sequencing reads
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>The FASTQ format remains at the heart of high-throughput sequencing. Despite advances in specialized FASTQ compressors, they are still imperfect in terms of practical performance tradeoffs. We present a multi-threaded version of Pseudogenome-based Read Compressor (PgRC), an in-memory algorithm for compressing the DNA stream, based on the idea of approximating the shortest common superstring over high-quality reads. Redundancy in the obtained string is efficiently removed by using a compact temporary representation. The current version, v2.0, preserves the compression ratio of the previous one, reducing the compression (resp. decompression) time by a factor of 8–9 (resp. 2–2.5) on a 14-core/28-thread machine.<div class="boxTitle">Availability and implementation</div>PgRC 2.0 can be downloaded from <a href="https://github.com/kowallus/PgRC">https://github.com/kowallus/PgRC</a> and <a href="https://zenodo.org/records/14882486">https://zenodo.org/records/14882486</a> (10.5281/zenodo.14882486).</span>


GeneFEAST: the pivotal, gene-centric step in functional enrichment analysis interpretation
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>GeneFEAST, implemented in Python, is a <span style="font-style:italic;">gene</span>-centric <span style="font-style:italic;">f</span>unctional <span style="font-style:italic;">e</span>nrichment <span style="font-style:italic;">a</span>nalysis <span style="font-style:italic;">s</span>ummarization and visualization <span style="font-style:italic;">t</span>ool that can be applied to large functional enrichment analysis (FEA) results arising from upstream FEA pipelines. It produces a systematic, navigable HTML report, making it easy to identify sets of genes putatively driving multiple enrichments and to explore gene-level quantitative data first used to identify input genes. Further, GeneFEAST can juxtapose FEA results from multiple studies, making it possible to highlight patterns of gene expression amongst genes that are differentially expressed in at least one of multiple conditions, and which give rise to shared enrichments under those conditions. Thus, GeneFEAST offers a novel, effective way to address the complexities of linking up many overlapping FEA results to their underlying genes and data, advancing gene-centric hypotheses, and providing pivotal information for downstream validation experiments.<div class="boxTitle">Availability and implementation</div>GeneFEAST GitHub repository: <a href="https://github.com/avigailtaylor/GeneFEAST">https://github.com/avigailtaylor/GeneFEAST</a>; Zenodo record: <a href="https://10.5281/zenodo.14753734">10.5281/zenodo.14753734</a>; Python Package Index: <a href="https://pypi.org/project/genefeast">https://pypi.org/project/genefeast</a>; Docker container: ghcr.io/avigailtaylor/genefeast.</span>


AcImpute: a constraint-enhancing smooth-based approach for imputing single-cell RNA sequencing data
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Single-cell RNA sequencing (scRNA-seq) provides a powerful tool for studying cellular heterogeneity and complexity. However, dropout events in single-cell RNA-seq data severely hinder the effectiveness and accuracy of downstream analysis. Therefore, data preprocessing with imputation methods is crucial to scRNA-seq analysis.<div class="boxTitle">Results</div>To address the issue of oversmoothing in smoothing-based imputation methods, the presented AcImpute, an unsupervised method that enhances imputation accuracy by constraining the smoothing weights among cells for genes with different expression levels. Compared with nine other imputation methods in cluster analysis and trajectory inference, the experimental results can demonstrate that AcImpute effectively restores gene expression, preserves inter-cell variability, preventing oversmoothing and improving clustering and trajectory inference performance.<div class="boxTitle">Availability and implementation</div>The code is available at <a href="https://github.com/Liutto/AcImpute">https://github.com/Liutto/AcImpute</a>.</span>


ORCO: Ollivier-Ricci Curvature-Omics—an unsupervised method for analyzing robustness in biological systems
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Although recent advanced sequencing technologies have improved the resolution of genomic and proteomic data to better characterize molecular phenotypes, efficient computational tools to analyze and interpret large-scale omic data are still needed.<div class="boxTitle">Results</div>To address this, we have developed a network-based bioinformatic tool called Ollivier-Ricci curvature for omics (ORCO). ORCO incorporates omics data and a network describing biological relationships between the genes or proteins and computes Ollivier-Ricci curvature (ORC) values for individual interactions. ORC is an edge-based measure that assesses network robustness. It captures functional cooperation in gene signaling using a consistent information-passing measure, which can help investigators identify therapeutic targets and key regulatory modules in biological systems. ORC has identified novel insights in multiple cancer types using genomic data and in neurodevelopmental disorders using brain imaging data. This tool is applicable to any data that can be represented as a network.<div class="boxTitle">Availability and implementation</div>ORCO is an open-source Python package and is publicly available on GitHub at <a href="https://github.com/aksimhal/ORC-Omics">https://github.com/aksimhal/ORC-Omics</a>.</span>


CryoTEN: efficiently enhancing cryo-EM density maps using transformers
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Cryogenic electron microscopy (cryo-EM) is a core experimental technique used to determine the structure of macromolecules such as proteins. However, the effectiveness of cryo-EM is often hindered by the noise and missing density values in cryo-EM density maps caused by experimental conditions such as low contrast and conformational heterogeneity. Although various global and local map-sharpening techniques are widely employed to improve cryo-EM density maps, it is still challenging to efficiently improve their quality for building better protein structures from them.<div class="boxTitle">Results</div>In this study, we introduce CryoTEN—a 3D UNETR++ style transformer to improve cryo-EM maps effectively. CryoTEN is trained using a diverse set of 1295 cryo-EM maps as inputs and their corresponding simulated maps generated from known protein structures as targets. An independent test set containing 150 maps is used to evaluate CryoTEN, and the results demonstrate that it can robustly enhance the quality of cryo-EM density maps. In addition, automatic <span style="font-style:italic;">de novo</span> protein structure modeling shows that protein structures built from the density maps processed by CryoTEN have substantially better quality than those built from the original maps. Compared to the existing state-of-the-art deep learning methods for enhancing cryo-EM density maps, CryoTEN ranks second in improving the quality of density maps, while running &gt;10 times faster and requiring much less GPU memory than them.<div class="boxTitle">Availability and implementation</div>The source code and data are freely available at <a href="https://github.com/jianlin-cheng/cryoten">https://github.com/jianlin-cheng/cryoten</a>.</span>


Significance in scale space for Hi-C data
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Hi-C technology has been developed to profile genome-wide chromosome conformation. So far Hi-C data have been generated from a large compendium of different cell types and different tissue types. Among different chromatin conformation units, chromatin loops were found to play a key role in gene regulation across different cell types. While many different loop calling algorithms have been developed, most loop callers identified shared loops as opposed to cell-type-specific loops.<div class="boxTitle">Results</div>We propose SSSHiC, a new loop calling algorithm based on significance in scale space, which can be used to understand data at different levels of resolution. By applying SSSHiC to neuronal and glial Hi-C data, we detected more loops that are potentially engaged in cell-type-specific gene regulation. Compared with other loop callers, such as Mustache, these loops were more frequently anchored to gene promoters of cellular marker genes and had better APA scores. Therefore, our results suggest that SSSHiC can effectively capture loops that contain more gene regulatory information.<div class="boxTitle">Availability and implementation</div>The Hi-C data used in this study can be accessed through the PsychENCODE Knowledge Portal at <a href="https://www.synapse.org/">https://www.synapse.org/#</a>! Synapse: syn21760712. The code utilized for Curvature SSS cited in this study is available at <a href="https://github.com/jsmarron/MarronMatlabSoftware/blob/master/Matlab9/Matlab9Combined.zip">https://github.com/jsmarron/MarronMatlabSoftware/blob/master/Matlab9/Matlab9Combined.zip</a>. All custom code used in this research can be found in the GitHub repository: <a href="https://github.com/jerryliu01998/HiC">https://github.com/jerryliu01998/HiC</a>. The code has also been submitted to Code Ocean with the doi: 10.24433/CO.1912913.v1.</span>


Scupa: single-cell unified polarization assessment of immune cells using the single-cell foundation model
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Immune cells undergo cytokine-driven polarization in response to diverse stimuli, altering their transcriptional profiles and functional states. This dynamic process is central to immune responses in health and diseases, yet a systematic approach to assess cytokine-driven polarization in single-cell RNA sequencing data has been lacking.<div class="boxTitle">Results</div>To address this gap, we developed <strong>s</strong>ingle-<strong>c</strong>ell <strong>u</strong>nified <strong>p</strong>olarization <strong>a</strong>ssessment (Scupa), the first computational method for comprehensive immune cell polarization assessment. Scupa leverages data from the Immune Dictionary, which characterizes cytokine-driven polarization states across 14 immune cell types. By integrating cell embeddings from the single-cell foundation model Universal Cell Embeddings, Scupa effectively identifies polarized cells across different species and experimental conditions. Applications of Scupa in independent datasets demonstrated its accuracy in classifying polarized cells and further revealed distinct polarization profiles in tumor-infiltrating myeloid cells across cancers. Scupa complements conventional single-cell data analysis by providing new insights into dynamic immune cell states, and holds potential for advancing therapeutic insights, particularly in cytokine-based therapies.<div class="boxTitle">Availability and implementation</div>The code is available at <a href="https://github.com/bsml320/Scupa">https://github.com/bsml320/Scupa</a>.</span>


NPM: latent batch effects correction of omics data by nearest-pair matching
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Batch effects (BEs) are a predominant source of noise in omics data and often mask real biological signals. BEs remain common in existing datasets. Current methods for BE correction mostly rely on specific assumptions or complex models, and may not detect and adjust BEs adequately, impacting downstream analysis and discovery power. To address these challenges we developed NPM, a nearest-neighbor matching-based method that adjusts BEs and may outperform other methods in a wide range of datasets.<div class="boxTitle">Results</div>We assessed distinct metrics and graphical readouts, and compared our method to commonly used BE correction methods. NPM demonstrates the ability in correcting for BEs, while preserving biological differences. It may outperform other methods based on multiple metrics. Altogether, NPM proves to be a valuable BE correction approach to maximize discovery in biomedical research, with applicability in clinical research where latent BEs are often dominant.<div class="boxTitle">Availability and implementation</div>NPM is freely available on GitHub (<a href="https://github.com/bigomics/NPM">https://github.com/bigomics/NPM</a>) and on Omics Playground (<a href="https://bigomics.ch/omics-playground">https://bigomics.ch/omics-playground</a>). Computer codes for analyses are available at (<a href="https://github.com/bigomics/NPM">https://github.com/bigomics/NPM</a>). The datasets underlying this article are the following: GSE120099, GSE82177, GSE162760, GSE171343, GSE153380, GSE163214, GSE182440, GSE163857, GSE117970, GSE173078, and GSE10846. All these datasets are publicly available and can be freely accessed on the Gene Expression Omnibus repository.</span>


Jellyfish: integrative visualization of spatio-temporal tumor evolution and clonal dynamics
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>Spatial and temporal intra-tumor heterogeneity drives tumor evolution and therapy resistance. Existing visualization tools often fail to capture both dimensions simultaneously. To address this, we developed Jellyfish, a tool that integrates phylogenetic and sample trees into a single plot, providing a holistic view of tumor evolution and capturing both spatial and temporal evolution. Available as a JavaScript library and R package, Jellyfish generates interactive visualizations from tumor phylogeny and clonal composition data. We demonstrate its ability to visualize complex subclonal dynamics using data from ovarian high-grade serous carcinoma.<div class="boxTitle">Availability and implementation</div>Jellyfish is freely available with MIT license at <a href="https://github.com/HautaniemiLab/jellyfish">https://github.com/HautaniemiLab/jellyfish</a> (JavaScript library) and <a href="https://github.com/HautaniemiLab/jellyfisher">https://github.com/HautaniemiLab/jellyfisher</a> (R package).</span>


COME: contrastive mapping learning for spatial reconstruction of single-cell RNA sequencing data
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Single-cell RNA sequencing (scRNA-seq) enables high-throughput transcriptomic profiling at single-cell resolution. The inherent spatial location is crucial for understanding how single cells orchestrate multicellular functions and drive diseases. However, spatial information is often lost during tissue dissociation. Spatial transcriptomic (ST) technologies can provide precise spatial gene expression atlas, while their practicality is constrained by the number of genes they can assay or the associated costs at a larger scale and the fine-grained cell-type annotation. By transferring knowledge between scRNA-seq and ST data through cell correspondence learning, it is possible to recover the spatial properties inherent in scRNA-seq datasets.<div class="boxTitle">Results</div>In this study, we introduce COME, a COntrastive Mapping lEarning approach that learns mapping between ST and scRNA-seq data to recover the spatial information of scRNA-seq data. Extensive experiments demonstrate that the proposed COME method effectively captures precise cell-spot relationships and outperforms previous methods in recovering spatial location for scRNA-seq data. More importantly, our method is capable of precisely identifying biologically meaningful information within the data, such as the spatial structure of missing genes, spatial hierarchical patterns, and the cell-type compositions for each spot. These results indicate that the proposed COME method can help to understand the heterogeneity and activities among cells within tissue environments.<div class="boxTitle">Availability and implementation</div>The COME is freely available in GitHub (<a href="https://github.com/cindyway/COME">https://github.com/cindyway/COME</a>)</span>


PharaCon: a new framework for identifying bacteriophages via conditional representation learning
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Identifying bacteriophages (phages) within metagenomic sequences is essential for understanding microbial community dynamics. Transformer-based foundation models have been successfully employed to address various biological challenges. However, these models are typically pre-trained with self-supervised tasks that do not consider label variance in the pre-training data. This presents a challenge for phage identification as pre-training on mixed bacterial and phage data may lead to information bias due to the imbalance between bacterial and phage samples.<div class="boxTitle">Results</div>To overcome this limitation, we proposed a novel conditional BERT framework that incorporates label classes as special tokens during pre-training. Specifically, our conditional BERT model attaches labels directly during tokenization, introducing label constraints into the model’s input. Additionally, we introduced a new fine-tuning scheme that enables the conditional BERT to be effectively utilized for classification tasks. This framework allows the BERT model to acquire label-specific contextual representations from mixed sequence data during pre-training and applies the conditional BERT as a classifier during fine-tuning, and we named the fine-tuned model as PharaCon. We evaluated PharaCon against several existing methods on both simulated sequence datasets and real metagenomic contig datasets. The results demonstrate PharaCon’s effectiveness and efficiency in phage identification, highlighting the advantages of incorporating label information during both pre-training and fine-tuning.<div class="boxTitle">Availability and implementation</div>The source code and associated data can be accessed at <a href="https://github.com/Celestial-Bai/PharaCon">https://github.com/Celestial-Bai/PharaCon</a>.</span>


AskBeacon—performing genomic data exchange and analytics with natural language
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Enabling clinicians and researchers to directly interact with global genomic data resources by removing technological barriers is vital for medical genomics. AskBeacon enables large language models (LLMs) to be applied to securely shared cohorts via the Global Alliance for Genomics and Health Beacon protocol. By simply “asking” Beacon, actionable insights can be gained, analyzed, and made publication-ready.<div class="boxTitle">Results</div>In the Parkinson's Progression Markers Initiative (PPMI), we use natural language to ask whether the sex-differences observed in Parkinson's disease are due to X-linked or autosomal markers. AskBeacon returns a publication-ready visualization showing that for PPMI the autosomal marker occurred 1.4 times more often in males with Parkinson’s disease than females, compared to no differences for the X-linked marker. We evaluate commercial and open-weight LLM models, as well as different architectures to identify the best strategy for translating research questions to Beacon queries. AskBeacon implements extensive safety guardrails to ensure that genomic data is not exposed to the LLM directly, and that generated code for data extraction, analysis and visualization process is sanitized and hallucination resistant, so data cannot be leaked or falsified.<div class="boxTitle">Availability and implementation</div>AskBeacon is available at <a href="https://github.com/aehrc/AskBeacon">https://github.com/aehrc/AskBeacon</a>.</span>


SpectroPipeR—a streamlining post Spectronaut® DIA-MS data analysis R package
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>Proteome studies frequently encounter challenges in down-stream data analysis due to limited bioinformatics resources, rapid data generation, and variations in analytical methods. To address these issues, we developed SpectroPipeR, an R package designed to streamline data analysis tasks and provide a comprehensive, standardized pipeline for Spectronaut<sup>®</sup> DIA-MS data. This novel package automates various analytical processes, including XIC plots, ID rate summary, normalization, batch and covariate adjustment, relative protein quantification, multivariate analysis, and statistical analysis, while generating interactive HTML reports for e.g. ELN systems.<div class="boxTitle">Availability and implementation</div>The SpectroPipeR package (manual: <a href="https://stemicha.github.io/SpectroPipeR/">https://stemicha.github.io/SpectroPipeR/</a>) was written in R and is freely available on GitHub (<a href="https://github.com/stemicha/SpectroPipeR">https://github.com/stemicha/SpectroPipeR</a>).</span>


AsaruSim: a single-cell and spatial RNA-Seq Nanopore long-reads simulation workflow
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>The combination of long-read sequencing technologies like Oxford Nanopore with single-cell RNA sequencing (scRNAseq) assays enables the detailed exploration of transcriptomic complexity, including isoform detection and quantification, by capturing full-length cDNAs. However, challenges remain, including the lack of advanced simulation tools that can effectively mimic the unique complexities of scRNAseq long-read datasets. Such tools are essential for the evaluation and optimization of isoform detection methods dedicated to single-cell long-read studies.<div class="boxTitle">Results</div>We developed AsaruSim, a workflow that simulates synthetic single-cell long-read Nanopore datasets, closely mimicking real experimental data. AsaruSim employs a multi-step process that includes the creation of a synthetic count matrix, generation of perfect reads, optional PCR amplification, introduction of sequencing errors, and comprehensive quality control reporting. Applied to a dataset of human peripheral blood mononuclear cells, AsaruSim accurately reproduced experimental read characteristics.<div class="boxTitle">Availability and implementation</div>The source code and full documentation are available at <a href="https://github.com/GenomiqueENS/AsaruSim">https://github.com/GenomiqueENS/AsaruSim</a>.</span>


MiNEApy: enhancing enrichment network analysis in metabolic networks
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Modeling genome-scale metabolic networks (GEMs) helps understand metabolic fluxes in cells at a specific state under defined environmental conditions or perturbations. Elementary flux modes (EFMs) are powerful tools for simplifying complex metabolic networks into smaller, more manageable pathways. However, the enumeration of all EFMs, especially within GEMs, poses significant challenges due to computational complexity. Additionally, traditional EFM approaches often fail to capture essential aspects of metabolism, such as co-factor balancing and by-product generation. The previously developed Minimum Network Enrichment Analysis (MiNEA) method addresses these limitations by enumerating alternative minimal networks for given biomass building blocks and metabolic tasks. MiNEA facilitates a deeper understanding of metabolic task flexibility and context-specific metabolic routes by integrating condition-specific transcriptomics, proteomics, and metabolomics data. This approach offers significant improvements in the analysis of metabolic pathways, providing more comprehensive insights into cellular metabolism.<div class="boxTitle">Results</div>Here, I present MiNEApy, a Python package reimplementation of MiNEA, which computes minimal networks and performs enrichment analysis. I demonstrate the application of MiNEApy on both a small-scale and a genome-scale model of the bacterium <span style="font-style:italic;">Escherichia coli</span>, showcasing its ability to conduct minimal network enrichment analysis using minimal networks and context-specific data.<div class="boxTitle">Availability and implementation</div>MiNEApy can be accessed at: <a href="https://github.com/vpandey-om/mineapy">https://github.com/vpandey-om/mineapy</a></span>


EpicPred: predicting phenotypes driven by epitope-binding TCRs using attention-based multiple instance learning
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Correctly identifying epitope-binding T-cell receptors (TCRs) is important to both understand their underlying biological mechanism in association to some phenotype and accordingly develop T-cell mediated immunotherapy treatments. Although the importance of the CDR3 region in TCRs for epitope recognition is well recognized, methods for profiling their interactions in association to a certain disease or phenotype remains less studied. We developed EpicPred to identify phenotype-specific TCR–epitope interactions. EpicPred first predicts and removes unlikely TCR–epitope interactions to reduce false positives using the Open-set Recognition (OSR). Subsequently, multiple instance learning was used to identify TCR–epitope interactions specific to a cancer type or severity levels of COVID-19 infected patients.<div class="boxTitle">Results</div>From six public TCR databases, 244 552 TCR sequences and 105 unique epitopes were used to predict epitope-binding TCRs and to filter out non-epitope-binding TCRs using the OSR method. The predicted interactions were used to further predict the phenotype groups in two cancer and four COVID-19 TCR-seq datasets of both bulk and single-cell resolution. EpicPred outperformed the competing methods in predicting the phenotypes, achieving an average AUROC of 0.80 ± 0.07.<div class="boxTitle">Availability and implementation</div>The EpicPred Software is available at <a href="https://github.com/jaeminjj/EpicPred">https://github.com/jaeminjj/EpicPred</a>.</span>


Tribus: semi-automated discovery of cell identities and phenotypes from multiplexed imaging and proteomic data
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Multiplexed imaging and single-cell analysis are increasingly applied to investigate the tissue spatial ecosystems in cancer and other complex diseases. Accurate single-cell phenotyping based on marker combinations is a critical but challenging task due to (i) low reproducibility across experiments with manual thresholding, and, (ii) labor-intensive ground-truth expert annotation required for learning-based methods.<div class="boxTitle">Results</div>We developed Tribus, an interactive knowledge-based classifier for multiplexed images and proteomic datasets that avoids hard-set thresholds and manual labeling. We demonstrated that Tribus recovers fine-grained cell types, matching the gold standard annotations by human experts. Additionally, Tribus can target ambiguous populations and discover phenotypically distinct cell subtypes. Through benchmarking against three similar methods in four public datasets with ground truth labels, we show that Tribus outperforms other methods in accuracy and computational efficiency, reducing runtime by an order of magnitude. Finally, we demonstrate the performance of Tribus in rapid and precise cell phenotyping with two large in-house whole-slide imaging datasets.<div class="boxTitle">Availability and implementation</div>Tribus is available at <a href="https://github.com/farkkilab/tribus">https://github.com/farkkilab/tribus</a> as an open-source Python package.</span>


Sul-BertGRU: an ensemble deep learning method integrating information entropy-enhanced BERT and directional multi-GRU for S-sulfhydration sites prediction
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>S-sulfhydration, a crucial post-translational protein modification, is pivotal in cellular recognition, signaling processes, and the development and progression of cardiovascular and neurological disorders, so identifying S-sulfhydration sites is crucial for studies in cell biology. Deep learning shows high efficiency and accuracy in identifying protein sites compared to traditional methods that often lack sensitivity and specificity in accurately locating nonsulfhydration sites. Therefore, we employ deep learning methods to tackle the challenge of pinpointing S-sulfhydration sites.<div class="boxTitle">Results</div>In this work, we introduce a deep learning approach called Sul-BertGRU, designed specifically for predicting S-sulfhydration sites in proteins, which integrates multi-directional gated recurrent unit (GRU) and BERT. First, Sul-BertGRU proposes an information entropy-enhanced BERT (IE-BERT) to preprocess protein sequences and extract initial features. Subsequently, confidence learning is employed to eliminate potential S-sulfhydration samples from the nonsulfhydration samples and select reliable negative samples. Then, considering the directional nature of the modification process, protein sequences are categorized into left, right, and full sequences centered on cysteines. We build a multi-directional GRU to enhance the extraction of directional sequence features and model the details of the enzymatic reaction involved in S-sulfhydration. Ultimately, we apply a parallel multi-head self-attention mechanism alongside a convolutional neural network to deeply analyze sequence features that might be missed at a local level. Sul-BertGRU achieves sensitivity, specificity, precision, accuracy, Matthews correlation coefficient, and area under the curve scores of 85.82%, 68.24%, 74.80%, 77.44%, 55.13%, and 77.03%, respectively. Sul-BertGRU demonstrates exceptional performance and proves to be a reliable method for predicting protein S-sulfhydration sites.<div class="boxTitle">Availability and implementation</div>The source code and data are available at <a href="https://github.com/Severus0902/Sul-BertGRU/">https://github.com/Severus0902/Sul-BertGRU/</a>.</span>


SimMS: a GPU-accelerated cosine similarity implementation for tandem mass spectrometry
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Untargeted metabolomics involves a large-scale comparison of the fragmentation pattern of a mass spectrum against a database containing known spectra. Given the number of comparisons involved, this step can be time-consuming.<div class="boxTitle">Results</div>In this work, we present a GPU-accelerated cosine similarity implementation for Tandem Mass Spectrometry (MS), with an approximately 1000-fold speedup compared to the MatchMS reference implementation, without any loss of accuracy. This improvement enables repository-scale spectral library matching for compound identification without the need for large compute clusters. This impact extends to any spectral comparison-based methods such as molecular networking approaches and analogue search.<div class="boxTitle">Availability and implementation</div>All code, results, and notebooks supporting are freely available under the MIT license at <a href="https://github.com/pangeAI/simms/">https://github.com/pangeAI/simms/</a>.</span>


HTSinfer: inferring metadata from bulk Illumina RNA-Seq libraries
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>The Sequencing Read Archive is one of the largest and fastest-growing repositories of sequencing data, containing tens of petabytes of sequenced reads. Its data is used by a wide scientific community, often beyond the primary study that generated them. Such analyses rely on accurate metadata concerning the type of experiment and library, as well as the organism from which the sequenced reads were derived. These metadata are typically entered manually by contributors in an error-prone process, and are frequently incomplete. In addition, easy-to-use computational tools that verify the consistency and completeness of metadata describing the libraries to facilitate data reuse, are largely unavailable. Here, we introduce HTSinfer, a Python-based tool to infer metadata directly and solely from bulk RNA-sequencing data generated on Illumina platforms. HTSinfer leverages genome sequence information and diagnostic genes to rapidly and accurately infer the library source and library type, as well as the relative read orientation, 3′ adapter sequence and read length statistics. HTSinfer is written in a modular manner, published under a permissible free and open-source license and encourages contributions by the community, enabling easy addition of new functionalities, e.g. for the inference of additional metrics, or the support of different experiment types or sequencing platforms.<div class="boxTitle">Availability and implementation</div>HTSinfer is released under the Apache License 2.0. Latest code is available via GitHub at <a href="https://github.com/zavolanlab/htsinfer">https://github.com/zavolanlab/htsinfer</a>, while releases are published on Bioconda. A snapshot of the HTSinfer version described in this article was deposited at Zenodo at 10.5281/zenodo.13985958.</span>


MOSTPLAS: a self-correction multi-label learning model for plasmid host range prediction
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Plasmids play an essential role in horizontal gene transfer, aiding their host bacteria in acquiring beneficial traits like antibiotic and metal resistance. There exist some plasmids that can transfer, replicate, or persist in multiple organisms. Identifying the relatively complete host range of these plasmids provides insights into how plasmids promote bacterial evolution. To achieve this, we can apply multi-label learning models for plasmid host range prediction. However, there are no databases providing the detailed and complete host labels of these broad-host-range plasmids. Without adequate well-annotated training samples, learning models can fail to extract discriminative feature representations for plasmid host prediction.<div class="boxTitle">Results</div>To address this problem, we propose a self-correction multi-label learning model called MOSTPLAS. We design a pseudo label learning algorithm and a self-correction asymmetric loss to facilitate the training of multi-label learning model with samples containing some unknown missing labels. We conducted a series of experiments on the NCBI RefSeq plasmid database, the PLSDB 2025 database, plasmids with experimentally determined host labels, the Hi-C dataset, and the DoriC dataset. The benchmark results against other plasmid host range prediction tools demonstrated that MOSTPLAS recognized more host labels while keeping a high precision.<div class="boxTitle">Availability and implementation</div>MOSTPLAS is implemented with Python, which can be downloaded at <a href="https://github.com/wzou96/MOSTPLAS">https://github.com/wzou96/MOSTPLAS</a>. All relevant data we used in the experiments can be found at <a href="https://zenodo.org/doi/10.5281/zenodo.14708999">https://zenodo.org/doi/10.5281/zenodo.14708999</a>.</span>


GCLink: a graph contrastive link prediction framework for gene regulatory network inference
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Gene regulatory networks (GRNs) unveil the intricate interactions among genes, pivotal in elucidating the complex biological processes within cells. The advent of single-cell RNA-sequencing (scRNA-seq) enables the inference of GRNs at single-cell resolution. However, the majority of current supervised network inference methods typically concentrate on predicting pairwise gene regulatory interaction, thus failing to fully exploit correlations among all genes and exhibiting limited generalization performance.<div class="boxTitle">Results</div>To address these issues, we propose a graph contrastive link prediction (GCLink) model to infer potential gene regulatory interactions from scRNA-seq data. Based on known gene regulatory interactions and scRNA-seq data, GCLink introduces a graph contrastive learning strategy to aggregate the feature and neighborhood information of genes to learn their representations. This approach reduces the dependence of our model on sample size and enhance its ability in predicting potential gene regulatory interactions. Extensive experiments on real scRNA-seq datasets demonstrate that GCLink outperforms other state-of-the-art methods in most cases. Furthermore, by pretraining GCLink on a source cell line with abundant known regulatory interactions and fine-tuning it on a target cell line with limited amount of known interactions, our GCLink model exhibits good performance in GRN inference, demonstrating its effectiveness in inferring GRNs from datasets with limited known interactions.<div class="boxTitle">Availability and implementation</div>The source code and data are available at <a href="https://github.com/Yoyiming/GCLink">https://github.com/Yoyiming/GCLink</a>.</span>


TiltRec: an ultra-fast and open-source toolkit for cryo-electron tomographic reconstruction
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Cryo-electron tomography (cryo-ET) has revolutionized our ability to observe structures from the subcellular to the atomic level in their native states. Achieving high-resolution reconstruction involves collecting tilt series at different angles and subsequently backprojecting them into 3D space or iteratively reconstructing them to build a 3D volume of the specimen. However, the intricate computational demands of tomographic reconstruction pose significant challenges, requiring extensive calculation times that hinder efficiency, especially with large and complex datasets.<div class="boxTitle">Results</div>We present TiltRec, an open-source toolkit that leverages the parallel capabilities of Central Processing Units and Graphics Processing Units to enhance tomographic reconstruction. TiltRec implements six classical tomographic reconstruction algorithms, utilizing optimized parallel computation strategies and advanced memory management techniques. Performance evaluations across multiple datasets of varying sizes demonstrate that TiltRec significantly improves efficiency, reducing computational times while maintaining reconstruction resolution.<div class="boxTitle">Summary</div>TiltRec effectively addresses the computational challenges associated with cryo-ET reconstruction by fully exploiting parallel acceleration. As an open-source tool, TiltRec not only facilitates extensive applications by the research community but also supports further algorithm modifications and extensions, enabling the continued development of novel algorithms.<div class="boxTitle">Availability and implementation</div>The source code, documentation, and sample data can be downloaded at <a href="https://github.com/icthrm/TiltRec">https://github.com/icthrm/TiltRec</a>.</span>


Single-cell copy number calling and event history reconstruction
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Copy number alterations are driving forces of tumour development and the emergence of intra-tumour heterogeneity. A comprehensive picture of these genomic aberrations is therefore essential for the development of personalised and precise cancer diagnostics and therapies. Single-cell sequencing offers the highest resolution for copy number profiling down to the level of individual cells. Recent high-throughput protocols allow for the processing of hundreds of cells through shallow whole-genome DNA sequencing. The resulting low read-depth data poses substantial statistical and computational challenges to the identification of copy number alterations.<div class="boxTitle">Results</div>We developed SCICoNE, a statistical model and MCMC algorithm tailored to single-cell copy number profiling from shallow whole-genome DNA sequencing data. SCICoNE reconstructs the history of copy number events in the tumour and uses these evolutionary relationships to identify the copy number profiles of the individual cells. We show the accuracy of this approach in evaluations on simulated data and demonstrate its practicability in applications to two breast cancer samples from different sequencing protocols.<div class="boxTitle">Availability and implementation</div>SCICoNE is available at <a href="https://github.com/cbg-ethz/SCICoNE">https://github.com/cbg-ethz/SCICoNE</a>.</span>


Efficient storage and regression computation for population-scale genome sequencing studies
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>The growing availability of large-scale population biobanks has the potential to significantly advance our understanding of human health and disease. However, the massive computational and storage demands of whole genome sequencing (WGS) data pose serious challenges, particularly for underfunded institutions or researchers in developing countries. This disparity in resources can limit equitable access to cutting-edge genetic research.<div class="boxTitle">Results</div>We present novel algorithms and regression methods that dramatically reduce both computation time and storage requirements for WGS studies, with particular attention to rare variant representation. By integrating these approaches into PLINK 2.0, we demonstrate substantial gains in efficiency without compromising analytical accuracy. In an exome-wide association analysis of 19.4 million variants for the body mass index phenotype in 125 077 individuals (AllofUs project data), we reduced runtime from 695.35 min (11.5 h) on a single machine to 1.57 min with 30 GB of memory and 50 threads (or 8.67 min with 4 threads). Additionally, the framework supports multi-phenotype analyses, further enhancing its flexibility.<div class="boxTitle">Availability and implementation</div>Our optimized methods are fully integrated into PLINK 2.0 and can be accessed at: <a href="https://www.cog-genomics.org/plink/2.0/">https://www.cog-genomics.org/plink/2.0/</a>.</span>


ImmunoTar—integrative prioritization of cell surface targets for cancer immunotherapy
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Cancer remains a leading cause of mortality globally. Recent improvements in survival have been facilitated by the development of targeted and less toxic immunotherapies, such as chimeric antigen receptor (CAR)-T cells and antibody-drug conjugates (ADCs). These therapies, effective in treating both pediatric and adult patients with solid and hematological malignancies, rely on the identification of cancer-specific surface protein targets. While technologies like RNA sequencing and proteomics exist to survey these targets, identifying optimal targets for immunotherapies remains a challenge in the field.<div class="boxTitle">Results</div>To address this challenge, we developed ImmunoTar, a novel computational tool designed to systematically prioritize candidate immunotherapeutic targets. ImmunoTar integrates user-provided RNA-sequencing or proteomics data with quantitative features from multiple public databases, selected based on predefined criteria, to generate a score representing the gene’s suitability as an immunotherapeutic target. We validated ImmunoTar using three distinct cancer datasets, demonstrating its effectiveness in identifying both known and novel targets across various cancer phenotypes. By compiling diverse data into a unified platform, ImmunoTar enables comprehensive evaluation of surface proteins, streamlining target identification and empowering researchers to efficiently allocate resources, thereby accelerating the development of effective cancer immunotherapies.<div class="boxTitle">Availability and implementation</div>Code and data to run and test ImmunoTar are available at <a href="https://github.com/sacanlab/immunotar">https://github.com/sacanlab/immunotar</a>.</span>


APNet, an explainable sparse deep learning model to discover differentially active drivers of severe COVID-19
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Computational analyses of bulk and single-cell omics provide translational insights into complex diseases, such as COVID-19, by revealing molecules, cellular phenotypes, and signalling patterns that contribute to unfavourable clinical outcomes. Current in silico approaches dovetail differential abundance, biostatistics, and machine learning, but often overlook nonlinear proteomic dynamics, like post-translational modifications, and provide limited biological interpretability beyond feature ranking.<div class="boxTitle">Results</div>We introduce APNet, a novel computational pipeline that combines differential activity analysis based on SJARACNe co-expression networks with PASNet, a biologically informed sparse deep learning model, to perform explainable predictions for COVID-19 severity. The APNet driver-pathway network ingests SJARACNe co-regulation and classification weights to aid result interpretation and hypothesis generation. APNet outperforms alternative models in patient classification across three COVID-19 proteomic datasets, identifying predictive drivers and pathways, including some confirmed in single-cell omics and highlighting under-explored biomarker circuitries in COVID-19.<div class="boxTitle">Availability and implementation</div>APNet’s R, Python scripts, and Cytoscape methodologies are available at <a href="https://github.com/BiodataAnalysisGroup/APNet">https://github.com/BiodataAnalysisGroup/APNet</a>.</span>


Embed-Search-Align: DNA sequence alignment using Transformer models
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>DNA sequence alignment, an important genomic task, involves assigning short DNA reads to the most probable locations on an extensive reference genome. Conventional methods tackle this challenge in two steps: genome indexing followed by efficient search to locate likely positions for given reads. Building on the success of Large Language Models in encoding text into embeddings, where the distance metric captures semantic similarity, recent efforts have encoded DNA sequences into vectors using Transformers and have shown promising results in tasks involving classification of short DNA sequences. Performance at sequence classification tasks does not, however, guarantee <span style="font-style:italic;">sequence alignment</span>, where it is necessary to conduct a genome-wide search to align every read successfully, a <span style="font-style:italic;">significantly longer-range task by comparison</span>.<div class="boxTitle">Results</div>We bridge this gap by developing a “<strong>E</strong>mbed-<strong>S</strong>earch-<strong>A</strong>lign” (ESA) framework, where a novel Reference-Free DNA Embedding (<span style="font-style:italic;">RDE</span>) Transformer model generates vector embeddings of reads and fragments of the reference in a shared vector space; read-fragment distance metric is then used as a surrogate for sequence similarity. ESA introduces: (i) Contrastive loss for self-supervised training of DNA sequence representations, facilitating rich reference-free, sequence-level embeddings, and (ii) a DNA vector store to enable search across fragments on a global scale. RDE is 99% accurate when aligning 250-length reads onto a human reference genome of 3 gigabases (single-haploid), rivaling conventional algorithmic sequence alignment methods such as <span style="font-style:italic;">Bowtie</span> and <span style="font-style:italic;">BWA-Mem</span>. RDE far exceeds the performance of six recent DNA-Transformer model baselines such as <span style="font-style:italic;">Nucleotide Transformer, Hyena-DNA</span>, and shows task transfer across chromosomes and species.<div class="boxTitle">Availability and implementation</div>Please see <a href="https://anonymous.4open.science/r/dna2vec-7E4E/readme.md">https://anonymous.4open.science/r/dna2vec-7E4E/readme.md</a>.</span>


DeepES: deep learning-based enzyme screening to identify orphan enzyme genes
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Progress in sequencing technology has led to determination of large numbers of protein sequences, and large enzyme databases are now available. Although many computational tools for enzyme annotation were developed, sequence information is unavailable for many enzymes, known as orphan enzymes. These orphan enzymes hinder sequence similarity-based functional annotation, leading gaps in understanding the association between sequences and enzymatic reactions.<div class="boxTitle">Results</div>Therefore, we developed DeepES, a deep learning-based tool for enzyme screening to identify orphan enzyme genes, focusing on biosynthetic gene clusters and reaction class. DeepES uses protein sequences as inputs and evaluates whether the input genes contain biosynthetic gene clusters of interest by integrating the outputs of the binary classifier for each reaction class. The validation results suggested that DeepES can capture functional similarity between protein sequences, and it can be implemented to explore orphan enzyme genes. By applying DeepES to 4744 metagenome-assembled genomes, we identified candidate genes for 236 orphan enzymes, including those involved in short-chain fatty acid production as a characteristic pathway in human gut bacteria.<div class="boxTitle">Availability and implementation</div>DeepES is available at <a href="https://github.com/yamada-lab/DeepES">https://github.com/yamada-lab/DeepES</a>. Model weights and the candidate genes are available at Zenodo (<a href="https://doi.org/10.5281/zenodo.11123900">https://doi.org/10.5281/zenodo.11123900</a>).</span>


MUSET: set of utilities for constructing abundance unitig matrices from sequencing data
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>MUSET is a novel set of utilities designed to efficiently construct abundance unitig matrices from sequencing data. Unitig matrices extend the concept of k-mer matrices by merging overlapping k-mers that unambiguously belong to the same sequence. MUSET addresses the limitations of current software by integrating k-mer counting and unitig extraction to generate unitig matrices containing abundance values, as opposed to only presence–absence in previous tools. These matrices preserve variations between samples while reducing disk space and the number of rows compared to k-mer matrices. We evaluated MUSET’s performance using datasets derived from a 618-GB collection of ancient oral sequencing samples, producing a filtered unitig matrix that records abundances in &lt;10 h and 20 GB memory.<div class="boxTitle">Availability and implementation</div>MUSET is open source and publicly available under the AGPL-3.0 licence in GitHub at <a href="https://github.com/CamilaDuitama/muset">https://github.com/CamilaDuitama/muset</a>. Source code is implemented in C++ and provided with kmat_tools, a collection of tools for processing k-mer matrices. Version v0.5.1 is available on Zenodo with DOI 10.5281/zenodo.14164801.</span>


MMnc: multi-modal interpretable representation for non-coding RNA classification and class annotation
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>As the biological roles and disease implications of non-coding RNAs continue to emerge, the need to thoroughly characterize previously unexplored non-coding RNAs becomes increasingly urgent. These molecules hold potential as biomarkers and therapeutic targets. However, the vast and complex nature of non-coding RNAs data presents a challenge. We introduce MMnc, an interpretable deep-learning approach designed to classify non-coding RNAs into functional groups. MMnc leverages multiple data sources—such as the sequence, secondary structure, and expression—using attention-based multi-modal data integration. This ensures the learning of meaningful representations while accounting for missing sources in some samples.<div class="boxTitle">Results</div>Our findings demonstrate that MMnc achieves high classification accuracy across diverse non-coding RNA classes. The method’s modular architecture allows for the consideration of multiple types of modalities, whereas other tools only consider one or two at most. MMnc is resilient to missing data, ensuring that all available information is effectively utilized. Importantly, the generated attention scores offer interpretable insights into the underlying patterns of the different non-coding RNA classes, potentially driving future non-coding RNA research and applications.<div class="boxTitle">Availability and implementation</div>Data and source code can be found at EvryRNA.ibisc.univ-evry.fr/EvryRNA/MMnc.</span>


Privacy-preserving framework for genomic computations via multi-key homomorphic encryption
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>The affordability of genome sequencing and the widespread availability of genomic data have opened up new medical possibilities. Nevertheless, they also raise significant concerns regarding privacy due to the sensitive information they encompass. These privacy implications act as barriers to medical research and data availability. Researchers have proposed privacy-preserving techniques to address this, with cryptography-based methods showing the most promise. However, existing cryptography-based designs lack (i) interoperability, (ii) scalability, (iii) a high degree of privacy (i.e. compromise one to have the other), or (iv) multiparty analyses support (as most existing schemes process genomic information of each party individually). Overcoming these limitations is essential to unlocking the full potential of genomic data while ensuring privacy and data utility. Further research and development are needed to advance privacy-preserving techniques in genomics, focusing on achieving interoperability and scalability, preserving data utility, and enabling secure multiparty computation.<div class="boxTitle">Results</div>This study aims to overcome the limitations of current cryptography-based techniques by employing a multi-key homomorphic encryption scheme. By utilizing this scheme, we have developed a comprehensive protocol capable of conducting diverse genomic analyses. Our protocol facilitates interoperability among individual genome processing and enables multiparty tests, analyses of genomic databases, and operations involving multiple databases. Consequently, our approach represents an innovative advancement in secure genomic data processing, offering enhanced protection and privacy measures.<div class="boxTitle">Availability and implementation</div>All associated code and documentation are available at <a href="https://github.com/farahpoor/smkhe">https://github.com/farahpoor/smkhe</a>.</span>


Relative quantification of proteins and post-translational modifications in proteomic experiments with shared peptides: a weight-based approach
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Bottom-up mass spectrometry-based proteomics studies changes in protein abundance and structure across conditions. Since the currency of these experiments are peptides, i.e. subsets of protein sequences that carry the quantitative information, conclusions at a different level must be computationally inferred. The inference is particularly challenging in situations where the peptides are shared by multiple proteins or post-translational modifications. While many approaches infer the underlying abundances from unique peptides, there is a need to distinguish the quantitative patterns when peptides are shared.<div class="boxTitle">Results</div>We propose a statistical approach for estimating protein abundances, as well as site occupancies of post-translational modifications, based on quantitative information from shared peptides. The approach treats the quantitative patterns of shared peptides as convex combinations of abundances of individual proteins or modification sites, and estimates the abundance of each source in a sample together with the weights of the combination. In simulation-based evaluations, the proposed approach improved the precision of estimated fold changes between conditions. We further demonstrated the practical utility of the approach in experiments with diverse biological objectives, ranging from protein degradation and thermal proteome stability, to changes in protein post-translational modifications.<div class="boxTitle">Availability and implementation</div>The approach is implemented in an open-source R package MSstatsWeightedSummary. The package is currently available at <a href="https://github.com/Vitek-Lab/MSstatsWeightedSummary">https://github.com/Vitek-Lab/MSstatsWeightedSummary</a> (doi: 10.5281/zenodo.14662989). Code required to reproduce the results presented in this article can be found in a repository <a href="https://github.com/mstaniak/MWS_reproduction">https://github.com/mstaniak/MWS_reproduction</a> (doi: 10.5281/zenodo.14656053).</span>


MetAssimulo 2.0: a web app for simulating realistic 1D and 2D metabolomic 1H NMR spectra
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Metabolomics extensively utilizes nuclear magnetic resonance (NMR) spectroscopy due to its excellent reproducibility and high throughput. Both 1D and 2D NMR spectra provide crucial information for metabolite annotation and quantification, yet present complex overlapping patterns which may require sophisticated machine learning algorithms to decipher. Unfortunately, the limited availability of labeled spectra can hamper application of machine learning, especially deep learning algorithms which require large amounts of labeled data. In this context, simulation of spectral data becomes a tractable solution for algorithm development.<div class="boxTitle">Results</div>Here, we introduce MetAssimulo 2.0, a comprehensive upgrade of the MetAssimulo 1.b metabolomic <sup>1</sup>H NMR simulation tool, reimplemented as a Python-based web application. Where MetAssimulo 1.0 only simulated 1D <sup>1</sup>H spectra of human urine, MetAssimulo 2.0 expands functionality to urine, blood, and cerebral spinal fluid, enhancing the realism of blood spectra by incorporating a broad protein background. This enhancement enables a closer approximation to real blood spectra, achieving a Pearson correlation of approximately 0.82. Moreover, this tool now includes simulation capabilities for 2D <span style="font-style:italic;">J</span>-resolved (<span style="font-style:italic;">J</span>-Res) and Correlation Spectroscopy spectra, significantly broadening its utility in complex mixture analysis. MetAssimulo 2.0 simulates both single, and groups, of spectra with both discrete (case–control, e.g. heart transplant versus healthy) and continuous (e.g. body mass index) outcomes and includes inter-metabolite correlations. It thus supports a range of experimental designs and demonstrating associations between metabolite profiles and biomedical responses.By enhancing NMR spectral simulations, MetAssimulo 2.0 is well positioned to support and enhance research at the intersection of deep learning and metabolomics.<div class="boxTitle">Availability and implementation</div>The code and the detailed instruction/tutorial for MetAssimulo 2.0 is available at <a href="https://github.com/yanyan5420/MetAssimulo_2.git">https://github.com/yanyan5420/MetAssimulo_2.git</a>. The relevant NMR spectra for metabolites are deposited in MetaboLights with accession number MTBLS12081.</span>


SP-DTI: subpocket-informed transformer for drug–target interaction prediction
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Drug–target interaction (DTI) prediction is crucial for drug discovery, significantly reducing costs and time in experimental searches across vast drug compound spaces. While deep learning has advanced DTI prediction accuracy, challenges remain: (i) existing methods often lack generalizability, with performance dropping significantly on unseen proteins and cross-domain settings; and (ii) current molecular relational learning often overlooks subpocket-level interactions, which are vital for a detailed understanding of binding sites.<div class="boxTitle">Results</div>We introduce SP-DTI, a subpocket-informed transformer model designed to address these challenges through: (i) detailed subpocket analysis using the Cavity Identification and Analysis Routine for interaction modeling at both global and local levels, and (ii) integration of pre-trained language models into graph neural networks to encode drugs and proteins, enhancing generalizability to unlabeled data. Benchmark evaluations show that SP-DTI consistently outperforms state-of-the-art models, achieving an area under the receiver operating characteristic curve of 0.873 in unseen protein settings, an 11% improvement over the best baseline.<div class="boxTitle">Availability and implementation</div>The model scripts are available at <a href="https://github.com/Steven51516/SP-DTI">https://github.com/Steven51516/SP-DTI</a>.</span>


FlowPacker: protein side-chain packing with torsional flow matching
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Accurate prediction of protein side-chain conformations is necessary to understand protein folding, protein–protein interactions and facilitate <span style="font-style:italic;">de novo</span> protein design.<div class="boxTitle">Results</div>Here, we apply torsional flow matching and equivariant graph attention to develop FlowPacker, a fast and performant model to predict protein side-chain conformations conditioned on the protein sequence and backbone. We show that FlowPacker outperforms previous state-of-the-art baselines across most metrics with improved runtime. We further show that FlowPacker can be used to inpaint missing side-chain coordinates and also for multimeric targets, and exhibits strong performance on a test set of antibody–antigen complexes.<div class="boxTitle">Availability and implementation</div>Code is available at <a href="https://gitlab.com/mjslee0921/flowpacker">https://gitlab.com/mjslee0921/flowpacker</a>.</span>


Vcfexpress: flexible, rapid user-expressions to filter and format VCFs
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Variant call format (VCF) files are the standard output format for various software tools that identify genetic variation from DNA sequencing experiments. Downstream analyses require the ability to query, filter, and modify them simply and efficiently. Several tools are available to perform these operations from the command line, including BCFTools, vembrane, slivar, and others.<div class="boxTitle">Results</div>Here, we introduce vcfexpress, a new, high-performance toolset for the analysis of VCF files, written in the Rust programming language. It is nearly as fast as BCFTools, but adds functionality to execute user expressions in the lua programming language for precise filtering and reporting of variants from a VCF or BCF file. We demonstrate performance and flexibility by comparing vcfexpress to other tools using the vembrane benchmark.<div class="boxTitle">Availability and implementation</div>vcfexpress is available under the MIT license at <a href="https://github.com/brentp/vcfexpress">https://github.com/brentp/vcfexpress</a> with code used for the manuscript deposited in <a href="https://doi.org/10.5281/zenodo.14756838">https://doi.org/10.5281/zenodo.14756838</a>.</span>