WORLD UNIVERSITY DIRECTORY
Medical Education Bioinformatics - current issue

  Back to "News Updates - Homepage"


| More


Bioinformatics - current issue - Recent Educational Updates

-->
GUEST: an R package for handling estimation of graphical structure and multiclassification for error-prone gene expression data
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>In bioinformatics studies, understanding the network structure of gene expression variables is one of the main interests. In the framework of data science, graphical models have been widely used to characterize the dependence structure among multivariate random variables. However, the gene expression data possibly suffer from ultrahigh-dimensionality and measurement error, which make the detection of network structure challenging and difficult. The other important application of gene expression variables is to provide information to classify subjects into various tumors or diseases. In supervised learning, while linear discriminant analysis is a commonly used approach, the conventional implementation is limited in precisely measured variables and computation of their inverse covariance matrix, which is known as the precision matrix. To tackle those challenges and provide a reliable estimation procedure for public use, we develop the R package GUEST, which is known as <strong><span style="font-style:italic;">G</span></strong>raphical models for <strong><span style="font-style:italic;">U</span></strong>ltrahigh-dimensional and <strong><span style="font-style:italic;">E</span></strong>rror-prone data by the boo<strong><span style="font-style:italic;">ST</span></strong>ing algorithm. This R package aims to deal with measurement error effects in high-dimensional variables under various distributions and then applies the boosting algorithm to identify the network structure and estimate the precision matrix. When the precision matrix is estimated, it can be used to construct the linear discriminant function and improve the accuracy of the classification.<div class="boxTitle">Availability and implementation</div>The R package is available on <a href="https://cran.r-project.org/web/packages/GUEST/index.html">https://cran.r-project.org/web/packages/GUEST/index.html</a>.</span>


VSS-Hi-C: variance-stabilized signals for chromatin contacts
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>The genome-wide chromosome conformation capture assay Hi-C is widely used to study chromatin 3D structures and their functional implications. Read counts from Hi-C indicate the strength of chromatin contact between each pair of genomic loci. These read counts are heteroskedastic: that is, a difference between the interaction frequency of 0 and 100 is much more significant than a difference between the interaction frequency of 1000 and 1100. This property impedes visualization and downstream analysis because it violates the Gaussian variable assumption of many computational tools. Thus heuristic transformations aimed at stabilizing the variance of signals like the shifted-log transformation are typically applied to data before its visualization and inputting to models with Gaussian assumption. However, such heuristic transformations cannot fully stabilize the variance because of their restrictive assumptions about the mean–variance relationship in the data.<div class="boxTitle">Results</div>Here, we present VSS-Hi-C, a data-driven variance stabilization method for Hi-C data. We show that VSS-Hi-C signals have a unit variance improving visualization of Hi-C, for example in heatmap contact maps. VSS-Hi-C signals also improve the performance of subcompartment callers relying on Gaussian observations. VSS-Hi-C is implemented as an R package and can be used for variance stabilization of different genomic and epigenomic data types with two replicates available.<div class="boxTitle">Availability and implementation</div><a href="https://github.com/nedashokraneh/vssHiC">https://github.com/nedashokraneh/vssHiC</a>.</span>


Knowledge mining of brain connectivity in massive literature based on transfer learning
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Neuroscientists have long endeavored to map brain connectivity, yet the intricate nature of brain networks often leads them to concentrate on specific regions, hindering efforts to unveil a comprehensive connectivity map. Recent advancements in imaging and text mining techniques have enabled the accumulation of a vast body of literature containing valuable insights into brain connectivity, facilitating the extraction of whole-brain connectivity relations from this corpus. However, the diverse representations of brain region names and connectivity relations pose a challenge for conventional machine learning methods and dictionary-based approaches in identifying all instances accurately.<div class="boxTitle">Results</div>We propose BioSEPBERT, a <strong>bio</strong>medical pre-trained model based on <strong>s</strong>tart-<strong>e</strong>nd position <strong>p</strong>ointers and <strong>BERT</strong>. In addition, our model integrates specialized identifiers with enhanced self-attention capabilities for preceding and succeeding brain regions, thereby improving the performance of named entity recognition and relation extraction in neuroscience. Our approach achieves optimal F1 scores of 85.0%, 86.6%, and 86.5% for named entity recognition, connectivity relation extraction, and directional relation extraction, respectively, surpassing state-of-the-art models by 2.6%, 1.1%, and 1.1%. Furthermore, we leverage BioSEPBERT to extract 22.6 million standardized brain regions and 165 072 directional relations from a corpus comprising 1.3 million abstracts and 193 100 full-text articles. The results demonstrate that our model facilitates researchers to rapidly acquire knowledge regarding neural circuits across various brain regions, thereby enhancing comprehension of brain connectivity in specific regions.<div class="boxTitle">Availability and implementation</div>Data and source code are available at: <a href="http://atlas.brainsmatics.org/res/BioSEPBERT">http://atlas.brainsmatics.org/res/BioSEPBERT</a> and <a href="https://github.com/Brainsmatics/BioSEPBERT">https://github.com/Brainsmatics/BioSEPBERT</a>.</span>


A BLAST from the past: revisiting blastp’s E-value
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>The Basic Local Alignment Search Tool, BLAST, is an indispensable tool for genomic research. BLAST has established itself as the canonical tool for sequence similarity search in large part thanks to its meaningful statistical analysis. Specifically, BLAST reports the <span style="font-style:italic;">E</span>-value of each reported alignment, which is defined as the expected number of optimal local alignments that will score at least as high as the observed alignment score, assuming that the query and the database sequences are randomly generated.<div class="boxTitle">Results</div>Here, we critically evaluate the <span style="font-style:italic;">E</span>-values provided by the standard protein BLAST (blastp), showing that they can be at times significantly conservative while at others too liberal. We offer an alternative approach based on generating a small sample from the null distribution of random optimal alignments, and testing whether the observed alignment score is consistent with it. In contrast with blastp, our significance analysis seems valid, in the sense that it did not deliver inflated significance estimates in any of our extensive experiments. Moreover, although our method is slightly conservative, it is often significantly less so than the blastp <span style="font-style:italic;">E</span>-value. Indeed, in cases where blastp’s analysis is valid (i.e., not too liberal), our approach seems to deliver a greater number of correct alignments. One advantage of our approach is that it works with any reasonable choice of substitution matrix and gap penalties, avoiding blastp’s limited options of matrices and penalties. In addition, we can formulate the problem using a canonical family-wise error rate control setup, thereby dispensing with <span style="font-style:italic;">E</span>-values, which can at times be difficult to interpret.<div class="boxTitle">Availability and implementation</div>The Apache licensed source code is available at <a href="https://github.com/batmen-lab/SGPvalue">https://github.com/batmen-lab/SGPvalue</a>.</span>


HBFormer: a single-stream framework based on hybrid attention mechanism for identification of human-virus protein–protein interactions
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Exploring human-virus protein–protein interactions (PPIs) is crucial for unraveling the underlying pathogenic mechanisms of viruses. Limitations in the coverage and scalability of high-throughput approaches have impeded the identification of certain key interactions. Current popular computational methods adopt a two-stream pipeline to identify PPIs, which can only achieve relation modeling of protein pairs at the classification phase. However, the fitting capacity of the classifier is insufficient to comprehensively mine the complex interaction patterns between protein pairs.<div class="boxTitle">Results</div>In this study, we propose a pioneering single-stream framework HBFormer that combines hybrid attention mechanism and multimodal feature fusion strategy for identifying human-virus PPIs. The Transformer architecture based on hybrid attention can bridge the bidirectional information flows between human protein and viral protein, thus unifying joint feature learning and relation modeling of protein pairs. The experimental results demonstrate that HBFormer not only achieves superior performance on multiple human-virus PPI datasets but also outperforms 5 other state-of-the-art human-virus PPI identification methods. Moreover, ablation studies and scalability experiments further validate the effectiveness of our single-stream framework.<div class="boxTitle">Availability and implementation</div>Codes and datasets are available at <a href="https://github.com/RmQ5v/HBFormer">https://github.com/RmQ5v/HBFormer</a>.</span>


Pod5Viewer: a GUI for inspecting raw nanopore sequencing data
<span class="paragraphSection"><div class="boxTitle">Abstract </div><div class="boxTitle">Motivation</div>Oxford Nanopore Technologies recently adopted the POD5 file format for storing raw nanopore sequencing data. The information stored in these files provides detailed insights into the sequencing features and enhances the understanding of raw nanopore data. However, the process of visualizing the data can be cumbersome, especially for users without programming skills. To address this issue, we developed the pod5Viewer, a GUI application for inspecting POD5 files.<div class="boxTitle">Results</div>The pod5Viewer offers straightforward access to raw sequencing data and associated metadata in POD5 files. It includes functionalities for viewing, plotting, and exporting individual reads. Designed with user-friendliness in mind, the pod5Viewer is easy to install and use, making it suitable for users with all technical backgrounds.<div class="boxTitle">Availability and implementation</div>The pod5Viewer is available as open source from the pod5Viewer Github repository (<a href="https://github.com/dietvin/pod5Viewer">https://github.com/dietvin/pod5Viewer</a>)</span>


JARVIS3: an efficient encoder for genomic data
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Large-scale genomic projects grapple with the complex challenge of reducing medium- and long-term storage space and its associated energy consumption, monetary costs, and environmental footprint.<div class="boxTitle">Results</div>We present JARVIS3, an advanced tool engineered for the efficient reference-free compression of genomic sequences. JARVIS3 introduces a pioneering approach, specifically through enhanced table memory models and probabilistic lookup-tables applied in repeat models. These optimizations are pivotal in substantially enhancing computational efficiency. JARVIS3 offers three distinct profiles: (i) rapid computation with moderate compression, (ii) a balanced trade-off between time and compression, and (iii) slower computation with significantly higher compression ratios. The implementation of JARVIS3 is rooted in the C programming language, building upon the success of its predecessor, JARVIS2. JARVIS3 shows substantial speed improvements relative to JARVIS2 while providing slightly better compression. Furthermore, we provide a versatile C/Bash implementation, facilitating the application in FASTA and FASTQ data, including the capability for parallel computation. In addition, JARVIS3 includes a mode for outputting bit information, as well as providing the Normalized Compression and bit rates, facilitating compression-based analysis. This establishes JARVIS3 as an open-source solution for genomic data compression and analysis.<div class="boxTitle">Availability and implementation</div>JARVIS3 is freely available at <a href="https://github.com/cobilab/jarvis3">https://github.com/cobilab/jarvis3</a>.</span>


OpenVariant: a toolkit to parse and operate multiple input file formats
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>Advances in high-throughput DNA sequencing technologies and decreasing costs have fueled the identification of small genetic variants (such as single nucleotide variants and indels) across tumors. Despite efforts to standardize variant formats and vocabularies, many sources of variability persist across databases and computational tools that annotate variants, hindering their integration within cancer genomic analyses. In this context, we present OpenVariant, an easily extendable Python package that facilitates seamless reading, parsing and refinement of diverse input file formats in a customizable structure, all within a single process.<div class="boxTitle">Availability and implementation</div>OpenVariant is an open-source package available at <a href="https://github.com/bbglab/openvariant">https://github.com/bbglab/openvariant</a>. Documentation may be found at <a href="https://openvariant.readthedocs.io">https://openvariant.readthedocs.io</a>.</span>


Polyphonia: detecting inter-sample contamination in viral genomic sequencing data
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>In viral genomic research and surveillance, inter-sample contamination can affect variant detection, analysis of within-host evolution, outbreak reconstruction, and detection of superinfections and recombination events. While sample barcoding methods exist to track inter-sample contamination, they are not always used and can only detect contamination in the experimental pipeline from the point they are added. The underlying genomic information in a sample, however, carries information about inter-sample contamination occurring at any stage. Here, we present Polyphonia, a tool for detecting inter-sample contamination directly from deep sequencing data without the need for additional controls, using intrahost variant frequencies. We apply Polyphonia to 1102 SARS-CoV-2 samples sequenced at the Broad Institute and already tracked using molecular barcoding for comparison.<div class="boxTitle">Availability and implementation</div>Polyphonia is available as a standalone Docker image and is also included as part of viral-ngs, available in Dockstore. Full documentation, source code, and instructions for use are available at <a href="https://github.com/broadinstitute/polyphonia">https://github.com/broadinstitute/polyphonia</a>.</span>


spread.gl: visualizing pathogen dispersal in a high-performance browser application
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Bayesian phylogeographic analyses are pivotal in reconstructing the spatio-temporal dispersal histories of pathogens. However, interpreting the complex outcomes of phylogeographic reconstructions requires sophisticated visualization tools.<div class="boxTitle">Results</div>To meet this challenge, we developed spread.gl, an open-source, feature-rich browser application offering a smooth and intuitive visualization tool for both discrete and continuous phylogeographic inferences, including the animation of pathogen geographic dispersal through time. Spread.gl can render and combine the visualization of multiple layers that contain information extracted from the input phylogeny and diverse environmental data layers, enabling researchers to explore which environmental factors may have impacted pathogen dispersal patterns before conducting formal testing. We showcase the visualization features of spread.gl with representative examples, including the smooth animation of a phylogeographic reconstruction based on &gt;17 000 SARS-CoV-2 genomic sequences.<div class="boxTitle">Availability and implementation</div>Source code, installation instructions, example input data, and outputs of spread.gl are accessible at <a href="https://github.com/GuyBaele/SpreadGL">https://github.com/GuyBaele/SpreadGL</a>.</span>


HAlign 4: a new strategy for rapidly aligning millions of sequences
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>HAlign is a high-performance multiple sequence alignment software based on the star alignment strategy, which is the preferred choice for rapidly aligning large numbers of sequences. HAlign3, implemented in Java, is the latest version capable of aligning an ultra-large number of similar DNA/RNA sequences. However, HAlign3 still struggles with long sequences and extremely large numbers of sequences.<div class="boxTitle">Results</div>To address this issue, we have implemented HAlign4 in C++. In this version, we replaced the original suffix tree with Burrows–Wheeler Transform and introduced the wavefront alignment algorithm to further optimize both time and memory efficiency. Experiments show that HAlign4 significantly outperforms HAlign3 in runtime and memory usage in both single-threaded and multi-threaded configurations, while maintains high alignment accuracy comparable to MAFFT. HAlign4 can complete the alignment of 10 million coronavirus disease 2019 (COVID-19) sequences in about 12 min and 300 GB of memory using 96 threads, demonstrating its efficiency and practicality for large-scale alignment on standard workstations.<div class="boxTitle">Availability and implementation</div>Source code is available at <a href="https://github.com/malabz/HAlign-4">https://github.com/malabz/HAlign-4</a>, dataset is available at <a href="https://zenodo.org/records/13934503">https://zenodo.org/records/13934503</a>.</span>


BWT construction and search at the terabase scale
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Burrows–Wheeler Transform (BWT) is a common component in full-text indices. Initially developed for data compression, it is particularly powerful for encoding redundant sequences such as pangenome data. However, BWT construction is resource intensive and hard to be parallelized, and many methods for querying large full-text indices only report exact matches or their simple extensions. These limitations have hampered the biological applications of full-text indices.<div class="boxTitle">Results</div>We developed ropebwt3 for efficient BWT construction and query. Ropebwt3 indexed 320 assembled human genomes in 65 h and indexed 7.3 terabases of commonly studied bacterial assemblies in 26 days. This was achieved using up to 170 gigabytes of memory at the peak without working disk space. Ropebwt3 can find maximal exact matches and inexact alignments under affine-gap penalties, and can retrieve similar local haplotypes matching a query sequence. It demonstrates the feasibility of full-text indexing at the terabase scale.<div class="boxTitle">Availability and implementation</div><a href="https://github.com/lh3/ropebwt3">https://github.com/lh3/ropebwt3</a>.</span>


Virtual tissue expression analysis
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Bulk RNA expression data are widely accessible, whereas single-cell data are relatively scarce in comparison. However, single-cell data offer profound insights into the cellular composition of tissues and cell type-specific gene regulation, both of which remain hidden in bulk expression analysis.<div class="boxTitle">Results</div>Here, we present tissueResolver, an algorithm designed to extract single-cell information from bulk data, enabling us to attribute expression changes to individual cell types. When validated on simulated data tissueResolver outperforms competing methods. Additionally, our study demonstrates that tissueResolver reveals cell type-specific regulatory distinctions between the activated B-cell-like (ABC) and germinal center B-cell-like (GCB) subtypes of diffuse large B-cell lymphomas (DLBCL).<div class="boxTitle">Availability and implementation</div>R package available at <a href="https://github.com/spang-lab/tissueResolver">https://github.com/spang-lab/tissueResolver</a> (archived as <a href="https://zenodo.org/records/14160846">10.5281/zenodo.14160846</a>).Code for reproducing the results of this article is available at <a href="https://github.com/spang-lab/tissueResolver-docs">https://github.com/spang-lab/tissueResolver-docs</a> archived as <a href="https://archive.softwareheritage.org/swh:1:dir:faea2d4f0ded30de774b28e028299ddbdd0c4f89">swh:1:dir:faea2d4f0ded30de774b28e028299ddbdd0c4f89</a>).</span>


STRprofiler: efficient comparisons of short tandem repeat profiles for biomedical model authentication
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>Short tandem repeat (STR) profiling is commonly performed for authentication of biomedical models of human origin, yet no tools exist to easily compare sets of STR profiles to each other or an existing database in a high-throughput manner. Here, we present STRprofiler, a Python package, command line tool, and Shiny application providing methods for STR profile comparison and cross-contamination detection. STRprofiler can be run with custom databases or used to query against the Cellosaurus cell line database.<div class="boxTitle">Availability and implementation</div>STRprofiler is freely available as a Python package with a rich CLI from PyPI <a href="https://pypi.org/project/strprofiler/">https://pypi.org/project/strprofiler/</a> with source code available under the MIT license on GitHub <a href="https://github.com/j-andrews7/strprofiler">https://github.com/j-andrews7/strprofiler</a> and at <a href="https://zenodo.org/records/10989034">https://zenodo.org/records/10989034</a>. A web server hosting an example STRprofiler Shiny application backed by a database with data from the National Cancer Institute-funded PDXNet consortium and The Jackson Laboratory PDX program is available at <a href="https://sj-bakerlab.shinyapps.io/strprofiler/">https://sj-bakerlab.shinyapps.io/strprofiler/</a>. Full documentation is available at <a href="https://strprofiler.readthedocs.io/en/latest/">https://strprofiler.readthedocs.io/en/latest/</a>.</span>


easySCF: a tool for enhancing interoperability between R and Python for efficient single-cell data analysis
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>This study introduces easySCF, a tool designed to enhance the interoperability of single-cell data between the two major bioinformatics platforms, R and Python. By supporting seamless data exchange, easySCF improves the efficiency and accuracy of single-cell data analysis.<div class="boxTitle">Availability and implementation</div>easySCF utilizes a unified data format (.h5 format) to facilitate data transfer between R and Python platforms. The tool has been evaluated for data processing speed, memory efficiency, and disk usage, as well as its capability to handle large-scale single-cell datasets. easySCF is available as an open-source package, with implementation details and documentation accessible at <a href="https://github.com/xleizi/easySCF">https://github.com/xleizi/easySCF</a>.</span>


Fast polypharmacy side effect prediction using tensor factorization
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Adverse reactions from drug combinations are increasingly common, making their accurate prediction a crucial challenge in modern medicine. Laboratory-based identification of these reactions is insufficient due to the combinatorial nature of the problem. While many computational approaches have been proposed, tensor factorization (TF) models have shown mixed results, necessitating a thorough investigation of their capabilities when properly optimized.<div class="boxTitle">Results</div>We demonstrate that TF models can achieve state-of-the-art performance on polypharmacy side effect prediction, with our best model (SimplE) achieving median scores of 0.978 area under receiver-operating characteristic curve, 0.971 area under precision–recall curve, and 1.000 AP@50 across 963 side effects. Notably, this model reaches 98.3% of its maximum performance after just two epochs of training (approximately 4 min), making it substantially faster than existing approaches while maintaining comparable accuracy. We also find that incorporating monopharmacy data as self-looping edges in the graph performs marginally better than using it to initialize embeddings.<div class="boxTitle">Availability and implementation</div>All code used in the experiments is available in our GitHub repository (<a href="https://doi.org/10.5281/zenodo.10684402">https://doi.org/10.5281/zenodo.10684402</a>). The implementation was carried out using Python 3.8.12 with PyTorch 1.7.1, accelerated with CUDA 11.4 on NVIDIA GeForce RTX 2080 Ti GPUs.</span>


ViraLM: empowering virus discovery through the genome foundation model
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Viruses, with their ubiquitous presence and high diversity, play pivotal roles in ecological systems and public health. Accurate identification of viruses in various ecosystems is essential for comprehending their variety and assessing their ecological influence. Metagenomic sequencing has become a major strategy to survey the viruses in various ecosystems. However, accurate and comprehensive virus detection in metagenomic data remains difficult. Limited reference sequences prevent alignment-based methods from identifying novel viruses. Machine learning-based tools are more promising in novel virus detection but often miss short viral contigs, which are abundant in typical metagenomic data. The inconsistency in virus search results produced by available tools further highlights the urgent need for a more robust tool for virus identification.<div class="boxTitle">Results</div>In this work, we develop ViraLM for identifying novel viral contigs in metagenomic data. By using the latest genome foundation model as the backbone and training on a rigorously constructed dataset, the model is able to distinguish viruses from other organisms based on the learned genomic characteristics. We thoroughly tested ViraLM on multiple datasets and the experimental results show that ViraLM outperforms available tools in different scenarios. In particular, ViraLM improves the F1-score on short contigs by 22%.<div class="boxTitle">Availability and implementation</div>The source code of ViraLM is available via: <a href="https://github.com/ChengPENG-wolf/ViraLM">https://github.com/ChengPENG-wolf/ViraLM</a>.</span>


CVR-BBI: an open-source VR platform for multi-user collaborative brain to brain interfaces
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>As brain imaging and neurofeedback technologies advance, the brain-to-brain interface (BBI) has emerged as an innovative field, enabling in-depth exploration of cross-brain information exchange and enhancing our understanding of collaborative intelligence. However, no open-source virtual reality (VR) platform currently supports the rapid and efficient configuration of multi-user, collaborative BBIs. To address this gap, we introduce the Collaborative Virtual Reality Brain-to-Brain Interface (CVR-BBI), an open-source platform consisting of a client and server. The CVR-BBI client enables users to participate in collaborative experiments, collect electroencephalogram (EEG) data, and manage interactive multisensory stimuli within the VR environment. Meanwhile, the CVR-BBI server manages multi-user collaboration paradigms, and performs real-time analysis of the EEG data. We evaluated the CVR-BBI platform using the SSVEP paradigm and observed that collaborative decoding outperformed individual decoding, validating the platform’s effectiveness in collaborative settings. The CVR-BBI offers a pioneering platform that facilitates the development of innovative BBI applications within collaborative VR environments, thereby enhancing the understanding of brain collaboration and cognition.<div class="boxTitle">Availability and implementation</div>CVR-BBI is released as an open-source platform, with its source code being available at <a href="https://github.com/DILIU1/CVR-BBI">https://github.com/DILIU1/CVR-BBI</a>.</span>


FastTENET: an accelerated TENET algorithm based on manycore computing in Python
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>TENET reconstructs gene regulatory networks from single-cell RNA sequencing (scRNAseq) data using the transfer entropy (TE), and works successfully on a variety of scRNAseq data. However, TENET is limited by its long computation time for large datasets. To address this limitation, we propose FastTENET, an array-computing version of TENET algorithm optimized for acceleration on manycore processors such as GPUs. FastTENET counts the unique patterns of joint events to compute the TE based on array computing. Compared to TENET, FastTENET achieves up to 973× performance improvement.<div class="boxTitle">Availability and implementation</div>FastTENET is available on GitHub at <a href="https://github.com/cxinsys/fasttenet">https://github.com/cxinsys/fasttenet</a>.</span>


OneSC: a computational platform for recapitulating cell state transitions
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Computational modeling of cell state transitions has been a great interest of many in the field of developmental biology, cancer biology, and cell fate engineering because it enables performing perturbation experiments <span style="font-style:italic;">in silico</span> more rapidly and cheaply than could be achieved in a lab. Recent advancements in single-cell RNA-sequencing (scRNA-seq) allow the capture of high-resolution snapshots of cell states as they transition along temporal trajectories. Using these high-throughput datasets, we can train computational models to generate <span style="font-style:italic;">in silico</span> “synthetic” cells that faithfully mimic the temporal trajectories.<div class="boxTitle">Results</div>Here we present OneSC, a platform that can simulate cell state transitions using systems of stochastic differential equations govern by a regulatory network of core transcription factors (TFs). Different from many current network inference methods, OneSC prioritizes on generating Boolean network that produces faithful cell state transitions and terminal cell states that mimic real biological systems. Applying OneSC to real data, we inferred a core TF network using a mouse myeloid progenitor scRNA-seq dataset and showed that the dynamical simulations of that network generate synthetic single-cell expression profiles that faithfully recapitulate the four myeloid differentiation trajectories going into differentiated cell states (erythrocytes, megakaryocytes, granulocytes, and monocytes). Finally, through the <span style="font-style:italic;">in silico</span> perturbations of the mouse myeloid progenitor core network, we showed that OneSC can accurately predict cell fate decision biases of TF perturbations that closely match with previous experimental observations.<div class="boxTitle">Availability and implementation</div>OneSC is implemented as a Python package on GitHub (<a href="https://github.com/CahanLab/oneSC">https://github.com/CahanLab/oneSC</a>) and on Zenodo (<a href="https://zenodo.org/records/14052421">https://zenodo.org/records/14052421</a>).</span>


Improved prediction of post-translational modification crosstalk within proteins using DeepPCT
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Post-translational modification (PTM) crosstalk events play critical roles in biological processes. Several machine learning methods have been developed to identify PTM crosstalk within proteins, but the accuracy is still far from satisfactory. Recent breakthroughs in deep learning and protein structure prediction could provide a potential solution to this issue.<div class="boxTitle">Results</div>We proposed DeepPCT, a deep learning algorithm to identify PTM crosstalk using AlphaFold2-based structures. In this algorithm, one deep learning classifier was constructed for sequence-based prediction by combining the residue and residue pair embeddings with cross-attention techniques, while the other classifier was established for structure-based prediction by integrating the structural embedding and a graph neural network. Meanwhile, a machine learning classifier was developed using novel structural descriptors and a random forest model to complement the structural deep learning classifier. By integrating the three classifiers, DeepPCT outperformed existing algorithms in different evaluation scenarios and showed better generalizability on new data owing to its less distance dependency.<div class="boxTitle">Availability and implementation</div>Datasets, codes, and models of DeepPCT are freely accessible at <a href="https://github.com/hzau-liulab/DeepPCT/">https://github.com/hzau-liulab/DeepPCT/</a>.</span>


Accurate and transferable drug–target interaction prediction with DrugLAMP
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Accurate prediction of drug–target interactions (DTIs), especially for novel targets or drugs, is crucial for accelerating drug discovery. Recent advances in pretrained language models (PLMs) and multi-modal learning present new opportunities to enhance DTI prediction by leveraging vast unlabeled molecular data and integrating complementary information from multiple modalities.<div class="boxTitle">Results</div>We introduce DrugLAMP (PLM-assisted multi-modal prediction), a PLM-based multi-modal framework for accurate and transferable DTI prediction. DrugLAMP integrates molecular graph and protein sequence features extracted by PLMs and traditional feature extractors. We introduce two novel multi-modal fusion modules: (i) pocket-guided co-attention (PGCA), which uses protein pocket information to guide the attention mechanism on drug features, and (ii) paired multi-modal attention (PMMA), which enables effective cross-modal interactions between drug and protein features. These modules work together to enhance the model’s ability to capture complex drug–protein interactions. Moreover, the contrastive compound-protein pre-training (2C2P) module enhances the model’s generalization to real-world scenarios by aligning features across modalities and conditions. Comprehensive experiments demonstrate DrugLAMP’s state-of-the-art performance on both standard benchmarks and challenging settings simulating real-world drug discovery, where test drugs/targets are unseen during training. Visualizations of attention maps and application to predict cryptic pockets and drug side effects further showcase DrugLAMP’s strong interpretability and generalizability. Ablation studies confirm the contributions of the proposed modules.<div class="boxTitle">Availability and implementation</div>Source code and datasets are freely available at <a href="https://github.com/Lzcstan/DrugLAMP">https://github.com/Lzcstan/DrugLAMP</a>. All data originate from public sources.</span>


Sparse Neighbor Joining: rapid phylogenetic inference using a sparse distance matrix
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Phylogenetic reconstruction is a fundamental problem in computational biology. The Neighbor Joining (NJ) algorithm offers an efficient distance-based solution to this problem, which often serves as the foundation for more advanced statistical methods. Despite prior efforts to enhance the speed of NJ, the computation of the <span style="font-style:italic;">n</span><sup>2</sup> entries of the distance matrix, where <span style="font-style:italic;">n</span> is the number of phylogenetic tree leaves, continues to pose a limitation in scaling NJ to larger datasets.<div class="boxTitle">Results</div>In this work, we propose a new algorithm which does not require computing a dense distance matrix. Instead, it dynamically determines a sparse set of at most O(n log n) distance matrix entries to be computed in its basic version, and up to O(n log 2n) entries in an enhanced version. We show by experiments that this approach reduces the execution time of NJ for large datasets, with a trade-off in accuracy.<div class="boxTitle">Availability and implementation</div>Sparse Neighbor Joining is implemented in Python and freely available at <a href="https://github.com/kurtsemih/SNJ">https://github.com/kurtsemih/SNJ</a>.</span>


Gene count estimation with pytximport enables reproducible analysis of bulk RNA sequencing data in Python
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>Transcript quantification tools efficiently map bulk RNA sequencing (RNA-seq) reads to reference transcriptomes. However, their output consists of transcript count estimates that are subject to multiple biases and cannot be readily used with existing differential gene expression analysis tools in Python.Here we present pytximport, a Python implementation of the tximport R package that supports a variety of input formats, different modes of bias correction, inferential replicates, gene-level summarization of transcript counts, transcript-level exports, transcript-to-gene mapping generation, and optional filtering of transcripts by biotype. pytximport is part of the scverse ecosystem of open-source Python software packages for omics analyses and includes both a Python as well as a command-line interface.With pytximport, we propose a bulk RNA-seq analysis workflow based on Bioconda and scverse ecosystem packages, ensuring reproducible analyses through Snakemake rules. We apply this pipeline to a publicly available RNA-seq dataset, demonstrating how pytximport enables the creation of Python-centric workflows capable of providing insights into transcriptomic alterations.<div class="boxTitle">Availability and implementation</div>pytximport is licensed under the GNU General Public License version 3. The source code is available at <a href="https://github.com/complextissue/pytximport">https://github.com/complextissue/pytximport</a> and via Zenodo with DOI: 10.5281/zenodo.13907917. A related Snakemake workflow is available through GitHub at <a href="https://github.com/complextissue/snakemake-bulk-rna-seq-workflow">https://github.com/complextissue/snakemake-bulk-rna-seq-workflow</a> and Zenodo with DOI: 10.5281/zenodo.12713811. Documentation and a vignette for new users are available at: <a href="https://pytximport.readthedocs.io">https://pytximport.readthedocs.io</a>.</span>


Micro-DeMix: a mixture beta-multinomial model for investigating the heterogeneity of the stool microbiome compositions
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Extensive research has uncovered the critical role of the human gut microbiome in various aspects of health, including metabolism, nutrition, physiology, and immune function. Fecal microbiota is often used as a proxy for understanding the gut microbiome, but it represents an aggregate view, overlooking spatial variations across different gastrointestinal (GI) locations. Emerging studies with spatial microbiome data collected from specific GI regions offer a unique opportunity to better understand the spatial composition of the stool microbiome.<div class="boxTitle">Results</div>We introduce Micro-DeMix, a mixture beta-multinomial model that deconvolutes the fecal microbiome at the compositional level by integrating stool samples with spatial microbiome data. Micro-DeMix facilitates the comparison of microbial compositions across different GI regions within the stool microbiome through a hypothesis-testing framework. We demonstrate the effectiveness and efficiency of Micro-DeMix using multiple simulated datasets and the inflammatory bowel disease data from the NIH Integrative Human Microbiome Project.<div class="boxTitle">Availability and implementation</div>The R package is available at <a href="https://github.com/liuruoqian/MicroDemix">https://github.com/liuruoqian/MicroDemix</a>.</span>


PhosX: data-driven kinase activity inference from phosphoproteomics experiments
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>The inference of kinase activity from phosphoproteomics data can point to causal mechanisms driving signalling processes and potential drug targets. Identifying the kinases whose change in activity explains the observed phosphorylation profiles, however, remains challenging, and constrained by the manually curated knowledge of kinase–substrate associations. Recently, experimentally determined substrate sequence specificities of human kinases have become available, but robust methods to exploit this new data for kinase activity inference are still missing. We present PhosX, a method to estimate differential kinase activity from phosphoproteomics data that combines state-of-the-art statistics in enrichment analysis with kinases’ substrate sequence specificity information. Using a large phosphoproteomics dataset with known differentially regulated kinases we show that our method identifies upregulated and downregulated kinases by only relying on the input phosphopeptides’ sequences and intensity changes. We find that PhosX outperforms the currently available approach for the same task, and performs better or similarly to state-of-the-art methods that rely on previously known kinase–substrate associations. We therefore recommend its use for data-driven kinase activity inference.<div class="boxTitle">Availability and implementation</div>PhosX is implemented in Python, open-source under the Apache-2.0 licence, and distributed on the Python Package Index. The code is available on GitHub (<a href="https://github.com/alussana/phosx">https://github.com/alussana/phosx</a>).</span>


DrugRepPT: a deep pretraining and fine-tuning framework for drug repositioning based on drug’s expression perturbation and treatment effectiveness
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Drug repositioning (DR), identifying novel indications for approved drugs, is a cost-effective strategy in drug discovery. Despite numerous proposed DR models, integrating network-based features, differential gene expression, and chemical structures for high-performance DR remains challenging.<div class="boxTitle">Results</div>We propose a comprehensive deep pretraining and fine-tuning framework for DR, termed DrugRepPT. Initially, we design a graph pretraining module employing model-augmented contrastive learning on a vast drug–disease heterogeneous graph to capture nuanced interactions and expression perturbations after intervention. Subsequently, we introduce a fine-tuning module leveraging a graph residual-like convolution network to elucidate intricate interactions between diseases and drugs. Moreover, a Bayesian multiloss approach is introduced to balance the existence and effectiveness of drug treatment effectively. Extensive experiments showcase the efficacy of our framework, with DrugRepPT exhibiting remarkable performance improvements compared to SOTA (state of the arts) baseline methods (improvement 106.13% on Hit@1 and 54.45% on mean reciprocal rank). The reliability of predicted results is further validated through two case studies, i.e. gastritis and fatty liver, via literature validation, network medicine analysis, and docking screening.<div class="boxTitle">Availability and implementation</div>The code and results are available at <a href="https://github.com/2020MEAI/DrugRepPT">https://github.com/2020MEAI/DrugRepPT</a>.</span>


Mutual information for detecting multi-class biomarkers when integrating multiple bulk or single-cell transcriptomic studies
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Biomarker detection plays a pivotal role in biomedical research. Integrating omics studies from multiple cohorts can enhance statistical power, accuracy, and robustness of the detection results. However, existing methods for horizontally combining omics studies are mostly designed for two-class scenarios (e.g. cases versus controls) and are not directly applicable for studies with multi-class design (e.g. samples from multiple disease subtypes, treatments, tissues, or cell types).<div class="boxTitle">Results</div>We propose a statistical framework, namely Mutual Information Concordance Analysis (MICA), to detect biomarkers with concordant multi-class expression pattern across multiple omics studies from an information theoretic perspective. Our approach first detects biomarkers with concordant multi-class patterns across partial or all of the omics studies using a global test by mutual information. A <span style="font-style:italic;">post hoc</span> analysis is then performed for each detected biomarkers and identify studies with concordant pattern. Extensive simulations demonstrate improved accuracy and successful false discovery rate control of MICA compared to an existing multi-class correlation method. The method is then applied to two practical scenarios: four tissues of mouse metabolism-related transcriptomic studies, and three sources of estrogen treatment expression profiles. Detected biomarkers by MICA show intriguing biological insights and functional annotations. Additionally, we implemented MICA for single-cell RNA-Seq data for tumor progression biomarkers, highlighting critical roles of ribosomal function in the tumor microenvironment of triple-negative breast cancer and underscoring the potential of MICA for detecting novel therapeutic targets.<div class="boxTitle">Availability and implementation</div>The source code is available on Figshare at <a href="https://doi.org/10.6084/m9.figshare.27635436">https://doi.org/10.6084/m9.figshare.27635436</a>. Additionally, the R package can be installed directly from GitHub at <a href="https://github.com/jianzou75/MICA">https://github.com/jianzou75/MICA</a>.</span>


Damsel: analysis and visualisation of DamID sequencing in R
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>DamID sequencing is a technique to map the genome-wide interaction of a protein with DNA. Damsel is the first Bioconductor package to provide an end to end analysis for DamID sequencing data within R. Damsel performs quantification and testing of significant binding sites along with exploratory and visual analysis. Damsel produces results consistent with previous analysis approaches.<div class="boxTitle">Availability and implementation</div>The R package Damsel is available for install through the Bioconductor project <a href="https://bioconductor.org/packages/release/bioc/html/Damsel.html">https://bioconductor.org/packages/release/bioc/html/Damsel.html</a> and the code is available on GitHub <a href="https://github.com/Oshlack/Damsel/">https://github.com/Oshlack/Damsel/</a>.</span>


Sensitivities in protein allocation models reveal distribution of metabolic capacity and flux control
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Expanding on constraint-based metabolic models, protein allocation models (PAMs) enhance flux predictions by accounting for protein resource allocation in cellular metabolism. Yet, to this date, there are no dedicated methods for analyzing and understanding the growth-limiting factors in simulated phenotypes in PAMs.<div class="boxTitle">Results</div>Here, we introduce a systematic framework for identifying the most sensitive enzyme concentrations (sEnz) in PAMs. The framework exploits the primal and dual formulations of these models to derive sensitivity coefficients based on relations between variables, constraints, and the objective function. This approach enhances our understanding of the growth-limiting factors of metabolic phenotypes under specific environmental or genetic conditions. Compared to other traditional methods for calculating sensitivities, sEnz requires substantially less computation time and facilitates more intuitive comparison and analysis of sensitivities. The sensitivities calculated by sEnz cover enzymes, reactions and protein sectors, enabling a holistic overview of the factors influencing metabolism. When applied to an <span style="font-style:italic;">Escherichia coli</span> PAM, sEnz revealed major pathways and enzymes driving overflow metabolism. Overall, sEnz offers a computational efficient framework for understanding PAM predictions and unraveling the factors governing a particular metabolic phenotype.<div class="boxTitle">Availability and implementation</div>sEnz is implemented in the modular toolbox for the generation and analysis of PAMs in Python (PAModelpy; v.0.0.3.3), available on Pypi (<a href="https://pypi.org/project/PAModelpy/">https://pypi.org/project/PAModelpy/</a>). The source code together with all other python scripts and notebooks are available on GitHub (<a href="https://github.com/iAMB-RWTH-Aachen/PAModelpy">https://github.com/iAMB-RWTH-Aachen/PAModelpy</a>).</span>


STRPsearch: fast detection of structured tandem repeat proteins
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Structured Tandem Repeats Proteins (STRPs) constitute a subclass of tandem repeats characterized by repetitive structural motifs. These proteins exhibit distinct secondary structures that form repetitive tertiary arrangements, often resulting in large molecular assemblies. Despite highly variable sequences, STRPs can perform important and diverse biological functions, maintaining a consistent structure with a variable number of repeat units. With the advent of protein structure prediction methods, millions of 3D models of proteins are now publicly available. However, automatic detection of STRPs remains challenging with current state-of-the-art tools due to their lack of accuracy and long execution times, hindering their application on large datasets. In most cases, manual curation remains the most accurate method for detecting and classifying STRPs, making it impracticable to annotate millions of structures.<div class="boxTitle">Results</div>We introduce STRPsearch, a novel tool for the rapid identification, classification, and mapping of STRPs. Leveraging manually curated entries from RepeatsDB as the known conformational space of STRPs, STRPsearch uses the latest advances in structural alignment for a fast and accurate detection of repeated structural motifs in proteins, followed by an innovative approach to map units and insertions through the generation of TM-score profiles. STRPsearch is highly scalable, efficiently processing large datasets, and can be applied to both experimental structures and predicted models. In addition, it demonstrates superior performance compared to existing tools, offering researchers a reliable and comprehensive solution for STRP analysis across diverse proteomes.<div class="boxTitle">Availability and implementation</div>STRPsearch is coded in Python. All scripts and associated documentation are available from: <a href="https://github.com/BioComputingUP/STRPsearch">https://github.com/BioComputingUP/STRPsearch</a>.</span>


DeepDR: a deep learning library for drug response prediction
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>Accurate drug response prediction is critical to advancing precision medicine and drug discovery. Recent advances in deep learning (DL) have shown promise in predicting drug response; however, the lack of convenient tools to support such modeling limits their widespread application. To address this, we introduce DeepDR, the first DL library specifically developed for drug response prediction. DeepDR simplifies the process by automating drug and cell featurization, model construction, training, and inference, all achievable with brief programming. The library incorporates three types of drug features along with nine drug encoders, four types of cell features along with nine cell encoders, and two fusion modules, enabling the implementation of up to 135 DL models for drug response prediction. We also explored benchmarking performance with DeepDR, and the optimal models are available on a user-friendly visual interface.<div class="boxTitle">Availability and implementation</div>DeepDR can be installed from PyPI (<a href="https://pypi.org/project/deepdr">https://pypi.org/project/deepdr</a>). The source code and experimental data are available on GitHub (<a href="https://github.com/user15632/DeepDR">https://github.com/user15632/DeepDR</a>).</span>


Tiberius: end-to-end deep learning with an HMM for gene prediction
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>For more than 25 years, learning-based eukaryotic gene predictors were driven by hidden Markov models (HMMs), which were directly inputted a DNA sequence. Recently, Holst <span style="font-style:italic;">et al.</span> demonstrated with their program Helixer that the accuracy of <span style="font-style:italic;">ab initio</span> eukaryotic gene prediction can be improved by combining deep learning layers with a separate HMM postprocessor.<div class="boxTitle">Results</div>We present Tiberius, a novel deep learning-based <span style="font-style:italic;">ab initio</span> gene predictor that end-to-end integrates convolutional and long short-term memory layers with a differentiable HMM layer. Tiberius uses a custom gene prediction loss and was trained for prediction in mammalian genomes and evaluated on human and two other genomes. It significantly outperforms existing <span style="font-style:italic;">ab initio</span> methods, achieving F1 scores of 62% at gene level for the human genome, compared to 21% for the next best <span style="font-style:italic;">ab initio</span> method. In <span style="font-style:italic;">de novo</span> mode, Tiberius predicts the exon−intron structure of two out of three human genes without error. Remarkably, even Tiberius’s <span style="font-style:italic;">ab initio</span> accuracy matches that of BRAKER3, which uses RNA-seq data and a protein database. Tiberius’s highly parallelized model is the fastest state-of-the-art gene prediction method, processing the human genome in under 2 hours.<div class="boxTitle">Availability and implementation</div><a href="https://github.com/Gaius-Augustus/Tiberius">https://github.com/Gaius-Augustus/Tiberius</a></span>


Dynamic modelling of signalling pathways when ordinary differential equations are not feasible
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Mathematical modelling plays a crucial role in understanding inter- and intracellular signalling processes. Currently, ordinary differential equations (ODEs) are the predominant approach in systems biology for modelling such pathways. While ODE models offer mechanistic interpretability, they also suffer from limitations, including the need to consider all relevant compounds, resulting in large models difficult to handle numerically and requiring extensive data.<div class="boxTitle">Results</div>In previous work, we introduced the <span style="font-style:italic;">retarded transient function (RTF)</span> as an alternative method for modelling temporal responses of signalling pathways. Here, we extend the RTF approach to integrate concentration or dose-dependencies into the modelling of dynamics. With this advancement, RTF modelling now fully encompasses the application range of ODE models, which comprises predictions in both time and concentration domains. Moreover, characterizing dose-dependencies provides an intuitive way to investigate and characterize signalling differences between biological conditions or cell types based on their response to stimulating inputs. To demonstrate the applicability of our extended approach, we employ data from time- and dose-dependent inflammasome activation in bone marrow-derived macrophages treated with nigericin sodium salt. Our results show the effectiveness of the extended RTF approach as a generic framework for modelling dose-dependent kinetics in cellular signalling. The approach results in intuitively interpretable parameters that describe signal dynamics and enables predictive modelling of time- and dose-dependencies even if only individual cellular components are quantified.<div class="boxTitle">Availability and implementation</div>The presented approach is available within the MATLAB-based <span style="font-style:italic;">Data2Dynamics</span> modelling toolbox at <a href="https://github.com/Data2Dynamics">https://github.com/Data2Dynamics</a> and <a href="https://zenodo.org/records/14008247">https://zenodo.org/records/14008247</a> and as R code at <a href="https://github.com/kreutz-lab/RTF">https://github.com/kreutz-lab/RTF</a>.</span>


Facilitating phenotyping from clinical texts: the medkit library
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>Phenotyping consists in applying algorithms to identify individuals associated with a specific, potentially complex, trait or condition, typically out of a collection of Electronic Health Records (EHRs). Because a lot of the clinical information of EHRs are lying in texts, phenotyping from text takes an important role in studies that rely on the secondary use of EHRs. However, the heterogeneity and highly specialized aspect of both the content and form of clinical texts makes this task particularly tedious, and is the source of time and cost constraints in observational studies.<div class="boxTitle"> </div>To facilitate the development, evaluation and reproducibility of phenotyping pipelines, we developed an open-source Python library named medkit. It enables composing data processing pipelines made of easy-to-reuse software bricks, named medkit operations. In addition to the core of the library, we share the operations and pipelines we already developed and invite the phenotyping community for their reuse and enrichment.<div class="boxTitle">Availability and implementation</div>medkit is available at <a href="https://github.com/medkit-lib/medkit">https://github.com/medkit-lib/medkit</a>.</span>


LmRaC: a functionally extensible tool for LLM interrogation of user experimental results
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Large Language Models (LLMs) have provided spectacular results across a wide variety of domains. However, persistent concerns about hallucination and fabrication of authoritative sources raise serious issues for their integral use in scientific research. Retrieval-augmented generation (RAG) is a technique for making data and documents, otherwise unavailable during training, available to the LLM for reasoning tasks. In addition to making dynamic and quantitative data available to the LLM, RAG provides the means by which to carefully control and trace source material, thereby ensuring results are accurate, complete, and authoritative.<div class="boxTitle">Results</div>Here, we introduce LmRaC, an LLM-based tool capable of answering complex scientific questions in the context of a user’s own experimental results. LmRaC allows users to dynamically build domain specific knowledge-bases from PubMed sources (<span style="font-style:italic;">RAGdom</span>). Answers are drawn solely from this RAG with citations to the paragraph level, virtually eliminating any chance of hallucination or fabrication. These answers can then be used to construct an experimental context (<span style="font-style:italic;">RAGexp</span>) that, along with user supplied documents (e.g. design, protocols) and quantitative results, can be used to answer questions about the user’s specific experiment. Questions about quantitative experimental data are integral to LmRaC and are supported by a user-defined and functionally extensible REST API server (<span style="font-style:italic;">RAGfun</span>).<div class="boxTitle">Availability and implementation</div>Detailed documentation for LmRaC along with a sample REST API server for defining user functions can be found at <a href="https://github.com/dbcraig/LmRaC">https://github.com/dbcraig/LmRaC</a>. The LmRaC web application image can be pulled from Docker Hub (<a href="https://hub.docker.com">https://hub.docker.com</a>) as dbcraig/lmrac.</span>


AltGosling: automatic generation of text descriptions for accessible genomics data visualization
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Biomedical visualizations are key to accessing biomedical knowledge and detecting new patterns in large datasets. Interactive visualizations are essential for biomedical data scientists and are omnipresent in data analysis software and data portals. Without appropriate descriptions, these visualizations are not accessible to all people with blindness and low vision, who often rely on screen reader accessibility technologies to access visual information on digital devices. Screen readers require descriptions to convey image content. However, many images lack informative descriptions due to unawareness and difficulty writing such descriptions. Describing complex and interactive visualizations, like genomics data visualizations, is even more challenging. Automatic generation of descriptions could be beneficial, yet current alt text generating models are limited to basic visualizations and cannot be used for genomics.<div class="boxTitle">Results</div>We present AltGosling, an automated description generation tool focused on interactive data visualizations of genome-mapped data, created with the grammar-based genomics toolkit Gosling. The logic-based algorithm of AltGosling creates various descriptions including a tree-structured navigable panel. We co-designed AltGosling with a blind screen reader user (co-author). We show that AltGosling outperforms state-of-the-art large language models and common image-based neural networks for alt text generation of genomics data visualizations. As a first of its kind in genomic research, we lay the groundwork to increase accessibility in the field.<div class="boxTitle">Availability and implementation</div>The source code, examples, and interactive demo are accessible under the MIT License at <a href="https://github.com/gosling-lang/altgosling">https://github.com/gosling-lang/altgosling</a>. The package is available at <a href="https://www.npmjs.com/package/altgosling">https://www.npmjs.com/package/altgosling</a>.</span>


FAPM: functional annotation of proteins using multimodal models beyond structural modeling
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Assigning accurate property labels to proteins, like functional terms and catalytic activity, is challenging, especially for proteins without homologs and “tail labels” with few known examples. Previous methods mainly focused on protein sequence features, overlooking the semantic meaning of protein labels.<div class="boxTitle">Results</div>We introduce functional annotation of proteins using multimodal models (FAPM), a contrastive multimodal model that links natural language with protein sequence language. This model combines a pretrained protein sequence model with a pretrained large language model to generate labels, such as Gene Ontology (GO) functional terms and catalytic activity predictions, in natural language. Our results show that FAPM excels in understanding protein properties, outperforming models based solely on protein sequences or structures. It achieves state-of-the-art performance on public benchmarks and in-house experimentally annotated phage proteins, which often have few known homologs. Additionally, FAPM’s flexibility allows it to incorporate extra text prompts, like taxonomy information, enhancing both its predictive performance and explainability. This novel approach offers a promising alternative to current methods that rely on multiple sequence alignment for protein annotation.<div class="boxTitle">Availability and implementation</div>The online demo is at: <a href="https://huggingface.co/spaces/wenkai/FAPM_demo">https://huggingface.co/spaces/wenkai/FAPM_demo</a>.</span>


Predicting the subcellular location of prokaryotic proteins with DeepLocPro
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Protein subcellular location prediction is a widely explored task in bioinformatics because of its importance in proteomics research. We propose DeepLocPro, an extension to the popular method DeepLoc, tailored specifically to archaeal and bacterial organisms.<div class="boxTitle">Results</div>DeepLocPro is a multiclass subcellular location prediction tool for prokaryotic proteins, trained on experimentally verified data curated from UniProt and PSORTdb. DeepLocPro compares favorably to the PSORTb 3.0 ensemble method, surpassing its performance across multiple metrics in our benchmark experiment.<div class="boxTitle">Availability and implementation</div>The DeepLocPro prediction tool is available online at <a href="https://ku.biolib.com/deeplocpro">https://ku.biolib.com/deeplocpro</a> and <a href="https://services.healthtech.dtu.dk/services/DeepLocPro-1.0/">https://services.healthtech.dtu.dk/services/DeepLocPro-1.0/</a>.</span>


DeepRSMA: a cross-fusion-based deep learning method for RNA–small molecule binding affinity prediction
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>RNA is implicated in numerous aberrant cellular functions and disease progressions, highlighting the crucial importance of RNA-targeted drugs. To accelerate the discovery of such drugs, it is essential to develop an effective computational method for predicting RNA–small molecule affinity (RSMA). Recently, deep learning-based computational methods have been promising due to their powerful nonlinear modeling ability. However, the leveraging of advanced deep learning methods to mine the diverse information of RNAs, small molecules, and their interaction still remains a great challenge.<div class="boxTitle">Results</div>In this study, we present DeepRSMA, an innovative cross-attention-based deep learning method for RSMA prediction. To effectively capture fine-grained features from RNA and small molecules, we developed nucleotide-level and atomic-level feature extraction modules for RNA and small molecules, respectively. Additionally, we incorporated both sequence and graph views into these modules to capture features from multiple perspectives. Moreover, a transformer-based cross-fusion module is introduced to learn the general patterns of interactions between RNAs and small molecules. To achieve effective RSMA prediction, we integrated the RNA and small molecule representations from the feature extraction and cross-fusion modules. Our results show that DeepRSMA outperforms baseline methods in multiple test settings. The interpretability analysis and the case study on spinal muscular atrophy demonstrate that DeepRSMA has the potential to guide RNA-targeted drug design.<div class="boxTitle">Availability and implementation</div>The codes and data are publicly available at <a href="https://github.com/Hhhzj-7/DeepRSMA">https://github.com/Hhhzj-7/DeepRSMA</a>.</span>


FEHAT: efficient, large scale and automated heartbeat detection in Medaka fish embryos
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>High-resolution imaging of model organisms allows the quantification of important physiological measurements. In the case of fish with transparent embryos, these videos can visualize key physiological processes, such as heartbeat. High throughput systems can provide enough measurements for the robust investigation of developmental processes as well as the impact of system perturbations on physiological state. However, few analytical schemes have been designed to handle thousands of high-resolution videos without the need for some level of human intervention. We developed a software package, named FEHAT, to provide a fully automated solution for the analytics of large numbers of heart rate imaging datasets obtained from developing Medaka fish embryos in 96-well plate format imaged on an Acquifer machine. FEHAT uses image segmentation to define regions of the embryo showing changes in pixel intensity over time, followed by the classification of the most likely position of the heart and Fourier Transformations to estimate the heart rate. Here, we describe some important features of the FEHAT software, showcasing its performance across a large set of medaka fish embryos and compare its performance to established, less automated solutions. FEHAT provides reliable heart rate estimates across a range of temperature-based perturbations and can be applied to tens of thousands of embryos without the need for any human intervention.<div class="boxTitle">Availability and implementation</div>Data used in this manuscript will be made available on request.</span>


Ranking antibody binding epitopes and proteins across samples from whole proteome tiled linear peptides
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Introduction</div>Ultradense peptide binding arrays that can probe millions of linear peptides comprising the entire proteomes of human or mouse, or hundreds of thousands of microbes, are powerful tools for studying the antibody repertoire in serum samples to understand adaptive immune responses.<div class="boxTitle">Motivation</div>There are few tools for exploring high-dimensional, significant and reproducible antibody targets for ultradense peptide binding arrays at the linear peptide, epitope (grouping of adjacent peptides), and protein level across multiple samples/subjects (i.e. epitope spread or immunogenic regions of proteins) for understanding the heterogeneity of immune responses.<div class="boxTitle">Results</div>We developed <strong>H</strong>ierarchical antibody binding <strong>E</strong>pitopes and p<strong>RO</strong>teins from li<strong>N</strong>ear peptides (HERON), an R package, which can identify immunogenic epitopes, using meta-analyses and spatial clustering techniques to explore antibody targets at various resolution and confidence levels, that can be found consistently across a specified number of samples through the entire proteome to study antibody responses for diagnostics or treatment. Our approach estimates significance values at the linear peptide (probe), epitope, and protein level to identify top candidates for validation. We tested the performance of predictions on all three levels using correlation between technical replicates and comparison of epitope calls on two datasets, and results showed HERON’s competitiveness in estimating false discovery rates and finding general and sample-level regions of interest for antibody binding.<div class="boxTitle">Availability and implementation</div>The HERON R package is available at Bioconductor <a href="https://bioconductor.org/packages/release/bioc/html/HERON.html">https://bioconductor.org/packages/release/bioc/html/HERON.html</a>.</span>


Afpdb: an efficient structure manipulation package for AI protein design
<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>The advent of AlphaFold and other protein Artificial Intelligence (AI) models has transformed protein design, necessitating efficient handling of large-scale data and complex workflows. Using existing programming packages that predate recent AI advancements often leads to inefficiencies in human coding and slow code execution. To address this gap, we developed the Afpdb package.<div class="boxTitle">Results</div>Afpdb, built on AlphaFold’s NumPy architecture, offers a high-performance core. It uses RFDiffusion's contig syntax to streamline residue and atom selection, making coding simpler and more readable. Integrating PyMOL’s visualization capabilities, Afpdb allows automatic visual quality control. With over 180 methods commonly used in protein AI design, which are otherwise hard to find, Afpdb enhances productivity in structural biology by supporting the development of concise, high-performance code.<div class="boxTitle">Availability and implementation</div>Code and documentation are available on GitHub (<a href="https://github.com/data2code/afpdb">https://github.com/data2code/afpdb</a>) and PyPI (<a href="https://pypi.org/project/afpdb">https://pypi.org/project/afpdb</a>). An interactive tutorial is accessible through Google Colab.</span>