Bioinformatics - current issue

By Continent

Members

Text Advertisements

	Study Abroad Guide studyabroadguide.com An one-stop career platform detailing Schools & Universities offering English language, Bachelor, Master and PhD programs with course fee, living cost, scholarships, visa details, etc. ---------------------------
	Cyprus College cycollege.ac.cy The college was founded in 1961 with the purpose to provide a well rounded education of high calibre where students can acquire the necessary academic knowledge. ---------------------------
	Wroclaw University international.uni.wroc.pl Founded in 1702 by Leopold I Habsburg. Since the beginning of 20th century the university has produced 9 Nobel Prize winners. ---------------------------
	Volyn University vdu.edu.ua The history dates back to 1940. At present, the university includes 4 institutes, 14 faculties and 73 departments. ---------------------------
	Berkeley College berkeleycollege.edu Through the power of internet, Berkeley college online brings the classroom to you anywhere in the world with the same high level of support as On-Campus classes. ---------------------------
	AIS ais.ac.nz New Zealand's largest international degree provider. The programmes are focused on the global marketplace. ---------------------------

WORLD UNIVERSITY DIRECTORY

Medical Education Bioinformatics - current issue

Back to "News Updates - Homepage"

| More

Bioinformatics - current issue - Recent Educational Updates

-->

PyEvoCell: an LLM-augmented single-cell trajectory analysis dashboard

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Several methods have been developed for trajectory inference in single-cell studies. However, identifying relevant lineages among several cell types and interpreting the results of downstream analysis remains a challenging task that requires deep understanding of various cell type transitions and progression patterns. Therefore, there is a need for methods that can aid researchers in the analysis and interpretation of such trajectories.<div class="boxTitle">Results</div>We developed PyEvoCell, a dashboard for trajectory interpretation and analysis that is augmented by large language model (LLM) capabilities. PyEvoCell applies the LLM to the outputs of trajectory inference methods such as Monocle3, to suggest biologically relevant lineages. Once a lineage is defined, users can conduct differential expression and functional analyses which are also interpreted by the LLM. Finally, any hypothesis or claim derived from the analysis can be validated using the veracity filter, a feature enabled by the LLM, to confirm or reject claims by providing relevant PubMed citations.<div class="boxTitle">Availability and implementation</div>The software is available at <a href="https://github.com/Sanofi-Public/PyEvoCell">https://github.com/Sanofi-Public/PyEvoCell</a>. It contains installation instructions, user manual, demo datasets, as well as license conditions. <a href="https://doi.org/10.5281/zenodo.15114803">https://doi.org/10.5281/zenodo.15114803</a>.</span>

Approximating edit distances between complex tandem repeats efficiently

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Extended tandem repeats (TRs) have been associated with 60 or more diseases over the past 30 years. Although most TRs have single repeat units (or motifs), complex TRs with different units have recently been correlated with some brain disorders. Of note, a population-scale analysis shows that complex TRs at one locus can be divergent, and different units are often expanded between individuals. To understand the evolution of high TR diversity, it is informative to visualize a phylogenetic tree. To do this, we need to measure the edit distance between pairs of complex TRs by considering duplication and contraction of units created by replication slippage. However, traditional rigorous algorithms for this purpose are computationally expensive.<div class="boxTitle">Results</div>We here propose an efficient heuristic algorithm to estimate the edit distance with duplication and contraction of units (EDDC, for short). We select a set of frequent units that occur in given complex TRs, encode each unit as a single symbol, compress a TR into an optimal series of unit symbols that partially matches the original TR with the minimum Levenshtein distance, and estimate the EDDC between a pair of complex TRs from their compressed forms. Using substantial synthetic benchmark datasets, we demonstrate that the estimated EDDC is highly correlated with the accurate EDDC, with a Pearson correlation coefficient of >0.983, while the heuristic algorithm achieves orders of magnitude performance speedup.<div class="boxTitle">Availability and implementation</div>The software program hEDDC that implements the proposed algorithm is available at <a href="https://github.com/Ricky-pon/hEDDC">https://github.com/Ricky-pon/hEDDC</a> (DOI: 10.5281/zenodo.14732958)</span>

Sawfish: improving long-read structural variant discovery and genotyping with local haplotype modeling

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Structural variants (SVs) play an important role in evolutionary and functional genomics but are challenging to characterize. High-accuracy, long-read sequencing can substantially improve SV characterization when coupled with effective calling methods. While state-of-the-art long-read SV callers are highly accurate, further improvements are achievable by systematically modeling local haplotypes during SV discovery and genotyping.<div class="boxTitle">Results</div>We describe sawfish, an SV caller for mapped high-quality long reads incorporating systematic SV haplotype modeling to improve accuracy and resolution. Assessment against the draft Genome in a Bottle (GIAB) SV benchmark from the T2T-HG002-Q100 diploid assembly shows that sawfish has the highest accuracy among state-of-the-art long-read SV callers across every tested SV size group. Additionally, sawfish maintains the highest accuracy at every tested depth level from 10- to 32-fold coverage, such that other callers required at least 30-fold coverage to match sawfish accuracy at 15-fold coverage. Sawfish also shows the highest accuracy in the GIAB challenging medically relevant genes benchmark, demonstrating improvements in both comprehensive and medically relevant contexts.When joint-genotyping seven samples from CEPH-1463, sawfish has over 9000 more pedigree-concordant calls than other state-of-the-art SV callers, with the highest proportion of concordant SVs (81%). Sawfish’s quality model enables selection for an even higher proportion of concordant SVs (88%), while still calling nearly 5000 more pedigree-concordant SVs than other callers. These results demonstrate that sawfish improves on the state-of-the-art for long-read SV calling accuracy across both individual and joint-sample analyses.<div class="boxTitle">Availability and implementation</div>Sawfish source code, pre-compiled Linux binaries, and documentation are released on GitHub: <a href="https://github.com/PacificBiosciences/sawfish">https://github.com/PacificBiosciences/sawfish</a>.</span>

Rethinking GWAS: how lessons from genetic screens and artificial intelligence could reveal biological mechanisms

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Modern single-cell omics data are key to unraveling the complex mechanisms underlying risk for complex diseases revealed by genome-wide association studies (GWAS). Phenotypic screens in model organisms have several important parallels to GWAS which the author explores in this essay.<div class="boxTitle">Results</div>The author provides the historical context of such screens, comparing and contrasting similarities to association studies, and how these screens in model organisms can teach us what to look for. Then the author considers how the results of GWAS might be exhaustively interrogated to interpret the biological mechanisms underpinning disease processes. Finally, the author proposes a general framework for tackling this problem computationally, and explore the data, mechanisms, and technology (both existing and yet to be invented) that are necessary to complete the task.<div class="boxTitle">Availability and implementation</div>There are no data or code associated with this article.</span>

CoverM: read alignment statistics for metagenomics

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>Genome-centric analysis of metagenomic samples is a powerful method for understanding the function of microbial communities. Calculating read coverage is a central part of analysis, enabling differential coverage binning for recovery of genomes and estimation of microbial community composition. Coverage is determined by processing read alignments to reference sequences of either contigs or genomes. Per-reference coverage is typically calculated in an ad-hoc manner, with each software package providing its own implementation and specific definition of coverage. Here we present a unified software package CoverM which calculates several coverage statistics for contigs and genomes in an ergonomic and flexible manner. It uses “Mosdepth arrays” for computational efficiency and avoids unnecessary I/O overhead by calculating coverage statistics from streamed read alignment results.<div class="boxTitle">Availability and implementation</div>CoverM is free software available at <a href="https://github.com/wwood/coverm">https://github.com/wwood/coverm</a>. CoverM is implemented in Rust, with Python (<a href="https://github.com/apcamargo/pycoverm">https://github.com/apcamargo/pycoverm</a>) and Julia (<a href="https://github.com/JuliaBinaryWrappers/CoverM_jll.jl">https://github.com/JuliaBinaryWrappers/CoverM_jll.jl</a>) interfaces.</span>

Realfreq: real-time base modification analysis for nanopore sequencing

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>Nanopore sequencers allow sequencing data to be accessed in real-time. This allows live analysis to be performed, while the sequencing is running, reducing the turnaround time of the results. We introduce <span style="font-style:italic;">realfreq</span>, a framework for obtaining real-time base modification frequencies while a nanopore sequencer is in operation. <span style="font-style:italic;">Realfreq</span> calculates and allows access to the real-time base modification frequency results while the sequencer is running. We demonstrate that the data analysis rate with <span style="font-style:italic;">realfreq</span> on a laptop computer can keep up with the output data rate of a nanopore MinION sequencer, while a desktop computer can keep up with a single PromethION 2 solo flowcell.<div class="boxTitle">Availability and implementation</div>Realfreq is a free and open-source application implemented in C programming language and shell scripts. The source code and the documentation for <span style="font-style:italic;">realfreq</span> can be found at <a href="https://github.com/imsuneth/realfreq">https://github.com/imsuneth/realfreq</a>. The version used for the manuscript is also available at <a href="https://doi.org/10.5281/zenodo.15128668">https://doi.org/10.5281/zenodo.15128668</a>.</span>

miss-SNF: a multimodal patient similarity network integration approach to handle completely missing data sources

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Precision medicine leverages patient-specific multimodal data to improve prevention, diagnosis, prognosis, and treatment of diseases. Advancing precision medicine requires the non-trivial integration of complex, heterogeneous, and potentially high-dimensional data sources, such as multi-omics and clinical data. In the literature, several approaches have been proposed to manage missing data, but are usually limited to the recovery of subsets of features for a subset of patients. A largely overlooked problem is the integration of multiple sources of data when one or more of them are completely missing for a subset of patients, a relatively common condition in clinical practice.<div class="boxTitle">Results</div>We propose miss-Similarity Network Fusion (miss-SNF), a novel general-purpose data integration approach designed to manage completely missing data in the context of patient similarity networks. miss-SNF integrates incomplete unimodal patient similarity networks by leveraging a non-linear message-passing strategy borrowed from the SNF algorithm. miss-SNF is able to recover missing patient similarities and is “task agnostic”, in the sense that can integrate partial data for both unsupervised and supervised prediction tasks. Experimental analyses on nine cancer datasets from The Cancer Genome Atlas (TCGA) demonstrate that miss-SNF achieves state-of-the-art results in recovering similarities and in identifying patients subgroups enriched in clinically relevant variables and having differential survival. Moreover, amputation experiments show that miss-SNF supervised prediction of cancer clinical outcomes and Alzheimer’s disease diagnosis with completely missing data achieves results comparable to those obtained when all the data are available.<div class="boxTitle">Availability and implementation</div>miss-SNF code, implemented in R, is available at <a href="https://github.com/AnacletoLAB/missSNF">https://github.com/AnacletoLAB/missSNF</a>.</span>

Demixer: a probabilistic generative model to delineate different strains of a microbial species in a mixed infection sample

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Multi-drug resistant or hetero-resistant tuberculosis (TB) hinders the successful treatment of TB. Hetero-resistant TB occurs when multiple strains of the TB-causing bacterium with varying degrees of drug susceptibility are present in an individual. Existing studies predicting the proportion and identity of strains in a mixed infection sample rely on a reference database of known strains. A main challenge then is to identify <span style="font-style:italic;">de novo</span> strains not present in the reference database, while quantifying the proportion of known strains.<div class="boxTitle">Results</div>We present Demixer, a probabilistic generative model that uses a combination of reference-based and reference-free techniques to delineate mixed infection strains in whole genome sequencing (WGS) data. Demixer extends a topic model widely used in text mining to represent known mutations and discover novel ones. Parallelization and other heuristics enabled Demixer to process large datasets like CRyPTIC (Comprehensive Resistance Prediction for Tuberculosis: an International Consortium). In both synthetic and experimental benchmark datasets, our proposed method precisely detected the identity (e.g. 91.67% accuracy on the experimental <span style="font-style:italic;">in vitro</span> dataset) as well as the proportions of the mixed strains. In real-world applications, Demixer revealed novel high confidence mixed infections (101 out of 1963 Malawi samples analysed), and new insights into the global frequency of mixed infection (2% at the most stringent threshold in the CRyPTIC dataset) and its significant association to drug resistance. Our approach is generalizable and hence applicable to any bacterial and viral WGS data.<div class="boxTitle">Availability and implementation</div>All code relevant to Demixer is available at <a href="https://github.com/BIRDSgroup/Demixer">https://github.com/BIRDSgroup/Demixer</a>.</span>

uHAF: a unified hierarchical annotation framework for cell type standardization and harmonization

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>In single-cell transcriptomics, inconsistent cell type annotations due to varied naming conventions and hierarchical granularity impede data integration, machine learning applications, and meaningful evaluations. To address this challenge, we developed the unified Hierarchical Annotation Framework (uHAF), which includes organ-specific hierarchical cell type trees (uHAF-T) and a mapping tool (uHAF-Agent) based on large language models. uHAF-T provides standardized hierarchical references for 38 organs, allowing for consistent label unification and analysis at different levels of granularity. uHAF-Agent leverages GPT-4 to accurately map diverse and informal cell type labels onto uHAF-T nodes, streamlining the harmonization process. By simplifying label unification, uHAF enhances data integration, supports machine learning applications, and enables biologically meaningful evaluations of annotation methods. Our framework serves as an essential resource for standardizing cell type annotations and fostering collaborative refinement in the single-cell research community.<div class="boxTitle">Availability and implementation</div>uHAF is publicly available at: <a href="https://uhaf.unifiedcellatlas.org">https://uhaf.unifiedcellatlas.org</a> and <a href="https://github.com/SuperBianC/uhaf">https://github.com/SuperBianC/uhaf</a>.</span>

HISSTA: a human in situ single-cell transcriptome atlas

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Spatial transcriptomics holds great promise for revolutionizing biology and medicine by providing gene expression profiles with spatial information. Until recently, spatial resolution has been limited, but advances in high-throughput in situ imaging technologies now offer new opportunities by covering thousands of genes at a single-cell or even subcellular resolution, necessitating databases dedicated to comprehensive coverage and analysis with user-friendly intefaces.<div class="boxTitle">Results</div>We introduce the HISSTA database, which facilitates the archival and analysis of in situ transcriptome data at single-cell resolution from various human tissues. We have collected and annotated spatial transcriptome data generated by MERFISH, CosMx SMI, and Xenium techniques, encompassing 112 samples and 28 million cells across 16 tissue types from 63 studies. To decipher spatial contexts, we have implemented advanced tools for cell type annotation, spatial colocalization, spatial cellular communication, and niche analyses. Notably, all datasets and annotations are interactively accessible through Vitessce, allowing users to focus on regions of interest and examine gene expression in detail. HISSTA is a unique database designed to manage the rapidly growing dataset of in situ transcriptomes at single-cell resolution. Given its comprehensive data content and advanced analysis tools with interactive visualizations, HISSTA is poised to significantly impact cancer diagnosis, precision medicine, and digital pathology.<div class="boxTitle">Availability and implementation</div>HISSTA is freely accessible at <a href="https://kbds.re.kr/hissta/">https://kbds.re.kr/hissta/</a>. The source code is available at <a href="https://doi.org/10.5281/zenodo.14904523">https://doi.org/10.5281/zenodo.14904523</a>.</span>

Seed2LP: seed inference in metabolic networks for reverse ecology applications

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>A challenging problem in microbiology is to determine nutritional requirements of microorganisms and culture them, especially for the microbial dark matter detected solely with culture-independent methods. The latter foster an increasing amount of genomic sequences that can be explored with reverse ecology approaches to raise hypotheses on the corresponding populations. Building upon genome-scale metabolic networks (GSMNs) obtained from genome annotations, metabolic models predict contextualized phenotypes using nutrient information.<div class="boxTitle">Results</div>We developed the tool <span style="font-style:italic;">Seed2LP</span>, addressing the inverse problem of predicting source nutrients, or <span style="font-style:italic;">seeds</span>, from a GSMN and a metabolic objective. The originality of Seed2LP is its hybrid model, combining a scalable and discrete Boolean approximation of metabolic activity, with the numerically accurate flux balance analysis (FBA). Seed inference is highly customizable, with multiple search and solving modes, exploring the search space of external and internal metabolites combinations. Application to a benchmark of 107 curated GSMNs highlights the usefulness of a logic modelling method over a graph-based approach to predict seeds, and the relevance of hybrid solving to satisfy FBA constraints. Focusing on the dependency between metabolism and environment, Seed2LP is a computational support contributing to address the multifactorial challenge of culturing possibly uncultured microorganisms.<div class="boxTitle">Availability and implementation</div>Seed2LP is available on <a href="https://github.com/bioasp/seed2lp">https://github.com/bioasp/seed2lp</a>.</span>

CancerTrialMatch: a computational resource for the management of biomarker-based clinical trials at a community cancer center

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>The widespread implementation of next-generation sequencing in cancer care has enabled routine use of molecular and biomarker profiling. At our cancer center, as with many others, biomarker-based clinical trials are increasingly available to oncologists as potential treatment options via molecular tumor boards. To better support this effort, we developed CancerTrialMatch, a systematic approach to capture structured clinical trial data and match patients to trials based on their disease characteristics and sequencing profiles.<div class="boxTitle">Results</div>CancerTrialMatch is an open-source application designed to streamline clinical trial curation and patient trial matching, while also enabling an institution’s curated trial portfolio to be distributed across the institution for easy access to providers, care teams and researchers. It facilitates curating, updating, and searching for trials through a semi-automated interface built using R Shiny, MongoDB, and Docker. While much of the trial data is retrieved via the clinicaltrials.gov Application Programming Interface, certain items like biomarkers and disease subtypes are entered manually. The user inputs disease type using the OncoTree classification, and provides relevant biomarker details, such as mutations, copy numbers, fusions, and other disease-specific markers. This resource reduces the time required for institutional trial management and helps to identify potential clinical trials for patients, ultimately supporting larger clinical trial enrollment and enhancing the clinical application of precision oncology.<div class="boxTitle">Availability and implementation</div>CancerTrialMatch was implemented and tested on Windows 11 (64-bit, 32 GB RAM) using WSL2 with Ubuntu 22.04. Docker 27.0.3 and Docker Compose 2.28.1 were used to build images and containers. Users can build it by cloning the repo and following the README instructions and supplemental file (cancertrialmatchsupplemental.pdf) . The source code and example data are available in GitHub and Figshare at <a href="https://github.com/AveraSD/CancerTrialMatch">https://github.com/AveraSD/CancerTrialMatch</a> and <a href="https://10.6084/m9.figshare.28447367">10.6084/m9.figshare.28447367</a> respectively.</span>

Marker selection strategies for circulating tumor DNA guided by phylogenetic inference

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Blood-based profiling of tumor DNA (“liquid biopsy”) offers great prospects for non-invasive early cancer diagnosis and clinical guidance, but requires further computational advances to become a robust quantitative assay of tumor clonal evolution. We propose new methods to better characterize tumor clonal dynamics from circulating tumor DNA (ctDNA), through application to two specific tasks: (i) applying longitudinal ctDNA data to refine phylogeny models of clonal evolution, and (ii) quantifying changes in clonal frequencies that may be indicative of treatment response or tumor progression. We pose these through a probabilistic framework for optimally identifying markers and using them to characterize clonal evolution.<div class="boxTitle">Results</div>We first estimate a density over clonal tree models using bootstrap samples over pre-treatment tissue-based sequence data. We then refine these models over successive longitudinal samples. We use the resulting framework for modeling and refining tree densities to pose a set of optimization problems for selecting ctDNA markers to maximize measures of utility for reducing uncertainty in phylogeny models and quantifying clonal frequencies given the models. We tested our methods on synthetic data and showed them to be effective at refining tree densities and inferring clonal frequencies. Application to real tumor data further demonstrated the methods’ effectiveness in refining a lineage model and assessing its clonal frequencies. The work shows the power of computational methods to improve marker selection, clonal lineage reconstruction, and clonal dynamics profiling for more precise and quantitative assays of somatic evolution and tumor progression.<div class="boxTitle">Availability and implementation</div><a href="https://github.com/CMUSchwartzLab/Mase-phi.git">https://github.com/CMUSchwartzLab/Mase-phi.git</a>. (DOI: 10.5281/zenodo.14776163).</span>

Negative dataset selection impacts machine learning-based predictors for multiple bacterial species promoters

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Advances in bacterial promoter predictors based on machine learning have greatly improved identification metrics. However, existing models overlooked the impact of negative datasets, previously identified in GC-content discrepancies between positive and negative datasets in single-species models. This study aims to investigate whether multiple-species models for promoter classification are inherently biased due to the selection criteria of negative datasets. We further explore whether the generation of synthetic random sequences (SRS) that mimic GC-content distribution of promoters can partly reduce this bias.<div class="boxTitle">Results</div>Multiple-species predictors exhibited GC-content bias when using CDS as a negative dataset, suggested by specificity and sensibility metrics in a species-specific manner, and investigated by dimensionality reduction. We demonstrated a reduction in this bias by using the SRS dataset, with less detection of background noise in real genomic data. In both scenarios DNABERT showed the best metrics. These findings suggest that GC-balanced datasets can enhance the generalizability of promoter predictors across Bacteria.<div class="boxTitle">Availability and implementation</div>The source code of the experiments is freely available at <a href="https://github.com/maigonzalezh/MultispeciesPromoterClassifier">https://github.com/maigonzalezh/MultispeciesPromoterClassifier</a>.</span>

AVPpred-BWR: antiviral peptides prediction via biological words representation

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Antiviral peptides (AVPs) are short chains of amino acids, showing great potential as antiviral drugs. The traditional wisdom (e.g. wet experiments) for identifying the AVPs is time-consuming and laborious, while cutting-edge computational methods are less accurate to predict them.<div class="boxTitle">Results</div>In this article, we propose an AVPs prediction model via biological words representation, dubbed AVPpred-BWR. Based on the fact that the secondary structures of AVPs mainly consist of α-helix and loop, we explore the biological words of 1mer (corresponding to loops) and 4mer (4 continuous residues, corresponding to α-helix). That is, the peptides sequences are decomposed into biological words, and then the concealed sequential information is represented by training the Word2Vec models. Moreover, in order to extract multi-scale features, we leverage a CNN-Transformer framework to process the embeddings of 1mer and 4mer generated by Word2Vec models. To the best of our knowledge, this is the first time to realize the word segmentation of protein primary structure sequences based on the regularity of protein secondary structure. AVPpred-BWR illustrates clear improvements over its competitors on the independent test set (e.g. improvements of 4.6% and 11.0% for AUROC and MCC, respectively, compared to UniDL4BioPep).<div class="boxTitle">Availability and implementation</div>AVPpred-BWR is publicly available at: <a href="https://github.com/zyweizm/AVPpred-BWR">https://github.com/zyweizm/AVPpred-BWR</a> or <a href="https://zenodo.org/records/14880447">https://zenodo.org/records/14880447</a> (doi: 10.5281/zenodo.14880447).</span>

wgatools: an ultrafast toolkit for manipulating whole-genome alignments

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>With the rapid development of long-read sequencing technologies, the era of individual complete genomes is approaching. We have developed <span style="font-style:italic;">wgatools</span>, a cross-platform, ultrafast toolkit that supports a range of whole-genome alignment formats, offering practical tools for conversion, processing, evaluation, and visualization of alignments, thereby facilitating population-level genome analysis and advancing functional and evolutionary genomics.<div class="boxTitle">Availability and implementation</div><span style="font-style:italic;">wgatools</span> supports diverse formats and can process, filter, and statistically evaluate alignments, perform alignment-based variant calling, and visualize alignments both locally and genome-wide. Built with Rust for efficiency and safe memory usage, it ensures fast performance and can handle large datasets consisting of hundreds of genomes. <span style="font-style:italic;">wgatools</span> is published as free software under the MIT open-source license, and its source code is freely available at <a href="https://github.com/wjwei-handsome/wgatools">https://github.com/wjwei-handsome/wgatools</a> and <a href="https://zenodo.org/records/14882797">https://zenodo.org/records/14882797</a>.</span>

AI-augmented physics-based docking for antibody-antigen complex prediction

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Predicting the structure of antibody-antigen complexes is a challenging task with significant implications for the design of better antibody therapeutics. However, the levels of success have remained dauntingly low, particularly when high standards for model quality are required, a necessity for efficient antibody design. Artificial intelligence (AI) has significantly impacted the landscape of structure prediction for antibodies, both alone and in complex with their antigens.<div class="boxTitle">Methods</div>We utilized AI-guided antibody modeling tools to generate ensembles displaying diversity in the complementarity-determining region (CDR) and integrated those into our previously published AlphaFold2-rescored docking pipeline, a strategy called AI-augmented physics-based docking. In this study, we also compare docking performance with AlphaFold and Boltz-1, the new state-of-the-art. We distinguish between two types of success tailored to specific downstream applications: (i) criteria sufficient for epitope mapping, where gross quality is adequate and can complement experimental techniques, and (ii) criteria for producing higher-quality models suitable for engineering purposes.<div class="boxTitle">Results</div>We highlight that the quality of the ensemble is crucial for docking performance, that including too many models can be detrimental, and that prioritization of models is essential for achieving good performance. In a scenario analogous to docking using a crystallized antigen, our results robustly demonstrate the advantages of AI-augmented docking over AlphaFold2, further accentuated when higher standards in quality are imposed. Docking also shows improvements over Boltz-1, but those are less pronounced. Docking performance is still noticeably lower than AlphaFold3 in both epitope mapping and antibody design use cases. We observe a strong dependence on CDR-H3 loop length for physics-based tools on their ability to successfully predict. This helps define an applicability range where physics-based docking can be competitive to the newer generation of AI tools.<div class="boxTitle">Availability and implementation</div>The AF2 rescoring scripts are available at github.com/gaudreaultfnrc/AF2-Rescoring.</span>

MR.RGM: an R package for fitting Bayesian multivariate bidirectional Mendelian randomization networks

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Mendelian randomization (MR) infers causal relationships between exposures and outcomes using genetic variants as instrumental variables. Typically, MR considers only a pair of exposure and outcome at a time, limiting its capability of capturing the entire causal network. We overcome this limitation by developing <strong>MR.RGM</strong> (Mendelian randomization via reciprocal graphical model), a fast R-package that implements the Bayesian reciprocal graphical model and enables practitioners to construct holistic causal networks with possibly cyclic/reciprocal causation and proper uncertainty quantifications, offering a comprehensive understanding of complex biological systems and their interconnections.<div class="boxTitle">Results</div>We developed <strong>MR.RGM</strong>, an open-source R package that applies bidirectional MR using a network-based strategy, enabling the exploration of causal relationships among multiple variables in complex biological systems. <strong>MR.RGM</strong> holds the promise of unveiling intricate interactions and advancing our understanding of genetic networks, disease risks, and phenotypic complexities.<div class="boxTitle">Availability and implementation</div><strong>MR.RGM</strong> is available at CRAN (<a href="https://CRAN.R-project.org/package=MR.RGM">https://CRAN.R-project.org/package=MR.RGM</a>, DOI: 10.32614/CRAN.package.MR.RGM) and <a href="https://github.com/bitansa/MR.RGM">https://github.com/bitansa/MR.RGM</a>.</span>

CytoSimplex: visualizing single-cell fates and transitions on a simplex

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>Cells differentiate to their final fates along unique trajectories, often involving multi-potent progenitors that can produce multiple terminally differentiated cell types. Recent developments in single-cell transcriptomic and epigenomic measurement provide tremendous opportunities for mapping these trajectories. The visualization of single-cell data often relies on dimension reduction methods such as UMAP to simplify high-dimensional single-cell data down to an understandable 2D form. However, these dimension reduction methods are not constructed to allow direct interpretation of the reduced dimensions in terms of cell differentiation. To address these limitations, we developed a new approach that places each cell from a single-cell dataset within a simplex whose vertices correspond to terminally differentiated cell types. Our approach can quantify and visualize current cell fate commitment and future cell potential. We developed CytoSimplex, a standalone open-source package implemented in R and Python that provides simple and intuitive visualizations of cell differentiation in 2D ternary and 3D quaternary plots. We believe that CytoSimplex can help researchers gain a better understanding of cell type transitions in specific tissues and characterize developmental processes.<div class="boxTitle">Availability and implementation</div>The R version of CytoSimplex is available on Github at <a href="https://github.com/welch-lab/CytoSimplex">https://github.com/welch-lab/CytoSimplex</a>. The Python version of CytoSimplex is available on Github at <a href="https://github.com/welch-lab/pyCytoSimplex">https://github.com/welch-lab/pyCytoSimplex</a>.</span>

RNALoc-LM: RNA subcellular localization prediction using pre-trained RNA language model

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Accurately predicting RNA subcellular localization is crucial for understanding the cellular functions and regulatory mechanisms of RNAs. Although many computational methods have been developed to predict the subcellular localization of lncRNAs, miRNAs, and circRNAs, very few of them are designed to simultaneously predict the subcellular localization of multiple types of RNAs. In addition, the emergence of pre-trained RNA language model has shown remarkable performance in various bioinformatics tasks, such as structure prediction and functional annotation. Despite these advancements, there remains a significant gap in applying pre-trained RNA language models specifically for predicting RNA subcellular localization.<div class="boxTitle">Results</div>In this study, we proposed RNALoc-LM, the first interpretable deep-learning framework that leverages a pre-trained RNA language model for predicting RNA subcellular localization. RNALoc-LM uses a pre-trained RNA language model to encode RNA sequences, then captures local patterns and long-range dependencies through TextCNN and BiLSTM modules. A multi-head attention mechanism is used to focus on important regions within the RNA sequences. The results demonstrate that RNALoc-LM significantly outperforms both deep-learning baselines and existing state-of-the-art predictors. Additionally, motif analysis highlights RNALoc-LM’s potential for discovering important motifs, while an ablation study confirms the effectiveness of the RNA sequence embeddings generated by the pre-trained RNA language model.<div class="boxTitle">Availability and implementation</div>The RNALoc-LM web server is available at <a href="http://csuligroup.com:8000/RNALoc-LM">http://csuligroup.com:8000/RNALoc-LM</a>. The source code can be obtained from <a href="https://github.com/CSUBioGroup/RNALoc-LM">https://github.com/CSUBioGroup/RNALoc-LM</a>.</span>

Clustering individuals using INMTD: a novel versatile multi-view embedding framework integrating omics and imaging data

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Combining omics and images can lead to a more comprehensive clustering of individuals than classic single-view approaches. Among the various approaches for multi-view clustering, nonnegative matrix tri-factorization (NMTF) and nonnegative Tucker decomposition (NTD) are advantageous in learning low-rank embeddings with promising interpretability. Besides, there is a need to handle unwanted drivers of clusterings (i.e. confounders).<div class="boxTitle">Results</div>In this work, we introduce a novel multi-view clustering method based on NMTF and NTD, named INMTD, which integrates omics and 3D imaging data to derive unconfounded subgroups of individuals. According to the adjusted Rand index, INMTD outperformed other clustering methods on a synthetic dataset with known clusters. In the application to real-life facial-genomic data, INMTD generated biologically relevant embeddings for individuals, genetics, and facial morphology. By removing confounded embedding vectors, we derived an unconfounded clustering with better internal and external quality; the genetic and facial annotations of each derived subgroup highlighted distinctive characteristics. In conclusion, INMTD can effectively integrate omics data and 3D images for unconfounded clustering with biologically meaningful interpretation.<div class="boxTitle">Availability and implementation</div>INMTD is freely available at <a href="https://github.com/ZuqiLi/INMTD">https://github.com/ZuqiLi/INMTD</a>.</span>

CaLMPhosKAN: prediction of general phosphorylation sites in proteins via fusion of codon aware embeddings with amino acid aware embeddings and wavelet-based Kolmogorov–Arnold network

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>The mapping from codon to amino acid is surjective due to codon degeneracy, suggesting that codon space might harbor higher information content. Embeddings from the codon language model have recently demonstrated success in various protein downstream tasks. However, predictive models for residue-level tasks such as phosphorylation sites, arguably the most studied Post-Translational Modification (PTM), and PTM sites prediction in general, have predominantly relied on representations in amino acid space.<div class="boxTitle">Results</div>We introduce a novel approach for predicting phosphorylation sites by utilizing codon-level information through embeddings from the codon adaptation language model (CaLM), trained on protein-coding DNA sequences. Protein sequences are first reverse-translated into reliable coding sequences by mapping UniProt sequences to their corresponding NCBI reference sequences and extracting the exact coding sequences from their GenBank format using a dynamic programming-based global pairwise alignment. The resulting coding sequences are encoded using the CaLM encoder to generate codon-aware embeddings, which are subsequently integrated with amino acid-aware embeddings obtained from a protein language model, through an early fusion strategy. Next, a window-level representation of the site of interest, retaining the full sequence context, is constructed from the fused embeddings. A ConvBiGRU network extracts feature maps that capture spatiotemporal correlations between proximal residues within the window. This is followed by a prediction head based on a Kolmogorov-Arnold network (KAN) using the derivative of gaussian wavelet transform to generate the inference for the site. The overall model, dubbed CaLMPhosKAN, performs better than the existing approaches across multiple datasets.<div class="boxTitle">Availability and implementation</div>CaLMPhosKAN is publicly available at <a href="https://github.com/KCLabMTU/CaLMPhosKAN">https://github.com/KCLabMTU/CaLMPhosKAN</a>.</span>

Exact model-free function inference using uniform marginal counts for null population

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Recognizing cause–effect relationships is a fundamental inquiry in science. However, current causal inference methods often focus on directionality but not statistical significance. A ramification is chance patterns of uneven marginal distributions achieving a perfect directionality score.<div class="boxTitle">Results</div>To overcome such issues, we design the uniform exact function test with continuity correction (UEFTC) to detect functional dependency between two discrete random variables. The null hypothesis is two variables being statistically independent. Unique from related tests whose null populations use observed marginals, we define the null population by an embedded uniform square. We also present a fast algorithm to accomplish the test. On datasets with ground truth, the UEFTC exhibits accurate directionality, low biases, and robust statistical behavior over alternatives. We found nonmonotonic response by gene <span style="font-style:italic;">TCB2</span> to beta-estradiol dosage in engineered yeast strains. In the human duodenum with environmental enteric dysfunction, we discovered pathology-dependent anti-co-methylated CpG sites in the vicinity of genes <span style="font-style:italic;">POU2AF1</span> and <span style="font-style:italic;">LSP1</span>; such activity represents orchestrated methylation and demethylation along the same gene, unreported previously. The UEFTC has much improved effectiveness in exact model-free function inference for data-driven knowledge discovery.<div class="boxTitle">Availability and implementation</div>An open-source R package “UniExactFunTest” implementing the presented uniform exact function tests is available via CRAN at doi: 10.32614/CRAN.package.UniExactFunTest. Computer code to reproduce figures can be found in supplementary file “UEFTC-main.zip.”</span>

MAFin: motif detection in multiple alignment files

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Whole Genome and Proteome Alignments, represented by the multiple alignment file format, have become a standard approach in comparative genomics and proteomics. These often require identifying conserved motifs, which is crucial for understanding functional and evolutionary relationships. However, current approaches lack a direct method for motif detection within MAF files. We present MAFin, a novel tool that enables efficient motif detection and conservation analysis in MAF files to address this gap, streamlining genomic and proteomic research.<div class="boxTitle">Results</div>We developed MAFin, the first motif detection tool for Multiple Alignment Format files. MAFin enables the multithreaded search of conserved motifs using three approaches: (i) using user-specified k-mers to search the sequences. (ii) with regular expressions, in which case one or more patterns are searched, and (iii) with predefined Position Weight Matrices. Once the motif has been found, MAFin detects the motif instances and calculates the conservation across the aligned sequences. MAFin also calculates a conservation percentage, which provides information about the conservation levels of each motif across the aligned sequences, based on the number of matches relative to the length of the motif. A set of statistics enables the interpretation of each motif's conservation level, and the detected motifs are exported in JSON and CSV files for downstream analyses.<div class="boxTitle">Availability and implementation</div>MAFin is offered as a Python package under the GPL license as a multi-platform application and is available at: <a href="https://github.com/Georgakopoulos-Soares-lab/MAFin">https://github.com/Georgakopoulos-Soares-lab/MAFin</a>.</span>

A framework for analyzing EEG data using high-dimensional tests

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>The objective of EEG data analysis is to extract meaningful insights, enhancing our understanding of brain function. However, the high dimensionality and temporal dependency of EEG data present significant challenges to the effective application of statistical methods. This study systematically addresses these challenges by introducing a high-dimensional statistical framework that includes testing changes in the mean vector and precision matrix, as well as conducting relevant analyses. Specifically, the Ridgelized Hotelling’s T2 test (RIHT) is introduced to test changes in the mean vector of EEG data over time while relaxing traditional distributional and moment assumptions. Secondly, a multiple population de-biased estimation and testing method (MPDe) is developed to estimate and simultaneously test differences in the precision matrix before and after stimulation. This approach extends the joint Gaussian graphical model to multiple populations while incorporating the temporal dependency of EEG data. Meanwhile, a novel data-driven fine-tuning method is applied to automatically search for optimal hyperparameters.<div class="boxTitle">Results</div>Through comprehensive simulation studies and applications, we have obtained substantial evidence to validate that the RIHT has relatively high power, and it can test for changes when the distribution is unknown. Similarly, the MPDe can infer the precision matrix under time-dependent conditions. Additionally, the conducted analysis of channel selection and dominant channel can identify significant channels which play a crucial role in human cognitive ability, such as PO3, PO4, Pz, P4, P8, FT7, and FT8. All findings confirm that the proposed methods outperform existing ones, demonstrating the effectiveness of the framework in EEG data analysis.<div class="boxTitle">Availability and implementation</div>Source code and data used in the article are available at <a href="https://github.com/yahu911/Framework_EEG">https://github.com/yahu911/Framework_EEG</a>.</span>

nf-core/pacvar: a pipeline for analyzing long-read PacBio whole genome and repeat expansion sequencing data

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Pacific Biosciences (PacBio) single-molecule, long-read sequencing enables whole genome annotation and the characterization of 20 complex repetitive repeat regions, especially relevant to neurodegenerative diseases, through their PureTarget panel. Long-read whole-genome sequencing (WGS) also allows for the detection of structural variants that would be difficult to detect with traditional short-read sequencing. However, the raw unaligned Binary Alignment Map data need to be processed before analysis. There is a need for an intuitive comprehensive bioinformatic pipeline that can analyze these data.<div class="boxTitle">Results</div>We present nf-core/pacvar, a comprehensive pipeline for analyzing both PacBio single-molecule PureTarget and WGS data that demultiplexes and parallelizes pre-processing, variant calling and repeat characterization. nf-core/pacvar is compatible with little configuration and has few dependencies. This pipeline enables rapid end-to-end, parallel processing of PacBio single-molecule whole genome and targeted repeat expansion sequencing.<div class="boxTitle">Availability and implementation</div>nf-core/pacvar is available on nf-core website (<a href="https://nf-co.re/pacvar/">https://nf-co.re/pacvar/</a>) and on github (<a href="https://github.com/nf-core/pacvar">https://github.com/nf-core/pacvar</a>) under MIT License (DOI: 10.5281/zenodo.14813048).</span>

H2GnnDTI: hierarchical heterogeneous graph neural networks for drug–target interaction prediction

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Identifying drug–target interactions (DTIs) is a crucial step in drug repurposing and drug discovery. The significant increase in demand and the expensive nature for experimentally identifying DTIs necessitate computational tools for automated prediction and comprehension of DTIs. Despite recent advancements, current methods fail to fully leverage the hierarchical information in DTIs.<div class="boxTitle">Results</div>Here, we introduce H<sup>2</sup>GnnDTI, a novel two-level hierarchical heterogeneous graph learning model to predict DTIs, by integrating the structures of drugs and proteins via a low-level view GNN and a high-level view GNN. The hierarchical graph consists of high-level heterogeneous nodes representing drugs and proteins, connected by edges representing known DTIs. Each drug or protein node is further detailed in a low-level graph, where nodes represent molecules within each drug or amino acids within each protein, accompanied by their respective chemical descriptors. Two distinct low-level graph neural networks are first deployed to capture structural and chemical features specific to drugs and proteins from these low-level graphs. Subsequently, a high-level graph encoder (GE) is used to comprehensively capture and merge interactive features pertaining to drugs and proteins from the high-level graph. The high-level encoder incorporates a structure and attribute information fusion module designed to explicitly integrate representations acquired from both a feature encoder and a GE, facilitating consensus representation learning. Extensive experiments conducted on three benchmark datasets have shown that our proposed H<sup>2</sup>GnnDTI model consistently outperforms state-of-the-art deep learning methods.<div class="boxTitle">Availability and implementation</div>The codes are freely available at <a href="https://github.com/LiminLi-xjtu/H2GnnDTI">https://github.com/LiminLi-xjtu/H2GnnDTI</a>.</span>

Lit-OTAR framework for extracting biological evidences from literature

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Summary</div>The lit-OTAR framework, developed through a collaboration between Europe PMC and Open Targets, leverages deep learning to revolutionize drug discovery by extracting evidence from scientific literature for drug target identification and validation. This novel framework combines named entity recognition for identifying gene/protein (target), disease, organism, and chemical/drug within scientific texts, and entity normalization to map these entities to databases like Ensembl, Experimental Factor Ontology, and ChEMBL. Continuously operational, it has processed over 39 million abstracts and 4.5 million full-text articles and preprints to date, identifying more than 48.5 million unique associations that significantly help accelerate the drug discovery process and scientific research >29.9 m distinct target–disease, 11.8 m distinct target–drug, and 8.3 m distinct disease–drug relationships.<div class="boxTitle">Availability and implementation</div>The results are accessible through Europe PMC’s SciLite web app (<a href="https://europepmc.org/">https://europepmc.org/</a>) and its annotations API (<a href="https://europepmc.org/annotationsapi">https://europepmc.org/annotationsapi</a>), as well as via the Open Targets Platform (<a href="https://platform.opentargets.org/">https://platform.opentargets.org/</a>). The daily pipeline is available at <a href="https://github.com/ML4LitS/otar-maintenance">https://github.com/ML4LitS/otar-maintenance</a>, and the Open Targets ETL processes are available at <a href="https://github.com/opentargets">https://github.com/opentargets</a>.</span>

Leveraging basecaller’s move table to generate a lightweight k-mer model for nanopore sequencing analysis

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Nanopore sequencing by Oxford Nanopore Technologies (ONT) enables direct analysis of DNA and RNA by capturing raw electrical signals. Different nanopore chemistries have varied k-mer lengths, current levels, and standard deviations, which are stored in “k-mer models.” In cases where official models are lacking or unsuitable for specific sequencing conditions, tailored k-mer models are crucial to ensure precise signal-to-sequence alignment, analysis and interpretation. The process of transforming raw signal data into nucleotide sequences, known as basecalling, is a fundamental step in nanopore sequencing.<div class="boxTitle">Results</div>In this study, we leverage the move table produced by ONT’s basecalling software to create a lightweight <span style="font-style:italic;">de novo</span> k-mer model for RNA004 chemistry. We demonstrate the validity of our custom k-mer model by using it to guide signal-to-sequence alignment analysis, achieving high alignment rates (97.48%) compared to larger default models. Additionally, our 5-mer model exhibits similar performance as the default 9-mer models another analysis, such as detection of m6A RNA modifications. We provide our method, termed <span style="font-style:italic;">Poregen</span>, as a generalizable approach for creation of custom, <span style="font-style:italic;">de novo</span> k-mer models for nanopore signal data analysis.<div class="boxTitle">Availability and implementation</div><span style="font-style:italic;">Poregen</span> is an open source package under an MIT license: <a href="https://github.com/hiruna72/poregen">https://github.com/hiruna72/poregen</a>.</span>

LipidFun: a database of lipid functions

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Lipids play crucial roles in various biological functions and diseases. However, a gap exists in databases providing information of lipids functions based on curated information. Consequently, LipidFun is purposed as the first lipid function database with sentence-level evidence detailing lipid-related phenotypes and biological functions.<div class="boxTitle">Results</div>Potential lipid functions were extracted from the biomedical literature using natural language processing techniques, with accuracy and reliability ensured through manual curation by four domain experts. LipidFun constructs classification systems for lipids, biological functions, and phenotypes for named entity recognition. Sentence-level evidence is extracted to highlight connections to lipid-associated biological processes and diseases. Integrating these classification systems and a large amount of sentence-level evidence allows LipidFun to provide an overview of lipid–phenotype and lipid–biological function associations through concise visualizations. Overall, LipidFun unravels the relationships between lipids and biological mechanisms, underscoring their overarching influence on physiological processes.<div class="boxTitle">Availability and implementation</div>LipidFun is available at <a href="https://lipidfun.bioinfomics.org/">https://lipidfun.bioinfomics.org/</a>.</span>

PROLONG: penalized regression for outcome guided longitudinal omics analysis with network and group constraints

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>There is a growing interest in longitudinal omics data paired with some longitudinal clinical outcome. Given a large set of continuous omics variables and some continuous clinical outcome, each measured for a few subjects at only a few time points, we seek to identify those variables that co-vary over time with the outcome. To motivate this problem we study a dataset with hundreds of urinary metabolites along with Tuberculosis mycobacterial load as our clinical outcome, with the objective of identifying potential biomarkers for disease progression. For such data clinicians usually apply simple linear mixed effects models which often lack power given the low number of replicates and time points. We propose a penalized regression approach on the first differences of the data that extends the lasso + Laplacian method [Li and Li (Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics 2008;24:1175–82.)] to a longitudinal group lasso + Laplacian approach. Our method, PROLONG, leverages the first differences of the data to increase power by pairing the consecutive time points. The Laplacian penalty incorporates the dependence structure of the variables, and the group lasso penalty induces sparsity while grouping together all contemporaneous and lag terms for each omic variable in the model.<div class="boxTitle">Results</div>With an automated selection of model hyper-parameters, PROLONG correctly selects target metabolites with high specificity and sensitivity across a wide range of scenarios. PROLONG selects a set of metabolites from the real data that includes interesting targets identified during EDA.<div class="boxTitle">Availability and implementation</div>An R package implementing described methods called “prolong” is available at <a href="https://github.com/stevebroll/prolong">https://github.com/stevebroll/prolong</a>. Code snapshot available at 10.5281/zenodo.14804245.</span>

PathoSeq-QC: a decision support bioinformatics workflow for robust genomic surveillance

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Recommendations on the use of genomics for pathogens surveillance are evidence that high-throughput genomic sequencing plays a key role to fight global health threats. Coupled with bioinformatics and other data types (e.g., epidemiological information), genomics is used to obtain knowledge on health pathogenic threats and insights on their evolution, to monitor pathogens spread, and to evaluate the effectiveness of countermeasures. From a decision-making policy perspective, it is essential to ensure the entire process’s quality before relying on analysis results as evidence. Available workflows usually offer quality assessment tools that are primarily focused on the quality of raw NGS reads but often struggle to keep pace with new technologies and threats, and fail to provide a robust consensus on results, necessitating manual evaluation of multiple tool outputs.<div class="boxTitle">Results</div>We present PathoSeq-QC, a bioinformatics decision support workflow developed to improve the trustworthiness of genomic surveillance analyses and conclusions. Designed for SARS-CoV-2, it is suitable for any viral threat. In the specific case of SARS-CoV-2, PathoSeq-QC: (i) evaluates the quality of the raw data; (ii) assesses whether the analysed sample is composed by single or multiple lineages; (iii) produces robust variant calling results via multi-tool comparison; (iv) reports whether the produced data are in support of a recombinant virus, a novel or an already known lineage. The tool is modular, which will allow easy functionalities extension.<div class="boxTitle">Availability and implementation</div>PathoSeq-QC is a command-line tool written in Python and R. The code is available at <a href="https://code.europa.eu/dighealth/pathoseq-qc">https://code.europa.eu/dighealth/pathoseq-qc</a>.</span>

vcfgl: a flexible genotype likelihood simulator for VCF/BCF files

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Accurate quantification of genotype uncertainty is pivotal in ensuring the reliability of genetic inferences drawn from NGS data. Genotype uncertainty is typically modeled using Genotype Likelihoods (GLs), which can help propagate measures of statistical uncertainty in base calls to downstream analyses. However, the effects of errors and biases in the estimation of GLs, introduced by biases in the original base call quality scores or the discretization of quality scores, as well as the choice of the GL model, remain under-explored.<div class="boxTitle">Results</div>We present vcfgl, a versatile tool for simulating genotype likelihoods associated with simulated read data. It offers a framework for researchers to simulate and investigate the uncertainties and biases associated with the quantification of uncertainty, thereby facilitating a deeper understanding of their impacts on downstream analytical methods. Through simulations, we demonstrate the utility of vcfgl in benchmarking GL-based methods. The program can calculate GLs using various widely used genotype likelihood models and can simulate the errors in quality scores using a Beta distribution. It is compatible with modern simulators such as msprime and SLiM, and can output data in pileup, Variant Call Format (VCF)/BCF, and genomic VCF file formats, supporting a wide range of applications. The vcfgl program is freely available as an efficient and user-friendly software written in C/C++.<div class="boxTitle">Availability and implementation</div>vcfgl is freely available at <a href="https://github.com/isinaltinkaya/vcfgl">https://github.com/isinaltinkaya/vcfgl</a>.</span>

Evaluating changes in attractor sets under small network perturbations to infer reliable microbial interaction networks from abundance patterns

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Inferring microbial interaction networks from microbiome data is a core task of computational ecology. An avenue of research to create reliable inference methods is based on a stylized view of microbiome data, starting from the assumption that the presences and absences of microbiomes, rather than the quantitative abundances, are informative about the underlying interaction network. With this starting point, inference algorithms can be based on the notion of attractors (asymptotic states) in Boolean networks. Boolean network framework offers a computationally efficient method to tackle this problem. However, often existing algorithms operating under a Boolean network assumption, fail to provide networks that can reproduce the complete set of initial attractors (abundance patterns). Therefore, there is a need for network inference algorithms capable of reproducing the initial stable states of the system.<div class="boxTitle">Results</div>We study the change of attractors in Boolean threshold dynamics on signed undirected graphs under small changes in network architecture and show, how to leverage these relationships to enhance network inference algorithms. As an illustration of this algorithmic approach, we analyse microbial abundance patterns from stool samples of humans with inflammatory bowel disease (IBD), with colorectal cancer and from healthy individuals to study differences between the interaction networks of the three conditions. The method reveals strong diversity in IBD interaction networks. The networks are first partially deduced by an earlier inference method called ESABO, then we apply the new algorithm developed here, EDAME, to this result to generate a network that comes nearest to satisfying the original attractors.<div class="boxTitle">Availability and implementation</div>Implementation code is freely available at <a href="https://github.com/Jojo6297/edame.git">https://github.com/Jojo6297/edame.git</a>.</span>

Associations on the Fly, a new feature aiming to facilitate exploration of the Open Targets Platform evidence

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>The Open Targets Platform (<a href="https://platform.opentargets.org">https://platform.opentargets.org</a>) is a unique, comprehensive, open-source resource supporting systematic identification and prioritisation of targets for drug discovery. The Platform combines, harmonizes and integrates data from >20 diverse sources to provide target–disease associations, covering evidence derived from genetic associations, somatic mutations, known drugs, differential expression, animal models, pathways and systems biology. An in-house target identification scoring framework weighs the evidence from each data source and type, contributing to an overall score for each of the 7.8M target–disease associations. However, the old infrastructure did not allow user-led dynamic adjustments in the contribution of different evidence types for target prioritisation, a limitation frequently raised by our user community. Furthermore, the previous Platform user interface did not support navigation and exploration of the underlying target–disease evidence on the same page, occasionally making the user journey counterintuitive.<div class="boxTitle">Results</div>Here, we describe ‘Associations on the Fly’ (AOTF), a new Platform feature—developed with a user-centred vision—that enables the user to formulate more flexible therapeutic hypotheses through dynamic adjustment of the weight of contributing evidence from each source, altering the prioritisation of targets.<div class="boxTitle">Availability and implementation</div>The codebases that power the Platform—including our pipelines, GraphQL API, and React UI—are all open source and licensed under the APACHE LICENSE, VERSION 2.0. You can find all of our code repositories on GitHub at <a href="https://github.com/opentargets">https://github.com/opentargets</a> and on Zenodo at <a href="https://zenodo.org/records/14392214">https://zenodo.org/records/14392214</a>. This tool was implemented using React v18 and its code is accessible here: (<a href="https://github.com/opentargets/ot-ui-apps">https://github.com/opentargets/ot-ui-apps</a>). The tools are accessible through the Open Targets Platform web interface (<a href="https://platform.opentargets.org/">https://platform.opentargets.org/</a>) and GraphQL API (<a href="https://platform-docs.opentargets.org/data-access/graphql-api">https://platform-docs.opentargets.org/data-access/graphql-api</a>). Data is available for download here: (<a href="https://platform.opentargets.org/downloads">https://platform.opentargets.org/downloads</a>) and from the EMBL-EBI FTP: (<a href="https://ftp.ebi.ac.uk/pub/databases/opentargets/platform/">https://ftp.ebi.ac.uk/pub/databases/opentargets/platform/</a>).</span>

Cleavage-stage embryo segmentation using SAM-based dual branch pipeline: development and evaluation with the CleavageEmbryo dataset

<span class="paragraphSection"><div class="boxTitle">Abstract</div><div class="boxTitle">Motivation</div>Embryo selection is one of the critical factors in determining the success of pregnancy in in vitro fertilization procedures. Using artificial intelligence to aid in embryo selection could effectively address the current time-consuming, expensive, subjectively influenced process of embryo assessment by trained embryologists. However, current deep learning-based methods often focus on blastocyst segmentation, grading, or predicting cell development via time-lapse videos, often overlooking morphokinetic parameters or lacking interpretability. Given the significance of both morphokinetic and morphological evaluation in predicting the implantation potential of cleavage-stage embryos, as emphasized by previous research, there is a necessity for an automated method to segment cleavage-stage embryos to improve this process.<div class="boxTitle">Results</div>In this article, we introduce the SAM-based dual branch segmentation pipeline for automated segmentation of blastomeres in cleavage-stage embryos. Leveraging the powerful segmentation capability of SAM, the instance branch conducts instance segmentation of blastomeres, while the semantic branch performs semantic segmentation of fragments. Due to the lack of publicly available datasets, we construct the CleavageEmbryo dataset, the first dataset of human cleavage-stage embryos with pixel-level annotations containing fragment information. We train and test a series of state-of-the-art segmentation algorithms on CleavageEmbryo. Our experiments demonstrate that our method outperforms existing algorithms in terms of objective metrics (mAP 0.874 on blastomeres, Dice 0.695 on fragments) and visual quality, enabling more accurate segmentation of cleavage-stage embryos.<div class="boxTitle">Availability and implementation</div>The code and sample data in this study can be found at: <a href="https://github.com/12austincc/Cleavage-StageEmbryoSegmentation">https://github.com/12austincc/Cleavage-StageEmbryoSegmentation</a>.</span>

Updates

Institution Search

Top Rotating Banners

Rotating Banners

-->