Manuscript drafts

This README is auto-updated by pdf_abstract_readme_and_push.sh.

PDFs

Total PDFs: 7

drake.pdf

Citation: Ishanu Chattopadhyay. Escape from Typicality: Why Evolution Operates Near One Mutation per Genome per Generation.

Abstract: This manuscript explores the phenomenon known as Drake’s rule, which states that across various organisms and viruses, the per-site mutation rate inversely correlates with genome length, resulting in a roughly constant number of mutations per genome per generation. Traditional explanations focus on biochemical fidelity constraints and population-genetic error-threshold arguments that limit mutation rates. However, this study offers a complementary perspective by analyzing the supply of statistically exceptional variants produced by mutation. By treating mutation as a blind local perturbation process, the research investigates the optimal mutation rates that maximize the expected discovery of atypical variants, which are crucial for selective amplification. The findings reveal that the optimal per-site mutation rate scales as µ⋆ = (1 + o(1))/n, leading to a per-genome mutation intensity that is O(1). This work provides a mechanism-independent understanding of Drake’s rule, emphasizing the balance between mutation supply and selective retention of exceptional outcomes, thereby enhancing our comprehension of evolutionary dynamics.

lambdaOR.pdf

Citation: Dmytro Onishchenko, Ishanu Chattopadhyay. Correcting Label-noise Corruption with lambda-Odds Ratio.

Abstract: In large-scale observational studies, misclassification of outcomes can distort attribution, leading to inflated associations and misleading risk factors. This manuscript introduces the lambda-Odds Ratio (λ-OR), a novel estimator designed to correct for such misclassification. The λ-OR employs two ROC thresholds to identify high-purity tails and applies a minimal-feasible ridge inversion to adjust the observed 2 × 2 contingency table. Through simulations, it is demonstrated that while naive log-odds ratios suffer from bias and coverage collapse, the λ-OR remains unbiased and achieves significantly higher coverage with only modest increases in root mean square error (RMSE). Applications of λ-OR to electronic health records (EHR) in conditions such as Alzheimer’s disease, idiopathic pulmonary fibrosis, and autism spectrum disorder reveal that it effectively mitigates the inflated discoveries associated with naive odds ratios and SHAP attributions, while highlighting biologically relevant modulators. The findings suggest that λ-OR provides a robust framework for accurate attribution in biobank and EHR-scale studies, addressing the challenges posed by label noise.

LSMsurvey.pdf

Citation: Zhuoqun Li, David Young, James Evans, Ishanu Chattopadhyay. Opinion Geometry from Social Surveys.

Abstract: Long-running social surveys provide rich longitudinal records of expressed attitudes, yet individual responses are often high-dimensional, sparse, and exhibit strong, time-varying dependencies across items. This study introduces a data-driven digital twin of opinions that learns cross-item dependencies within survey waves and induces a time-indexed distance over response vectors, facilitating sample generation and principled imputation. The induced distance, defined via Jensen–Shannon divergence, yields a wave-specific intrinsic geometry of opinion space, analogous to Riemannian analyses of curved spaces. This framework allows for the validation of response patterns and the reconstruction of missing or masked responses across large datasets, including the General Social Survey and Eurobarometer. The results demonstrate improved geometric fidelity in response reconstruction and successful predictions of voting behavior, outperforming standard demographic-based models. By establishing a generative representation of opinion structures, this work advances the understanding of opinion dynamics and provides a robust method for analyzing complex survey data.

LSMviral.pdf

Citation: Kevin Y. Wu, MS; Feng Li, DVM; PhD; Ishanu Chattonpadhyay, PhD. Emergenet: Digital Twin of Influenza A Emergence From Non-Human Hosts.

Abstract: Animal influenza viruses pose a significant pandemic threat, yet assessing the emergence potential of these strains before they can transmit efficiently among humans remains a challenge. Current methods rely heavily on expert evaluations and experimental assays, which are not scalable to the volume of genomic data generated by modern surveillance. This study introduces Emergenet, a sequence-based digital twin of influenza evolution that utilizes over 463,266 hemagglutinin and neuraminidase sequences to learn context-dependent mutational constraints. Emergenet provides rapid prioritization of animal influenza strains based on their evolutionary proximity to human-adapted viruses. The model was validated through two key applications: forecasting seasonal vaccine strains and estimating emergence risk for animal strains, showing significant improvements over traditional methods. Emergenet’s predictions consistently outperformed World Health Organization recommendations for vaccine strains and demonstrated a strong correlation with the CDC’s Influenza Risk Assessment Tool. This innovative approach enables scalable biosurveillance and supports proactive public health strategies, making it a valuable tool for military and public health readiness against potential influenza outbreaks.

nero.pdf

Citation: Charles Ross Schmidt, Ishanu Chattopadhyay. Complexity Signature of Generated Text.

Abstract: This study presents a model-agnostic, training-free estimator of intrinsic complexity, quantified as entropy rate, for long-form text streams using a fixed coarse-grained alphabet. By analyzing outputs from various contemporary large language models (LLMs) alongside human-authored texts, we find that LLM outputs consistently exhibit lower entropy rates compared to human prose. This statistic is derived directly from the text without needing access to model internals or requiring supervision or retraining, allowing for a clear distinction between human and LLM-generated text across different model families and corpora. The differences in entropy rates are interpreted through the lens of algorithmic statistics, suggesting variations in the effective descriptive complexity of the generative mechanisms involved. Our approach, termed the Nonparametric, learning-free Entropy-Rate Oracle (NERO), serves not only as a means of discrimination but also as a calibration-free method for ranking generative capacity and monitoring distributional changes across successive models. This work emphasizes the importance of intrinsic complexity in understanding the generative capabilities of AI and human authors.

zebra_adrd.pdf

Citation: Dmytro Onishchenko, James A. Mastrianni, Ishanu Chattopadhyay. Bloodwork-free Early Screening for Alzheimer’s Disease via Comorbid Pattern Recognition in Electronic Health Records.

Abstract: The early identification of Alzheimer’s disease and related dementias (ADRD) is hindered by the reliance on specialized tests and late-stage diagnosis. The Zero-burden Risk Assessment (ZeBRA) is an AI-driven tool that predicts the onset of ADRD up to ten years prior to diagnosis, utilizing only routine electronic health record (EHR) data without the need for laboratory tests, imaging, or questionnaires. Trained on a vast dataset of nearly 488,000 cases and over 12 million controls, ZeBRA demonstrated high accuracy with AUC values of 0.93 for 1-year predictions and 0.83 for 10-year predictions, while maintaining strong positive likelihood ratios at 95% specificity. The model’s performance was consistent across various demographic groups. In preliminary testing, higher ZeBRA scores correlated with lower cognitive assessment scores, indicating its potential for identifying cognitive impairment. ZeBRA’s scalability, cost-effectiveness, and independence from specialized testing make it a promising tool for early detection and presymptomatic trial enrichment in ADRD, addressing a critical need for accessible screening methods.

zebra_pancreatitis.pdf

Citation: Ishanu Chattopadhyay, PhD; Dmytro Onishchenko, MSc; Philip Kern, MD; Darwin Conwell, MD, MSc, FACG. AI-driven Test-Free Prediction of ICU Admission, Insulin Dependence, and Exocrine Dysfunction after Acute Pancreatitis.

Abstract: Acute pancreatitis (AP) presents with diverse clinical trajectories, necessitating effective prognostic tools to predict outcomes such as ICU admission and long-term complications like exocrine pancreatic dysfunction (EPD) and pancreatogenic diabetes. This study introduces an AI-driven platform that utilizes electronic health records (EHR) to predict three critical outcomes following a first AP diagnosis: (i) ICU admission within varying timeframes, (ii) incident EPD, and (iii) incident insulin dependence among previously non-diabetic patients. The models were trained on a substantial dataset comprising over 164 million individuals, demonstrating high predictive accuracy with area under the curve (AUC) values of 0.986 for same-day ICU admission and 0.913 for incident EPD. The findings highlight the potential of the Zero-Burden Risk Assessment (ZeBRA) framework to facilitate early risk stratification and targeted follow-up without the need for additional testing. By leveraging existing coded longitudinal history, this approach addresses a significant gap in pancreatitis management, offering a scalable and interpretable method for improving patient outcomes across the AP-chronic pancreatitis continuum.