Feature selection in most black-box machine learning algorithms, such as BERT, is based on the cor-relations between features and the target variable rather than causal relationships in the dataset. This makes their predictive power and decisions questionable because of their potential bias. This paper presents novel BERT models that learn from causal variables in a clinical discharge dataset. The causal-directed acyclic Graphs (DAG) identify input variables for patients’ survival rate prediction and decisions. The core idea behind our model lies in the ability of the BERT-based model to learn from the causal DAG semi-synthetic dataset, enabling it to model the underlying causal structure accurately in-stead of the generic spurious correlations devoid of causation. The results from Causal DAG Conditional Independence Test (CIT) validation metrics showed that the conceptual assumptions of the causal DAG were supported, the Pearson correlation coefficient ranges between -1 and 1, the p-value was (>0.05), and the confidence interval of 95% and 25% were satisfied. We further mapped the semi-synthetic dataset that evolved from the Causal DAG to three BERT models. Two metrics, pre-diction accuracy, and AUC score, were used to compare the performance of the BERT models. The accuracy of the BERT models showed that the regular BERT has a performance of 96%, while Clinical-BERT performance was 90%, and Clinical-BERT-Discharge-summary was 92%. On the other hand, the AUC score for BERT was 79%, ClinicalBERT was 77%, while ClinicalBERT-discharge summary was 84%. Our experiments on the synthetic dataset for the patient’s survival rate from the causal DAG datasets demonstrate high predictive performance and explainable input variables for human under-standing to justify prediction.