# br Data processing and statistical

2.5. Data processing and statistical analysis

After exporting from TargetLynx software, data were log10-trans-formed to approximate normality, and general linear models (GLMs) were used for comparison of metabolite levels between the sets of control samples from differing centers as well as between BC patients and healthy controls. Age was included as a covariate in all univariate models to control for potential confounding effects. The Benjamini-Hochberg false discovery rate (FDR) control was implemented to cor-rect for multiple comparisons. The FDR q-value threshold for significant markers was set at 0.05. Partial correlation analysis was used to cal-culate correlation coefficients among metabolites. These statistical analyses were performed using SPSS 22.0 (SPSS Inc.; Chicago, IL).

Partial least squares discriminant analysis (PLS-DA) was performed using log10- transformed, Pareto scaled data to construct classification models. For Pareto scaling, data were mean-centered and divided by the square root of the standard deviation of each variable. An internal 7-fold (n was automatically selected by the software) cross-validation was carried out to estimate the performance of PLS-DA models. Model va-lidation was also performed using a 300-iteration permutation test. R2 represents the explanatory capacity of the model, and Q2 signifies the predictive capacity of the model. Our PLS-DA model was constructed using SIMCA-P 14.1 software (Umetrics, Umeå, Sweden). The differ-ential metabolites were obtained based on variable importance in projection values (VIP > 1) taken from an initial PLS-DA model and
Journal of Chromatography B 1105 (2019) 26–37

significant q-values (q < 0.05) derived from our corrected GLM. These differential metabolites were then selected as a panel of markers to construct a second PLS-DA model for discrimination between BC pa-tients and healthy controls. Area under the receiver operating char-acteristic curve (AUROC) was then calculated to evaluate the classifi-cation performance of the PLS-DA model.
Exploratory factor analysis (EFA) was conducted using Comprehensive Exploratory Factor Analysis (CEFA) software version 2.0 (Columbus, OH) to determine underlying pathways affected in BC patients. We used EFA, a data Rituximab and analytic technique, to discover patterns of latent variables that could increase the interpret-ability of the data. Rotation was conducted in an effort to achieve a simple structure and increase pathway identification. Both analysis and rotation were unsupervised, and all metabolite names were replaced as variable numbers to maintain unbiased interpretation of factor load-ings. Parallel analysis was conducted with a random data matrix of the same order as our experimental data. Factors were retained if and only if they accounted for more variance than the random data, as evidenced by their respective eigenvalues.

Pathway analyses were performed and visualized using both Ingenuity Pathway Analysis (IPA) [44] and the MetaboAnalyst 4.0 software package [45].

3. Results

3.1. Targeted metabolic profiles of BC versus healthy controls

A total of 102 BC patients and 99 healthy controls were included in the study. There was no statistically significant difference in age (p > 0.05) between BC patients and healthy controls as calculated by the Mann-Whitney U test. The clinical information of BC patients is shown in Table 1.

In the current study, we used a large-scale, targeted LC-MS/MS method for reliable and comprehensive plasma metabolite detection. Using this paradigm, targeted analysis of 400+ MRM transitions was achieved for metabolites spanning over 20 different chemical classes (e.g., acyl glycines, bile acids, cyclic amines, etc.) from > 35 metabolic pathways (e.g., vitamin and cofactor metabolism, citric acid cycle, lipid metabolism, etc.) across positive and negative ionization modes (Supplemental Table S1). In total, we found that 105 metabolites were detectable in the QC sample with signal-to-noise ratios (S/Ns) > 3. Moreover, 98 of the 105 detected metabolites were observed in > 90% of all study samples. Following normalization by averaged values from the QC injection data, relative levels of the 98 metabolites had a median coefficient of variation (CV) value of 5.1%, with ~87% of metabolites having CV < 15% (Supplemental Fig. S1).

Sixty-seven of the 98 metabolites showed statistical significance (FDR q < 0.05) between BC patients and controls (Supplemental Table S2). Since the control samples were from two distinct institutions, we compared the metabolite levels between the two sets of control samples using GLMs with age adjustment. Between the two sets of controls, 68 of 98 metabolites were observed to be significant (q < 0.05). Of the 67 metabolites shown to be significantly different between cancer patients and controls, 53 of them were also observed to be statistically sig-nificant between the two sets of controls (Supplemental Table S2). Given this overlap, a real potential for confounding exists and, as such, any true comparison of BC patients and healthy controls necessitates a comparison of metabolites that were not found to be significantly dif-ferent between the two control cohorts. There were 30 metabolites which exhibited no statistically significant differences between the two groups of control samples (q > 0.05) (Supplemental Scheme S1). Thus, these 30 metabolites were used for subsequent analyses. An over-re-presentation analysis was performed to evaluate the scope and depth of the metabolic profile generated from these 30 compounds. As shown in Supplemental Fig. S2, the metabolic profile used for comparison of cancer and control subjects is reflective of 27 pathways, and