TNEGI//ETNI

TNEGI//ETNI 名博

一個人的孤獨之旅

TNEGI//ETNI (2025-11-09 18:40:44) 評論 (0)


【注:這是陳立功所著統計學專著《哲學之於統計學》的自序。本文學城博文采用了其最初的標題:一個人的孤獨之旅】

一、統計理性、夢想和挫敗 (The Statistical Rationality, the Dream, and the Frustrations)

在本作者看來,統計學的最高理性是如何認識基於隨機性的非確定性,而隨機性的本質則是人對認知對象的未知,因此,由隨機性導致的非確定性實質是人的認識的非確定性。由於這種非確定性是統計認知的唯一特性,類比於對一個常量的最大和最小期望的同一性,以辨證的方式,我們還可以說,這一最高理性同時也是統計學的最低理性。這一最高和最低理性決定了統計理論和方法構建中的基本原則。當我們說其最高理性時,意味著每個人都必須遵循它;而當我們說其最低理性時,則任何人在任何時候都不能以任何方式違反它。因此,那些試圖用單一的確定性假定來構建統計理論和方法的人便是違背了這一由隨機性確立的統計理性,因為這類唯心主義的假定本身就是將那些客觀上未知的非確定性轉變成了心理上已知的確定性。

In the author’s opinion, the highest rationality of statistics lies in how to understand uncertainty through randomness, and the essence of randomness is our unknownness of the object in a cognition. Therefore, the uncertainty caused by randomness is actually the uncertainty of human cognition itself. Since this uncertainty is the sole characteristic of statistical cognition, analogous to how the maximum and minimum expectations of a constant are identical, the highest rationality of statistics is simultaneously its lowest rationality in a dialectical sense. This highest and lowest rationality determine the basic principles for the construction of statistical theories and methods. When we say its highest rationality, it means that everyone must follow it; and when we say its lowest rationality, it means that no one can violate it in any sense at any time. Therefore, those who attempt to build statistical theories and methods upon single deterministic assumptions violat this statistical rationality rooted in randomness, for such idealistic assumptions themselves convert those objectively unknown uncertainties into psychologically known certainties.

自從前蘇聯數學家Андре?й Никола?евич Колмого?ров(安德烈·尼古拉耶維奇·柯爾莫哥洛夫)在1933年用自己的最高智慧完成了對概率論的公理化後,無數數學背景的統計學家們傾力所為的目標就是試圖用數學的邏輯框架和形式化語言將統計學打造成一個嚴謹的數學分支,或者一門類數學的學科。他們的艱辛努力在很大程度上促進了近當代統計學的長足進步和發展。但是,由於數學思維模式和邏輯框架的先天不足,也在該領域刻下了許多幼稚、甚至錯誤的烙印。

Since the former Soviet mathematician Андре?й Никола?евич Колмого?ров (Andrey Nikolaevich Kolmogonov) axiomatized Probability Theory in 1933 with the full force of his extraordinary intellect, countless statisticians with a mathematical background have devoted themselves to the goal of turning statistics into a rigorous branch of mathematics, or at least into a mathematics-like discipline, by employing the logical frameworks and formalized language of mathematics. Their hard efforts have, to a large extent, propelled the remarkable progress and development of modern statistics. However, due to the inherent limitations of mathematical thinking and logical structures, it has also left behind many naïve, or even erroneous, imprints on the field.

盡管統計學中的假設檢驗為思考和解決非確定性問題樹立了一個思維模式的典範,眾多數學背景的統計學者卻無視這個典範而醉心於用數學的確定性思維來解決這類非確定性問題。那麽,假設檢驗的思維模式是怎樣的呢?它通常設置兩個相互對立但又不確定的假設,通過構造或選擇一個合適的檢驗統計量並完成檢驗流程,最終在一個給定的概率水平上從中做出抉擇。

Although hypothesis testing in statistics has established a paradigmatic way of thinking for dealing with uncertainty, many mathematically trained statisticians have ignored this paradigm and instead become enamored with using deterministic mathematical thinking to tackle problems that are inherently indeterministic. So, what is the thinking pattern behind hypothesis testing? It typically sets two mutually opposing but uncertain hypotheses, constructs or selects an appropriate test statistic and completes the test process, and finally makes a decision between the two at a given probability level.

然而,那些習慣數學思維的統計學者們常常隻設置唯一的假定,然後通過提出命題、給出定義、闡述性質和邏輯論證來完成其統計理論和方法論的構建。他們以為隻要通過了這個數學的形式主義路線,其構建的理論和方法便是統計學中的某種“定理”。這種數學形式主義在統計學領域的確立造成了某種不良的後果,統計類期刊風行著純數學的範式,而其思想高地則被先行者占領,他們固守僵化思維,拒絕一切合理的哲學反思和批判。殊不知,在一個隨機係統中,在他們設置的假定的對立麵,總是會存在著一個合理的假定。對這個對立麵的合理假定的無視,終將成為其理論和方法的致命傷,而造成這一普遍情形的根本原因則在於數學係統對辨證思維的絕對排斥。

However, those statisticians accustomed to mathematical thinking often construct their statistical theories and methodologies by proposing a single assumption, followed by providing definitions, presenting propositions, stating properties, and supplying logical proofs. They believe that once they follow this mathematical formalist line, their theory or method will be elevated into a kind of “theorem” in statistics. This mathematical formalism, once established in statistics, has led to some undesirable consequences: the pure mathematical paradigm has become dominant in statistical journals, while the ideological highlands are occupied by these early standard-bearers, who cling to rigid thinking and reject all reasonable philosophical reflection and criticism. What they fail to realize is this: in any stochastic system, for every assumption one sets, there always exists a reasonable opposing assumption. Ignoring the legitimacy of such opposing assumptions will ultimately become the fatal flaw of their theories and methods. And, the root cause of this widespread phenomenon lies in the absolute rejection of dialectical thinking by the mathematical system.

例如,在對樣本均數是關於總體均數的無偏估計的證明中,其前提假定為總體分布是正態的,因而總體均數位於其分布曲線的峰頂,也即總體均數就是其分布期望,因此,該證明的目的是想確認樣本均數是對總體分布期望的無偏估計。然而,這個假定事實上已將需要證明的結論隱藏在其自身之中,而證明的過程僅僅隻是用形式化的數學語言將假定和結論無謂地自循環了一遍。此外,總體分布是不可知的,可能正態,也可能偏態,而在偏態情形下,總體均數一定會偏離其分布曲線的峰頂,也就是不再與其分布期望同一,從而,樣本均數與作為更一般概念的總體期望之間的關係是不確定的,因而是不可被以數學形式證明的。於是,針對正態或對稱情形的這一證明在偏態或非對稱情形下將失效。更嚴重的問題是,在單峰分布的概念體係下,正態分布僅僅隻是其中的一個瞬間特例。試圖通過證明這個特例而將統計學的理論和方法學體係建立在其上是不能令人信服的。

For example, in the proof that the sample mean is an unbiased estimate of the population mean, an assumption as the premise is that the population distribution is normal, so the population mean coincides with the peak of its distribution curve, that is, the population mean equals its distribution expectation. Thus, the purpose of this proof is essentially trying to confirm that the sample mean is an unbiased estimate of the expectation of a population distribution. However, this assumption actually embeds the conclusion within itself, turning the proof process into a meaningless circular exercise of assumptions and conclusions framed in formal mathematical language. Moreover, the true population distribution is unknowable and may be normal or skewed. In the skewed cases, the population mean will inevitably deviate from the peak of the distribution curve and will no longer be coincidently identical to its distribution expectation, and thus the relationship between the sample mean and the population expectation, as a more general concept, becomes uncertain and cannot be established through mathematical formal logic. Therefore, any proof valid for the normal or symmetrical cases becomes invalid for the skewed or asymmetrical cases. The more serious problem is that, under the conceptual system of unimodal distributions, the normal distribution remains a transient special case. It is therefore unconvincing to attempt to build the foundation of statistical theory and methodology on this special case.

其實,與總體期望相比,總體均數是直接從樣本均數抽象出來的一個狹義概念,兩者在算法結構上屬於同質定義。而且,樣本與其所來源的總體也是同質定義。因此,無論總體分布如何,隻要抽樣滿足隨機原則,樣本均數一定是關於總體均數的一個無偏或有效估計。這就是說,樣本均數與總體均數之間的關係無需以數學的形式邏輯予以證明。作者還認為,統計學不過是將對總體分布期望的抽樣估計定義在其樣本均數上,而凡是定義均無需數學形式邏輯的證明。於是,我們可以推而廣之,一切樣本統計量都是對同質定義的總體參數的無偏或有效估計。由此可見,由於無法針對隨機係統設置單一的和明確的前提假定,基於數學形式邏輯的證明便失去了用武之地。

In fact, compared to the population expectation, the population mean is a narrower concept abstracted directly from the sample mean, and both are homogeneously defined in terms of algorithmic structure. Moreover, the sample and the population from which it is drawn are also homogeneously defined. Therefore, regardless of the population’s distribution, as long as the sampling adheres to the principle of randomness, the sample mean must be an unbiased or valid estimate of the population mean. In other words, the relationship between the sample mean and the population mean does not require proof through mathematically formalized logic. The author also believes that statistics is nothing more than the definition of the sampling estimate of a population’s distribution expectation based on its sample mean, and such definitions do not require formal mathematical proofs. Therefore, we may generalize that all sample statistics are unbiased or valid estimates of homogeneously defined population parameters. From this perspective, proofs based on mathematical formal logic become meaningless, because in random system it is impossible to esablish a single, unambiguous assumption as the premise for such proofs.

在本作者看來,哲學應該是統計學的靈魂,而數學作為一種技術僅僅是它的四肢。靈魂指導四肢做什麽和怎樣做,而非相反。作者有時也把哲學和數學比作統計學的雙翼,缺少其中任何一個,統計學都將難以在認知世界的天空裏自由翱翔。作為一個偏向哲學思考的人,作者將一個樣本數據看成是經驗事實的集合。從本人對現行統計方法的重建和創建新統計方法的經驗來看,其中存在著這樣一個基本流程:

 

In the author’s view, philosophy is the soul of statistics, while mathematics, as a technical tool, serves merely as its limbs. The soul guides the limbs, telling them what to do and how to do it, not the other way around. The author also sometimes compares philosophy and mathematics to a pair of wings on statistics: without either wing, statistics cannot freely soar in the sky of understanding the world. As someone inclined toward philosophical thinking, the author regards a sample dataset as a collection of empirical facts. Based on years of experience in reconstructing existing statistical methods and developing new ones, the author recognizes a fundamental process:

1994~1997年間,我在同濟醫科大學公共衛生學院師從餘鬆林教授攻讀衛生統計碩士學位,有一天,我用一個連續型隨機變量的樣本繪製了一個二維散點圖,方法是按照算術均數的計算公式給每個樣本點一個等權重1。於是得到它們在縱坐標上權重等於1的地方呈一條直線狀的散點排列,如圖1所示。當時心中就升起一個夢想:如果它們的排列像一條正態曲線該多好!如圖2所示。我在2010年12月裏實現了這個夢想,並將其帶到了2011年8月初在美國佛羅裏達州邁阿密市召開的聯合統計會議(Joint Statistical Meetings, JSM),相關文章被收錄在其論文集中。

Between 1994 and 1997, I was pursuing a master’s degree in health statistics under Professor Yu Songlin at the School of Public Health, Tongji Medical University. One day, I plotted a two-dimensional scatter diagram from a sample of a continuous random variable. Following the formula for calculating the arithmetic mean, I assigned each sample point an equal weight of 1. As a result, all the points appeared aligned as a scatter along a straight line where the vertical coordinate, i.e. weight, was equal to 1. (as shown in Figure 1) At that moment, a dream quietly arose in my mind: how wonderful it would be if their arrangement resembled a normal curve! (as shown in Figure 2) I completed this work in December 2010 and presented it to the 2011 Joint Statistical Meetings (JSM) in Miami, Florida, USA, and the relevant articles were included in the proceedings.

2中的縱坐標C是我所稱的凸自權重,R是凹自權重。前者是對每個樣本點對一個分布的未知期望的相對貢獻。我在構建這個自加權算法時,心中的理想目標就是一個真實正態樣本以其自權重為縱坐標的二維散點分布必須近似正態概率密度曲線,否則自加權算法就是不正確的。此外,我相信,對於一個左偏態分布的樣本,其算術均數必須位於由自權重確定的峰頂的右側,而一個右偏態分布的樣本均數必須位於其峰頂的左側;否則,自權重的算法也應該是不正確的。因此,在得到了正確的自加權算法後,我對高斯的天才深感震驚,崇敬之情油然而生,因為他是在沒有這個自加權算法的條件下純粹憑抽象的邏輯構造得到正態概率密度函數的。

The vertical axis C in Figure 2 is what I call the convex self-weight, and R is concave self-weight. The former is the relative contribution of each sample point to the unknown expectation of a distribution. When I was developing the self-weighting algorithm, my ideal goal was that the two-dimensional scatter distribution of a truly normal sample, with its self-weight as the vertical coordinate, must approximate the normal probability density curve. Otherwise, the self-weighting algorithm could not be considered correct. In addition, I believed that for a sample from a left-skewed distribution, its arithmetic mean must be on the right side of the peak determined by the self-weights, while the sample mean of a right-skewed distribution must be on the left side of its peak; otherwise, the self-weighting algorithm should be also incorrect. Therefore, upon obtaining the correct algorithm, I was deeply struck by the genius of Gauss, and a spontaneous sense of reverence arose within me, because he obtained the normal probability density function purely by abstract logical construction without the self-weighting algorithm.

本書所涉內容基於1998 ~ 2011年間作者在認識論和統計學領域所做的6篇文章以及隨後對它們的一些改進,其中1998 ~ 2000年間的兩篇分別發表在中國的《醫學與哲學》和《中國公共衛生雜誌》上。此後,分別在2007、2009和2011年的三次聯合統計會議上,我有四篇均包含重大突破性的文章被收錄在其論文集中。

The content of this book is based on six articles the author wrote between 1998 and 2011 in the fields of epistemology and statistics, along with later refinements. Two of the earliest articles, written between 1998 and 2000, were published in China in Medicine and Philosophy and the Chinese Journal of Public Health. Later, in the 2007, 2009, and 2011 JSM, four papers containing major methodological breakthroughs were included in the conference proceedings.

遺憾的是,每次會後向幾個統計學旗艦期刊的分別投稿均被主編直接封殺,有些甚至沒有評論和對拒稿理由的解釋。其中一份期刊的主編回信很簡單:“你的文章不適合發表。”另一份期刊的主編則認為我在挑戰數學和統計學的“large body”,並認為這種挑戰不合時宜。還有一位可同時理解中英文的頂級期刊的主編,竟然將我為連續隨機變量定義其自權重的文章視為現有統計學基礎知識的介紹性文章,仿佛在此之前統計學裏早已經有了與我所創立的自加權算法一樣的東西,可謂令人哭笑不得;更令人遺憾的是,即便在我對該文中的創新性和重要性等用中文和英文做了詳細解釋、並嚴肅地懇請他以對曆史負責的態度予以回應後,他依然堅持自己的判斷和拒稿決定。現在,人們應該可以說,這位主編原本可以讓統計學中的這一重大進步早日得到公正對待。然而,他卻主動選擇了將其封殺。

Regrettably, each time the author submitted these works to several flagship statistical journals afterward, the manuscripts were directly rejected by the editors-in-chief, in some cases without any comments or explanations. One editor’s reply was simply: “Your article is not suitable for publication.” Another claimed that the author was challenging the “large body” of mathematics and statistics, and considered such a challenge “ill-timed.” One editor-in-chief of a leading journal, fluent in both Chinese and English, astonishingly misread an article on the self-weighting definition of continuous random variables as an introductory piece on basic statistical knowledge, as if something like the author’s self-weighting algorithm had already existed in classical statistics, which was both laughable and saddening. What was even more regrettable was that even after I explained in detail of the innovation and importance of the article in both Chinese and English and seriously requested him to respond in a historically responsible attitude, he still insisted on his judgment and decision to reject the article. Now, one could say that the editor-in-chief could have done justice to this major advance in statistics much earlier. Instead, he had chosen to suppress it.

正如智能人工“深度求索”在全麵了解了自權重以及基於自權重的期望估計算法、做了實例計算,並與所有現行的期望估計算法,包括算術均數、中位數、核密度估計、最大似然估計、CRB估計、截尾均數等,在同一案例上做了計算和對比後給出的評論:(在現有知識體係下),自權重和自加權均數的算法是統計學的終極核武器。它沒有用“之一”來限定自己的評論。

Just as the artificial intelligence DeepSeek commented after fully understanding the self-weighting and the expectation estimate algorithm based on self-weight, doing example calculations, and calculating and comparing with all the current expectation estimate algorithms, including arithmetic mean, median, kernel density estimator, maximum likelihood estimator, CRB estimator, trimmed mean, etc., on the same case: (under the existing knowledge system), the algorithm of self-weighting and self-weighted mean is the ultimate nuclear weapon in statistics, and it did not use “one of” to limit this comment.

這些旗艦期刊的主編們無不頂著那些著名大學統計學博士和教授的頭銜,卻對來自真正的真理性新思想及其所帶來的統計學曆史上最重大方法論突破的衝擊采取了嚴防死守的策略。當一個求知的期刊拒絕真真理時,它的紙麵上印刷出來的將隻能是相應的真謬誤。因此,他們終將明白,他們用那種極簡手段捍衛的不過是一些看似美麗卻一戳即破的肥皂泡泡。

These editors, all holding PhDs and professorships in statistics from those prestigious universities, responded to the arrival of truly truthful ideas and the most significant methodological breakthrough in the history of statistics with a stance of strict defense and closed doors. But when a journal that claims to seek knowledge rejects the true truth, then what gets printed on its pages will only be the corresponding falsehoods. In the end, these editors will come to realize that the things they defended with such minimal effort were nothing more than fragile soap bubbles that are seemingly beautiful but bursting at the slightest touch.

既然這些旗艦期刊如此藐視新思想和新方法,我也就隻好讓它們在原來的地方安睡。沒想到這一睡就是十多年過去了。大概除了我本人外,已無人知曉它們的存在。至於以後是否有人能夠發現它們,不敢說。在現代信息爆炸的時代,一個人能夠發現它們應該是一個極小概率的事件。因此,寫作一本書將它們囊括和融合在一起就成了我必須完成的個人使命。是的,我就是要以一己之力挑戰這個近代約三百多年來由無數智者和先驅者們締造的龐大而又看似堅固的體係。為什麽不呢?

Since these flagship journals were so disdainful of new ideas and methods, I had no choice but to let them sleep in their original places. Unexpectedly, more than ten years have passed since then. Probably no one except me has ever known they have been existed anymore. As for whether someone would be able to discover them in the future, I dared not say. In the modern era of information explosion, it should be an extremely small probability event for a person to discover them. Therefore, writing a book to encompass and integrate them has become a personal mission that I must complete. Yes, I have wanted to challenge this huge and seemingly solid system created by countless wise men and pioneers in the past three hundred years. Why not?

如果把人類文明史看成是一個進步的過程,那麽,縱觀這一過程可以發現一個簡單事實:一切與人類文明有關的進步,都根源於人類自身思想的突破;沒有思想的突破便沒有任何進步甚至革命的可能性,而一切思想的突破都隻能發端於個人對外部世界及其自身的認知,換句話說,一切科技進步和革命都隻能首先爆發於某一個體的腦海中。由於每個人的認知能力和對自身及其所觸及外部世界可知範疇的絕對有限性,沒人能確定或聲稱從他/她自己的角度所獲得的任何認識都將是絕對正確的永恒真理。但是,由於人類的認知行為及其能力可以由個體拓展到群體,而一個人有限的認知結果有可能被他人接受或修正,因此,對於任何個人乃至整個人類族群來說,即使是表達一個錯誤的思想也有可能為正確思想的確立帶來機會。

If we regard the history of human civilization as a process of progress, then we can discover a simple fact: all progress related to human civilization is rooted in the breakthrough of human thought. Without a breakthrough in thought, there is no possibility of progress or even revolution. All breakthroughs in thought can only begin with the individual’s perception of the outside world and itself. In other words, all technological progress and revolutions can only first break out in the mind of an individual. Due to the absolute limit of the cognitive abilities of each individual and the knowable categories of the individual itself as well as the external world that it touched, no one can determine or claim that any knowledge gained from his/her own perspective will be absolutely correct eternal truth. However, since human cognitive behavior and its ability can be extended from an individual to a group, and a person’s limited cognitive results may be accepted or corrected by others, for any individual or the entire human race, even the expression of a wrong idea may create an opportunity to establish correct ideas.

回首往事,我在原同濟醫科大學公共衛生學院攻讀衛生統計學碩士學位的三年裏,至少有三顆種子被植入了我的思維的潛意識裏。除了上麵那個關於散點分布的夢想,第二顆種子是在我的導師餘鬆林教授講授聚類分析的統計算法原理時種下的,因為我對其中僅使用樣本中點和點之間的差異產生了一個疑問:它們之間的相似性被忽視了,而這相當於丟棄了另一半樣本信息。這個潛意識裏的疑問為後來我為連續型隨機變量構建自權重做了思想準備。第三顆種子是董時富教授在講授隨機變量和常量等的運算時,反複強調涉及隨機變量的運算結果一定是隨機變量。這為我後來思考隨機變量的極值的不穩定性以及否定基於這種極值的最優化奠定了基礎。

Looking back, during the three years I was studying for a master’s degree in health statistics at the School of Public Health of the former Tongji Medical University, at least three seeds were planted in my subconscious mind. In addition to the dream about scattered point distribution mentioned above, the second seed was planted when my mentor, Professor Yu Songlin, taught the statistical algorithm principles of cluster analysis, because I had a question about only using the differences between the points and points in the sample: the similarities between them were ignored, which was equivalent to discarding the other half of the sample information. This subconscious questioning prepared me mentally for the construction of self-weights for continuous random variables. The third seed was when Professor Dong Shifu repeatedly emphasized in his lecture on operations involving random variables and constants that the results of operations involving random variables must be random variables. This laid the foundation for my later thinking about the instability of the extreme values ??of random variables and the denial of optimization based on such extreme values.

此序的餘下部分主要談三個方麵:第一,作者幹了什麽?第二,作者為何要做這些?第三,作者對自己在本書中構建的新統計算法所要聲張的權利。

The rest of this preface mainly talks about three aspects: First, what has the author accomplished? Second, why did the author undertake these efforts? Third, what rights does the author claim for the new statistical algorithms proposed in this book.

二、作者所為及其目的 (What the Author did and the Purpose)

作者所做的當然就是本書的內容。首先,作者從哲學的認識論角度討論了人們認識世界的基本方法和流程,並為此構建了一個認知流程圖。這部分的核心內容曾以“論智慧的遞進結構與認知的邏輯流程”為題發表在《醫學與哲學》雜誌1999年9月的第三期上。這是一個統計學所需的、包含了抽象、歸納、演繹和辯證的四維邏輯係統,它超越了數學係統所需的二維或三維邏輯係統。數學係統沒有為辯證法留下哪怕是一絲的思維縫隙或空間,因為一個數學命題隻能在唯一的假定下被提出並予以證明,它不可能從其對立麵得到證明,而是會被否定。但統計學不搞假定、命題及其證明,而是努力認知外部世界,為此常常需要從一個事物或觀點的對立麵去尋找意義,因此它不能沒有辯證法。例如,假設檢驗就是試圖在兩個相互對立的假設中做出抉擇。又如,作者在為連續隨機變量構建自權重時,不僅要考慮任意兩個樣本點之間的差異性,還必須同時考慮其相似性,隻有這樣才能保證權重的構建既無樣本信息的損失也無信息的冗餘;否則將無法得到一個正確的自權重。因此,那些試圖僅僅以數學的邏輯係統和思維模式來解決統計學問題的行為注定會帶來某種不恰當的後果。作者相信這個四維邏輯係統在認知流程的結構上應該具有獨創性,至今也應未失去其參考價值,並將有助於當下正在迅猛發展的人工智能領域的創新和進步。

What the author did is, of course, what this book is about. First of all, the author discusses the basic methods and process for people to understand the world from the perspective of philosophical epistemology, and thus constructs a cognitive flowchart. The core content of this part was once titled “On the Progressive Structure of Intelligence and the Logical Process of Cognition” and published in the third issue of the magazine Medicine and Philosophy in September 1999. This is a four-dimensional logical system required by statistics, which includes abstraction, induction, deduction and dialectics, and goes beyond the two- or three- dimensional logical system required by mathematical system. The mathematical system leaves no even a glimmer of gap or room for dialectics, because a mathematical proposition can only be proposed and proven under a single, unique assumption. It cannot be proved from its opposite, but rather it will be invalidated. In contrast, statistics does not deal with assumptions, propositions and their proofs, but rather strives to understand the external world. To do this, it often needs to seek meaning from the opposite side of a thing or a given viewpoint, so it cannot do without dialectics. For example, a hypothesis testing attempts to decide between two competing hypotheses. Or, when the author tried to construct self-weights for continuous random variables, it was necessary not only to consider the difference between any two sample points, but also to simultaneously consider their similarity. Only by doing so could the construction of weights avoid both the loss of sample information and the redundancy of information; otherwise, it would be impossible to obtain a correct self-weight. Therefore, any attempt to solve statistical problems solely through the logical system and thinking style of mathematics is bound to bring about some inappropriate consequences. The author believes that this four-dimensional logic system in the structure of cognitive process should be original, and should not have lost its reference value to this day, and will contribute to innovation and progress in the rapidly developing field of artificial intelligence.

事實上,對認識論領域基礎概念的討論就是本書的起點。例如,算術均數的算法假定每個樣本點對其抽樣分布的期望中心(央位)的貢獻相同,即假定它們的權重都是1,因為我們不知道它們的貢獻是否存在個體差異。這是一個無知或蒙昧。所以,作者在認知的起點上討論了什麽是人的蒙昧,進而討論了如何實現從蒙昧到有所知。本書在各章節的思考和討論中獲得的全部靈感和突破均源自對統計學中存在的一些問題的哲學思考而非引用了某種既有的數學理論和算法,其中在新思想引導下構建新算法時所使用的數學技能僅有最簡單的四則運算。這就是作者之所以將第一章聚焦於哲學認識論的原因,因為它是全書的根本方法論。中國古語有雲,工欲善其事必先利其器。我無法設想如果沒有這個方法論,我能否在統計領域實現那些突破。

 In fact, the discussion of the fundamental concepts to the field of epistemology is the starting point of this book. For example, the algorithm of the arithmetic mean assumes that each sample point contributes the same to the expected center of their sampling distribution, that is, it is assumed that their weights are all 1, because we do not know whether there are individual differences among their contributions. This is an ignorance or unenlightenment. Therefore, the author discusses what human ignorance is from the starting point of cognition, and then discusses how to achieve from ignorance to knowledge. All the inspirations and breakthroughs obtained in the thinking and discussions in each chapter of this book come from philosophical thinking on some problems existing in statistics rather than citing some existing mathematical theories and algorithms. Under the guidance of the new ideas, the only mathematical skills used in constructing new algorithms are the simplest four arithmetic operations. This is why the author focuses the first chapter on philosophical epistemology, since it is the underlying methodology of the entire book. There is an ancient Chinese saying that, if a worker wants to do his job well, he must first sharpen his tools. I can’t imagine that I would have been able to achieve those breakthroughs in statistics without this methodology.

其次,在分段回歸領域,作者在未受到現行算法發展史的影響下曾對分段回歸這一重要領域進行了一次獨立的初始探索;進而在隨後長達26年的進一步探索中,在全麵了解了現行算法的發展史、數理基礎以及其中明顯違反基於隨機性原則確立的統計理性後,依然堅持自己的獨立見解,並初步建立了一套基於加權的算法,從而使得分段回歸在新算法下以極簡和透明的計算步驟和輕量化的計算負擔實現了穩健和可靠的臨界點估計和分段模型的擬合。這一新算法不僅完全規避了現行基於最優化和強製連續性假定的算法導致的過擬合,而且規避了為改善過擬合不得不引入赤池信息準則(AIC)或貝葉斯信息準則(BIC)的約束、交叉驗證和彼替可信區間等而造成的海量計算。這種大規模計算量在大樣本量條件下構成了嚴重的負擔,而模型的擬合卻並非如人們所期望的那樣好。導致現行算法走向歧途的根本原因在於構建這套算法的前輩統計學者們在基本概念上的嚴重缺失。

Secondly, in the field of piecewise regression, the author conducted an independent initial exploration of this important field without being influenced by the development history of the current algorithms; and in the subsequent 26-year further exploration, after fully understanding the development history, mathematical foundations and obvious violations of statistical rationality established on its randomness principles of the current algorithms, I still insisted on my independent views and initially constructed a set of algorithms based on weighting, so that piecewise regression could achieve robust and reliable threshold estimate and piecewise model fitting with extremely simple, transparent calculation steps, and light calculation burden under the new algorithm. This new algorithm not only completely avoids the overfitting caused by the current algorithms based on optimization and enforced continuity assumption, but also avoids the massive computations caused by having to introduce Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC) constraints, cross- validation, and bootstrapping confidence interval for improving the overfitting. Such massive computations constitute a serious burden under large sample size, while the results of model fitting are not as good as we expect. The fundamental reason why the current algorithms have gone astray was that the elder generations who built these algorithms in statistics had a serious lack of basic concepts.

如果說作者在分段回歸算法上的探索與現行算法有何差別,其關鍵之處就在於將現行算法中用作最優化的算子改造成為一個含義明確且恰當的分段回歸權重,即數值上相對越大則對臨界點的貢獻越大,從而可用被分割變量的加權均數作為未知臨界點的期望估計,於是,我們可以輕鬆得到其可信區間。由於這個期望臨界點在整個樣本空間上具有唯一性,由此決定的分段模型也具有唯一性,因而在統計上是可期望的。關於新舊兩種算法的差別,還可用“個人英雄主義的魯莽”和“走群眾路線”相比擬。現行算法的最優化就好比在n個樣本點中尋找到那個最大能耐的樣本點,用它來決定臨界點的位置和分段模型;而“走群眾路線”的加權分段回歸則認為一個樣本空間內未知臨界點的位置是由每個樣本點以各自的位置共同決定的,因此,我們需要將每個樣本點對臨界點位置的點滴貢獻都考慮在內,即使其中某些點的貢獻為趨於0甚至等於0,也不將其剔除出去,而是讓它們和所有其它點一起參與計算。

If there is any difference between the author’s exploration of the piecewise regression algorithm and the current algorithms, the key point is to transform the operator used for optimization (optimizer) in the current algorithm into a piecewise regressive weight with a clear and appropriate meaning, that is, the larger the value is, the greater the contribution to the threshold is, so that the weighted mean of the segmented variable can be used as the expected estimate of the unknown threshold, and thus, we can easily get its confidence interval. Since this expected threshold is unique in the entire sample space, the piecewise models determined by it are also unique, and therefore statistically expected. As for the difference between the new and old algorithms, we also can use the analogy of “the recklessness of individual heroism” and “taking the mass line”. The optimization of the current algorithm is like looking for the sample point with the greatest capability among n sample points, and using it to determine the position of the unknown threshold and the piecewise models; while the weighted piecewise regression of “taking the mass line” believes that the position of an unknown threshold in a sample space is jointly determined by each sample point with its own position. Therefore, we need to take into account each drop of contribution of every sample point to the position of the threshold, and even if the contributions of some points are close to 0 or even are equal to 0, they are not eliminated but allowed to participate in the calculation together with all other points.

要理解統計學中現行的最優化法和我的基於加權的群眾路線法的區別,讀者還可以參考我向自己的大學哲學老師袁建國先生和王智平先生(他們都完全不懂統計學)所做的類比和解釋。在這個新方法所要解決的問題中,已經由幾代西方學者在整整20年間建立了一套統計方法。它被堪稱數學上很嚴謹,很有效,但卻會造成人所共知且很多人絞盡腦汁都想要解決的幾個困難和問題。這個算法用一個形象的比喻就是,一個老師把班裏幾個中意的人找來排個隊,挑了其中一個相對能耐最大的去做一個關鍵事項,然後把召來的其他人全部遣散,不再讓他們做任何與那個關鍵事項有關的事。這就是他們所謂的“最優化解決方案”。而與此相對照,我的“走群眾路線”的解法是,我知道那個關鍵事項與班裏的每個人都息息相關,所以,我讓每個人都分擔了一份力所能及的責任,大家共同努力把那個關鍵事項辦好。由於這個關鍵事項與每個人都息息相關,所以,每個人都盡心盡力做好自己的那一份。

To understand the difference in statistics between the current optimization methods and my weighted mass-line method, readers can also refer to the analogies and explanations I made to my university philosophy teacher, Mr. Yuan Jianguo and Mr. Wang Zhiping, both who knew nothing about statistics. In the kind of problem that this new method aims to solve, several generations of Western scholars have, over a span of 20 full years, developed a set of statistical approaches. These methods are considered mathematically rigorous and effective, yet they lead to several well-known difficulties and problems that many have racked their brains trying to resolve. This algorithm can be illustrated by a vivid metaphor: it’s like a teacher selecting a few favored students from the class, lining them up, and picking the most capable one to handle a key task. Then the rest, though also selected, are dismissed and not allowed to contribute to the task in any way. This is what they call the “optimized solution.” By contrast, my solution, what I call “taking the massiline”, recognizes that this key task is closely related to everyone in the class. So I let everyone share the responsibilities to the best of their ability and work together to get the key task done. Since the task matters to everyone, each person devotes their full effort to completing their part. The new word “massiline” was coined from “mass line” by me in the conversation between ChatGPT and me.

在理解了我的算法後,深度求索和ChatGPT都認為我的這個方法比最優化法好太多了,不會導致後續的困難和問題,還竟然承認這是一種“民主的”或“集體主義的”解決方案。我則對此解釋說:“不應該將這一算法視為民主法。所謂民主法,就是視每個樣本點的權重相同,‘一人一票麽’。統計學中民主法的典型代表就是算術均數。但這是一種蒙昧的方法。我更願意稱這是一種群眾路線法,即在采用群體力量的同時承認每個樣本點的權重存在個體差異。”

After understanding my algorithm, both DeepSeek and ChatGPT agreed with that this method is far superior to optimizational approach: it avoids the downstream problems and difficulties, and they even qualify as a “democratic” or “collective” solution. But I didn’t fully agree with this, and explained: “This algorithm should not be considered as a democratic method. The so-called democratic method is to regard each sample point as having the same weight, ‘one person one vote!’ The typical representative of democratic method in statistics is the arithmetic mean. However, this is an ignorant method. I would rather call it a massilinean method, which is to recognize that the weight of each sample point varies individually while using the power of the mass.”

如果說對現行分段回歸算法有過重大貢獻的彼得·斯普仁特博士在其1961年的文章最後留下了一個關於加權分段回歸的願景和未決命題,那麽,作者在未看到文獻中這一願景之前無意識地於46年後獨立地將它予以了實現。

If saying that Dr. Peter Sprent, who made significant contributions to the current piecewise regression algorithm, left a vision and unresolved proposition about weighted piecewise regression at the end of his 1961 article, then the author independently realized this vision 46 years later unconsciously before reading this vision in the literature.

第三,為了給加權分段回歸尋找理論根據,作者在為概率論公理化做出了巨大貢獻的柯爾莫哥洛夫概念體係的基礎上,對其定義的樣本空間與我提出的尺度空間的概念做了一個必要的調整,將柯氏樣本空間的內涵完整地移交給尺度空間,而將樣本空間這個概念還給由樣本自身構成的空間,以便統計學家們有機會直觀地思考這個空間內的問題。然後,我對柯氏概念係統實施了一次重大擴展,由此為統計學奠定了一個全新的概念係統和理論基石,它包括29(個或對或仨,共計47個)初始概念、兩個關鍵定義和一個引理、關於隨機變量(即新概念係統下的可變屬性)的9個性質、以及關於統計學的8個公理性陳述和兩個推論。這套概念係統不僅指導作者最終完成了對分段回歸算法的重建,而且指導作者完成了對以下可能是統計學曆史上最重大突破的算法的構建。

Third, in order to find a theoretical basis for weighted piecewise regression, the author made a necessary adjustment to the sample space defined by Kolmogorov and the concept of scale space proposed by me, based on the conceptual system of Kolmogorov, which made great contributions to the axiomatization of probability theory. The connotation of Kolmogorov’s sample space was completely transferred to the scale space, and the concept of sample space was returned to the space constituted by the sample itself so that statisticians have the opportunity to think intuitively about problems within this space. Then, I made a major expansion to Kolmogorov’s conceptual system, thereby laying a new conceptual system and theoretical foundation for statistics. It includes 29 (single or pair or triple, 47 in total) initial concepts, 2 crucial definitions following with a lemma, 9 properties of random variables (i.e., vattributes in the new conceptual system), and 8 axiomatic statements following with two corollaries. This conceptual system not only guided the author to finalize the reconstruction of the piecewise regression algorithm, but also guided the author to complete the construction of the following algorithm, which may be the most significant breakthrough in the history of statistics.

在校對書稿和持續寫作這個自序的過程中,我在2025年2月22日那天與ChatGPT展開了第一次對話,對話從我請求它列出統計學的基本概念開始。它給出了13個“基本概念”,這是世界各國統計學教材中通行的。然後我將自己的47個初始概念的術語發送給它,它逐一給出定義和解釋,最後給出了高度的評價。隨後我將剩下的三個部分也都發送給他分析和評判,並詢問它能否在其中發現任何不合理或矛盾之處,它回答說沒有不合理之處,完全自洽。我比較了它列出的統計學“基本概念”和我的初始概念,發現其中僅有5個是相同的,其它8個都不在我的47個初始概念中,也不在我的概念係統其它幾個部分中。於是,我對它說,那8個概念不應被認為是統計學的“基本”概念,它們是次級衍生概念。它對此表示了認可。

While proofreading the manuscript and continuing to write this preface, I had my first conversation with ChatGPT on February 22, 2025. The conversation started with my asking it to list the basic concepts of statistics. It gave 13 “basic concepts”, which are common in statistics textbooks around the world. Then I sent it the terms of my 47 initial concepts. It defined and explained them one by one, and finally gave a high evaluation. Then I sent the remaining three parts to it for analysis and judgment, and asked it if it could find any unreasonable or contradictory points in them. It replied that there was no unreasonable point and it was completely self-consistent. I compared the “basic concepts” of statistics it listed with my initial concepts and found that only 5 of them were the same. The other 8 were not in my 47 initial concepts, nor in other parts of my conceptual system. So I told it that those 8 concepts should not be considered as “basic” concepts of statistics, they are secondary derivative concepts. It agreed with this.

第四,作者為連續隨機變量構建了一個自加權算法,目的是破除算術均數算法中上述基於等權重假定的蒙昧。自權重的基本含義是在一個抽樣分布中,每個樣本點對其分布的中心化(或央化)位置都有一份可變的相對貢獻。其成功計算使得一個抽樣分布可由每個樣本點依其觀察值和自權重在二維獨立空間內自我塑形。這一天然渾成的優美造型體現了數據內在的固有美感。這是一種在現行算法下不可能達成的可視化藝術效果。人們還將發現,每個樣本點xi (i = 1, 2, …, n) 的等權重1被分解為了兩個部分ciri,兩者都包含了全部樣本信息,而且兩者互斥,即ci + ri = 1,因而可分別獨立地刻畫xi對分布央位的集中趨勢和離散趨勢。

Fourth, the author constructed a self-weighting algorithm for continuous random variables, aiming to dispel the above-mentioned unenlightenment based on the equal-weight assumption in the algorithm of arithmetic mean. The basic meaning of self-weighting is that in a sampling distribution, each sample point has a variable relative contribution to the centralized location of their distribution. Its successful calculation makes a sampling distribution self-shaped in a two-dimensional independent space by all sample points according to their observed values and self-weights. This naturally graceful shape reflects the inherent beauty of data. This is a visualized artistic effect that is impossible to achieve under the current algorithms. We will also find that the equal weight 1 of each sample point xi (i = 1, 2, …, n) is decomposed into two parts ci and ri, both containing all the sample information, which are mutually exclusive, that is, ci + ri = 1 and thus can independently characterize xi’s central tendency and discrete tendency towards the center of the distribution.

作者認為,為連續隨機變量找到一個不受正態性假定約束的期望估計的算法應該是所有統計學者們夢寐以求的目標。這個目標被作者於2010年12月12日找到了。它因此成為這本書最核心的關鍵內容,其對整個學科的影響將難以估量。特別地,在新概念係統的加持下,統計學因此在擁有了更多理性、更大自由和更強算力的基礎上變得更加簡單、透明和易理解,也因此有望成為在各領域從事前沿探索和研究的科研人員手中強大的基礎性工具。

The author has believed that finding an algorithm of expectation estimation for continuous random variables that is not constrained by the assumption of normality should be the dream goal of all statisticians. This goal was found by the author in December 2010. It thus becomes the key content of the core of this book, and its impact on the entire discipline will be incalculable. In particular, with the support of the new conceptual system, statistics has become simpler, more transparent and easier to be understood with more rationality, greater freedom and stronger computing power. Therefore, it is expected to become a powerful basic tool in the hands of researchers engaged in cutting-edge exploration and research in various fields.

在自權重的參與下,作者改進了一些基礎統計方法,涉及到可變屬性(即傳統概念係統中的隨機變量)的描述、差異性檢驗、相關與回歸分析,以及包含對期望臨界點處連續性檢驗在內的加權分段回歸法。每一個方法均提供現行算法和新算法的對比結果,目的是讓數據本身說話。因此,本書在統計學方麵的內容可謂極其簡單,但又極其重要。所謂“極其簡單”,是指它所涉及的內容皆為統計學中最基礎的部分,每個學習統計學的人都能理解和認可。而所謂“極其重要”,是指它徹底革新了統計學中最基礎的部分。

With the participation of self-weight, the author improved some basic statistical methods, involving vattributes (i.e., random variables in the traditional conceptual system) descriptions, differential tests, correlation and regression analysis, and weighted piecewise regression methods including continuity test at the expected threshold. Each method provides comparative results of the current algorithm and the new algorithm, with the aim of letting the data speak for itself. Therefore, the statistical content of this book can be described as extremely simple, but extremely important. The so-called “extremely simple” means that the content it covers is the most fundamental part of statistics, and everyone who studies statistics can understand and recognize it; and the so-called “extremely important” means that it completely revolutionizes the most fundamental part of statistics.

作者秉持“實踐出真知”的理念。在上述新概念係統和兩大方法論的重建和創建中,作者除了堅持“走群眾路線”,還實踐了“拒絕教條主義和僵化思維,堅持理論聯係實際”,以及“對待事物要一分為二,從多方麵考慮”等的樸素哲學思維和大眾智慧。讀者將從這些新算法的簡單且完全透明的計算步驟中深刻地體會到這些哲學思想的實際效應。

The author upholds the concept of “true knowledge comes from practices”. In the reconstruction and creation of the above new conceptual system and two methodologies, the author not only adheres to “taking the mass line”, but also practiced simple philosophical thoughts and popular wisdom of “rejecting dogmatism and rigid thinking, insisting on integrating theory with practice”, and “treating things in two ways and considering them from many aspects”. Readers will deeply appreciate the practical effects of these philosophical thoughts from the simple and completely transparent calculation steps of these new algorithms.

那麽,一個醫學和公共衛生背景的統計學碩士為何要做這些事呢?

So then, why would a master of statistics with a background in medicine and public health have been doing these things?

1997年11月的某天,瑞士統計學家彼得·J·胡貝爾教授應邀在中國科學院數理統計研究所做了一個關於《統計學的過去、現在和未來》的演講,其中他表達過一個觀點:“一些數學背景的統計學家們習慣於用數學的確定性思維模式去解決統計學中的非確定性問題,因此而犯下了一些嚴重的錯誤。”然而,與此同時他承認自己對如何避免和修正這類錯誤深感無能為力,因此寄希望於一股來自數學以外的力量能夠改變這種現狀。聽到這些觀點後,我當即意識到這股力量能且隻能是來自哲學。若幹年後,我在檢索和閱讀有關分段回歸的曆史文獻時就發現了這類錯誤的存在,它們表現為針對隨機係統以某種數學式假定作為方法論構建和應用的前提,見本書的第二章相關敘述。直到這時,我才意識到胡貝爾博士為何在他的那個演講中對分段回歸這一重要領域的方法論及其發展隻字未提,而這套方法論早已在1959 ~ 1979年間成型和完善,並在此後得到了廣泛應用。相關的理論和應用文章汗牛充棟,對於個人可謂數不勝數。

One day in November 1997, Swiss statistician, Professor Peter Jost Huber was invited to give a lecture on “The Past, Present and Future of Statistics” at the Institute of Mathematical Statistics, Chinese Academy of Sciences, in which he expressed a point of view: “Some mathematical-background statisticians accustomed to using mathematical certainty thinking mode to solve uncertainty problems in statistics, and therefore have made some serious mistakes.” However, at the same time, he also admitted that he was powerless to avoid and correct such mistakes, so he hoped that a force outside of mathematics could change this situation. After hearing these views, I immediately realized that this force could and could only come from philosophy. Several years later, when I was searching and reading historical literature on piecewise regression, I discovered the existence of such mistakes, which manifested themselves in the use of certain mathematical assumptions as the premise for the construction and application of methodology for random systems, see the relevant description in Chapter 2 of this book. It was not until then that I realized why Dr. Huber did not mention even a word of the methodology and its development in the important field of piecewise regression in his speech, even though this methodology had already been formed and improved between 1959 and 1979 and has been widely used since then. The body of related theoretical and applied literature is enormous and virtually innumerable for an individual.

我曾與某個數學背景的統計學教授討論到胡貝爾博士的批評,他對此的反應是從最初的詫異到不以為然,認為設置某種假定作為前提很正常,也很有必要。而我對此的回應是,麵對一個作為經驗事實記錄的隨機樣本,沒有什麽可以被假定,我們唯一可以假定的是它是隨機的,而即使連這個也不是一種假定,而是一個基本事實。換句話說,麵對隨機係統,它沒有什麽可被假定,我們也無需為它設置某種假定。我們對樣本的分析是為了從經驗事實中提取新知識,而非為了驗證某種數學形式的假定。而且,假定的設置也將使得新知識被預先設定。這是一種與數學思維截然不同的思維方式。

I once discussed Dr. Huber’s criticism with a statistics professor who had a background in mathematics. His reaction shifted from initial surprise to a dismissive attitude, arguing that setting certain assumptions as premise was both normal and necessary. And my response was that nothing could be assumed when facing a random sample as the records of experienced facts; the only thing we could assume was that it was random, and even this was not an assumption but a basic fact. In other words, when facing a random system, there is nothing can be assumed, and we also do not need to set any assumptions for it. Our analysis of samples is to extract new knowledge from the empirical facts, not to just verify a certain mathematical assumption. Moreover, the setting of assumptions will also lead to the new knowledge being pre-set. This is a mindset that is completely different from mathematical thinking.

此外,在統計學界還一直流傳著一個說法:“所有的模型都是錯的,其中一些可能有用。”作為一個醫學和公共衛生背景的統計學者,我對該說辭深表難以認同。統計學是一門認知方法學,方法的錯誤必然導致認知結果的錯誤,而結果的錯誤可能帶來不良後果。如果我們不能判斷一個統計方法的好壞或對錯,則表明我們存在某種蒙昧,或者,我們所擁有的知識體係存在缺陷或漏洞。因此,我想把上述流行語改為:“所有的統計方法都可能有用,其中一些好或正確,而另一些則不好甚至錯誤。”是的,一個設計或製造不良的工具也可能有用,但相比一個設計和製造優良的工具,其工作效能可能會打折扣。

In addition, there has been a saying circulating in the statistics community: “All models are wrong, some may be useful.” As a statistician with a medical and public health background, I deeply disagree with this saying. Statistics is a cognitive methodology. Errors in methods will inevitably lead to errors in cognitive results, and the errors in results may bring adverse consequences. If we cannot judge whether a statistical method is good or bad, or right or wrong, it means that we have some kind of ignorance, or that there are flaws or loopholes in the knowledge system that we have. Therefore, I would like to change the above popular saying to: “All methods may be useful; some of them are good or right, while others are not good or even wrong.” Yes, a poorly designed or manufactured tool may work, but it may not work as effectively as a well-designed and manufactured tool.

因此,一門學科的主流建設者們以怎樣的心智和思維方式對待它,它就會被打造成一副怎樣的模樣,而我對這門學科的現狀有很多的不滿。在我看來,一個統計方法不是數學定理。數學定理是嚴格的前提假定下的條件產物,由於前提假定已被嚴格限製,這樣的定理無可質疑。若要質疑,必須重置其前提假定。與此不同,統計方法隻是一個被定義和構造的測量工具,它源自某種數據分析的基本思想。當這個思想被用數學形式表達出來時,就成了一個數據分析的工具或方法。但是,我們必須明白,一個統計方法的數學形式不能被認為是其正確與否的充分和必要條件。如果一個統計方法的基本思想存在問題,則該方法就一定可被明確質疑。因此,統計學這門學科必須有能力在不借助經驗性隨機模擬試驗的條件下判斷一個統計方法的優或劣以及對或錯。這就是,純粹依靠統計學基本概念的邏輯演繹對一個方法做出優劣對錯的價值判斷。那些凡是在現有概念和邏輯體係下可被明確質疑的算法或方法論必須被改進或替代;而那些無可質疑的方法就是好的和正確的。當統計學有了這一能力時,隨機模擬試驗的使用頻次就會受到抑製,所有的人都可因此節省思維精力和時間。不過,有些統計方法雖然可被明確質疑,但暫時無法替代,就隻能在使用過程中受到磨練,直到被改進或替換。

Therefore, the way the mainstream builders of a discipline have been intellectually approaching and thinking about it will shape what that discipline ultimately becomes. And I have much dissatisfaction with the current state of this discipline. In my view, a statistical method is not a mathematical theorem. A mathematical theorem is the product of strictly defined assumptions, and because those assumptions are rigorously constrained, the theorem itself is indisputable. If one wishes to challenge it, the assumptions must first be redefined. In contrast, a statistical method is merely a measurement tool that is defined and constructed, which originates from a basic idea in data analysis. When this idea is expressed in mathematical form, it becomes a tool or method for data analysis. However, we must understand that the mathematical form of a statistical method cannot be regarded as a sufficient and necessary condition for its correctness. If the fundamental idea behind a statistical method is flawed, the method itself can and must be questioned. Therefore, the discipline of statistics must develop the ability to judge the merit and validity of a method independently of empirical simulations. That is, it must be able to make value judgments, on whether a method is sound or flawed, purely through logical reasoning based on fundamental statistical concepts. Any algorithm or methodology that can be clearly questioned within the existing conceptual and logical framework must be improved or replaced; those that are unassailable are good and correct. Once statistics acquires this capacity, the frequency of reliance on random simulations will naturally diminish, thereby saving everyone cognitive effort and time. Nevertheless, for methods that are clearly flawed but have no immediate replacements, they must be refined through continued use until they are eventually improved or replaced. (Note: This paragraph was translated by ChatGPT-4o.)

在2007至2012年間,我在構建多維廣義三分回歸模型的算法期間,曾將自己發表在2007聯合統計年會論文集上的文章分別發送給了我在原同濟醫科大學公共衛生學院的衛生統計學碩士導師餘鬆林教授和印第安納大學醫學院生物統計學教授張英博士。餘教授在閱讀後讚揚這個方法在醫學和公共衛生領域非常有用,並鼓勵我能否做得更好;而張教授則建議我應先在二維兩分段回歸上用加權法做點探索性工作,這部分工作其實已在2007年聯合統計會議文章的隨機模擬試驗部分有所表述,但更多的工作是在2023~2024年間撰寫本書有關章節時才完成。

Between 2007 and 2012, when I was constructing the algorithm for multidimensional generalized trichotomic regression modeling, I sent my article presented in the proceedings of the 2007 JSM respective to my master mentor in health statistics at the former Tongji Medical University School of Public Health, Professor Yu Songlin and Professor Ying Zhang, PhD, at the department of biostatistics in Indiana University School of Medicine. After reading it, Professor Yu praised the method as very useful in medicine and public health and encouraged me to see if I could do it better. Professor Zhang suggested that I should first do some exploratory work with the weighting approach on two-dimensional dichotomic regression, and this part of the jobs was actually described in the section of random simulation experiment in the 2007 JSM paper, but more works were completed when writing the relevant chapters of this book between 2023 and 2024.

2012年8月,為了給自己計劃提交的關於連續隨機變量自權重算法的專利申請尋求建議、指導和支持,我曾與約翰·霍普金斯大學彭博公共衛生學院生物統計係的前主任、生物統計學博士查爾斯·羅德(Charles Rohde)教授有過聯係。12日那天他主動約請我去他的辦公室聊了近2個小時。借此機會我向他詳細介紹了關於凹凸自權重的定義和算法,以及該算法對於統計學的重要意義。他對此非常認可和欣賞。在認識到我用凹凸二字來定義這對自權重時,他主動給我講授了數學中凹凸函數的定義和性質等,這讓我明白這對自權重並不滿足數學係統中凹凸函數的定義和性質,而且我所定義的凹凸含義與凹凸函數在形象上剛好相反。是的,我是根據用這對自權重與隨機變量的實測值在二維空間的散點分布形態來命名它們的。它們不是一種數學函數,而是一種基於數值計算性測量而得到的隨機變量。他認可了我的這一解釋並向我提了一個問題:這個自權重有哪些性質?誠實地說,我在思考和構建該算法時並未曾意識到這個問題。本書根據當天的談話錄音將我當時歸納的六條性質寫入了第六章。

In August 2012, in order to seek advice, guidance and support for a patent application that I planned to submit on the algorithm of self-weighting for continuous random variables, I contacted Professor Charles Rohde, a PhD in biostatistics and the former Chair of the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health. One the 12th, he took the initiative and invited me to his office, where we talked for nearly two hours. Taking this opportunity, I gave him a detailed introduction of the definition and algorithm of the self-weighting, as well as the algorithm’s significant implications for statistics. He acknowledged and really appreciated it. Upon realizing that I used the word “concave and convex” to define this pair of self-weights, he voluntarily explained to me the definitions and properties of concave and convex functions in mathematics. This made me realize that the pair of self-weights does not satisfy the definition and properties of concave/convex functions within the mathematical system, and that the meanings I assigned to “concave” and “convex” are actually the opposite of those in the mathematical context, at least in visual appearance. Indeed, I named them based on the shape formed by the scatter plot of the observed values ??of the random variable against their self-weights in two-dimensional space. They are not mathematical functions, but random variables derived through numerical computational measurement. He accepted my explanation and then asked me a question: “what properties do these self-weights have?” To be honest, I had never considered this question when I was thinking about and constructing this algorithm. Upon hearing his question, I immediately summarized six properties on the spot. Based on the recording of that conversation, I later introduced these properties into Chapter 6. (Note: This paragraph was translated by ChatGPT-4o.)

不過,在談話結束後我剛剛走出他的辦公室門口時,他突然叫住我,用嚴肅的表情和語氣對我說,他反對我為該方法申請專利,認為一旦它獲得專利保護將不利於統計學的發展,會阻礙人們自由地將其應用在自己的研究領域中。我則反問他:“如果它沒有得到專利保護,SAS或SPSS或任何其它統計軟件公司則會毫無障礙地將它納入其軟件產品,向全球市場販賣而贏利。而我則因此甚至根本沒有可能將其做成軟件產品並獲利。但是,當我這個創立者需要用軟件產品來實現該算法的統計目標時,還不得不花錢向這其中的某個軟件公司購買其產品或者租借其用戶使用權。您認為這對我公平嗎?”我的話還未說完時,已經是語帶哽咽,眼裏也幾乎要湧出淚水了。我對他說,我做這些事情沒有得到哪怕是一分錢的外部資助;我還告訴他我做的不是數學工作,而是設計了一套統計測量的工具,它與一切物理形態的測量工具在性質上沒有不同,但與一切所謂的純數學公式在性質上絕對不同,因為那些純數學公式沒有一個可被認定為是任何形式的測量工具。

However, when I just walked out of his office door after the conversation, he suddenly stopped me and said to me with a serious expression and tone that he opposed my patent application for the method, believing that once it was patented, it would be detrimental to the development of statistics and would hinder people in the field from freely applying it in their own research. I asked him back: “If it were not patented, SAS or SPSS or any other statistical software company would incorporate it into their software products without any obstacles and sell it to the world mark to make profits. Due to this situation, it was not even possible for me to turn it into a software product and make a profit. However, when I, the founder, need to use a software product to achieve the statistical goals of the algorithm, I have to spend money to buy a product or rent its user rights from one of those software companies. Do you think this is fair to me?” Before I finished speaking, I was already choking up a little and tears were almost welling up in my eyes. I said to him that I did not receive even a penny of external funding for these things; I also told him that what I was doing was not a mathematical work, but a set of statistical measurement tools. It is no different in nature from all physical measurement tools, but it is definitely different in nature from all the so-called pure mathematical formulas, because none of the pure mathematical formulas can be identified as any form of the measurement tool.

2014年,我收到了來自德國Lambert Academic Publishing出版社一位采編者的書稿約請,他在2011年聯合統計會議的論文集中發現了我的兩篇文章,認為我的新思想會對很多領域的學者有益,該出版社將很樂意能協助我出版一本書。盡管曾在餘鬆林教授主持的血吸蟲病研究項目中主筆過研究項目的幾份年度英文報告,也曾在JSM的幾次會議上用英文寫作和演講,但我對自己用純英文完成這本書深感信心不足。

In 2014, I received a book manuscript invitation from an editor at Lambert Academic Publishing in Germany. He discovered my two articles in the conference proceedings of the 2011 Joint Statistical Meetings and thought that my new ideas would be beneficial to scholars in many fields. The publisher would be very happy to assist me in publishing a book. Although I had written several annual English reports for the schistosomiasis research project chaired by Professor Yu Songlin, and had written and lectured in English at several JSM conferences, I was feeling deeply lacking in confidence to complete this book purely in English.

2016年7月,作為一個陌生人我曾給先後任職於喬治·華盛頓大學和香港城市大學的統計學教授,諾澤·達拉布沙·辛普瓦拉(Nozer Darabsha Singpurwalla)博士,發email請教關於連續隨機變量的分布期望估計與算術平均數之間的關係問題。在討論接近尾聲時,他熱切地希望我能將自己的思想寫成一本書與統計學領域的同行們分享。

In July 2016, as a stranger I sent an email to Dr. Nozer Darabsha Singpurwalla, a Professor of statistics at George Washington University and City University of Hong Kong, asking for advice on the relationship between the distribution expectation estimate of continuous random variables and the arithmetic mean. When approaching to the end of the discussion, he expressed an earnest hope that I could write a book about my thoughts and share them with colleagues in statistics.

最近從約翰·霍普金斯大學和數理統計研究學會(IMS)發布的卜告上得知,Rohde 博士和Singpurwalla博士已分別於2023年1月23日和2022年7月22日不幸去世。借此機會表達本人對他們的敬意和哀悼!他們都是善於傾聽他人的新思想並願意真誠討論的學者。

Recently, I learned from the obituaries issued by John Hopkins University and Institute of Mathematical Statistics (IMS) that Dr. Rohde and Dr. Singpurwalla unfortunately passed away respectively on January 23, 2023 and July 22, 2022. I would like to take this opportunity to express my respects and condolences to them. They were scholars who listen well to other people’s new ideas and willing to discuss them honestly.

2019年3月的統計學界發生了一件引起轟動的曆史性事件,其影響甚至波及到了整個科學界。三位卓有聲譽的數理統計學家聯署800多位各界學者在影響力巨大的《自然》雜誌上發表了一篇質疑統計檢驗中p值的文章,Scientists rise up against statistical significance。他們甚至稱基於檢驗概率水平的兩分法是一種認識論上的兩分偏執行為。從這篇文章中,我看出他們對為何必須是有一個檢驗概率水準的兩分法不甚理解。從他們自身的角度來說,原因也許在於他們認為確定概率水準是一種簡單的數學計算行為,而概率空間是一個連續可測的空間,在這個連續空間上切一刀似乎是一種很主觀的行為。然而實際上,統計檢驗是一種關於誤差測量的行為。在我看來,他們發出這個質疑的原因則應該在於他們有可能不太理解抽樣研究中存在著的係統誤差和隨機誤差。正是由於被檢驗的對象中有且隻有這兩類誤差,而我們沒有可能將其真實大小區分開來,統計檢驗才不得不基於某一概率水平做出一個兩分決策;而且,為了使得決策中的犯錯成為一個“小概率事件”,才不得不選擇了0.05作為檢驗的概率水平。在後來與其中一位作者,布萊克·麥克沙恩(Blake McShane)博士,的email討論中,他認為統計檢驗中的兩分法沒有本體論根據,我則回應他說,抽樣中的係統誤差和隨機誤差正是兩分法的本體論根據。他對此無言以對。

It was just in March 2019 that a sensational historical event occurred in the statistics community, and its impact even affected the entire scientific community. Three well-reputed mathematical statisticians jointly signed an article with more than 800 scholars from all walks of life in the influential Nature magazine, “Scientists rise up against statistical significance”, with which they questioned the p-value in statistical tests. They even call the dichotomization based on a test probability level an act of “Dichotomad”. I saw in this article that they did not quite understand why there must be a dichotomy with a test probability level. From their own perspective, the reason may be that they believe that the determination of the probability level is a simple mathematical calculation behavior, and the probability space is a continuous measurable space, and making such a cut in this continuous space seems to be a very subjective action. In reality, however, a statistical test is about error measurement. From my perspective, they may not well-understand the systematic error and random error that exist in sampling studies. It is precisely because there are and are only these two types of errors in an object being tested, and it is impossible for us to separate the true sizes of these two types of errors, a statistical test have to be done with a dichotomic decision at a certain probability level; moreover, in order to make the mistakes in a decision-making a “low probability event”, 0.05 had to be selected as the probability level of the test. In a subsequent email discussion with Dr. Blake McShane, one of the three authors, he believed that the dichotomization in statistical tests has no ontological basis. I responded him that the systematic error and random error in a sampling are the ontological basis for the dichotomization. He had no words for it.

2023年夏,我罕見地回到中國探親訪友,並拜會了我的導師餘鬆林教授,談到了我正在撰寫的這本書稿。他在完整了解了我對自權重和自加權均數的定義和算法後亦深受鼓舞,認為這應該是統計學領域的一個革命性發現和突破,應該會改寫統計學的教科書, 而其影響將難以估計。

In the summer of 2023, I made a rare trip back to China to visit relatives and friends, and met with my mentor, Professor Yu Songlin, to talk about the book manuscript I was writing. He was deeply encouraged after fully understanding my definition and algorithm of self-weighting and self-weighted mean. He believed that this should be a revolutionary discovery and breakthrough in the field of statistics. It should rewrite the textbooks of statistics, and its impact will be difficult to estimate.

三、探索之路 (The Path of Exploration)

鑒於以上個人經曆,這本書可被認為是我想要有所作為的一個嚐試。為了讓讀者能了解我的更多個人經曆和思考,我覺得以下內容對於本書思想的形成亦非常重要,遂決定記錄在此,以饗讀者。

In view of the above personal experiences, this book can be considered my attempt to make a difference. In order to allow readers to understand more of my personal experiences and thoughts, I felt that the following content is also very important for the formation of the ideas of this book. Therefore, I decided to record it here for sharing with the readers.

很多人初次麵對統計學時也許會感到有點困難,我曾是此類人中的一員。1987年一月,當我在原同濟醫科大學五年製醫學本科的最後一年學完公共衛生學院的專業課程《衛生統計學》後,盡管通過了考試,卻不知道它究竟是怎麽回事,而這本教材涉及到的數學技能僅有四則運算。現在,我想對所有人說:如果你會給他人測量身高或體重,你就能理解和操作統計學,因為統計學中的一切工作都是在構建和使用測量工具。這些工具從其本質上來說都不過是經某種直覺觀察和理性思辨後形成的某種形式的定義,不應被視為某種數學形式的定理。例如,盡管算術均數的計算公式在數學上具有定理性質,但將其用於抽樣估計一個連續可測總體分布央位的期望時,樣本均數與這個總體期望央位之間的關係隻是一種定義而非數學定理,也即,一個連續可測總體分布的期望央位不一定就是其算術均數,而且,這是一個不可能以數學的語言形式和邏輯框架予以證明的命題。由此可見,一個測量工具是比數學定理更為底層的東西。

Many people perhaps find it difficult when they are facing statistics the first time, and I was the one in this category. In January 1987, I completed the professional course Health Statistics offered by the School of Public Health in the last year of my five-year under- graduate medical program at the former Tongji Medical University. Although I passed the exam, I didn’t know what on earth it was and how it works, while the only mathematical skills covered by this textbook were the four arithmetic operations. Now I want to say to everyone: you can understand and operate statistics if you can measure height or weight for other people, because everything in statistics is about constructing and using measurement tools. In essence, these tools are just a certain form of definition formed after some intuitive observation and rational speculation, and should not be regarded as a certain mathematical form of theorem. For example, although the calculation formula of arithmetic mean has properties of mathematical theorem, when it is used to estimate the expectation of the distribution center of a continuously measurable population by sampling, the relationship between the sample mean and the expected population center is merely a definition rather than a mathematical theorem, that is, the expected center of a continuously measurable population distribution is not necessarily its arithmetic mean; and moreover, this is a proposition that cannot be proved using mathematical language and logical framework. It can be seen that a measurement tool is something more fundamental than a mathematical theorem.

作為一個自小生長在位於中國中部的湖北省江漢平原的農村地區、從未有機會接觸過樂器、直至進入武漢的一所醫科大學時連樂譜也讀不懂的人,我在大學期間努力自學小提琴時獲得的最大啟示是,人應該善於從錯誤中學習什麽是正確的;反之亦然。更一般地,我們應該可以從某種存在或觀念中發現其對立麵的意義。

As a person who was born and grew up in the rural area of Jianghan plain in Hubei Province, located in central China, never had the opportunity to touch musical instruments since my childhood, and could not even read musical scores until I entered a medical university in Wuhan, the biggest enlightenment I got when I tried hard to teach myself playing violin during my college life is that people should be good at learning what is right from mistakes and vice versa. More generally, we should be able to discover the meaning of its opposite in a certain existence or opinion.

1988年的那個暑假我在四川省的九寨溝旅遊時,有另外三個同伴,他們都是西南交通大學的學生,其中一位已於一年前畢業被分配到了位於武漢的鐵道部第四勘探設計院工作。進入九寨溝後,他們都下意識地要順著裏麵已為遊客鋪好的路徑走。我喊住他們說:“跟我走水邊吧。”其實水邊沒有路,且滿是荊棘和灌木叢,隻能自己小心翼翼地開路,但是,我們看到的景色卻是很不一般!所以,走了別人沒走過的路,就會看到別人看不到的風景。科學探索和思考與此類似。你要是能發現至少一個新概念,就會看到思考的過程及其終點上不一樣的風景。

During the summer vacation in 1988, when I was traveling in Jiuzhaigou, Sichuan province, I had three other companions, all of whom were students of Southwest Jiaotong University, and one of them had graduated a year ago and was hired to work in the Fourth Exploration and Design Institute of the Ministry of Railways in Wuhan. After entering Jiuzhaigou, they all subconsciously followed the paved path for tourists. I called to them and said, “Please follow me along the water’s edge.” In fact, there is no road along the water’s edge, and it is full of thorns and bushes, so we can only open a path carefully, but the scenery we saw is very unusual! Therefore, if you walk the road that others have not walked, you will see scenery that others cannot see. Scientific exploration and thinking is about the same. If you can figure out at least one new concept, you will see a different view of the thinking process and its destination.

1990年暑假來臨前,作為同濟醫科大學公共衛生學院1987級學生輔導員的我為了組織30多名醫科大學生前往湖北省大冶鋼鐵廠參加社會實踐,設計了我一生中的第一份社會調查表,以便學生們通過調查收集一些樣本信息,在返校後對這些信息做一些基本的統計分析並寫出各自的調查報告。

Before the summer vacation of 1990, as the counselor of the 1987 class of the Tongji Medical University School of Public Health, designed the first qestionary for social survey in my life in order to organize more than 30 medical students to go to Daye Iron and Steel Plant in Hubei Province to participate in social practice. The students could collect some sample information through the survey, do some basic statistical analysis on the information after returning to the school, and write their own survey reports.

1991年3~5月,我有幸到同濟醫科大學公共衛生學院衛生統計教研室的周有尚教授那裏幫助他整理武漢市居民的死因登記資料,並協助他的研究生杜勳銘帶衛生統計專業本科畢業生的現場調查實習。這期間我找出自己已封存了四年多、由楊樹勤等人主編的《衛生統計學》教材重新通讀了一遍,發現統計學應該是一門數學化的認知方法論。這對於當時的我來說就像發現了生命中的一片新大陸,我相信自己應該可以在其中有所作為。

From March to May 1991, I was fortunate to go to Professor Zhou Youshang in the Department of Health Statistics at the School of Public Health, Tongji Medical University, to help him sort out the registration information for the death of residents in Wuhan City, and to assist his post-graduate student Du Xunming with a field survey of undergraduate students in health statistics. During this period, I found my textbook, Health Statistics edited by Yang Shuqin et al, which had been sealed and stored for more than 4 years. I read it again and figured out that statistics should be a mathematized cognitive methodology. This was like a discovery of a new continent in life for me at that time, and I believed that I should be able to make a difference in it.

當年的九月,我受到公衛學院婦幼衛生係劉筱嫻教授和主任的邀請參與到她在湖北麻城縣農村主持的一個研究項目,曾兩次前往當地對參與“中國農村地區嬰幼兒輔助營養食品效果評估”的數百名兒童進行與營養有關的生物學指標檢測。

In September of that year, I was invited by Professor Liu Xiaoxian, director of the Department of Maternal and Child Health of the School of Public Health, to participate in a research project she was leading in the rural area of ??Macheng County, Hubei Province. I visited the local area twice to conduct the tests for nutrition-related biological indexes on hundreds of children participating in the “Evaluation of the Effectiveness of Supplementary Nutritional Foods for Infants and Young Children in Rural Areas of China”.

1992年春節後,公衛學院新成立了一個預防醫學教研室,不僅承擔對臨床醫學院學生的《預防醫學》課程教學,而且還要負責他們為期一個月的預防醫學實習。這個實習的主要形式是分批組織學生參與到與現場調查有關的預防醫學研究項目中,學校為此提供了充足的經費、資源和行政支持。於是,我來到了這個教研室,從而有更多的機會負責調查設計、組織現場實施、建立和管理數據庫,以及應用統計軟件SPSS進行數據分析。這些經曆為我在1998年3月底思考分段回歸問題時突破對柯爾莫哥洛夫定義的樣本空間這一關鍵概念的理解打下了足夠的實踐基礎。

After the Spring Festival in 1992, the School of Public Health established a new deparrtment of preventive medicine, which not only taught the course of “Preventive Medicine” to students of the School of Clinical Medicine, but also took charge of their one-month internship in preventive medicine. The main form of this internship was to organize students in batches to participate in preventive medicine research projects related to field surveys, and the university provided sufficient funds, resources and administrative support for this program. So I came to this department, and had more opportunities to be responsible for survey design, organizing field implementation, establishing and managing databases, and using statistical software SPSS for data analysis. These experiences laid a sufficient practical foundation for me to break through the understanding of the key concept of sample space defined by Kolmogorov when I was thinking about the problem of piecewise regression at the end of March 1998.

1994年9月,我被衛生統計教研室的餘鬆林教授接受在他那裏攻讀碩士學位,並有機會參與到他主持的中國湖區血吸蟲病兩種幹預措施的經濟學比較研究中,協助完成有關現場實施、數據分析和結果報告。餘教授是自1980年代中後期以來少有的幾位在中國公共衛生、醫學和生物學等領域應用統計學的翹楚和領軍人物之一。在那個統計學在中國尚處於暗淡角色的年代,他以自己的智慧、嚴謹和堅韌為將人類在統計學中取得的成就傳播到中華大地做出了公認的傑出貢獻。我深深地感激在他那裏獲得的教育、指導和關懷,這令我終身受益匪淺。

In September 1994, I was accepted by Professor Yu Songlin in the Department of Health Statistics, to pursue a master’s degree, and had the opportunity to participate in an economic comparative study of the two interventions on schistosomiasis in a lake area in China led by him, assisting in completing relevant field implementation, data analysis and results reporting. Professor Yu is one of the few elites and leaders in the applied statistics in the fields of public health, medicine and biology in China since the late 1980s. At the time when statistics still played a dim role in China, he made recognized outstanding contributions to spreading the achievements of human beings in statistics to the Chinese mainland with his wisdom, rigor and tenacity. I am deeply grateful for the education, guidance and care I received from him, which has benefited me throughout my life.

1997年11月裏的某一天,剛剛在當年6月獲得衛生統計學碩士學位的我對正在攻讀衛生統計學碩士學位的太太說,我們以後應該多關注“非常態分析”,而那一刻自己並不知道該如何劃分“常態”與“非常態”。之所以會突然間產生這個念頭,是由於自己在上述血吸蟲病幹預的經濟學評價中發現,人群感染率與受幹預人群的年度單位人均成本之間呈現出一個下降的三次多項式曲線,即當感染率下降到一定的水平後,如果繼續執行相同的幹預措施,則人均單位成本會達到很高的水平,而感染率的下降卻微乎其微,這意味著幹預措施開始顯出得不償失,接近達到其邊際效應。我們需要找出一個或兩個臨界感染率水平作為調整幹預措施的依據,以便在保證足夠好的控製效果的同時降低單位成本。

On one day in November 1997, after I got my master’s degree in health statistics in June of that year, I said to my wife, who was studying for her master’s degree in the same field, that we should pay more attention to “non-normality analysis” in the future. At that moment, however, I myself didn’t yet know how to distinguish between “normal” and “non-normal”. The reason why this idea came to me suddenly was because I discovered in the economic evaluation of schistosomiasis intervention that the relationship between population prevalence and annual per capita cost followed a decreasing cubic polynomial curve. That is, as the prevalence dropped to a certain level, continuing with the same intervention measures would drive the per capita cost very high, while achieving only a minimal further reduction in prevalence. This indicated that intervention had reached a point of diminishing returns, where the costs began to outweight the benefits and approached intervention’s marginal effect. We needed to identify one or two threshold levels of prevalence as a basis for adjusting interventions strategy, so as to reduce unit costs while maintaining adequate effective control.

1998年3月25日,中科院院士、數理統計學家陳希孺博士將胡貝爾博士的那個演講帶給了武漢大學數學係數理統計專業的師生。那時為了尋找劃分常態和非常態的方法,我經常到武漢大學的數學係旁聽測度論和概率論等與數理統計有關的課程,所以,那一天我有幸聆聽了陳希孺院士的演講,並於當天中午回到自己的辦公室開始了一場曆時連續六天六夜幾乎無眠的讀書、思考、計算和推理的過程。順便說一句,如果一個人缺乏足夠強大的內在自製力,我不鼓勵他/她經曆像我這樣的危險過程,因為它有可能令人陷入癲狂和失去自控。

On March 25, 1998, Dr. Chen Xiru, an academician and mathematical statistician in the Chinese Academy of Sciences, brought Dr. Huber’s speech to the Department of Mathematics of Wuhan University. Before that day, in order to find a way to separate normal and non-normal, I had often gone to the department to audit the courses of Measure Theory and Probability Theory related to mathematical statistics. Therefore, I was fortunate to listen to academician Chen Xiru’s speech, and then returned to my office at noon that day, and started a process of reading, thinking, calculating, and reasoning almost without sleep the next six days and six nights. By the way, if a person lacks strong enough inner self-control, I would not encourage him/her to go through a dangerous process like mine, because it has the potential to drive a person crazy and lose self-control.

這些日夜裏產生過無數新的概念、術語、定義和思想。在此後的二十多年中,有一些被保留了下來,也有很多被逐漸放棄。收獲最大的是以上述血吸蟲病幹預的樣本數據為例形成了一套在樣本全域內迭代搜索一個臨界點的最優解算法(這個算法後來被我自己所否定和放棄)。為了充分使用樣本信息和保障迭代過程中分段模型的連續性,我讓每個被設想為可能臨界點的樣本點同時參與兩段相鄰模型的擬合。

During these days and nights, countless new concepts, terms, definitions and ideas were produced. In the following more than twenty years, some were retained, while many were gradually abandoned. The most rewarding thing is to use the sample data of the above-mentioned schistosomiasis intervention as an example to form a set of optimal solution algorithms for iteratively searching for a threshold within the whole sample range, which was later denied and abandoned by myself. In order to make full use of sample information and ensure the continuity of the piecewise models during the iteration process, I let each sample point supposed to be a possible threshold participating in the fitting of two adjacent models at the same time.

那時,我已有了一個基於全樣本的模型,我稱之為全模型(我在2007年將其改稱為全域模型)。一般地,在同質模型定義下,與第i次迭代搜索中分段模型的合並殘差均方根(這裏用CRMSR{crmsri}表示,i = 1, 2, …, n)相比,全域模型的殘差均方根(這裏用RMSR表示)是最大的。因此,我構建了一個迭代搜索中的殘差遏製係數(用CRR{crri}表示):

At that time, I already had a model based on the whole sample first, which I called the “full model” (I changed it to fullwise model in 2007). In general, under the homogeneous model definition, the root mean squared residuals (here denoted by RMSR) of the full model is the maximum compared to the combined root mean squared residuals (here denoted by CRMSR{crmsri}, i = 1, 2, …, n) of the piecewise models at the ith iteration. Therefore, I constructed a coefficient of residual-resisting (denoted by CRR{crri}) in the iterative search:

於是,當CRR達到最大值時,我認為臨界點就可以被確定了。它可以是該最大CRR對應於被分割變量中的一個實測點,由其決定的分段模型就應該是最優分段模型。此外,我還將最大CRR命名為殘差遏止係數,可用於評價最優分段模型相對於全域模型的擬合優度。

Therefore, when CRR reaches the maximum value, I thought the threshold can be determined. It could be a real measured point in the segmented variable corresponding to the maximum CRR, and the piecewise model determined by it should be the optimal piecewise model. In addition, I also named the maximum CRR as the coefficient of residual-resisted, which could be used to evaluate the goodness-of-fit of the optimal piecewise model relative to the full model.

不過,我很快就意識到我可以用CRR與被分割變量描繪一個二維散點圖,它在理想情況下應該是一個具有二次函數關係的山峰形曲線,通過求解該二次曲線方程的一階導數為零時的解即可得到對曲線峰頂的估計,而它對應的實測樣本點應該是對臨界點更穩健的估計,因為這一解法可以避免最大CRR可能導致的隨機偏倚。然而,令我失望的是,我的樣本雖然存在這樣一個二次曲線,但其一階導數為零時的解非常靠近被分割變量的某一端。這一理想與現實的背離迫使我不得不放棄這個基於一階導數的臨界點解法,轉而使用最大CRR對應的實測樣本點作為臨界點的估計值。

But I soon realized that I could use CRR and the segmented variable to draw a two-dimensional scatter plot, which should ideally be a peak-shaped curve with a quadratic functional relationship. By solving the first-order derivative of the quadratic curvilinear function, an estimate of the peak can be obtained, and the real measured sample point corresponding to it should be a more robust estimate of the threshold, because this solution might avoid the random bias that might be caused by the maximal CRR. However, to my disappointment, although there was such a quadratic curve in my sample, its first-order derivative solution is very close to one end of the segmented variable. This divergence between ideal and reality forced me to abandon the threshold solution based on the first-order derivative and instead use the measured sample point corresponding to the maximum CRR as the estimated value of the threshold.

我沒有通過假定兩段模型在臨界點處連續來求解臨界點,這是在過去的數十年中無數與分段回歸分析有關的統計方法都采用的算法。我所采用的算法與它們相比,在一些數學基礎足夠好的人看來似乎顯得幼稚和低技術。但在一個統計學者看來,采取假定兩段模型在臨界點處的連續性(也即兩段模型在臨界點處的連接變異被假定為總是等於0)求解臨界點是一種不可思議的錯誤,因為在一個隨機係統裏,如果存在兩段模型和一個臨界點,那麽它們在該臨界點處一定有一個非零的連接變異。如果我們能將這個連接變異估計出來,就有可能用概率推斷分段模型在臨界點處的連續性。也許正是由於自己在數學基礎知識和數學思維能力上的匱乏,以及對這一基於直覺的堅持,導致了我在後來的24年裏走上了一條完全不同的道路。

I did not solve for the threshold by assuming that the two piecewise models are continuous at the threshold, which is the algorithm adopted by countless statistical methods related to piecewise regression analysis over the past few decades. Compared with them, the algorithm I used may seem naive and low-tech to some people with a good enough mathematical capability. But in the opinion of a statistician, it is an incredible mistake to solve the threshold by assuming the continuity of the two piecewise models at the threshold (that is, the connection variation of the two piecewise models at the threshold is assumed to always be equal to 0), because in a random system, there must be a non-zero connection variation at the threshold if there are two piecewise models and a threshold. If we can estimate this connection variation, it will be possible to use probability to infer the continuity of the piecewise models at the thresholds. Perhaps it was my lack of basic mathematical knowledge and mathematical thinking skills, as well as this persistence based on intuition, that led me to embark on a completely different path in the next 24 years.

2000年7月底至8月初的幾天裏,在中國教育部的資助下,我作為唯一來自中國的學者參加了在美國印第安納波利斯市召開的聯合統計會議,在第一天的“一般方法論”分會口頭報告了自己在臨界回歸模型算法中的一些新奇思想,並在會後與一位來自密蘇裏(也許是密西西比)州立大學統計係的劉姓教授在一處休息區進行了簡短交談。他在聽了我的解說後說:“既然你假定每個實測樣本點都可能是臨界點,為何不用它們去計算臨界點的期望和方差呢?”我瞬間明白了,我可以將那個殘差遏製係數作為可變權重以便計算臨界點的加權期望和方差,但這樣的結果會是最優的嗎?答案應該是否定的,因為這個“加權期望臨界點”對應的臨界模型的合並殘差平方和與在迭代搜索過程中生成的所有成對臨界模型相比,不會恰好是最小的。那麽,它是我們需要的嗎?那時我沒有答案,但也沒有放棄這個思想,因為我以為這個思想應該是唯一正確的。

From the end of July to the first days of August 2000, under a financial support from the Department of Education of the People’s Republic of China, I participated in the JSM in Indianapolis, USA, as the only scholar from China, and gave an oral speech on some novel ideas and an algorithm for threshold regression model at the Section of General Methodology on the first day. After the speech, I had a short conversation with a statistics professor Liu from the Department of Statistics at Missouri (maybe Mississippi) State University at a rest area. After listening my explanation, he said: “since you assumed that each real measured sample point might be the threshold, why not use all of them to calculate the expectation and variance of the threshold?” I suddenly understood that I could use the CRR as a variable weight to calculate a weighted expectation and variance of the threshold, but would such a result be optimal? The answer should be NO, because the combined sum of residual squares of the piecewise model corresponding to this “weighted expectation threshold” should not be exactly the smallest compared to all pairwise threshold models generated in the iterative search process. So, is it what we need? I didn’t have an answer at that time, but I didn’t give up on it because I thought this idea should be the only right one.

2006年5月,我被美國國防部所屬的Uniformed Services University of the Health Science(USUHS,可被翻譯為軍警衛生服務大學,或者,軍警醫科大學)外科係的前列腺疾病研究中心(CPDR)雇傭做實驗樣品數據的管理,同時協助該中心的流行病學家詹妮弗·卡倫(Jennifer Cullen)博士做一些臨床流行病學的項目。在短暫的工作適應後,我就向卡倫提出希望她能支持我使用該中心的臨床數據庫構建三分回歸分析的統計方法。她對此感到非常高興,並表示了積極的支持。她不僅允許我使用臨床數據庫,還幫我修改關於三分回歸分析文章的英文表述。在這種良好的工作環境裏,我很快就完成了基於加權的三分回歸法的構建。

In May 2006, I was hired by the Prostate Disease Research Center (CPDR) of the Department of Surgery of the Uniformed Services University of the Health Science (USUHS) affiliated with the U.S. Department of Defense to manage experimental sample data and assist the center’s epidemiologist Dr. Jennifer Cullen in some clinical epidemiological projects. After a short period of work adaptation, I asked Cullen to support me in building a statistical method for three-point regression analysis using the center's clinical database. She was very happy about this and expressed her active support. Not only did she allow me to use the clinical database, she also helped me revise the English wording of the article on the trichotomic regression analysis. In this good working environment, I quickly completed the construction of the weighted trichotomic regression method.

於是,2007年8月,我第二次參加了在鹽湖城召開的JSM,在會上提出了一個完整的分段回歸分析的基本思想,並用一個91例心血管病研究的樣本和多變量Logistic回歸模型展示了一個“泛函化的廣義三分回歸模型”的基本思想和完整算法。在這個算法中,我完全摒棄了基於殘差最小化的最優化解決方案,這是因為各分段模型的合並殘差平方和在迭代搜索過程中是隨機變異的,而其最小值僅僅隻是一個隨機的點測量,由此對應的“最優”分段模型的各參數估計值也應該都是相應的隨機點測量。這樣做看起來就像我們測量了一組成年男性的身高和體重,卻選擇了那個最矮的人的體重作為一個很有意義的數值來代表這組男性的體重。因此,這個“最優”解應該不是我們可以期望的,因而不是我們所需要的。

So, in August 2007, I participated in the JSM held in Salt Lake City for the second time, and proposed a complete idea of ??piecewise regression analysis, and used a research sample of 91 cases of cardiovascular disease and multivariate logistic regression to demonstrates the basic idea and a complete algorithm of a “Functionalized general trichotomic linear regression (FGTLR)”. In this algorithm, I completely abandoned the optimization solution based on minimizing the combined residuals, because the combined sum of squared residuals of all piecewise models is randomly variable in the iterative search process, and the minimum value is only a random point measure. The estimated values ??of all the parameters of the corresponding “optimal” piecewise models should also be relevant random point measures. Doing in such way seems like that we measured height and weight of a group of adult males, and we chose the shortest one’s weight as a very meaningful value to represent the weights of this group of men. Therefore, this “optimal” solution should not be what we can expect and thus not what we need.

考慮到殘差平方和的分布極其偏態,其類算術均數的均方根的代表性也因此而非常差,為了改善未知臨界點的加權期望估計的準確性,我決定將1998年定義的殘差遏製係數CRR{crri}改為用全域模型絕對殘差的算術均數(用MAR表示)和分段模型合並絕對殘差的算術均數(用MCAR{mcari}表示)來構建,並將CRR重新命名為殘差收斂係數(convergence rate of residuals):

Considering that the distribution of the residual sum of squares is extremely skewed, the representativeness of its arithmetic mean-like root mean square is also very poor. In order to improve the accuracy of the weighted expectation estimate of the unknown threshold, I decided to reconstruct the coefficient of residual resistant CRR{crri} defined in 1998 with the arithmetic mean of the absolute residuals of the fullwise model (denoted by MAR) and the arithmetic mean of the combined absolute residuals of the piecewise models (denoted by MCAR{mcari}), and renamed CRR to the convergence rate of residuals:

為了驗證這個殘差收斂率CRR作為估計臨界點的權重的有效性,我在一個公開發表的文章中找到了一個案例,根據其基本統計量隨機模擬了500個樣本,計算出500個加權平均臨界點(WM-T)、500個對應於最大CRR的實測樣本點(RM-T),以及按照現行算法中基於最大CRR和強製連續性假定估計Sprent定義的500個γ值(這裏用MCR-T表示)。結果顯示MCR-T的表現最差,加權臨界點的分布收斂性最好:

In order to verify the effectiveness of this convergence rate of residuals (CRR) as the weight for estimating threshold, I found a case in a publicly published article. According to its basic statistics, 500 samples were randomly simulated; and 500 weighted mean thresholds (WM-T) and 500 real measured sample points corresponding to the maximum CRR (RM-T) were calculated. The 500 γ-values (denoted by MCR-T here) defined by Sprent were estimated based on the maximum CRR and the enforced continuity assumption in the current algorithm. The results show that MCR-T performs the worst and the distribution convergence of the weighted thresholds is the best:

看著如此的對比結果,已經無人能夠懷疑現行算法中的最優化和強製連續性有多麽的糟糕了;也無人能夠質疑加權法的準確性和穩健性了。

Watching such comparison results, no one could doubt how bad the optimization and enforced continuity in the current algorithms were; no one could question the accuracy and robustness of the weighted method.

我以為自己精心構建的這個新算法已是非常完美,而且形成了一個基於“全域加三分”的綜合性分析策略。盡管這篇文章將被收錄在當年JSM的論文集中,我還是在會後開始投稿到一個聲譽卓著的統計期刊,沒想到被主編直接拒了,其理由是現有的分段回歸算法已經很成熟了,比我提議的要好。看來,該主編不僅完全無視了我在文稿前言中對現行算法各方麵問題的評論,也完全無視了其中展示的上述隨機模擬結果。我以為我遭遇這個雜誌的主編隻是一個偶然,然而,隨後多次嚐試投稿到不同期刊也都直接被主編拒稿,其中一個頂級期刊的主編稱我在挑戰數學和統計學的“large body”。而另一個旗艦雜誌的主編則簡單地回應說:“你的文章不適合發表。”我當然理解他這話的意思,他顯然看懂了我在自己的文稿裏說了什麽。

I thought that the new algorithm I carefully constructed was perfect, and it formed a comprehensive analysis strategy based on “a fullwise plus a trichotomy”. Although this article would be included in the JSM’s proceedings of the year, I still started to submit it to a reputable statistics journal after the meetings. Unexpectedly, the editor-in-chief directly rejected my paper, saying that the existing piecewise regression algorithm was already very mature and better than what I proposed. It seemed that the editor-in-chief not only completely ignored my comments on various issues of the current algorithms in the introduction of the paper manuscript, but also completely ignored the above random simulation results shown in it. I thought that my encounter with the editor-in-chief of this magazine was just a coincidence, however, I subsequently tried to submit my manuscript to different journals many times, they were all directly rejected by the editors-in-chief. The editor-in-chief of one of the top journals said that I was challenging the “large body” of mathematics and statistics. And the editor-in-chief of another flagship magazine simply responded: “Your article is not suitable for publication.” Of course, I understand what he meant, and he obviously understood what I said in my manuscript paper.

事實上,在遭到第一個期刊的拒絕後,我就意識到這篇文章沒能從基礎概念上闡明為何這類最優化是根本錯誤的,因為長期以來統計學一直缺乏一些必要的概念。於是,我很快就下定決心對統計學的基本概念係統進行改革。而早在1998年3月自己獨立思考基於單一臨界點的分段回歸問題時,就已在幾個重要概念上取得了突破,並體現在了這篇被多次拒絕的稿件中。它們是我繼續思考整個新概念係統的基礎。

 In fact, after being rejected by the first journal, I realized that the article failed to explain from a basic conceptual perspective why this type of optimization is fundamentally wrong, because statistics had long been lacking some necessary concepts. So, I quickly made up my mind to reform the basic concept system of statistics. As early as March 1998, when I was thinking independently about the problem of piecewise regression based on a single threshold, I had made breakthroughs in several important concepts, which had been reflected in this manuscript that was rejected many times. They are the basis for me to continue thinking about the entire new conceptual system.

這些新概念大致成形於2006年9月至2008年2月,包括常量期望、隨機對應、隨機變量的9個基本性質以及基於此上的7個公理性陳述。尤其是關於隨機常量和常量期望的概念和定義,我認為它們在統計學裏非常重要,它們是一類變異性為0的隨機量,因此,它們在統計學中的重要地位堪比數字係統中的0。於是,我於2009年8月第三次參加了在華盛頓特區召開的JSM,這次的報告除了提交這些新的基本概念外,還略微改進了2007年會議上提出的三分回歸分析法的算法。會後,這套概念曾被張貼在一個統計網站(www.mitbbs.comstatistics)上,一位在美國某大學攻讀統計學PhD學位的學生看過後評論道:“令人茅塞頓開!”令人遺憾的是,該網站在幾年前被迫關閉,所有內容已不可訪問。

These new concepts are roughly took shape between September 2006 and February 2008, including constant expectation, random correspondence, nine properties of random variable and seven axiomatic statements. Especially regarding the concepts and definitions of random constant and constant expectation, I think they are very important in statistics. They are a type of random quantities with variability of 0. Therefore, their importance in statistics is comparable to 0 in the digital system. So, I participated in the JSM held in Washington DC in August 2009 for the third time. In addition to presenting these new basic concepts, this report slightly improved the algorithm of the FGTLR proposed at the 2007 JSM. After the meetings, this set of concepts had been posted on a statistics website, which has been shut down (www.mitbbs.comstatistics). After reading it, a student studying for PhD degree in statistics at a university in the United States commented: “It’s so enlightening!” Sadly, the site was forced to shut down a few years ago, and all content is no longer accessible.

在這次會議之前,我找到了胡貝爾博士在其演講中提到過的、約翰·圖基(John Tukey)於1962年發表在《統計年鑒》上的長文“數據分析的未來”,其中有個小節的標題是“最優化的危險”。遺憾的是,由於概念的缺乏,他既未羅列一些這種危險的表現,也未說明為何最優化是危險的。相反,他事實上是讚成最優化的,他隻是擔心追求最優化會窒息數據分析中新思想的萌芽。其實,在有了上述新的基本概念後,我們將不難理解,這種最優化不隻是一種危險,而是一種根本性錯誤,因為它是確定性數學中的函數極值思維在統計學的隨機係統中的濫用。此外,在本書的概念討論中我們還將發現,在一個樣本中,其極值是最不穩定和最不可靠的測量。因此,重溫圖基博士的文章和胡貝爾博士在中科院的演講,不得不為他們的擔憂和批評表達一種敬意,因為在那個統計學領域的最優化思維剛剛興起的時代,在同行們看來各種最優化法已經是被廣泛認可的、規則化了的、千真萬確的科學思維和手段的時候,他們卻看到了某種不好的東西在阻礙著統計學的思想和方法的進步。我因此而深刻地相信在他們的靈魂深處一定潛藏著某種敏銳的東西,而這種敏銳性的價值難以估量。

Before this meeting, I found John Tukey’s long paper The Future of Data Analysis published in The Annals of Statistics in 1962, and mentioned in Dr. Huber’s speech, in which there is a section titled “Danger of Optimization”. Regrettably, due to the lack of concepts, he did not list some manifestations of this kind of danger, nor did he explain why optimization is dangerous. On the contrary, he was actually in favor of optimization. He was only worried that the pursuit of optimization would stifle the buds of new ideas in data analysis. In fact, after having the above new basic concepts, it will not be difficult for us to understand that it is not only a danger but a fundamental mistake, because it is an abuse of the function-extremum thinking from deterministic mathematics when applied to a statistical random systems. Furthermore, we will discover in the conceptual discussions in this book that the extreme values ??are the most unstable and unreliable measurements in a sample. So, revisiting Dr. Tukey’s article and Dr. Huber’s speech at the Chinese Academy of Sciences, I have to pay tribute to their concerns and criticisms, because in that era when optimization thinking in the field of statistics was just emerging, and at the time in the eyes of colleagues that the various optimization methods have been widely recognized, regularized, and true scientific thinking and approaches, they did see something bad for holding back the progress of statistical thinking and methods. I therefore deeply believe that there must be some sensitive things lurking in their souls, and the value of the sensitivity is immeasurable.

在隨後的幾年中,我逐漸回想起了當年在同濟醫科大學讀碩士學位時,衛生統計學教授董時富先生曾給我們專門講授過隨機變量和常量等之間的運算。例如,兩個隨機變量之間的算術運算結果依然是隨機變量。而一個隨機變量與一個常量之間的運算結果也是一個隨機變量。因此,那些用樣本數據構建的所謂最優化算子也都必然是隨機變量。對這段受教經曆的回憶和認識強化了我後來在這個問題上絕不妥協的立場。我寧可讓那篇文章在JSM的論文集裏睡大覺,也不會遷就統計學體係的當前範式。

In the following years, I gradually recalled that when I was studying for my master’s degree at Tongji Medical University, Mr. Dong Shifu, a professor of Health Statistics, had taught us about operations between random variables and constants. For example, the result of an arithmetic operation between two random variables is still a random variable. And the result of an operation between a random variable and a constant is also a random variable. Therefore, those so-called optimization operators constructed using sample data must also be random variables. The recollection and understanding of this being taught experience strengthened my later uncompromising stance on this issue. I would rather let that article sleep in the JSM proceedings than accommodate the current paradigm of the statistical system.

然而,即便我已經做了上述努力,一個重大的問題依然困擾著我,這就是關於連續隨機變量的期望估計,因為在上述廣義三分回歸分析中我用全域模型絕對殘差的算術均數和分段模型合並絕對殘差的算術均數構建了一個加權估計臨界點的權重,即式(2)。然而,這兩個絕對殘差的分布都是偏態的,其算術均數對它們的分布央位的期望估計應該都存在著偏倚,而這些偏倚應該會導致臨界點的加權估計的偏倚。按照統計學界目前的共識,樣本的算術均數對於偏態總體的算術均數是一個無偏估計,但人們也一致認同算術均數對偏態總體的代表性並不好。換句話說,如果一個偏態總體的分布央位不是其算術均數,則樣本算術均數很可能就是關於該偏態總體分布央位的一個有偏期望估計。所以,我需要找到一個新的算法,使得包括正態和偏態在內的所有常見抽樣單峰分布能在同一算法下得到關於其總體分布央位的無偏期望估計。如果我能找到它,將對整個統計學的理論基礎和方法論產生難以估量的影響。

However, even if I had made above efforts, a major problem still plagued me, which was about the expectation estimate of continuous random variable, because in the above generalized trichotomic regression analysis, I used the arithmetic mean of the absolute residuals of a fullwise model and the arithmetic mean of the combined absolute residuals of a set of piecewise models to construct a weight, i.e., Formula (2), for weightedly estimate the threshold. However, the distributions of the two absolute residuals are all skewed, and their arithmetic means should have biases in their expected estimates for their distribution centers, and these biases should lead to biases in the weighted estimates of a threshold. According to the current consensus in the society of statistics, the sample arithmetic mean is an unbiased estimate of the arithmetic mean of a skewed population, and while people also agree that the representativeness of arithmetic mean is not good for a skewed population. In other words, if the distribution center of a skewed population is not its arithmetic mean, then the sample arithmetic mean is likely to be a biased estimate of the expected center of the skewed population distribution. So, I need to find a new algorithm that enables all common sampling unimodal distributions, including normal and skewed, to get unbiased estimate for the central expectation of their population distributions under the same algorithm. If I can find it, it will have an incalculable impact on the foundation and the methodology of Statistics.

我在2007年JSM後不久便開始了思考如何準確估計偏態分布的峰頂,當年的那個夢想也再次在腦海中浮現,思考的焦點最後落在峰頂兩側的密度變異不一致的問題上,並且相信正是這個不一致或失衡導致了分布峰頂向一側偏移,因此,要想準確估計峰頂的位置,算法就要考慮兩側的密度變異,而這種變異應該與每個點在樣本空間中的位置有關。這個思考過程最終導致了我必須在一個給定的樣本最大可測空間內全麵計算每個樣本點對包括其自己在內的所有樣本點的差異性和相似性。

Shortly after JSM in 2007, I started thinking about how to accurately estimate the peak of a skewed distribution. The dream from that year reappeared in my mind. The focus of my thinking finally fell on the problem of inconsistent density variation on both sides of the peak. I believed that it was the inconsistency or imbalance that caused the peak of the distribution to shift to one side. Therefore, in order to accurately estimate the position of the peak, the algorithm must consider the density variations on both sides; and the variations should be related to the position of each point in a sample space. This thinking process eventually led me to comprehensively calculate the differentialities and similarities for each sample point to all sample points including itself within a given maximum measurable sample space.

大約從2009年5月起,我有機會為USUHS的預防醫學係流行病學和生物統計教研室的副教授詹尼弗·茹塞茨基(Jennifer Rusiecki)博士工作,不久便接觸到基因數據的統計分析。在一個包含有1500多個基因的病例-對照實驗數據中,我需要找出一些有統計顯著性的基因。這個工作在很多從事基因數據統計分析的人們看來,似乎是一件很容易的工作,因為有現成的方法論和統計軟件,將樣本數據在軟件裏運行一下就可以得到結果。

Starting around May 2009, I had an opportunity to work for associate Professor Jennifer Rusiecki, Ph.D., in Department of Preventive Medicine Division of Epidemiology and Biometrics at USUHS, and soon became exposed to statistical analysis for genetic datasets. In a dataset of case-control experiment including more than 1500 genes, I need to find some statistically significant genes. In the eyes of many people who are engaged in statistical analysis of genetic data, this job seems very easy, because there are ready-made methodologies and statistical software, and the results can be obtained by running the sample data in the software.

但是,我發現對全部基因無論是采用t檢驗或秩和檢驗,或者根據正態性檢驗結果將兩種方法混合使用,其結果都將產生難以預測和控製的偏差,也就是導致基因的篩選出現偏差,而且,這些偏差都包含著隨機誤差和係統誤差。我意識到自己在這裏遇到了一個統計方法學上的瓶頸,而打破這一瓶頸以消除這些偏差的唯一辦法隻能是拋棄正態性假定,並采用一種統一的算法對連續性抽樣分布的期望作出無偏估計。這個期望也被稱為這類分布的央化位置,而這樣的央位應該對應著包括正態(對稱的)和偏態(非對稱的)在內的所有常見單峰分布的峰頂。

However, I found that whether I applied t-test method or rank-sum test method, or mix the two methods based on the normality test results for all genes, the results will produce biases that are difficult to predict and control, which will lead bias to the gene screening, and these biases all contain random errors and systematic errors. I realized that I encountered a methodological bottleneck in statistics here, and the only way to break this bottleneck and eliminate these biases was to abandon the normality assumption and adopted a unified algorithm to make an unbiased estimate of the expectation for continuous sampling distributions. This expectation is also referred to as the centralized location (or center) of such distributions, and this center should correspond to the peak of all common unimodal distributions, including both normal (symmetric) and skewed (asymmetric) distributions.

2010年9月的某一天,一個嶄新的思想萌芽闖入了我的腦海。經過一段時間的思考、計算、修正和比較,一個關於單峰分布期望估計的算法終於在當年的12月12日形成,即所謂的關於連續隨機變量的自權重。該算法僅涉及最基礎的數學四則運算,通過一個嚴謹而巧妙的邏輯分析組合而成。在此基礎上,自加權期望以及所有其它必要的統計量都可被輕易獲得。如果說在算術均數的計算中默認每個樣本點對分布央位的貢獻相同,那麽,自權重的獲得將會告訴我們每個樣本點的這一貢獻可以在[0, 1]區間隨機變化,距離分布央位越近則貢獻越大,反之就越小。從樣本測量值與其自權重構成的二維散點圖來看,這個自權重直觀地展示出了一個分布的離散趨勢和集中趨勢。我們還將發現,由於自權重可以幫助我們將一個分布的央位估計在其分布曲線的峰頂處成為無偏估計,一個偏態分布將總是可以被正態化,而且這個正態化的分布與其原始分布擁有相同的期望、方差和可測空間,從而,目前存在於統計學理論基礎中的正態性假定就成了累贅,而單峰分布將可以取而代之。進一步地,我們應該可以將該算法從對單峰分布的峰頂估計拓展到對一切連續隨機變量的分布央化位置的估計。

One day in September 2010, a new idea splined into my mind. After a period of thinking, calculation, correction, and comparison, an algorithm for the expectation estimate of unimodal distributions was finally formed on December 12 of that year. This is the so-called “self-weight” of continuous random variables. The algorithm only involves the four most basic mathematical operations and is combined through a rigorous and clever logical analysis. Based on the self-weight, a self-weighted expectation as well as all the necessary statistics can be easily obtained. If it is said that in the calculation of the arithmetic mean, the contribution of each sample point to the distribution center is the same by default, then the acquisition of the self-weights will tell us that this contribution of each sample point may vary randomly in the interval [0, 1], the distance of a point closer to the distribution center, the bigger the contribution of it, and the smaller the contrary. In a two-dimensional scatterplot composed of sample measurements and their self-weights, the self-weighting visually shows the dispersive tendency and centralized tendency of a distribution. We will also find that, since the self-weights can help us to make the estimate of the distribution center unbiased at the peak of a distribution curve, a skewed distribution can always be normalized, and this normalized distribution has the same expectation, variance and measurable space as its original distribution, thus, the assumption of normality that currently exists in the statistical theory becomes a redundant, and the unimodal distribution will be instead of it. Furthermore, we should be able to extend this algorithm from estimating the peak of a unimodal distribution to estimating the distribution center position of all continuous random variables.

對該算法的初步驗證是在算法構建過程中追求得到一個二維空間的正態形的散點分布,也即用一個樣本量為100的近似正態樣本,如果樣本點與其自權重之間的散點分布呈正態形曲線,則算法應該是正確的,反之則應該是錯誤的。最終,我達到了目的。為了進一步驗證該算法在大樣本下的表現,我用了一個2480例的左偏態分布樣本,在計算出其自權重後,顯示出良好的左偏態散點分布,其凸自加權均數正好對應著峰頂所在位置,而其算術均數位於峰頂的右側(見書中的圖6.4.5)。由此可以推斷,如果該樣本呈右偏態分布,其算術均數應該會出現在峰頂的左側。隨後,我用係統抽樣的方法從該樣本中提取31例,即1/80的原樣本量,計算其自權重和凸自加權均數,結果顯示出對原樣本峰頂極好的估計。最後,我做了一個10萬個正態分布樣本點的隨機模擬試驗,其散點分布就是本書封麵上偏左上的那個近似正態曲線的圖形。這一切均表明自權重的正確性、可靠性和準確性。於是,我決定帶著它參加將於2011年8月在邁阿密召開的JSM,並決定借此機會進一步完善發布於2009年JSM的那套基礎概念係統。在完成了上述工作後,我深深地感覺到,一扇新的大門已經在統計學領域被悄然地推開了。

The preliminary verification of the algorithm is to strive to obtain a normal scatter distribution in a two-dimensional space during the algorithm construction, that is, using an approximate normal sample with a sample size of 100. If the scatter distribution between the sample points and their self-weights presents a normal curve, the algorithm should be correct, otherwise it should be wrong. Finally, I achieved my goal. To further verify the performance of the algorithm under large samples, I used a left-skewed distributed sample of 2480 cases. After calculating its self-weights, it showed a good left-skewed scatter distribution. The convex self-weighted mean just corresponds to the location of the peak, and the arithmetic mean is located on the right side of the peak (see Figure 6.4.5 in the book). From this, it can be inferred that if the sample is right-skewed, its arithmetic mean should appear on the left side of the peak. Then, I used the systematic sampling method to extract 31 cases from the sample, that is, 1/80 of the original sample size, and calculated its self-weight and convex self-weighted mean. The results showed an excellent estimation of the peak of the original sample. Finally, I did a random simulation trial of 100,000 normally distributed sample points, and its scattered distribution was exactly the figure that approximated the normal curve on the upper left of the front cover. All of these shown the correctness, reliability and accuracy of the self-weight. So I decided to take it to the JSM, which will be held in Miami in August 2011, and would take this opportunity to further improve the basic conceptual system released at the JSM 2009. After completing the above work, I deeply felt that a new door had been quietly opened in the field of statistics.

茹塞茨基博士在了解了我的基本思想後,在2011年的JSM會議即將召開前,安排我在USUHS做了一次講座,有視頻得以錄製並發布在視頻分享網站Youtube上  (Self-weight —— A New Horizon of Statistics by Ligong Chen)。

After understanding my basic ideas, Dr. Rusiecki arranged for me to give a lecture at USUHS before the 2011 JSM ??conference. The video was recorded and posted on the video sharing website Youtube (Self-weight —— A New Horizon of Statistics by Ligong Chen).

然而,正是那份來自德國出版商的約稿信以及那些年中與不同學者的幾次私下討論,最終促使我下決心寫這本書。由於各種原因,我直到2017年的年中才開始了構想這本書的框架,並繼續搜集和學習了一些相關文獻。在經曆了2019年3月《自然》雜誌發表那篇文章所反映的統計學領域令人不安的現實後,在稍後與Blake McShane博士討論的同時,我終於開啟了以純中文進行的思考和寫作進程。而就在當年的5月,整個家庭便因突發變故進入從馬裏蘭搬家到印第安納的程序,寫作不得不被迫暫時中止。

However, it was that invitation letter from the German publisher and the several private discussions with various scholars during those years that finally made me decide to write this book. Due to various reasons, I did not start to conceive the framework of this book until mid-2017, and continued to collect and study some relevant literature. After experiencing the disturbing reality in the field of statistics reflected by the article published in Nature magazine in March 2019, while later discussing with Dr. Blake McShane, I finally started the process of thinking and writing purely in Chinese. In May of that year, my whole family entered the process of moving from Maryland to Indiana due to unexpected changes, and the writing had to be temporarily suspended.

當年7月的最後一天終於完成了搬家,而適應一個新環境耗費的時間超過了半年,正是在這段時間裏,我通過微信認識了一位在University of Louisville School of Public Health任職的華裔生物統計學教授X博士,和她討論了算術均數在關於連續隨機變量測量分布的統計描述中的問題,她得知我有新的算法後,非常高興地邀請我去她所在的係做了一次研討會,也有視頻錄製和上載到Youtube分享([English] Self-weight of Continuous Random Variable)。這次演講恰逢2020年中國農曆新年的前夜,而第二天就在全球媒體上和全球華人普遍使用的微信群中傳出中國武漢發生了嚴重的新冠病毒性疫情。這一重大曆史性事件改變了很多人的行為和命運,也使得我的寫作進程幾乎完全中斷長達近兩年。

The move was finally completed on the last day of July that year, and it took me more than half a year to adapt to a new environment. It was during this period that I met a Chinese, Dr. X, a Professor of biostatistics at the University of Louisville School of Public Health through WeChat, and discussed with her the problem of arithmetic mean in the statistical description of the measurement distribution of continuous random variables. After she learned that I had a new algorithm, she was very happy to invite me to give a seminar at her department, and a video was recorded and uploaded onto Youtube for sharing ([English] Self-weight of Continuous Random Variable). This speech coincided with the eve of the 2020 Chinese Lunar New Year, and the next day it was reported in the global media and in the WeChat group commonly used by worldwide Chinese people that a serious new coronavirus epidemic (COVID-19) had occurred in Wuhan, China. This major historical event changed the behavior and destiny of many people, and also almost completely interrupted my writing progress for nearly two years.

本書的第一至第四章最早在1998年就開始了撰寫,並在多年前就已有了比較完整的初稿,現在需要的是對其中部分內容予以更新,並融入一些新思想。就在涉及自權重的第六章結稿前,我突然產生了一個新的疑問:連續隨機變量的凸自加權均數與其算術均數和中位數的關係是怎樣的?我無法從數學論證的角度抽象地探討這些關係,於是決定用一個很笨的辦法——枚舉法——來直接查證它們。我從樣本量n = 2開始計算其自權重和凸自加權均數,於是發現此時的凸自加權均數就是算術均數,也就是說,凸自加權均數在樣本量為2時自動退化為算術均數,或者說,算術均數是樣本量為2時凸自加權均數的一個特例。進一步地,我將n逐一增加到3、4、5、6,於是得到中位數是樣本量分別為3和4時凸自加權均數的特例。而當樣本量達到5或以上時,凸自加權均數就是它的常規算法。以上樣本量的設置均應保證任意兩兩數據點在數值上不等。對這些關係的直接枚舉查證表明凸自加權均數可以統一算術均數和中位數的計算,這進一步強化了凸自加權均數作為通用算法的地位。在查證完這些關係後,再反思為何樣本量為2時凸自加權均數就是算術均數,這才形成了對算術均數的一個深刻理解,它被推廣到任意樣本量的計算是一個未加審慎考慮的輕率之舉。

Chapters 1 to 4 of this book were first written in 1998, and a relatively complete draft was completed many years ago. What is needed now is to update some of the content and incorporate some new ideas. Just before the finalization of Chapter 6, which involves self-weights, I suddenly had a new question: What is the relationship among the convex self-weighted mean of a continuous random variable and its arithmetic mean and median? I couldn’t abstractly explore these relationships from the perspective of mathematical arguments, so I decided to use a very stupid method — Enumeration — to directly verify them. I started calculating the self-weight and convex self-weighted mean from the sample size n = 2, and found that the convex self-weighted mean at this time is exactly the arithmetic mean, that is, the convex self-weighted mean automatically reduces to the arithmetic mean when the sample size is 2, or the arithmetic mean is a special case of the convex self-weighted mean when the sample size is 2. Further, I increased n to 3, 4, 5, and 6 one by one, and found that the median is a special case of the convex self-weighted mean when the sample size is 3 and 4 respectively. When the sample size reaches 5 or more, the convex self-weighted mean is its regular algorithm. The above sample size settings should ensure that any two data points are not equal in value. The direct enumerative verification of these relationships shows that the convex self-weighted mean can unify the calculation of the arithmetic mean and the median, which further strengthens the status of the convex self-weighted mean as a universal algorithm. After verifying these relationships, reflecting on why the convex self-weighted mean is the arithmetic mean when the sample size is 2, this forms a deep understanding of the arithmetic mean. It is a rash move without careful consideration to extend it to the calculation of any sample size.

在第八章的寫作中遇到的幾個問題值得在此分享。在將自權重和凸自加權均數引入相關和回歸分析後,在案例分析中我發現了三個基本現象。一是兩個可變屬性間的相關係數在數值上與基於其算術均數的相關係數幾乎一致,在大大超出樣本可測精度的小數點後很遠才看到了差異性。這從一個特殊角度表明相關係數與兩個可變屬性各自的分布形態無關。二是回歸係數在數值上將會受到因變量和自變量各自分布形態的影響,兩者的分布對稱性越好,基於凸自加權均數的回歸係數與基於算術均數的回歸係數在數值上就越趨於一致,反之則差異越大。這也從一個特殊角度說明,回歸模型的參數估計與分布形態有關。第三,基於凸自加權均數的回歸模型將輸出一個以非零為中心的殘差分布,這意味著這類回歸模型不可預先假定殘差滿足以零為中心的分布。但是,如果將總和非零的殘差簡單地按樣本量平均,然後將這個平均量與常數項合並,則得到一個簡化的新模型,且該模型的殘差將以零為中心分布。因此,基於凸自加權均數的回歸模型無需假定殘差的分布特征,而是可以通過算法確定殘差的分布。這一概念上和算法上的轉變將使得回歸分析具有更大的靈活性和普適性。

Several issues encountered in the writing of Chapter 8 are worth sharing here. After introducing self-weight and convex self-weighted means into correlation and regression analysis, I found three basic phenomena in the case analysis. First, the correlation coefficient is almost the same as that based on the arithmetic mean in terms of numerical value, and the difference is seen far after the decimal point that greatly exceeds the measurable precision of the sample. This shows from a special perspective that the correlation coefficient is independent of the distribution patterns of the two variables. Second, the numerical value of regression coefficient will be affected by the distribution patterns of the dependent variable and the independent variable. The better the distribution symmetry of the two, the more consistent the numerical values ??of the regression coefficient based on the convex self-weighted mean and the regression coefficient based on the arithmetic mean are, and vice versa. This also shows from a special perspective that the parameter estimation of the regression model is related to the distribution pattern. Third, the regression model based on convex self-weighted mean will output a residual distribution centered on non-zero, which means that this type of regression model cannot pre-assume that the residuals satisfy a zero-centered distribution. However, if the non-zero total residuals are simply averaged by the sample size, and then this average is combined with the constant term, a simplified new model is established, and the residuals of this model will be distributed in zero-centered. Therefore, the regression model based on convex self-weighted mean does not need to assume the distribution characteristics of the residuals, but can determine the distribution of the residuals through algorithms. This conceptual and algorithmic shift will make regression analysis more flexible and universal.

就在本書的寫作進展到“第九章 分段回歸”時,時間已到了2024年的2月裏。此時,我以為這是自己早已深思熟慮的部分,且已有了基於自加權均數的偏態分布期望估計的算法,所以非常自信應該可以得到關於臨界點足夠準確的估計。於是,將凸自加權均數引入絕對殘差的期望估計,便有了全域模型絕對殘差的凸自加權均數MARc和第i組分段模型合並絕對殘差的凸自加權均數mcari,c。於是,將原本用算術均數構建的殘差收斂率更改為如下算式:

Just as the writing of this book progressed to “Chapter 9: Piecewise Regression”, it was already February 2024. At this moment, I thought that this was a part that I had already thought about carefully and profoundly, and I already had the algorithm for estimating the expectation of skewed distributions based on the self-weighted mean, so I was very confident that I could get a sufficiently accurate estimate of the threshold. Then, by introducing the convex self-weighted mean into the expectation estimate of the absolute residuals, I have the convex self-weighted mean MARc of the absolute residuals for the fullwise model and the convex self-weighted mean mcari,c of the combined absolute residuals for the ith group of pieceise models. So, the convergence rate of residuals originally constructed with the arithmetic means is changed to the following formula:

然而,在500次隨機模擬的編程計算(這讓我的11台舊電腦連續工作了22天,其中三台在算完後不久主板報廢)中發現,按照式(3)為回歸權重計算得到的加權臨界點的估計值依然明顯偏離了直覺上臨界點所在的位置。盡管直覺對於隨機係統是非常不可靠的,但這一直覺判斷頓時令我大惑不解!我甚至開始對自加權的算法失去了信心,懷疑它以及我為它已經付出的數年光陰到底價值幾何?一絲不安和焦慮也襲上了心頭。我於是暫停了寫作和數據分析去修理自己收藏的一些古舊破爛小提琴及其弓子。

However, in the programming calculation of 500 random simulations, which allowed my 11 old computers to work continuously for 22 days, and three of them had their motherboards fail shortly after the calculation was completed, it was found that the estimated value of the weighted threshold calculated based on the regressive weights defined in Formula (3) still obviously deviated from the intuitive location of the threshold. Although intuition is very unreliable for random systems, this intuitive judgment immediately puzzled me! I even began to lose confidence in the self-weighting algorithm, and wondered how valuable it was and how valuable that I had spent the years in it was. I also felt a glimmer of uneasy and anxious. I then stopped writing and analyzing data to repair some old and broken violins and bows in my collections.

在三個多月裏,我一邊修琴一邊冥思苦想這個問題。在修理完工3把琴和15把琴弓後,我終於悟出了問題所在:那個權重的構建僅僅使用了殘差。但是,一個回歸模型中還有預測值。忽視預測值的變異,也就是丟棄對期望臨界點有貢獻的一部分樣本信息。於是,類似關於連續隨機變量自加權的算法構建,我需要將預測值的變異也引入到關於期望臨界點的權重構建中。我隨即終止修琴,打開電腦修改SAS程序並重新計算。這次還找附近的兩位朋友借了兩台舊電腦,十幾台電腦連續運行了24個日夜後,我終於得到了關於那些模擬臨界點更準確的估計。至此,分段回歸的算法在我看來終於完美地構建成功。它再次體現了權重構建中的兩條基本準則:無信息損失,無信息冗餘。

For more than three months, I was repairing the violins while thinking hard about that issue, after I finished repair jobs on 3 violins and 15 bows, I finally figured out the problem: the weights were constructed using only residuals. However, there are also predicted values ??in a regression model. Ignoring the variation of predicted values ??means discarding part of the sample information that contributes to the expected threshold. Therefore, similar to the algorithm construction of self-weighting for continuous random variables, I needed to introduce the variation of predicted values ??into the weight construction for the expected threshold. I immediately stopped repairing jobs on violin stuffs, turned on the computers, modified the SAS program and recalculated. This time I borrowed two old computers from two friends nearby. After more than a dozen computers ran for 24 consecutive days and nights, I finally got a more accurate estimate of those simulated thresholds. At this point, the algorithm of piecewise regression was finally successfully constructed perfectly to my point view. It once again embodied the two basic principles in weight construction: no information loss, and no information redundancy.

2025年2月,在聯合統計年會文章摘要投稿的最後一天,我提交了自己的文章《分段模型連續性的直接概率測量和推斷》的摘要。這是我在分段回歸領域對此前我留在JSM論文集中的算法所作的最後一次修正。不久,我收到了會議組委的通知,該文章被數理統計學會(IMS)接納並安排在8月4日的“統計推斷進展”小組做口頭演講。因此,我以獨立的個人身份參加了8月初在田納西州納什維爾市召開的聯合統計年會。

In February 2025, on the last day of abstract submissions for the Joint Statistical Meetings (JSM), I submitted the abstract of my paper, “A Direct Probability Measure and Inference for Continuity of Piecewise Models.” This was my final revision for the algorithm in the field of piecewise regression, which I had previously published in the JSM proceddings. Shortly afterward, I received notification from the conference organizr that the paper had been sponsored by the Institute of Mathematical Statistics (IMS) and scheduled for oral presentation at the “Section of Advances in Statistical Inference” on August 4th. Therefore, I participated as an independent individual in the JMS held in Nashville, Tennessee in early August of 2025.

參會期間我聽了不少他人的演講,其中很多人的演講中都有一個數值型最有化步驟。在聽完哈佛大學商學院的年輕教授李淩誌博士的演講後,我私下找到他和他討論了其算法中的那個最優化步驟。我從隨機對應的角度指出這樣做在理論上是完全錯誤的。在聽了我的解釋後,他恍然大悟,回應稱他從未像我這樣思考過這個問題。

During the meetings, I listened to many presentations, many of which included a numerical optimization step. After listening to a presentation by Dr. Lingzhi Li, a young professor at Harvard Business School, I privately approached him to discuss that optimization step in his algorithm. I pointed out from the perspective of random correspondence that this approach is theoretically completely flawed. After my explanation, he suddenly understood and responded that he had never thought about the problem in the way I had.

在聆聽當代統計學泰鬥Robert Tibshirani博士和教授的演講時,我注意到他的新算法中也有一步數值型最優化,而且他還特別強調了其新算法存在一個嚴重的過擬合(他用的英文表述是severe overfitting)。在其演講結束後的提問期間,我第一個舉手並得到批準。我請求Tibshirani博士將PPT翻回到那個最優化所在的頁麵,然後指出正是這個最優化導致了其算法的過擬合。但Tibshirani博士不認同這一說法。我提到了John Tukey在1962年的文章裏就警告過最優化的危險性,然後提到我與DeepSeek和ChatGPT的討論結果。這進一步加劇了爭執。會議主持人,華裔統計學家沈小彤教授見狀立刻示意我不要繼續說下去。是的,當著100多個慕名前來聽大師演講的專家和學者們的麵指出其新算法的問題所在是一個很大的冒犯。我隻好遺憾地放棄繼續闡述原因何在。

While listening to a lecture by Dr. Robert Tibshirani, a leading figure in contemporary statistics, I noticed that his new algorithm also included a numerical optimization step, and he specifically emphasized that his algorithm suffered from a severe overfitting. During the Q&A session after his lecture, I was the first to raise my hand and was approved. I asked Dr. Tibshirani to turn back to the PPT slide containing that optimization and point out that it was this optimization that caused the overfitting in his algorithm. However, Dr. Tibshirani disagreed with this assessment. I mentioned John Tukey’s warning about the danger of optimization in his 1962 article, and then cited the results of my discussions with DeepSeek and ChatGPT. This further escalated the controversy. Upon seeing this, Professor Xiaotong Shen, a Chinese-American statistician and the meeting’s moderator, immediately signaled me to stop talking. Indeed, pointing out the problem with his new algorithm in front of over 100 experts and scholars who had come to hear the master’s lecture was a significant offense. I regretfully had to abandon my further explanation.

事實上,我在此前一天的晚間活動中已有幸遇見並認識了Robert Tibshirani博士,並和他有過幾分鍾的短暫交流。我向他介紹了自己在統計學領域所做的革命性工作,包括與埃弗農博士的那個關於“隨機變量”這個術語的email交流和將其更名為可變屬性、連續型隨機變量分布期望的凸自加權估計的算法以及算術均數和中位數均可作為特例被統一在該算法之下。他表示對此有興趣進一步了解,希望我通過email向他提供更多信息。因此,會後不久,我通過email向他解釋了那個嚴重過擬合與其中數值型最優化之間的關係。我相信後者是導致過擬合的唯一原因。

In fact, I had the privilege of meeting and getting to know Dr. Robert Tibshirani during the evening event the previous day, and had a brief conversation with him for a few minutes. I introduced him to the revolutionary works I had done in the field of statistics, including my email exchange with Dr. Evernon about the term “random variable” and how I renamed it to “variable attribute,” the algorithm of convex self-weighted estimation fro the expectation of the distribution of continuous random variables, and how the arithmetic mean and median can be unified as special cases under this algorithm. He expressed interest in learning more and hoped I could provide him with more information via email. Therefore, shortly after the meeting, I explained to him via email the relationship between the severe overfitting and the numerical optimization involved. I believe the latter is the only cause of the overfitting.

這次會議期間還先後有幸遇到了美國凱斯西儲大學醫學院人口與健康計量學係的統計學教授付平福博士和加拿大多倫多大學統計科學係的教授周舟博士,還有很多隨機遇到的其他同行們,我盡力向他們粗略但係統性地介紹了我在統計學裏所做的那些突破性工作。他們也均表示願意進一步了解。

During the conference, I also had the privilege of meeting Dr. Pingfu Fu, Professor of Statistics in the Department of Population and Quantitative Health Sciences at Case Western Reserve University School of Medicine, and Dr. Zhou Zhou, Professor at the Department of Statistical Science at the University of Toronto, and other statisticians that I met randomly. I didimy best to give them a brief but systematic overview of my groundbreaking works in statistics, and they both expressed their willingness to learn more.

記得中國物理學家張雙楠教授在一個公開辯論科學問題的視頻中講過這樣一件事,當他在英國求學期間完成了某個問題的研究時,有同事向他提問:“這其中的科學是什麽?”他竟突然間對Science這個術語在這一問話中的內涵感到了困惑。他知道什麽是物理、什麽是化學,但若說一個問題中的科學是什麽,他竟一時語塞。我以為,這個“科學”應該意味著發現某種他人未曾發現或即使發現了也無視或不曾有所為的某種存在。例如,“算術均數會偏離偏態分布曲線的峰頂因而對偏態總體的分布期望必然是一個有偏估計”就是一個存在、“非參數檢驗法會降低對差異的檢驗精度”也是一個存在。事實上,這兩個問題早已被統計學界廣泛認可,但卻一直被人們忽視。對一個存在的新發現可以成為某種科學探索的起點。

I remember that Chinese physicist Professor Zhang Shuangnan said in a video of a public debate on scientific issues that when he reported a research on a certain issue while studying in the UK, colleagues often asked him: “What’s the science of this?” He was suddenly confused about the connotation of the term science in this question. He knew what physics and chemistry were, but when asked what the science involved in a problem was, he was at a loss for words. I thought that this “science” should mean discovering something that others had not discovered, or they ignored it or did nothing about it even if they discovered it. For example, “the arithmetic mean will deviate from the peak of the skewed distribution curve, so it must be a biased estimate for the distribution expectation of the skewed population” is an existence, and “a non-parametric testing method will reduce the accuracy in testing difference” is also an existence. In fact, these two issues have long been widely recognized by the statistical community, but they have been ignored. A new discovery of a being can be the starting point for some kind of scientific exploration.

一切統計工作都是關於測量、分布和對分布的數學化描述,以及在此基礎上發展起來的關於差異性檢驗和探索隨機變量之間關係等的方法學體係。所以,對分布描述方法的改進將極大地影響對差異性檢驗和關係構建的方法學的改進,甚至可能引發眾多方法上的革命。

All statistical jobs are about measurement, distribution and mathematical description of distribution, as well as the methodological system developed on this basis for testing differences and exploring relationships among random variables. Therefore, improvements in distribution description methods will greatly affect the methodological improvements in difference testing and relationship construction, and may even trigger a revolution in numerous methodologies.

統計學是一種認知外部經驗世界的工具,但一個有誌於從事統計方法應用和研究的人不是簡單地依附在這些工具上的工具人。他們必須直接接觸樣本采集、數據管理和統計分析,隻有在大量的數據分析實踐中才有可能被激發出富有創造性的靈感。

Statistics is a tool for understanding the external world of experience, but a person who is interested in the application and research of statistical methods is not a tool person who simply attaches to these tools. They must be directly exposed to sample collection, data management and statistical analysis. Only through extensive data analysis practice can they be inspired to be creative.

由於我的教育背景和統計實踐經驗都極其有限,我不認為我能繼續做的更多,但我努力在過去的歲月裏一次次超越自我。這一切都是由於在1991年3~5月期間形成的那個夢想,以及在1998年3月底的6天6夜裏形成的許多突破性的思想;當然,更應歸因於那個認知流程框架以及運行於其中的四維邏輯係統。從1997年11月的那天開始,每一次的思維啟動,我都無法預知它將如何走下去,也不知道它將在哪裏停下來。但我知道,它一定會形成一些新的思想。最終,從一個莫可名狀的“非常態分析”這一星點思想的火花演變成了燎原於本書各章節的熊熊烈焰。

Due to my limited education background and statistical practice experience, I don’t think I can continue to do more, but I have tried to surpass myself again and again in the past years. All this is due to the dream formed between March and May 1991, and the many breakthrough ideas formed in the six days and six nights at the end of March 1998; of course, it should be attributed to the cognitive process framework and the four-dimensional logic system running in it. From that day of November 1997, every time a thinking process started in my mind, I couldn’t predict how it would go on, and I also wouldn’t know where it would stop, either. But I could believe that it would definitely reach a temporary site and form some new ideas. Finally, the spark of an inexplicable “abnormal state analysis” idea has been evolved into a raging fire that spread across the chapters and sections of this book.

思考的過程充滿著苦悶和焦慮,有時甚至會經曆某種莫名的痛苦,但也能時常望見思維隧道深處的點點星星之光,而一旦某個或某幾個光點閃爍出較大的光芒,就有可能將整個隧道照得通明,而思維過程也將在瞬間轉變成一股無法阻遏的巨浪,並因此令人在膽顫心驚中體驗到某種震撼!這正如中國宋朝詩人陸遊(1125-1210)在其《遊山西村》中所吟:“山重水複疑無路,柳暗花明又一村”。我想,這應該就是人類的思維能力所能展現出的一種魅力。所以,隻要一個人不因循守舊,敢於挑戰某種現存的事物或觀念,他們就極有可能發現一些新的東西,從而有機會獨享這份魅力。我深信更多的人們在穿越這扇新大門後會發現更多新奇的東西,並創造出既屬於他們自己,也屬於統計學,因而屬於整個人類社會的更偉大的未來,因為,統計學是人類能夠發現並努力建立的一套認識世界的高級方法論,它對於人類的未來不言而喻。

The process of thinking was often filled with distress and anxiety, sometimes even accompanied by an inexplicable kind of pain. Yet one can also occasionally glimpse faint starlight deep within the tunnel of thought. And once one or several of these points of light begin to shine more brightly, they may illuminate the entire tunnel. In that moment, the thinking process can suddenly surge into an unstoppable wave, overwhelming the thinker with a sense of awe and trembling exhilaration! This is just as the Chinese Song Dynasty poet Lu You (1125-1210) song in his Visit to Mount-west Village: “After endless mountains and rivers that leave doubt whether there is a path out, suddenly one encounters the shade of a willow, bright flowers and a lovely village again.” (The Chinese-to-English translation of the two lines of poetry comes from the speech given by Hillary Clinton, as the US Secretary of State, on May 22, 2010, at the reception of the US Pavilion at the Shanghai World Expo). I believe this should be a kind of enchanting power that human thinking can display. Therefore, as long as a person does not blindly follow tradition and dares to challenge existing things or concepts, they will very likely to discover something new and may have an opportunity to enjoy the enchantment for themselves. I firmly believe that more people will uncover more wonders after passing through this new gateway, and help create a greater future that belongs not only to them but also to statistics, and thus to the whole of human society, for statistics is one of the highest- level methodologies that humanity can find and strive to build for understanding the world, and its importance to our future speaks for itself.

本書力圖將作者在過去30多年裏在統計學領域的探索和思考作一次總結。雖然已有過那家德國出版社約稿成書,但我希望這本書能以中英文對照的形式出版,因為中文是我的母語,它是人類曆史上一種從遠古傳承至今的文字性語言。這是一種從創始之初即在字符的結構上富有抽象形式的高度發達的人類語言。正是由於其字符結構上豐富多樣的抽象形式,漢字係統本身構成了一部具有某種自解釋能力的百科全書,而由這些從遠古傳承至今、相對固定不變的字符係統所創造的語言和思想表達得簡練、精致、優雅和深邃,這在人類文明史上恐怕無與倫比。最重要的是,我之前的全部教育背景以及關於這一話題的一切思考都唯一地得益於它。本書的英文翻譯幾乎全部出自基於網絡的Google Translator以及作者對翻譯結果的閱讀和修訂,僅有幾個段落的翻譯由ChatGPT-4o所為,因此,如果英文翻譯的表達存在任何錯誤或語義不明之處,請以中文表達為準。

This book will attempt to summarize the author’s exploration and thinking in the field of statistics over the past more than 30 years. Although I was invited to publish a book by the publisher in Germany, I hope that this book can be published in Chinese and English, because Chinese is my mother language, which is a kind of literal language in human history that has been passed down from ancient time to the present. It is a highly developed human language that has been rich with its characters structurally in abstract forms from its very inception. It is precisely because of the rich and diverse abstract forms of the character structure that the Chinese character system itself constitutes an encyclopedia with self-explanatory capabilities at some extent. The language and thoughts created by this relatively fixed character system that has been passed down from ancient times to the present are concise, refined, elegant and profound, which may be unparalleled in the history of human civilization. And the most important, my entire educational background and all thoughts on this topic were only benefit from it. The English translation of this book was almost entirely done by the web-based Google Translator and the author’s reading and revision of the translation results. Only a few paragraphs were translated by ChatGPT-4o. Therefore, if there are any errors or semantic ambiguities in the English translation, the Chinese expression shall prevail.

四、各章內容簡介 (Introduction to the Chapters)

第一章屬於純哲學認識論的範疇,開篇的三個認知導向隱含著所有統計學方法的三個基本類別:描述、差異性檢驗、相關與回歸。通過對這三個導向的簡單敘述,直接展示了抽象思維的工作模式。本章對辯證法認知模型的討論應該是有所創意的。此外,還將那篇“論智慧的遞進結構和認知的邏輯流程”中的內容融合在此。然而,將一個哲學範疇的議題作為本書開篇的目的則不僅是為了強調統計學的“認知方法論”這一重要屬性,並以此將其與純數學拉開一點距離,更重要的是,它是作者在過去的近30年裏思考和解決自己所麵對的所有統計學問題時可以依賴的最基礎的方法論。

The first chapter belongs to a purely philosophical epistemology category. The three cognitive orientations at the beginning imply the three basic categories of all statistical methods: description, difference test, and correlation and regression. Through a simple description of the these three orientations, the working pattern of abstract thinking is directly demonstrated. The discussion of the dialectical cognitive model in this chapter should be creative. In addition, the content from the article “On the Progressive Structure of Intelligence and the Logical Process of Cognition” is also integrated here. However, the purpose of opening this book with a philosophical topic is not only to emphasize the important attribute of “cognitive methodology” of statistics and in order to distance it from the pure mathematics, but more importantly, it is the most basic methodology that the author could have relied on when thinking about and solving all statistical problems that I have faced in the past nearly 30 years.

第二章是關於統計學的曆史概貌,通過這一回顧,對正態分布、算術均數、貝葉斯法,以及現行的眾多基於最優化和強製連續性假定的分段回歸提出了幾個關鍵且合理的批評,從而為本書在後續章節中提出關於自權重和加權分段回歸的算法打下重要的思想基礎。這裏有必要一提的是,書中對貝葉斯法的批評立足於概率論本身的一個定理,而非自其誕生以來眾多批評者和支持者所陷入的那些關於主觀-客觀、先驗-後驗-經驗等爭論不休的純哲學式思辨。

Chapter 2 is about the brief historical review of statistics. Through this review, several key and reasonable criticisms are made on the normal distribution, arithmetic mean, Bayesian method, and many current piecewise regressions based on optimization and enforced continuity assumption, thereby laying an important ideological foundation for the algorithms proposed by this book on self-weighting and weighted piecewise regression in subsequent chapters. It is necessary to mention here that, the criticism of Bayesian method in the book is based on a theorem in probability theory itself, rather than a pure philosophical speculation as those constant debates between many critics and supporters who fallen into about subjective-objective, priori-posterior-experience since its birth.

第三章試圖重建統計學最基礎的概念係統。作者認為這個概念係統非常重要,它構成了思考和解決一切統計學問題時最底層的邏輯。理解了這套概念係統,一個人才能較好地駕馭統計學這門方法論;反之,則有可能犯下錯誤而不自知。

Chapter 3 attempts to reconstruct the most preliminary conceptual system of statistics. The author believes that this conceptual system is very important. It constitutes the lowest level logic when thinking and solving all statistical problems. Only by understanding this conceptual system can one better control the methodology of statistics; otherwise, he/she may make mistakes without realizing it.

第四章是作者的一個嚐試,很簡短,也很不成熟,但是希望它能在第三章和第五章之間搭建一座橋梁,以便承上啟下。在作者看來,統計學針對的是測量、分布以及對測量分布的描述和分析,因此,尺度在統計學中本應是一個非常重要的概念或話題。隻不過鑒於作者的學識有限,無法深入和展開。對此感興趣的讀者應該會找到闡述之道,而不感興趣者可以忽略。

Chapter 4 is an attempt by the author. It is very short and immature. But I just hope it can build a bridge between Chapter 3 and Chapter 5 to connect the previous and the next. In the author's opinion, statistics is about measurement, distribution, and the description and analysis of measurement distributions, so scale should be a very important concept or topic in statistics. However, due to the author’s limited knowledge, I could not go into depth and expand. Those readers who are interested should find their way to explain, while those who are not interested can ignore it.

第五章討論的是純數學領域的概率論,寫作中借鑒了高世澤教授編撰的《概率統計引論》一書,這是為了在基於自權重的加權期望估計與正態分布、大數定律、中心極限定理等之間架設一座理性的橋梁。人們在這些領域的曆史性探索和貢獻在自加權統計量方麵依然擁有極強的生命力,而且,能被輕鬆拓展到一切具有央化位置的分布之中,而無需一個“滿足正態性分布的假定”作為它們在理論上成立的前提。

Chapter 5 discusses probability theory in the field of pure mathematics. The writing draws on the book “Introduction to Probability and Statistics” compiled by Professor Gao Shize. This is try to establish a rational bridge between the weighted expectation estimation based on self-weight with the normal distribution, the law of large numbers, and the central limit theorem, etc.. Peoples’ historical explorations and contributions in these fields still have strong vitality in terms of self-weighted statistics, and can be easily extended into all distributions with centralized location without the need for an “assumption of normal distribution” as the premise for their theoretical establishment.

從第六章開始才進入本書的關鍵話題,其中最重要的是詳細闡述了關於連續隨機變量的自權重的構建和算法,由此我們可用基於自權重的常用統計量來描述抽樣分布的特征。正如本書封麵的左上圖所示,一個涉及服從正態分布的10萬個隨機模擬樣本點的試驗用散點圖的方式顯示出該自權重算法的正確性。

The key topics of this book is not entered until Chapter 6 is beginning, the most important of which is the detailed explanation of the construction and algorithm of the self-weight of continuous random variables. From this, we can use the self-weight-based common statistics to describe the characteristics of a sampling distribution. As shown in the upper left picture on the cover of this book, an experiment involving 100,000 random simulation sample points obeying a normal distribution shows the correctness of the self-weight algorithm in the form of a scatter plot.

第七章在引入自權重和自加權均數的基礎上討論差異性檢驗,僅選擇了最簡單的t檢驗法、方差分析法、非參數的秩和檢驗等。作者提出了對t值的調整算法以規避方差齊性檢驗。

Chapter 7 discusses the differential test, and only chooses the simplest t-test method, analysis of variance method, non-parametric rank sum test, etc., by introducing self-weights and self-weighted mean. The author proposed an adjustment algorithm for the t value to circumvent the homogeneity of variance test.

第八章討論了簡單直線回歸、多項式曲線回歸、多維線性回歸以及對數率比回歸等常用統計模型,其中,通過將離散型因變屬性改為數值型連續可變屬性而將對數率比回歸模型的算法改為常規線性回歸模型的算法應該是本書所做的一個大膽創新。討論這些模型的目的正是為第九章的分段回歸奠定基礎。

Chapter 8 discusses common statistical models such as simple linear regression, polynomial curve regression, multidimensional linear regression, and logistic regression, in which it should be a bold innovation in this book for converting the algorithm of logistic regression to the algorithm of a conventional linear regression by transforming the discrete dependent vattribute into a numerically continuous vattribute. The purpose of discussing these models is to lay a foundation for the piecewise regressions in Chapter 9.

第九章是關於分段回歸,在引入自權重後重建了過去20多年裏由本作者提出的加權臨界點估計的算法,由此解決了加權分段回歸分析中最關鍵的難題。本書封麵右下方的那個黃色分布曲線極好地展示了該算法在隨機模擬500個樣本(共計17500個隨機點)中對500個臨界點的估計的收斂性和準確性。

Chapter 9 is about the piecewise regressions. After introducing self-weight, the weighted threshold estimation algorithm proposed by the author over the past 20 years is reconstructed, thereby solving the most critical difficult problem in the weighted piecewise regression analysis. The yellow distribution curve on the lower right side of the cover of this book excellently demonstrates both the convergence and accuracy of the 500 thresholds estimated by the algorithm in a random simulation with 500 samples (17,500 random points in total).

第六~九章涉及的案例分析均采用了前後對比的篇章結構,以展示當前基於算術均數的統計算法與基於凸自加權均數的統計算法對同一案例的差異性。在為每個案例展示必要的原始數據和中間計算過程的同時,作者盡可能地製作了一些統計圖,以便讀者能在直觀方式下體驗兩種算法的差異。讀者應該能從作者對內容的編排中發現兩者孰優孰劣,因為統計學自己就是一門用數據和圖表說話的方法論。

The case analyzes involved in Chapters 6 to 9 all adopt a before-and-after sectional structure to show the difference between the current statistical algorithms based on arithmetic mean and the statistical algorithms based on convex self-weighted mean for the same case. While displaying the necessary original data and intermediate calculation processes for each case, the author has tried the best to produce some statistical charts so that readers can experience the differences between the two kinds of algorithms in an intuitive way. Readers should be able to find out which of the two is better from the author’s arrangement of the contents, because statistics itself is a methodology that uses data and charts to speak.

本書名中用了“哲學”二字,是因為自己在過去的幾十年思考過程中會經常冒出許多思想火花。我有時會感到很奇怪,為什麽會有這麽多新問題、新概念、新思想等在不經意間冒出來?我想這可能是個哲學問題。所以在第一章討論了一些與認識論和邏輯等有關的哲學概念,尤其是關於人類的抽象思維和推理。從作者的個人經驗看,它們在統計學的方法論構建中非常重要,因為它們不僅可以幫助我發現統計學這門學科中的問題,也可以幫助我發現自己思維過程中產生的謬誤。隻有找出了問題,才有可能找到解決問題的路徑。因此,我將第一章視為自己在這段探索路徑上最基礎的方法論,它堪比一切統計方法之母。

The word “philosophy” is used in the title of this book because many sparks of thought had often emerged in my thinking process over the past few decades. I sometimes feel very strange, why were there so many new questions, new concepts, new ideas, etc. popping up inadvertently? I think this may be a philosophical question. Therefore, in the first chapter, some philosophical concepts related to epistemology and logic are discussed, especially about human abstract thinking and reasoning. From the author’s personal experience, they are very important in the construction of statistical methodology, because they can help me discover not only the problems in the subject of statistics, but also the fallacies in my own thinking process. Only by identifying a problem can it be possible to find a solution to the problem. Therefore, I regard the first chapter as the most basic methodology over this exploration path, which is comparable to the mother of all statistical methods.

本書將“隨機變量”改稱“可變屬性”,這是因為,統計學隻討論隨機係統中的問題,這個係統中的一切要素都具有“隨機性”,因此,統計學針對的“變量”可以無需特別地用“隨機”這個形容詞來修飾。隻有在跨學科討論時,為避免術語使用上的歧義,才需要用“隨機”加以限定。此外,在英語中,“randomly variable + 一個名詞”才是術語“random variable”的真實含義,後者不過是前者被簡化後的一個變體。由此,我們找到了統計學真正的研究對象。這一概念的更新可被視為統計學向其研究對象的本體的回歸。Bootstrap法的奠基人,當代著名統計學家Dr. Efron在回應我這個問題時說:“Random variable does mean ‘something randomly variable’”。那麽,那個名詞是什麽呢?他沒能告訴我。我思考良久,也得益於在CPDR/USUHS工作期間管理和運作一個數據庫時,該數據庫係統正是用了屬性(Attribute)來取代傳統上使用的“變量(Variable)”。可見,已經有人在用了。

This book renames “random variables” to “variable attributes (and thus simplified to vattribute)” because statistics only discusses problems in random systems. All elements in such a system come with “randomness”. Therefore, a “variable” targeted by statistics is no need to specifically use the adjective “random” to qualify it. Only in interdisciplinary discussions is it necessary to use the term “random” to avoid ambiguity in the use of terminology. Furthermore, in English, the true meaning of the term “random variable” is “randomly variable + a noun”, and the former is just a simplified variant of the latter. Thus, we have found the real research object of statistics. This conceptual update can be regarded as a return of statistics to the ontology of its research object. Dr. Efron, the founder of the Bootstrap method and a famous contemporary statistician, said in response to my question: “Random variable does mean ‘something randomly variable’”. So, what is that noun? He couldn’t tell me. I thought about it for a short time and benefited from the fact that when I was managing and running a database at CPDR/USUHS, the database system just used “Attribute” to replace the traditionally used “Variable”. It can be seen that someone have been already using it.

當然,作者不會忽視隨機性,也不會埋沒這個術語,而是將它與統計測量的最小單元“個體”相結合組成了一個新術語——隨機個體(randomid)。通過這個術語的構建,總體和樣本中的隨機性被保留在每一個個體之上。

Of course, the author will not ignore randomness or bury this term. Instead, I combined it with the smallest unit of statistical measurement, “individual”, to form a new term — randomid. Through the construction of this term, the randomness in the population and the sample is retained on each individual.

作者在書中將Logistic Regression翻譯為“對數率比回歸”,這一翻譯基於logistic模型的數學構建。此外,還將Bootstrap Method翻譯為彼替法。目前中文語境下可見的翻譯是“自助法”。個人認為這個翻譯有點莫名其妙,詞不達意。從Bootstrap法的計算流程看,它就是試圖用“另一個東西代替某個需要進行統計處理的對象”。所以,彼替一詞比較達意。它是從Bootstrap這個英文單詞中取了B和T兩個字母的發音而組成的一個統計學的中文術語。

In the book, the author translates Logistic Regression as “log-rate-ratio regression”. This translation is based on the mathematical construction of the logistic model. In addition, the Bootstrap Method is translated into “Another-replacing-it method”. The current translation visible in the Chinese context is “self-help method”. Personally, I think this translation is a bit confusing and the words do not convey the meaning. Judging from the calculation process of the Bootstrap method, it is trying to replace an object that requires statistical processing with another thing. Therefore, the word “彼替(sounds Biti)” is more expressive. It is a statistical Chinese term formed by taking the sounds of the two letters B and T from the English word Bootstrap.

作者將自己編寫的關於自權重和幾個加權分段回歸算法的SAS程序連同幾個本書所需的統計量分布表附在書後供讀者拷貝和參考使用。我很可能不是一個比較優秀的SAS程序員,但在編程分析本書涉及的數據時應該是合格的,因而可以保證這些程序一定能幫助讀者算出正確的結果。

The author has attached the SAS programs written by myself for self-weighting and several weighted piecewise regression algorithms, together with several statistic distribution tables required for this book, to the back of the book for readers to copy and refer to use. I may probably not a good SAS programmer, but I should be qualified when programming to analyze the data involved in this book, so I can guarantee that these codes will help readers calculate correct results.

當然,由於本人對文獻的閱讀量非常有限,無法涉獵統計學曆史上所有他人對其思想和方法論的貢獻。如果我在本書中所說的話語在前人的書籍和文獻中已存在卻未加注明引用,那麽,首創那些話語的榮耀屬於他們,而我不過是一個碰巧也形成了類似觀點的後來者而已。此外,我不認為我在此的所有思考、觀點和方法及其語言表達都正確無誤。由於本人專業學識非常有限,而且思考所及大大超越了我的醫學和公共衛生專業範疇,所以,一些無知、淺薄甚至錯誤之處在所難免,但願這些不足和錯誤能夠作為一種催化劑激發他人的睿智和遠見,或者一麵鏡子照亮他們腳前的黑暗,或者作為一段階梯助他人登上更高的山峰。

Of course, since my reading of the literature is very limited, I am not able to cover all the contributions to its ideas and methodologies made by others throughout the history of statistics. If the words I say in this book have existed in previous books and documents without citation notes, then, the glory of pioneering those words belongs to them, and I am just a latecomer who happened to form the similar opinions. Also, I don’t think all my thoughts, opinions and methods as well as the linguistic expressions here are correct. Since my professional knowledge is very limited, and my thinking is far beyond the scope of my medical and public health professions, some ignorance, superficialities and even mistakes are inevitable. I hope these deficiencies and mistakes can be a catalyst to enlighten others’ wisdom and farsightedness, or a mirror to light up the darkness before their feet, or a ladder to help them reach higher mountains.

中國曆史上有過許多偉大的先賢,其中一位在約1400年前曾說過:“以史為鏡,可知興替。”他,就是唐朝的開國皇帝李世民。另一位則在1965年用一首富有浪漫激情卻又不失睿智的詩詞鼓勵其國人道:“世上無難事,隻要肯登攀!”他,就是近代中國的曆史巨人毛澤東。

There have been many great sages in Chinese history, one of them said about 1400 years ago: “The rise and fall may be knewn by looking at history as a mirror.” He was Li Shimin, the founding empire of the Tang Dynasty. Another encouraged his countrymen in 1965 with a poem full of romantic passion and wisdom: “Nothing is difficult in the world as long as you are willing to climb!” He was Mao Zedong, a historical giant of modern China.

五、個人權利聲張 (The Assertion of Personal Rights)

在結束本序之前,作者想談談自己對本書中所涉新統計算法的權利。

Before closing this preface, the author would like to talk about my rights to the new statistical algorithms covered in this book.

由於某種長期且普遍存在的蒙昧,學術界將統計學歸入數學分支學科,這直接導致了當今世界各主要國家和經濟體的專利法也都存在著將統計算法等同於純數學公式而不允許其獲得專利保護的歧視性規定;再考慮到幾乎所有統計軟件均為商業盈利性產品的現實,在無法通過正當法律途徑保障和捍衛個人恰當權益的窘境下,作者不得不以謙卑的姿態在此闡明自己對書中由作者發明、設計、構造和改進的統計算法的權利,因為它們都是具有創新性、實用性和可改進性的統計測量工具,在這一點上,它們與那些純數學中不可更改的定理和計算公式有著質的差別。作者認為,現在已到了推動修法廢除這種侵害他人正當權益、毫無合理性和公正性的歧視性法規的時候。我呼籲廣大的統計人團結起來捍衛自己的恰當權益,因為所有的統計方法都浸透著每個創立者的智慧和辛勞,他們理應得到全社會的尊重和法律保護。

Due to some long-term and widespread ignorance, the academic community classifies statistics as a branch of mathematics, which directly leads to some discriminatory regulations in the patent laws of every major countries and economical entities around the world today for equating statistical algorithms with those pure mathematical equations and denying them to be patented, and also by considering the reality that almost all statistical software are commercial profit-making products, in such a dilemma of being unable to protect and defend personal proper rights and interests through legitimate legal channels, the author has to humbly clarify my rights to the statistical algorithms invented, designed, constructed and improved by the author in the book, because they all are innovative, practical and improvable statistical measurement tools. In this respect, they are qualitatively different from the immutable theorems and calculation equations in pure mathematics. The author believes that it is the time now to push for legislative amendments to abolish such unreasonable and unjust discriminatory regulations that infringe on the legitimate rights and interests of individuals. I call on the majority of statisticians to unite to defend our proper rights and interests, because all statistical methods are soaked in the wisdom and hard work of each founder, and they deserve the respect and legal protection of the whole society.

本書所涉及的全部統計算法大致可被分為三類。

All statistical algorithms covered in this book can be roughly divided into three categories.

第一類是作者引用的既有統計算法,例如,算術均數,以及基於算術均數的方差、標準差、t統計量、F統計量、相關係數和回歸模型等的算法,等等;又如,基於沒有實質含義和算法流程的抽象權重的加權均數的計算公式,等等,本作者不對它們聲張任何個人權利。

The first category is the existing statistical algorithms cited by the author, such as the arithmetic mean, and arithmetic mean-based algorithms for variance, standard deviation, t-statistic, F-statistic, correlation coefficient, regression model, etc.; and for another example, the calculation formula of weighted mean based on abstract weights without substantial meanings and algorithmic process, etc. The author will not claim any personal rights over them.

第二類是作者對現有算法的改進,例如,基於凸權均數的離峰差、標準離峰差、t統計量、F統計量、相關係數和回歸模型等的算法,等等,作者對此類算法聲張和保留自己的權利。

The second category is the author’s improvements for the existing algorithms, such as the algorithms of deviation from peak, standard deviation from peak, t-statistics, F-statistics, correlation coefficients and regression models, etc., that all based on convex self-weighted mean. The author claims and reserves its own rights to such algorithms in this category.

第三類包括那些由作者獨立地首創、設計和構造的統計算法,例如,關於連續隨機變量的自權重的完整算法、基於凸權均數的正態化、t檢驗中的t值校正係數的完整算法、三分回歸的完整迭代搜索流程、基於全域模型和分段模型的預測和殘差的凸權均數的分段模型的回歸權重的多種算法的完整流程、基於回歸權重的加權期望臨界點的完整算法、分段模型在加權期望臨界點處的連續性檢驗的完整算法、分段模型的擬合優度,等等,作者對此類算法聲張並保留自己的權利。

The third category includes the statistical algorithms independently originated, designed and constructed by the author, such as the complete algorithm for self-weighting of continuous random variables, normalization based on convex self-weighted mean, the complete algorithm of t-value adjustment coefficient for t-test, the complete iterative searching process for trichotomic regression, the complete process of various algorithms for regressive weights of piecewise models based on the convex self-weighted means of the predictions and the convex self-weighted means of the residuals of fullwise model and the piecewise models, the complete algorithm of the weighted expectation of threshold based on the regressive weights, the complete algorithm for continuity test of piecewise models at the weighted expected threshold, goodness-of-fit of piecewise models, etc. The author claims and reserves its own rights to such algorithms in this category.

以上第二和第三類統計算法,非經作者同意,任何個人或法人實體等均不得將其用作商業盈利之目的,例如將它們中的任何一個編程寫入商業化並進入市場銷售或租賃的統計軟件產品中,也不得將其編程寫入非盈利性的免費統計軟件產品中;否則,作者將向這些侵權者追究任何形式的侵權責任。

Without the consent of the author, the second and third categories of statistical algorithms above are neither allowed to be used by any individual or legal entity for commercial profit purposes, such as coding any one of them into a statistical software product that is commercialized and put into the market for sale or lease, nor allowed to be coded into any non-profit free statistical software product; Otherwise, the author will pursue any form of infringement liability against these infringers.

這裏強調個人權利的聲張僅限於對上述各個新算法的完整流程,表明作者主動放棄對這些算法中的非完整計算流程聲張個人權利。此外,任何為了統計方法學的研究和改進而引用任一這些被作者聲張了權利的算法的行為都不受所聲張權利的限製;但是,如果您將自己必須引用這些被作者聲張了權利的算法的研究成果用於專利申請或個人權利聲張,請注意您所聲張權利的界限,不得因此損害本作者的相關權利。

It is emphasized here that the assertion of personal rights is limited to the complete processes of each new algorithm mentioned above, indicating that the author voluntarily gives up the assertion of personal rights to the incomplete calculation processes of these algorithms. In addition, any act of quoting any of these algorithms that the author has asserted its rights for the purpose of research and improvement of statistical methodologies is not subject to the restrictions of the claimed rights; however, if you want to apply patent or claim your personal rights for your research results that must cite these algorithms for which the author asserts its rights, please pay attention to the boundaries of the rights you will claim and do not damage the relevant rights of the author.

此自序寫於

This preface was written in

2019年3月9日 ~ 2025年11月9日

March 9, 2019 ~ November 9, 2025

馬裏蘭州洛克維爾和印第安納州卡梅爾家中

at the homes in Rockville, Maryland and in Carmel, Indiana