智慧即財富

陳立功的文學城博客:馳縱騁橫,談今博古,飛花揚月,行文交友
個人資料
TNEGI//ETNI (熱門博主)
  • 博客訪問:
正文

關於分段回歸的定義以及統計學的基本概念係統

(2010-10-30 21:06:56) 下一個
                                      Ligong Chen's Definition on the Piecewise Regression and
                                                 The Basic Conceptual System of Statistics

關於本人(陳立功,Ligong Chern)的三分回歸分析法的文章銜接:
http://www.meetingproceedings.us/2009/jsm/contents/papers/303243.pdf

        一、什麽是分段回歸?
                What is Piecewise regression?
        在統計學中,分段回歸分析(Piecewise regression analysis,PRA),或簡稱分段回歸(Piecewise regression),在廣義的回歸分析(Regression analysis)中是一種方法或分析的策略。它試圖在一個被分割的、可連續測量的隨機樣本空間裏找到一個或多個隨機的臨界點(Critical point,Threshold)以便將整個隨機樣本空間分割為兩個或多個子空間,並在此基礎上為每個子空間擬合一個臨界模型,從而以一組隨機可變的回歸模型來描述和預測整個隨機空間上複雜的回歸關係。有了分段回歸分析的方法和技術,我們就有可能依從或改變一個隨機空間裏的複雜關係以便實現特定的目的。因此,一個廣義回歸分析的完整策略應該由一個全域回歸分析和分段回歸分析組成[1]。根據對分段回歸的上述定義,我們就不難理解到,它應該是處於整個統計學方法論的頂端位置[2]。
         In Statistics, the piecewise regression analysis (PRA) is a method or analytical strategy in general Regression analysis. It is based on finding one or more random critical points or thresholds on a segmented random variable to segment a continuably measured random sample space into two or more sub-spaces in order to describe randomly variable regression relationships in the whole measurable space. With the PRA, we may have an approach to follow or change the relationships in order to realize a particular purpose. Therefore, a complete strategy for the general regression analysis should be composed of a fullwise regression analysis and a piecewise regression analysis. According to the definition, we will understand that the PRA should be at the top of the large body of the methodology in Statistics.

        在分段回歸分析中擬合的回歸模型有時被稱為分段模型或臨界模型或分割模型。這三個術語應該擁有同一個內涵,或者說它們是同義詞。
         The regression models fitted in the piecewise regression analysis are sometimes called piecewise models or threshold models or segmented models. All of the three terms should share a same connotation, or they are synonyms.

        二、統計學的基本概念係統
                The Basic Conceptual System of Statistics
         要想準確無誤地理解我提出的三分回歸分析法的全部內容,幾個新的概念需要被引入到現行的統計學和概率論的基本概念係統之中,而且有幾個基本概念需要被澄清或甚至被重新定義。
         Several basic concepts in Ligong Chen's paper or in his thinking process need to be clarified, and some of them need to be corrected. Due to a limited space of the JSM proceedings, he had no chance to do it. Anyone might feel very difficult when he/she tries to understand the ideas and the method in his paper if there were all the concepts stated here since some existing concepts' connotations have been adjusted and some new concepts are emerged. So, I would like to borrow here to give his explanation. 

        個體:在認識論範疇內,一個個體是一個獨立的存在或實體或客體,且擁有其自身已知的、可知的和不可知的全部屬性,並且由於這些屬性,一個個體可以與所有其它個體相區別。在一個特定的領域中,任何以最小單元存在著的事物可以被稱為是一個個體。當一個個體進入一個主體的觀察範疇且能被認知或再認知時,它的每一個屬性應該是確定的而非不確定的。換句話說,一個個體是它自己而非任何其它事物是由於它所擁有的全部屬性至少在被認知的那一刻是確定的。反之,如果它的全部屬性在被主體觀察時是不確定的,那麽主體將對它不可知,或者說它對於主體來說不可測。
         Individual: in the domain of epistemology, an individual is an independent existence, or substance, or entity, or object with all known, knowable and unknown attributes by which an individual can be distinguished from all others. Everything existing as the smallest unit in a specific scope can be called an individual. Every attribute about an individual should be certain rather than uncertain if it can be cognized or recognized, or when it is entered into an observation of a subject. In other words, it is itself rather than anything else because all of its attributes are certain at the moment of cognition or recognition. In contrary, a subject should have no way to know it if its attributes are uncertain in an observation; or it is immeasurable to the subject.
 
        屬性:一個個體的一個屬性(用符號A(字體:kunstler script)表示)是關於它的一個抽象的特征。這類抽象的特征通常有質和量兩大類,由此我們可以在許多個體中定義一個群體或類。例如,一個個體可以有姓字、性別、身高和體重等屬性。每一個屬性是唯一的並且表達著一個特定的含義。
         Attribute: an attribute, denoted by A (kunstler script), is an abstraction of a characteristic of an individual with a specific quality or quantity by which we may define at least one group or category in the individuals; for example, an individual may have a name, gender status, age, height and weight, etc. Every attribute is unique and indicates a specific meaning.
 
        子屬性:它是一個附屬的屬性且被定義在一個屬性的名下,例如,姓名={亞裏士多德,培根,黑格爾},性別={男,女,性別畸變}以及年齡={介於[0,140]之間的一個數值,如2,35或86歲},等等,其中{亞裏士多德,培根,黑格爾}、{男,女,性別畸變}和{2, 35 or 86歲}等是被分別定義在姓名、性別和年齡等名下的子屬性。
         Sub-attribute: an affiliate attribute is defined under the name of an attribute, for example, Name={Aristotle, Bacon, Hegel}, Gender={male, female, abnormity} or Age={a value that is in the range of [0, 140], i.e. 2, 35 or 86 years old}, etc., where {Aristotle, Bacon, Hegel}, {male, female, abnormity} and {2, 35 or 86 years old} are sub-attributes defined under the name of Name, Gender and Age, respectively.

        不變屬性:一個屬性被認為是不變的,如果(1)它是它自己;或(2)沒有子屬性可以被定義在其名下;或(3)即使存在子屬性,但定義它們是不必要的。從而,這個屬性在觀察或試驗過程中可以被認為是沒有變化或變異性的,因而可以被用來清楚地定義一個群體或類別,例如,性別=男,或年齡大於等於18歲,或性別=男且年齡大於等於18歲,等等。
         Invariable attribute: an attribute is said to be invariable if (1) it is itself, or (2) there are no sub-attributes that can be defined under its name, or (3) it is unnecessary to define the sub-attributes even if they exist. Thus there is no change or variability on the attribute in an observation or experiment so that it can be used to define a group or category clearly, for example, Gender=male, or Age>=18, or Gender=male and Age>=18, etc.

        可變屬性:一個屬性被認為是可變的,如果在一個觀察或試驗中至少有兩個不同的子屬性可以被定義在其名下,且各子屬性是可以相互區分且準確定義的,相互之間沒有任何混淆和衝突。因此,可變屬性的概念等同於現行係統中隨機變量的概念,例如,性別={男,女,性別畸變},0歲<=年齡<=140歲,等。
         Variable attribute: an attribute is said to be variable if there are at least two different sub-attributes that can be defined under its name in an observation or experiment. Every sub-attribute is distinguishable and can be defined clearly without any confusion and/or confliction with each other, thus the concept of variable attribute is equal to the concept of random variable in the current system, for example, Gender=(male, female, abnormity), 0<=Age<140, etc.

        離散可變屬性:一個屬性被認為是離散可變的,如果定義在其名下的所有子屬性是質性的,例如,地點和學校、樹木和湖泊、疾病和治療,等等。
        Discretely variable attribute(DVA): an attribute is said to be discretely variable if all the sub-attributes defined under its name are qualitative, for example, locations and schools, trees and lakes, diseases and treatments, etc.
 
        連續可變屬性:一個屬性被認為是連續可變的,如果定義在其名下的所有子屬性是量性的,例如,高度和重量、速度和加速度、容積和比率,等等。
         Continuously variable attribute(CVA): a variable attribute is said to be continuously variable if all the sub-attributes defined under its name are quantitative, for example, height and weight, speed and acceleration, volume and ratio, etc.

        總體或總體空間:一個總體(用符號P(字體:kunstler script))是由一些有著相同的不變和可變屬性的個體組成的一個群體或集合。總體中的個體構成了一個空間,即總體空間。通常,一個總體被認為是無限的,因為其中的個體數量可能是無限的,或者由於數量巨大以至於在一次有限的觀察中不可能全部觀察到。一個總體有可能進入一個或一群觀察主體的一個特定的觀察或試驗範疇。
        Population or Population space(總體或總體空間): a population, denoted by P (kunstler script), is a group or set of all individuals with all the same invariable and variable attributes. All the individuals in a population constitute a space, or population space. Usually a population is considered to be infinite since the individuals may be infinite or in a too large number to be obtained. A population may be entered into a scope of an observation or experiment taken by a subject or a group of subjects.
 
        尺度空間:一個尺度空間(用符號Ω表示)是由一個可變屬性的全部無重複或衝突的子屬性或一次觀察或試驗中的全部可能結果構成的空間,例如,一個統計調查表就是一個尺度空間。由此,一個尺度空間是關於不變屬性和可變屬性的一個集合,且這個集合不能為空集,因為它是一個統計測量的工具。因此,這裏對尺度空間的定義等同於現行概率論係統中的“樣本空間”的定義。顯然,一個尺度空間不能被說成是一個樣本空間,因為它僅僅是一個測量工具而非一個樣本本身。
        Scale space: a scale space, denoted by Ω, is a space constructed with all possible sub-attributes or outcomes without duplicates or conflictions of a variable attribute in an observation or experiment, for example, a questionnaire for a statistical survey. Thus, a scale space is a set of invariable attributes and variable attributes and the set may not be empty. It is a tool for a statistical survey. So, the scale space here is equal to the "sample space" in the current probability theory. Clearly, a scale space cannot be called a sample space since it is just a measurement tool rather than the sample itself.
 
        測度:一個測度(用符號M表示)是在一定的觀察或試驗範疇內有著特定目的的測量行為,通常由至少一個主體執行以便獲得關於總體中一定數量的個體的不變屬性和可變屬性的原始記錄和認知。特別地,在統計學中的所有測度都是隨機測度,因為任何被測對象都是隨機得到的。
         Measure: a measure, denoted by M, is an action taken by at least one subject in order to obtain original records or cognitions on all invariable and variable attributes with a certain number of individuals defined and selected in an observation or experiment for a specific purpose. Especially in Statistics, any measure is a random measure since any object that is measured is randomly obtained.
 
        分布:一個分布(用符號D表示)是關於個體的觀察結果在尺度空間上的表達。
        Distribution: a distribution, denoted by D, is a result of a measuring action on a scale space.
 
        樣本:一個樣本(用符號S表示)是一個測量行為中全部被觀察個體的全部結果,因此,它是一個尺度空間上完整的分布。一個樣本是總體的一個隨機子集。不存在沒有尺度空間相關聯的獨立樣本;反之亦然。在統計學範疇內,一個樣本通常也被稱為是一個數據集。由於總體中個體的無限性,一個樣本應該通過一個隨機機製獲得從而使得其對總體的代表性得到一定程度的保證。由此,統計學範疇內的任何樣本都是一個隨機樣本。在統計學中,樣本中的一個個體通常被稱為一個“觀察”或“隨機樣本點”或簡稱“樣本點”,因此,一個樣本中一個個體或觀察或樣本點不能再被稱為是一個“樣本”;否則將引起概念間的混淆甚至衝突;除非在一個測量行為中隻有一個個體被觀察到,此時,一個樣本就等於這個個體。一般而言,一個樣本自身作為一個整體在另一個觀察範疇內是一個個體,但卻是不同於樣本中的個體的個體。這個作為“個體”的樣本也應該擁有其自身的屬性,即樣本屬性,且每一個屬性也應該是確定的,恰如以上討論的關於總體中個體的屬性的性質一樣。
         Sample(樣本): a sample, denoted by S, is a complete result of all individuals in a measuring action, thus it is a complete distribution over a scale space. It is a random subset of a population. There should be no independent sample without a scale space associated with it, and vice versa. In the domain of Statistics, a sample is often called a dataset. A sample should be obtained with a random mechanism in order to be guaranteed to be a representative of a population since the individuals in a population are usually infinite. Thus, any sample in the domain of Statistics is a random sample. In Statistics, an individual in a sample is often called an observation or a random sample point or sample point in brief. Thus, an individual or an observation or a sample point in a sample cannot be called a “sample” again; otherwise it may cause confusions or conflictions with the sample itself, except in the case that only one individual is measured. In general, a sample itself as a whole is an individual in another scope of an observation, in which it is different from the individuals in the sample. It should have its own attributes, and every attribute should be certain, too, just as it is with any individual discussed above.
 
        樣本空間:一個樣本空間(表示符號同樣本)可以是一個樣本自身或樣本數據集,因為在任何樣本中應該沒有重複的個體記錄,因而每個樣本點都是一個獨立的元素,即使在僅有一個離散變量而關於該變量的觀察僅有兩個或兩個以上的子屬性和三個或三個以上的觀察個體的下情形中也是如此。換句話說,我們可以反問:如果一個樣本自身不能被稱為是一個樣本空間,那麽,還有什麽其它的東西能被稱為是樣本空間呢?事實上,一個樣本中的全部個體就構成了一個完整的空間,這個空間就是樣本空間。
         Sample space: a sample space, shares S with sample, can be the sample itself or the dataset since in any sample there should be no duplicates, thus each sample point is an independent element even in the case that there is only one discrete variable with two or more categories and three or more observations in the sample. In other words we can say that in contrary, if a sample itself can not be called a sample space, what else can it be?
 
        可測空間:一個空間被認為是可測的,如果其中每一個體在尺度空間上可測。從而,一個總體是一個可測空間,因為其中所有的個體在一個尺度空間上應該是可測的。
         Measurable space: a space is said to be measurable if everything in it can be measured on a scale space. Thus, a population is a measurable space since all individuals in it should be measurable on a scale space.
 
        被測空間:一個空間被認為是被測的,如果其中每一個體被一個尺度空間所測,無論這個測量對於任一個體是否成功。從而,一個樣本是一個被測空間。
         Measured space: a space is said to be measured if everything in it is measured on a scale space, regardless that the measure on an individual is successful or unsuccessful. Thus, a sample is a measured space.
 
         隨機映射:它是一個隨機機製,用符號M(字體:kunstler script)。通過它一個樣本或樣本空間被從一個可測空間或總體在尺度空間上得到。
         Random mapping: it is a random mechanism by which a sample or sample space is obtained from a measurable space or population through a scale space, denoted by M (kunstler script).
 
        概率空間:一個概率空間(用符號P表示)是一個被概率化為1的樣本空間。我們不能將一個概率空間定義在一個總體空間或可測空間上,因為一個尺度空間對於一個總體來說可能不是一個完備的空間但對於樣本來說卻是完備的。此外,一個總體空間通常是未知的,因此,一個概率空間如果被定義在一個總體空間上將帶給我們一個未知的空間,從而這樣的定義是徒勞的。我們也不能將概率空間單獨地定義在一個尺度空間上,因為後者不過是一個測量工具而非我們試圖通過概率來認識的真實的隨機世界。然而,一個概率空間應該是被定義在一個分布著樣本空間中的全部被測個體的尺度空間上。因此,隻有樣本空間是一個完備的空間且可以在尺度空間上被概率化。當然,一個在數學上被很好地定義了的確定的完備空間也是可以被概率化為1的,隻要它滿足由當前知識係統設定的一些特定的條件,例如,所有理論分布,包括正態分布、標準正態分布、t-分布、F-分布以及卡方分布,等等。因此,如何概率化一個樣本空間屬於數學特別是概率論的範疇。
         Probability space: a probability space, denoted by P, is a sample space which is probabilized into 1. We cannot define a probability space over a population space or measurable space since a scale space may not be a complete one for a population but is complete for a sample. In addition, a population space is usually unknown, so to define a probability space over a population space will give us an unknown space, thus the definition is in vain. We cannot define a probability space over a scale space alone either since the scale space is just a measurement tool rather than a real world that we try to know in statistics. However, a probability space should be defined over a scale space with all measured individuals in a sample space since the sample space is a distribution over the scale space. Thus, only the sample space is a complete space and can be probabilized over the scale space. Of course, a certain complete space that is well defined in mathematics may be probabilized into 1 as long as it satisfies some specific conditions in terms of the existing knowledge system, for example, all the theoretical distributions, such as normal distribution, standard normal distribution, t-distribution, F-distribution as well as Chi-square distribution, etc. Therefore, how to probabilize a sample space belongs to the domain of Mathematics, especially the Theory of Probability.
 
        空間的連續性和可連續性:由於總體的無限性,我們不能在總體空間上直接討論空間的連續性,但可以經由樣本來討論這個問題。這裏有兩個不同的概念:一個是連續空間;另一個是可連續空間。一個連續空間不等於一個可連續空間。一個樣本空間被認為是連續的,如果其中所有個體處於一個確定的子樣本空間或整個樣本空間自身之中,例如,100個男性的身高和100個女性的身高將各自被視為一個連續空間而不是一個可連續空間。然而,如果將這兩個空間混合在一起,則這200人的身高將被視為是一個可連續空間而非一個連續空間,因為這個混合空間是由兩個可識別的、相互重疊或分離的空間構成的。不過,這個混合空間仍然可以以一種連續測量的方式得到,且以“人的身高”為屬性被定義為一個連續空間。
         Continuity and Continuability of space: We cannot directly discuss the continuity over a population space but only on a sample space. There are two different concepts in this scope. One is continuous space, and the other is continuable space. A continuous space is not equal to a continuable space. A space is said to be continuous if all individuals in a sample are in a certain sub-sample or the whole sample itself, for example, the records of 100 males’ height and the records of 100 females’ height can be considered as a continuous space respectively. However, if we put them together, then the records of the 200 peoples’ height will be considered as a continuable space rather than a continuous space since this mixed space may be an overlapped or a separated space of the two continuous spaces. However, it can be measured in a continuous manner as a whole single space.
 
        空間的不可分性和可分性:一個空間的可分性在離散空間裏是很容易理解的。曾經令人在哲學上感到困難的是關於一個連續空間的可分性,例如,一塊磚頭是一個完整的連續空間,如果將它分開,勢必要打破它。然而,在引入了空間的可連續性概念後,這樣的理解就不會遇到任何邏輯障礙,例如,由兩塊磚頭粘合起來的空間也可以被視為一個可連續空間,但卻是一個可分離的空間,因此,一個可連續空間不等於一個連續空間,而一個可連續空間具有可分性。
         Indivisibility and Divisibility of space: the divisibility of a space should be understood if the space is a discrete space. It is difficult to understand the divisibility over a continuous space in philosophy. However, after the concept of continuable space is introduced into the knowledge system, everything should be simple, since a continuable space is not equal to a continuous space. Thus, a continuable space may be divisible.
 
        統計量:一個統計量(用符號s表示)是關於樣本或樣本空間的一個屬性。由於樣本是來自總體的一個隨機子集,因此,一個統計量是一個隨機的點測量。它也被認為是定義在樣本空間也就是概率空間上的一個實可測函數。統計學所要做的正是構造特定的統計量以便對樣本空間的屬性作出描述,進而推斷總體空間的相應屬性(總體的屬性通常用一個特定的術語即參數來稱呼)。因此,一個統計量是一個隨機的常量而非一般數學意義上的常量。一般數學意義上的常量通常沒有任何形容詞修飾,也就是一個常量是它自己。由此可知,一個樣本中的全部記錄也都可以被理解為隨機常量。在統計學的範疇內,一個常量被認為是隨機的僅僅是針對樣本本身而言。因此,我們可以說一個統計量對於一個給定的樣本來說是確定的,而對於總體來說則是非確定的。然而,一個樣本統計量在不同的子樣本之間以及它們與整個樣本的統計量之間可以是不同的,因為任何子樣本為其自身的統計量提供了較少的信息。例如,一個單一的全域回歸模型提供了關於整個樣本空間上的一個確定不變的回歸關係,而分段回歸模型將帶給我們一組不同臨界空間上的可變關係,從而,一個完整的樣本空間可以被分割為若幹個片段。
         Statistic: a statistic, denoted by s, is an attribute about a sample or sample space. It is a random point measure since the sample is a random subset of a population. It is a real measurable function defined over a sample space thus a probability space. What Statistics does is to construct specific statistics to describe a sample space thus to infer the relevant attributes, which is denoted by a specific term, parameters, of the population space. Thus, a statistic is a random constant rather than a constant in mathematics, which is constant itself without a specific description. Thus, all records in a sample can be understood as random constants, too. A constant is said to be random only for a sample in the domain of Statistics. Therefore, we can say a statistic is certain to the given sample itself but uncertain to the population. However, a sample statistic may be different in a sub-sample from that of the sample since a sub-sample contributes less information to its own sub-sample statistic. For example, a single fullwise regression model will provide a certain or invariable regression relationship over the whole sample space; and a piecewise regression model will bring us a set of different regression relationships in different threshold intervals. Thus, the whole sample space can be segmented into several pieces or segments.
 
        參數:一個參數(用符號p表示)是關於總體的一個屬性,通常用一個相應的樣本統計量來估計和推斷,此時的總體參數可以被認為是不變的,且這一假設對總體來說不會導致損害。然而,我們必須意識到它在自身的自然曆史中應該是可變的。
         Parameter: a parameter, denoted by p, is an attribute of a population and will be estimated and inferred with a relevant sampling statistic. It can be treated as an invariable attribute in a statistical estimate since such a treatment doesn't matter to a population. However, we should have to believe that it is variable in the natural history of itself.
 
        隨機空間或隨機係統:在我們所討論的問題的範疇內,一個隨機空間或隨機係統(用符號R(字體:kunstler script)表示)是一個與上述全部概念相關聯的抽象概念,也就是說,它是一個廣義化的概念,而非上麵提到的某個或某幾個具體的概念。由於定義總體的不變屬性和樣本中個體的隨機常量以及樣本本身的全部統計量,一個隨機空間可能包含了一定程度的確定性,從而在描述和推斷總體時,我們的結論也就有了一定程度的確定性。但是,我們必須牢記,任何樣本對其總體的非確定性是一個絕對的本質屬性,因此,基於樣本基礎上的關於總體的全部描述在本質上是隨機的或非確定性的。
         Random space or Random system: a random space or random system, denoted by R (kunstler script), is an abstract concept associated with all the concepts above in the domain that we discussed here. It is a generalized concept without a specific object among the concrete concepts stated above. In other words, all of the above concepts constitute a complete random space. A random space may contain a sort of certainty due to the invariable attributes for defining a population, as well as the random constants of all the individuals and all the statistics, thus we will have a sort of certainty in our description and inference on the population. However, we must remember that the uncertainty of a sample to the population is absolute, thus all the descriptions about the population based on a sample are essentially random or uncertain.
 
(注:本概念係統於2010年10月18日在Wikipedia網站上關於Piecewise regression analysis的詞條中提了出來,由於涉嫌原創性研究以及可能引起的巨大學術爭論,被Wikipedia管理人員於當月27日將整個詞條刪除)


[ 打印 ]
閱讀 ()評論 (0)
評論
目前還沒有任何評論
登錄後才可評論.