en.search-data.min.fb90688d28de821996478a32e584fb0fc68f990efc792c387cfdf8d87fbfb935.json

[{"id":0,"href":"/posts/sum-check/","title":"Notes on Sum-Check Protocol","section":"Blog","content":" ◀ ▶ Page: / "},{"id":1,"href":"/posts/vbt/","title":"My English translation of an ancient sanskrit Meditation book","section":"Blog","content":"This is the English translation after I finished my study in varanasi by end of \u0026lsquo;19 (right before the pandemic). I think it is time to make this public. If you are interested to know more, there is a full annotation on Amazon.\nWithout further ado, here is the full text:\n1\nSri Devi asked:\nI heard that everything is illusion.\nWe should follow the instruction of Rudrayāmala to enter union.\nTrika as a totality is not separable.\nBest of the best guidance, please explain part by part.\n2\nToday many questions are left unﬁnished\nAnd doubted, highest creator.\nWhat is real rūpa, deva?\nWhat is sound-quantity (śabdarāśirna), manifested images (kalā), and illusions\n(maya)?\n3\nWhat are nine mantras (navātma) for non-duality,\nTo complete the realization of bhairava?\nSince making such wishes, breaking and splitting are already going,\nWhat is śakti and its three-fold composition?\n4\n[What are] Nāda, bindu and māya,\nWhat is half moon (candra-ārdha), what is non-arising (nirodhikāḥ)?\nHow does the mind arise from cakras?\nWhat is the true nature of śakti?\n5\nTranscedentally transcend the parts,\nDestroy dependent subjects,\nTranscend what is in that manner,\nTranscended, thus ascended.\n6\nNeither every countenance is breakable,\nNor the body is, by becoming.\nTranscended, diminished,\nParts will not lead to becoming.\n7a\nMy kind lord, please do condescend,\nDispelling my doubts completely.\n7b-8a\nGreat, great what you asked,\nBest tantra I now speak in a soft tone to you, my love.\nOn many hiding secrets of śiva,\nI am telling you this way.\n8b-9a\nWhat is the part of cit? Forms (Rūpa).\nSo is self-nature (ātma), as part of bhairava.\nThe essence of my teachings is also in this manner,\nThe recognition (vijñana) is like the indra\u0026rsquo;s net (indrajala).\n9b\nMaya is like a dream of the highest cit,\nor like gandharva\u0026rsquo;s city in the sky. All ignorance.\n10\nThe purpose of meditation is all about awakening.\nActions (kriya) is vanity for a living.\nFor a man, the only represented knowledge to get united spiritually,\nis seated in vikalpa.\n11\nIn reality, bhairava is not nine forms of mantras.\nNor sound-quantity (śabdarāśirna).\nNor that of trika.\nNor that of three-fold śakti.\n12\nNor sounds, bindu or maya,\nNor half-moon (candrārdha), nor non-arising (nirodhikāḥ),\nNor mixed chakra-practice (cakrakrama),\nNor śakti nor true nature (ātma).\n13\nAbove are for immature minds,\nTerrifying the mentally weak ones.\nAll like mother giving pleasing caddies,\nOnly appearing and serving for calling and illustration.\n14\nThe space and time, both manifested images bringing no liberation.\nThe sense of organs illuminate the body but are no body.\nThe idea of one who shows and names is impossible and intoxicating.\nThe highest truth goes on and on but unspeakable.\n15\nThe essence of self-being (svabhava) is internal joy (ānandā).\nvikalpa is the power to be free.\nSo far that what resides are full of manifested images (kārā),\nbhairava is the supreme mind .\n16\nThus, cognize the real body.\nSee the universe of nothing but purity.\nThe manner of true transcendence happens,\nIs the way of [real] ceremony (pūja) or pleasing God.\n17\nThe form of bhairava unrolls,\nWhen you reside everywhere [and nowhere].\nWith the transcendental transcending form (rūpa),\nThe transcendental self-nature is.\n18\nIn that manner of powerful śakti,\nIndivisible, residing everywhere,\nHence the laws(dharma) become laws-in-the-go(dharmavāt),\nWhen śakti goes transcendental, mind goes transcendental.\n19\nDifferent those, [cognize] the dis-manifest (vibhāvyate).\nPerfect cognition on entities,\nBeginning to enter [the supreme state].\n20\nRemain being within śakti,\n[Keep on] creative contemplation on undivided (abheda),\nThat is śiva\u0026rsquo;s form,\nThat is the proper entrance from this world to śiva.\n21\nThe world in that manner is as suggested.\nLike splendid shining beams of light.\nCognition arises as freedom of existence (vibhāgādi).\nThus, śakti goes and śiva is realized.\n22\nMy trident-marked lord,\nSkull-decorations-worn one,\nSpace, time, and emptiness show themselves,\neverywhere and nowhere.\n23\nResiding full of manifested images (kārā),\nHow to become bhairava and understand directly?\nWhose path and entrance [to follow]?\nHow to transcend myself and become [bhairava, the one], I would like to know.\nOn what right I\u0026rsquo;ness can I realize,\nBeing as is and developing bhairava?\n24\nUpward is prāna,\nDownward is jīvo,\nPause (visarga) is ātmā,\nTranscendence is uccāra.\n[Where] the two-fold respiration is born,\n[One ﬁnd there] full of nourishments and pacifying.\n25\nBreath in, breath out.\nWhere life meets ether.\nIn this way, bhairava in the move becomes bhairava.\nAnd body etherized.\n26\nNeither let śakti wonder or enter into it,\nLet the form of breaths manifest,\nThrough which, nirvikalpa is in the middle, and\nThe form of Bhairava exists everywhere.\n27\nClear the retained breath (kumbhitā),\nBreath in there, [when] one becomes [the one].\nThus one\u0026rsquo;s inside archives peace and so is one\u0026rsquo;s consciousness.\nWhen śakti achieves peace, bhairava manifests.\n28\nFrom root (mūla), rays of light illuminates.\nSubtler, subtler, rising upward.\nIn this way, consider love and loathing,\nthe same inside. Bhairava arises.\n29\nThe intention (iccha) rise inside in the form of lightnings.\nIn the sequence of cakras [successively].\nAs far as three-ﬁstful upward,\nYour inside [sees] the great arising.\n30\nIn the proper sequence of [going through] the internal-twelfth-ﬁnger [dvādaśanta],\nThe twelfth-ﬁnger (dvādaśa) melts away, undivided.\nGross, subtle, transcendental,\nFreedom, freedom, inside becomes śiva.\n31\nLike that, breath in, breath out. [Feel] inside of your head crown.\nWith the contraction of eyebrows, break [the crown].\nAfter mind obtains nirvikalpa,\nEverything goes upward. Everything ascends.\n32\nLike peacock\u0026rsquo;s crest [seen from] side, showing variegated colors,\nFive-fold emptiness(śūnya) like a maṇḍala.\nMeditate on such emptiness(śūnya) in one\u0026rsquo;s hollow inside,\nBy entering one\u0026rsquo;s heart (hṛdaya), one becomes (bhairava).\n33\nLike such, follow the chakra sequence (krama),\nContemplate that therein,\nEmptiness (śūnya), non-oneness (divisions, kuḍya), oneness (transcendental savior, parepātre)\nDissolved in one\u0026rsquo;s self. Wish granted.\n34\nPlace the mind inside the skull.\nReside with eyes half-open.\nDuring the sequence (krama), keep the mind ﬁrm.\nHighest goal will be achieved.\n35\nDwell in the middle of the middle nāḍī,\n[Where] the string of soul [is like] a ﬁbrous lotus stalk.\nMeditate upon its inside space.\nThere God manifests.\n36\nSee in one\u0026rsquo;s mental eyes that manifested images (kārā) are mere obstructions.\nThe contraction of eyebrows (bhrūbhedā) is the entrance to arising.\nView bindu dissolved in sequence.\nIn the middle of everywhere, the highest stands.\n37\nThe internal law (dhāma) jolts and ascends,\nEnded with subtle ﬂames known as tilakā.\nThe light of bindu is from inside of heart (hṛdaya).\nMeditate on that dissolved in [one\u0026rsquo;s heart]. [One becomes] dissolved [into the\none] .\n38\nUnsurpassed savior is in one\u0026rsquo;s ears.\nSounds can lead to dissolution quickly.\nPerfect brahman sounds bring full accomplishments.\nSupreme brahman walks above.\n39\nDuring utterance of śabda, praṇa ascending and traversing.\nWith [sounds] extended, waned, and echoed, perform creative contemplate (bhāvanā) on their inside emptiness.\nThe emptiness transcends in the same manner as śakti.\nThe emptiness is to be feared.\n40\nNo endeavor on syllables.\nAdhere to the end of the last word, then become [the one].\nEmptier, emptier, become the emptiness, like that.\nRealizing men as manifested images (kārā) of emptiness, [one] becomes [the\none].\n41\nMusic sounds of string instruments are śabda-arrows.\nFar reaching and [strong] arrows. Residing together with sequence (krama).\nThe target is the inside, no-difference mind (citta).\nRealizing body [the same as] the transcendental ether, one becomes [the one].\n42\nEverything is vowel-less mantra (piṇḍamantra).\n[From] the gross syllables, the sequence [begins].\nHalf-moon (ardhendu), bindu, nāda, all internal.\nWhile becoming the uttered emptiness (śūnya), one becomes [the one].\n43\nOne\u0026rsquo;s own body, where all directions are,\nAt the same time (yugapad) all vanishes into ether.\n[As such], nirvikalpa mind begins,\n[When one simultaneously] engages in the void all around.\n44\nAbove is emptiness. Below (mūla) is emptiness.\nSimultaneously (yugapad), both become this way.\nBe indifferent and ignore one\u0026rsquo;s body.\nMind on śakti and śūnya. One becomes [the one].\n45\nAbove is emptiness. Below is emptiness.\nStay there when one\u0026rsquo;s heart becomes emptiness.\nAt the same time, [keep in] the nirvikalpa state.\nNirvikalpa arises. Let it be.\n46\n[Feel] emptiness of limbs and rest of body,\nIn milliseconds (kṣaṇa-mātra), one becomes free from existence.\nBecome nirvikalpa, be nirvikalpa.\n[Realize that] nirvikalpa as one\u0026rsquo;s true nature.\n47\nEverywhere, the nature, the whole creation,\nEther is omnipresent.\nFreedom from existence and arising begins.\nContemplate residing with this. One becomes [the one].\n48\nThe division of inside [and outside] the body,\nConsider them becoming like walls.\nNothing inside begins.\nMeditate on having nothing inside. One becomes [the one].\n49\nIn one\u0026rsquo;s heart-ether, dissolve senses from organs.\nIn the middle of the lotus seat,\nNon-difference thought (citta) is shining.\nHighest beauty is obtained.\n50\nAll across self-body,\nThoughts are dissolved in the internal-twelfth-ﬁngers (dvādaśānte).\nFirm awareness becomes ﬁrm.\nThe real goal is thus engaged.\n51\nThus, in that manner, over there,\nThe internal-twelfth-ﬁngers(dvādaśānte), cast your mind.\nEvery moment, [be aware that] being in a diminishing manner.\nCauseless and goal-less, one becomes [the one] one day and one night.\n52\nFire of time, light of time,\nArises and ﬁlls ones\u0026rsquo;s own body.\nContemplate on the inside burning,\nAnd the peaceful lights. Then, one becomes [the one].\n53\nGoing and going, the whole world,\nMeditate on vikalpa burning and burnt.\n[Become] one of no difference thoughts.\n[Become] one of highest being. One becomes [the one].\n54\nSelf-body, the world, as goes.\nSubtler, subtler, [eventally] pervades.\nResidence of the real driver of breaths [are realized].\n[When one] meditates on the transcendental internal ether.\n55\nWhile thick śakti turns thin,\nMeditate within the scope of twelfth-ﬁngers (dvādaśa).\nEntering into heart-chakra (hṛdaya), keep meditating.\nUnbinding from the [worldly] dream is obtained.\n56\nThe main form of worldly path,\nis considering sequences (krama) with vacuity.\nGross, subtle and transcendental, reside.\nThe whole inside mind is dissolved.\n57\nSitting with the whole universe, everywhere,\nThe ubiquitous master inside, is the same with śiva.\nThe real way for ceremonies and devotion,\nOf śiva is to meditate on the great arising.\n58\nNow the universe,\nConsider as being emptiness (śūnya).\nThere [as] such, mind dissolves.\nThere, total dissolution is possessed.\n59\nSee with mental eye that suspension of breath is possessed.\nAll divisions abandoned, removed, cast.\nTotal dissolution arise. One obtain [the one-ness] within a millisecond.\nTotally dissolved, wholly absorbed. One becomes [the one].\n60\nNo trees, hills, walls.\nView places [as] removed and cast.\nBecome mentally dissolved.\n[Allow] subtle-styled to appear.\n61\nIn both ways [of contrasting dual], ﬁnd cognition (jñana),\nBy Meditating on the middle. Seek shelter there.\nSimultaneously (yogapad), one shall abandon the dual.\nThere in the middle, truth shines.\n62\nBecome abandoned and non-departed cit.\nAvoid becoming an inside wonderer.\nThus, [one] becomes the extended middle.\nGreater world is manifesting.\n63\nThe whole body is consisting of only pure intelligence.\nBecome transcendental of this two-folded world.\nAt the same time, nirvikalpa [arises].\nMind [is transcended and ] arises to the highest.\n64\n[Be aware of where] two-fold breaths join.\nInside enters. Outside exits.\nYogi [shall have such] sameness recognition (vijñana).\nArising mind is possessed.\n65\n[Meditate on that] the entire world and one\u0026rsquo;s own body goes on,\nFull of self-dependent joy. [One] becomes love.\nAt the same time, one\u0026rsquo;s self [becomes] death-less.\nTranscend [even this] joy as illusion, one becomes [the one].\n66\nVain religious austerities and dramatic ritual performance,\nWill perish and be gone.\nRise with the great joy,\nWhereby the truth manifests.\n67\nAll organs of senses are bondage.\nMove prāna and śakti upward slowly.\nAt the right moment, [one] feels ants crawling,\nwhich manifests the highest delight.\n68\nNow, [burn one\u0026rsquo;s] intellect [as] sacriﬁce in rites, in the middle.\nThoughts are comforting illusion. Cast away.\nProceeding, only ﬁlled with breaths.\nUnit [with] love and joy.\n69\nGet united with śakti, [feel energy running] trembling and agitated.\nCompletely be absorbed and reside in śakti.\nSince comfort is the true creator (brahma),\nIt brings forth one\u0026rsquo;s own ascension.\n70\nTaste the agitation and smoothing.\nRecollect the spreading comfort in full measure.\nBecome śakti this way.\nBecome the inundated joy.\n71\nThe great joy is attained,\n[When] seeing or envisioning a family or friend missing for a long time.\nMeditate on such joy arising.\nAll minds get wholly absorbed. One becomes [the one].\n72\nEating, drinking, [be aware of] joy in such actions.\nThe joy of taste is blossoming.\nBecome ﬁlled and reside [with].\nThe great joy. One becomes [the one].\n73\nSelf-apply songs and hearings.\n[Of the] same pleasure, mind\u0026rsquo;s agent of action [is].\nYogis, wholly absorbed as such.\n[From] minds thus produces ātma.\n74\nThere, there, the delight of the no-difference mind.\nThe mind possessed in that very place,\nThus, thus the transcendental joy,\nThe true nature, engaged wholly.\n75\nWhen sleep [is approaching yet] to arrive,\nThe sphere of outside vanished.\n[Where] the self-abiding consciousness [becomes] accessible,\nTranscendence manifests.\n76\nLight is the lamp of emptiness,\nGleaming variegated appearances in ether.\nView those non-rendered (dṛṣṭi nirveśya). Right there,\nTrue self-nature manifests.\n77\nManifested images (kārā) is what is born there.\nOmnipresent bhairava, is no different from a snake.\nIn khecarā\u0026rsquo;s way, view a small part of anything (kalā)\nOne obtained transcendence and [the one] manifests.\n78\n[Adopt] the corpse pose (mṛdvāsana, savasana). Like oars,\n[Leave] arms and legs, [with] no support.\nFix on that union.\nContemplate on the ﬁlled transcendence. One becomes [the one].\n79\nIn a proper sitting pose,\nRest one\u0026rsquo;s arms crooked like a cup.\nThink of ether in the armpits.\nFeel extended equanimity. Fully dissolved.\n80\nA gross form object,\nFirmly ﬁx one\u0026rsquo;s view [in mental eyes] settled on it,\nSoon, one will enter support-less being (nirādhāra),\nMind thus goes with the one.\n81\nTongue in the middle with [uvula] widely open,\nThoughts cast down in the middle.\nUtter \u0026lsquo;ha\u0026rsquo; in the mind [with exhale],\nLet it be and peace dawns.\n82\nSitting, sleeping, standing,\n[Be in] support-less state (nirādhāra), free from existence.\nSelf-body, mind goes subtle.\nIn a millisecond, mental disposition (āśaya) goes thin. One becomes [the one].\n83\nStanding and sitting in a moving position,\nWhich proceeds and presents a moving body.\nMind discontinues this way. One becomes [the one].\nŚiva will be obtained.\n84\nAlas, watch the spotless ether.\nView the non-inside.\nThe most ﬁrm true nature. Thus, within a millisecond,\nThe body of the one is obtained.\n85\n[Meditate on] one\u0026rsquo;s forehead fully dissolved into the whole ether,\nUntil one feels as [the one].\nThe entire manifested images (kārā) of bhairava,\nThoroughly enter into their true quintessence.\n86\nLimited knowledge (jñāta) causes duality.\nOutside worlds [are what] śiva destroys.\nThe universal primary form of the one,\nThe absolute eternal inﬁnite cognition is shining and possessed.\n87\nProceed, on a pitch dark deep night,\nOn the ﬁrst of a fornight, [without moon]. Dwell for a long time,\nUnable to see the form of beings is eyes\u0026rsquo; problem.\nThe one [only shows when] desire for forms stopped.\n88\nProceed. Shut eyes ﬁrst.\nAhead, deep dark ether ahead.\nGoing forward, [see] bhairava\u0026rsquo;s form.\nWholly absorbed in this existence. One becomes [the one].\n89\nEndeavors are harmful for [spiritual] faculties.\nSubdue [the motive to make efforts] in this way. Non-arising (nirodha).\nEnter the non-dual emptiness.\nRight there, one\u0026rsquo;s nature (ātmā) manifests.\n90\nNo bindu, no visarga,\nNo kāra. [Only] the sound, the great sound,\nRises with force.\nCognize in entirety the highest God.\n91\nWords are with visarga.\nThoughts (cit) goes with inside visarga.\nMake thoughts support-less (nirādhāra).\nFeel the eternal brahma.\n92\nSelf-nature (ātmā) is manifested images (kāra) of ether.\nMeditation is the compass guiding to freedom.\nBecome independent and basis-less (nirāśrayā), the consciousness (cit) and śakti.\nSelf-form (svarūpa) thus shows up.\n93\n[When] a part of limb is wounded,\n[Such] strong [painful] arising feelings contain mighty information.\nRight there, one\u0026rsquo;s consciousness gains union [with the one],\n[Where] spotless bhairava is seated.\n94\nRemove the limitedness of one\u0026rsquo;s consciousness (citta).\nSelf (mamān) existence [becomes]:\nVikalpa on living-beings and non-existence.\n[Then even] abandon vikalpa. One becomes [the one].\n95\nIllusions (māyā) are beguiling indeed.\nKeep expanding (kalāyā) and let thoughts stop.\nThe ﬁrst law (dharma) of true living beings is,\nExpanding one\u0026rsquo;s individuality. One becomes [the one].\n96\nOnce the intention (icchā) is born,\n[Immediately] behold with apathy. This is the right manner.\nWhere (iccā) is born,\nRight there clinging and attachment begins [and union is lost].\n97\nIf one\u0026rsquo;s own intention (iccā) is not born,\nCognition begins. Bad I\u0026rsquo;ness is exhausted.\nReal self, truth existence,\nBe fully absorbed in [which] with the whole mind. One becomes [the one].\n98\nIntention, or rather cognition,\ngives rise to entrance to cit.\nAttain enlightenment of nature (ātmā) with no-difference mind ascending.\nThe true meaning shows.\n99\nCauseless, existential cognition [shows that]\nĀtmā is support-less and erroneous.\nTrue light of cit arrived.\nKeep on, the one becomes (śivaḥ) [the one].\n100\nThe law of universal consciousness (cit) is [inherently embodied] in everything.\nRather than being shot in like an arrow, wherever consciousness exists.\n[When individual] soul becomes one with everything,\nOne is capable of existential cognition (bhavajijñana).\n101\nLust, hatred, greed, confusion,\nMadness, malice, etc.,\nFreedom from which, one shall be enlightened through,\nWhat remains and is real.\n102\nUniverse is an illusion, like indra\u0026rsquo;s net.\nCast down like a powerful magic.\nMeditate on everything, roaming around.\nBehold! Happiness dawns!\n103\nNo mentalities. Abandon pains.\nNo euphoria. All [contrasting feelings] one shall abandon.\nDirect cognition is obtained in the middle,\nWhere truth remains.\n104\nAbandon attachment to one\u0026rsquo;s own body.\nExpand everywhere into every being,\nWith the entire heart ﬁrmly [seated]. View [in mental eyes]\nThat joy of no difference. One becomes [the one].\n105\nFirstly, suspend the breaths [as in ﬁre rituals], rather than recognize (vijñāna).\nThe ﬁrst intention (icchā) is not from one\u0026rsquo;s own inside.\nIt is existent in everything and everywhere, diffused and omnipresent.\n[Then one shall realize that] one being has everything.\n106\nUnderstand beings as recipients of saṁvit (universal consciousness),\nOf which, Every living being shared in common.\nYogi shall enter that,\nuniversal bonding(saṁbandha) [with] attentiveness.\n107\nWhat one has is not the body, [which is]\nThe physical metaphor (anubhāva) of the universal consciousness (saṁvit).\nContemplate one\u0026rsquo;s self-body,\nas abandoned and [reside in] omnipresence. One becomes [the one] one day and\none night.\n108\nWith the support-less (nirādhāra) mind,\nAlways no-difference, be no difference,\nAs the nature, highest nature.\nOne becomes [the one].\n109\nThe cognition of entirety (sarvajña) is no difference from identity (sarvakartā)\nwith entirety.\nWith the pervading highest creator.\nWith self [realized] such God\u0026rsquo;s law (śaivadharmā),\nFirmly, one becomes [the one].\n110\n[Like] illusions relates to maya (ur-maya).\nFlames are a shattered part of ﬁre. Rays of sunlight.\nMe [is] also as such, of bhairava.\nMe is a shattered division of the universe.\n111\nWhirl and whirl, the body\nHastes, then falls to the ground.\nThe excitement of śakti waned and ceased.\nThe transcendental great victory state [manifests].\n112\nSupport can be an instructor, or rather, śakti.\nCognize, thoughts dissolved.\nŚakti is aroused [when one is] entering,\nthe inside excitement. [Śakti as] the body of bhairava.\n113\nTeachings of this tradition.\nListen, here [is the] proper talk.\nFinal emancipation, newly-born, śiva\u0026rsquo;s way,\nEyes ﬁrm ﬁxed in the whole.\n114\nContract your hearing.\nFor apertures in lower body, [contract] in the same manner.\nMeditate on the vowel-less consonant-less (anackam-ahalaṁ) [sounds],\nSupreme eternal brahma.\n115\nIn the direction of a well or, a great cave,\nStand over without seeing,\nThe proper no-difference mind.\nAt the same moment, thoughts get dissolved. [The one] manifests.\n116\nThere, there, mind stops.\nOutside goes inside.\nThus, thus, the one stays.\nHere is everywhere.\n117\nThere, there, all organs of senses are paths,\n[Through which], the form-less universal consciousness (cit) becomes visible.\nHis entirety is bestowed with attributes of Logos (vāc).\nPervading and lurking cit is full of nature-ness (form identity, ātmātā).\n118\nAt the beginning and the end of (ādyante): sneezes, fear, grief,\nConfusions in mind, battleﬁeld escape,\nInterests in any extraordinary matters, hunger,\nAre all close to brahma\u0026rsquo;s existence.\n119\nDirect one\u0026rsquo;s mind towards recollections of some reality.\nView the place with mind abandoned.\n[Allow] the whole body become support-less (nirādhāra).\nKeep proceeding. [The one] becomes visible.\n120\nAnywhere, a placed object,\nView [it] as nulliﬁed, gradually.\nThat jñana is associated with thoughts.\nGet dissolved in emptiness (śūnya). One becomes [the one].\n121\nDevotion is so abundant that [one becomes] indifferent from passion and worldly attachment.\nAs achieving success through intellect,\n[Proceed] with śakti, the eternal ﬁre.\nBecome great. [One] ascend and becomes one with Śiva.\n122\nDwell on objects or subjects out of thought focus.\nGradually [realize] the emptiness of reality [there].\nMeditate with all heart, towards greatness.\nWith that mastered, peace [dawns].\n123\nShort-witted minds are attached to purity.\n\u0026ldquo;With purity, Śiva manifests.\u0026rdquo;\n[Actually] there is neither purity nor impurity in I\u0026rsquo;ness.\nBe with nirvikalpa and happiness. One becomes [the one].\n124\nAll beings have the nature of bhairava,\nSharing the same scope as God-ness or divinity goes.\nTherefore, [there is] no distinction.\nReside in transcendence, Rest on the non-dualism.\n125\n[Regard] friends the same as foes,\nHuman beings the same as ideas.\nThe transcendental brahma consciousness is ﬁlled with Logos (vāc).\nHave this, the blissful cognition. One becomes [the one].\n126\nNo enmities, wherever.\nNo affections, there.\nDestroy [distinction of] affections and enmities. One is free.\nBrahma proceeds in the middle way.\n127\nWhat is not known. What is not understandable.\nWhat is emptiness (śūnya). What is non-existence.\nIn this manner [realize] bhairava beings everywhere.\nIn this manner, one [becomes] enlightened omnipresent being ﬁnally.\n128\nThe eternal basis-free (nirāśraya) emptiness (śūnya),\nPervades when understanding is abandoned.\nWith thoughts goes beyond ether,\n[One realizes that there is] no ether and enters thoroughly [into the one].\n129\nThere, there, [wherever] mind goes,\nThat, that, in that very moment,\nAbandon [it] as unstable and ﬁckle,\nThus, calm [dawns as the sea without waves]. One becomes [the one].\n130\nFearful bellows everywhere,\nPervading, indivisible and vacant.\nIn this way, [realize] bhairava\u0026rsquo;s sound,\nContinuous, uninterrupted. One becomes [the one with śiva].\n131\nIn my I\u0026rsquo;ness, wherever at the beginning of [such thought]: \u0026ldquo;here I am\u0026rdquo;,\nIn the direction of getting attached [mentally],\n[Let] mind become support-less (nirādhāra),\nMeditate purely directing this.\n132\nEternal, omnipresent, support-less,\nPervading, complete, [leave] minds there,\nMeditate disregarding their pronunciations.\nDo this and substance goes in place of forms.\n133\nUntrue as lights from indra\u0026rsquo;s net (indrajālā),\nThe concept of [\u0026ldquo;here I am\u0026rdquo;] is. [One shall] be existent everywhere.\nWhat is truth, indra\u0026rsquo;s net (itself).\nFix on this ﬁrmly. [One becomes] pure and translucent as the diamond.\n134\nNature (ātma) is without manifested images (kāra).\nWhere there is cognition (jñāna), there is action (kriyā).\nCognition exerts itself on non-external beings.\nEmptiness comes when I\u0026rsquo;ness - \u0026ldquo;Here I am (idaṁ)\u0026rdquo;, becomes \u0026ldquo;world-ness - world\nI am (jaga)\u0026rdquo;.\n135\nNo bondage, no freedom,\nIn this way, hope of living is a scarecrow.\n\u0026ldquo;Here I am\u0026rdquo; is a picture on the limited intellect (buddi).\nArrows of illusions gleam like the sun.\n136\nSenses (indriya) are gates [to salvation] everywhere.\nJoy, pains, when they begin to join.\nThus abandon senses (indriya) like [dismantling a cart by removing] pin of axles,\nSelf-abide (svastha). Self-nature(svātma) is [thus] nulliﬁed(nivartate).\n137\nIn worldly [erroneous] language, cognition irradiates and illuminates, [and]\nNature (ātmā) is manifestation of God.\nNo-difference, indivisibility, non-existence,\nCognitive cognition dis-manifests (vibhāvyate).\n138\nMind (mānasa), consciousness (cetanā), energy (śakti),\nSelf-nature (ātmā), all these four,\nWhenever from which one is emancipated,\nThen [one will become] bhairava\u0026rsquo;s body.\n139\nTeaching being waveless and calm,\nOne hundred told succinctly,\nAnd twelve more,\n[On] cognition and perceiving the one.\n140\nHere [if one learned even] one of those union,\n[One can achieve] successfully, [and becomes] bhairava on one\u0026rsquo;s own.\nTo speak holy words, to perform karmic acts,\nTo curse and to bless, one becomes an agent of deeds.\n141\n[One becomes] exempt from decay and death, being raw,\nUnbounded, bestowed with all great qualities,\nYogi in union, a mighty master,\nWith one\u0026rsquo;s spirit (dhi\u0026rsquo;pa) united [with the pervading one].\n142a\nEven without the worldly life, the liberated one,\nis, with neither actions nor attachments.\n142b\nIndependent existence is the great iśvara.\n143a\nFreedoms arrives, [when one shall] pervade everywhere.\n[What] chanting (japa) [shall one follow]?\n143b\nMeditate on no great ruler,\nNeither ceremonies nor pleasing [gods].\n144\nTo who are the ceremonies and burnings made?\nWho takes the sacriﬁces? I would like to know.\nMaking wishes in that manner is external,\nLike a gross spiritual teacher.\n145\nExpand state of existence. Transcend one\u0026rsquo;s own being.\nCreative contemplate on manifestation of emptiness (ābhāvyate).\nJapa, in this respect, is one\u0026rsquo;s own sound (nāda).\nMantra is sound of one\u0026rsquo; nature (ātmā), as such.\n146\nMeditate with ﬁrm understanding [that]\nNo manifested images (nirākārā), no basis (nirāśrayā),\nNor meditate with a possessed body,\nNo face, no hand, or other imaginations.\n147\nIndeed, ceremonies (pūjā), [is] not about giving ﬂowers.\nOne who meditate and practice ﬁrmly,\nbe with nirvikalpa and transcend beyond ether,\n[is] under ceremony by dissolving supports.\n148\n[Even achieve] one single union [of above] there,\nOne will rise and be gone one day and one night.\nFull of kārā-less-ness, in this respect,\n[One is performing] perfect completion god pleasing (tṛipti).\n149\nIn the great abode of emptiness, burn as in ﬁre ritual,\nElements (bhūta), sense streams, sense organs, and so on,\nTogether with thinking, sacriﬁce all\nAs oblations, with consciousness as ladle.\n150a\nYoga, followed in the way of the highest lord [is],\nTo achieve the goal of non-difference bliss,\nTo remove all the sins and impurities,\nTo protect the completeness.\n151\nRudra\u0026rsquo;s tradition on śakti is about the divine existential union.\nThe [true] pilgrimage is transcendental creative contemplation.\nDifferent [from conventional wisdom], His truth [is],\nEither ceremonies (pūjā) or pleasing gods (tṛpti).\n152\nFreedom (svatantra), bliss (ānanda), pure consciousness (cinmātra),\nThe essence of self-nature (svātmā) is everywhere.\nBe in divine union, binding the self-form (svarūpa)\nwith one\u0026rsquo;s own mind. [This is the true] puriﬁcation ablution to be done.\n153\nCeremonies (pūja) is held on the whole existing creations.\nDivine pleasing (tarpana) shall go beyond transcendence.\nŚiva\u0026rsquo;s pūja is everywhere.\nWith all such, where are the ceremonies?\n154\nRoaming prāna, entering jīva.\nIntention, [meditating on those] are [all] crook actions.\nThe dwelled abode of self-nature (ātmā) is,\nThe destination of the transcendental pilgrimage. Transcendental transcendence.\n155\nMouths and words are servants.\nStay with the great bliss (mahānanda). Ceremonies are illusions (maya).\nAs such, divinity is the same as feces,\nThe highest bhairava is obtained.\nAll those with kārā, are external.\nAll those with kārā destroyed, is supreme.\nHamsa, soham, all those mantras,\nare worldly japa.\n156\nThose with deﬁnite number or shape or aspects of day and night,\n[One shall abandon]. Treat one thousand the same as twenty.\nThe divine japa is particularized in this [nirvikalpa] voice.\nPrāna inside, either easy or hard to attain, is a dumb idea.\n157\nThis is indeed\nthe highest, extinct, greatest [dharma],\nThis shall neither enter nor manifest, at any time,\nto those who are:\n158\nVile, cruel,\nLacking devotion to their gurus.\n[This teaching is for] those who have,\nintelligence of nirvikalpa, or virtues (vīrā), or lofty nature.\n159\nIf you are devoted to gurus and [people of same] class,\nAbandon the trivial,\nVillage, kingdom, city, or any estates,\nkids, wife, household.\n160\n[With] all those abandoned,\nYou shall understand:\nHow most unsteady,\nTo dwell on the seat of I\u0026rsquo;ness, [regarded as] highest.\n161\nEven one proffers (pradāvyā) one\u0026rsquo;s life [to God],\nHis highest eternality will not take.\nHear me well,\nIt is auspicious to offer I\u0026rsquo;ness as the transcendental pleasing (paritṛpt).\n162\nOn Rudrayāmala tantra,\nNow the ﬁrst essence holds.\nNo śakti is separable with living beings.\nThe heart knowledge is cognition in the middle (jñatamadya).\n163\nWherefore, in happy union, devī,\nGracefully, becomes the one with śiva.\n"},{"id":2,"href":"/posts/zk-survey/","title":"Updates in my old zkSNARK Research: IVC, Folding etc","section":"Blog","content":"Since 2022, we have seen significant progress in zkSNARK research, particularly in areas like Incremental Verifiable Computation (IVC) and folding schemes. I have refurbished my 2022 survey article and make it publicly available on GitHub.\nCheck it out here: https://github.com/tlkahn/zkSNARKs-Cryptography-Protocol-Survey/blob/main/main.pdf\nI welcome feedback and contributions. If you spot any mistakes or have suggestions, please open an issue on the GitHub repository.\nStill working in stealth mode. Stay tuned for more.\n"},{"id":3,"href":"/posts/higgs/","title":"About Peter Higgs","section":"Blog","content":"Peter Higgs, one of the greatest theorectical physists who passed away earlier this week in Edinburgh.\nThere is a Native American proverb which goes: Tell me a fact, and I\u0026rsquo;ll learn. Tell me a truth, and I\u0026rsquo;ll believe. But tell me a story, and it will live in my heart forever. I guess the best way to memorize someone is to share good stories:\nWhen questioned about his plans if the LHC data confirms the Higgs boson\u0026rsquo;s existence, as anticipated by most physicists, he responded:\n\u0026ldquo;I shall open a bottle of something.\u0026rdquo;\n\u0026ldquo;A bottole of what?\u0026rdquo;\n\u0026ldquo;Champagne. Because drinking a bottle of whiskey takes a little more time.\u0026rdquo;\nOn the day his Nobel Prize announcement was expected, he thought it best to leave town. However, his car broke down, leaving him stranded. On the way back to lunch on foot, he was intercepted by a neighbor.\n\u0026ldquo;Congratulations, Mr Higgs. You\u0026rsquo;ve just won the prize!!\u0026rdquo;\nHe said: \u0026ldquo;What prize?\u0026rdquo;\nA quick Intro to Standard model # The Standard Model of particle physics is a theory that describes the electromagnetic, weak, and strong interactions between elementary particles. It is formulated in terms of quantum field theory and has been remarkably successful in explaining many experimental observations. The fundamental particles in the Standard Model are divided into two categories: fermions and bosons.\nFermions # Fermions are particles with half-integer spin. They include:\nQuarks # Quarks are the building blocks of hadrons, such as protons and neutrons. There are six types, or flavors, of quarks:\nUp ( \\(u\\) )\nDown ( \\(d\\) )\nCharm ( \\(c\\) )\nStrange ( \\(s\\) )\nTop ( \\(t\\) )\nBottom ( \\(b\\) )\nLeptons # Leptons are fundamental particles that do not participate in the strong interaction. There are six types of leptons:\nElectron ( \\(e^-\\) )\nElectron neutrino ( \\(\\nu_e\\) )\nMuon ( \\(\\mu^-\\) )\nMuon neutrino ( \\(\\nu_\\mu\\) )\nTau ( \\(\\tau^-\\) )\nTau neutrino ( \\(\\nu_\\tau\\) )\nBosons # Bosons are particles with integer spin. They include:\nGauge Bosons # Gauge bosons are responsible for mediating the fundamental forces:\nPhoton ( \\(\\gamma\\) ) - Mediates the electromagnetic force\n\\(W^\u0026#43;\\) boson, \\(W^-\\) boson, and \\(Z\\) boson - Mediate the weak force\nGluon ( \\(g\\) ) - Mediates the strong force\nHiggs Boson # The Higgs boson ( \\(H\\) ) is responsible for giving mass to other particles through the Higgs mechanism.\nWikipedia has provided a clean chart of all the above \u0026ldquo;fundamental\u0026rdquo; particles:\n"},{"id":4,"href":"/posts/source-of-midjourney/","title":"Source of Midjourney","section":"Blog","content":"In an interview, David Holz, the founder/CEO of Midjourney revealed that the term \u0026ldquo;Midjourney\u0026rdquo; was inspired by Zhuangzi\u0026rsquo;s philosophy, which he studies and mused a lot.\nMany wander from which part of the book the term \u0026ldquo;midjourney\u0026rdquo; was taken from.\nIt is from Qi-Wu, the second piece of inner chapters. In my translation of the book, I use the term \u0026ldquo;non-duality\u0026rdquo; to reflect the underlying meaning of the title.\nThe piece of text is actually talking about self-preservation (not only for life but everything one values and pursues):\nOn Self-preservation\nMy lifetime is finite; while knowledge is infinite. Chasing infinite by finite is unwise. Unwittingly adherence to the chase is perilous. Avoid: act good only due to moral obligations; refrain from evil only out of fear of legal consequences. Walk the midjourney. It will help you stay in one piece; preserve your vitality; supply your provisions; live out your full lifespan.\nInterestingly, the paragraph appears right before the paragraph is the well-known butterfly metaphor:\nOnce, Chuang-tzu dreamed he was a butterfly. Vividly a butterfly. He fluttered around, happy with himself. He didn’t know there was a Chuang-Chou. Shortly he woke up. There he was, vividly Chuang-Chou. Is it Chuang-Chou dreamed he was the butterfly, or a butterfly dreamed he was Chuang-Chou. There must be something that separates Chuang-Chou and butterfly. Realizing the illusory nature of identity is called moksha or awakening.\n"},{"id":5,"href":"/posts/em-algorithm-journey/","title":"An iterative exploration on EM algorithm","section":"Blog","content":" An iterative exploration on EM algorithm # As a continuation of my notes on classicial machine learning, this is an exclusive study on EM agorithm to deepen my understanding from a wider variety of perspectives.\nEM as Expection + Maximization # The understanding of latent variable is the first `get` in EM algorithms.\nThrough latent variable, we have witnessed the greatness of EM algorithm which breaks through the limit of MLE.\nLikelihood of latent variables:\n\\(L(\\theta;{\\bf X})=p({\\bf X} | \\theta)=\\int p({\\bf X},{\\bf Z} | \\theta)d{\\bf Z}\\) E-step:\n\\(Q(\\theta|\\theta^{(i)})=\\operatorname{E}_{\\mathbf{Z},\\mathbf{X},\\theta^{(i)}}\\left[\\log L(\\theta;\\mathbf{X},\\mathbf{Z})\\right]\\) M-step:\n\\(\\theta^{(t\u0026#43;1)} = \\underset{\\theta}{\\arg\\max} \\, Q(\\theta|\\theta^{(t)})\\) Here\u0026rsquo;s the same example using a Gaussian Mixture Model (GMM) with two components in Python, using NumPy and SciPy:\nImport required libraries and generate data: import numpy as np from scipy.stats import norm np.random.seed(42) data = np.concatenate((np.random.normal(1, 1, 100), np.random.normal(5, 1, 100))) print(len(data)) print(data[:5]) Define the log-likelihood function for the complete data (X, Z): def log_likelihood(data, means, pis, k): return np.sum(np.log(np.sum(pis[j] * norm.pdf(data, means[j], 1) for j in range(k)))) Implement the EM algorithm: def EM(data, k, max_iterations): means = np.random.uniform(np.min(data), np.max(data), k) pis = np.full(k, 1/k) for _ in range(max_iterations): # E-step assignments = np.array([[pis[j] * norm.pdf(x, means[j], 1) for j in range(k)] for x in data]) assignments = assignments / np.sum(assignments, axis=1).reshape(-1, 1) # M-step means = np.sum(assignments * data.reshape(-1, 1), axis=0) / np.sum(assignments, axis=0) pis = np.sum(assignments, axis=0) / len(data) return means, pis Run the EM algorithm with the generated data, 2 components, and 50 iterations: means, pis = EM(data, 2, 50) Visualize the result: import matplotlib.pyplot as plt plt.hist(data, bins=20, density=True, alpha=0.6) x = np.linspace(np.min(data), np.max(data), 1000) gmm_pdf = pis[0] * norm.pdf(x, means[0], 1) + pis[1] * norm.pdf(x, means[1], 1) plt.plot(x, gmm_pdf, \u0026#39;r-\u0026#39;, linewidth=2) plt.savefig(\u0026#34;./gmm_pdf.png\u0026#34;) plt.show() EM as a local lower bound construction # If you delve deeper into the convergence proof of the EM algorithm based on latent variables, using the Jensen\u0026rsquo;s inequality construction for the log(x) function 1, we can easily prove that the EM algorithm repeatedly constructs new lower bounds and then further solves them.\nThus the EM process can be seen as: fix the current parameters \\(\\theta_n\\) first, calculate a lower bound function for the distribution of the latent variables, optimize this function to obtain new parameters, and then repeat.\nThe EM process described here is connected to the Variational Bayes (VB) method:\nVariational Bayes can be seen as an extension of the expectation-maximization (EM) algorithm from maximum a posteriori estimation (MAP estimation) of the single most probable value of each parameter to fully Bayesian estimation which computes (an approximation to) the entire posterior distribution of the parameters and latent variables. As in EM, it finds a set of optimal parameter values, and it has the same alternating structure as does EM, based on a set of interlocked (mutually dependent) equations that cannot be solved analytically.\nBoth methods involve finding approximate solutions to intractable optimization problems by constructing lower bounds for the objective functions. Similarly, in the Variational Bayes method, the goal is to approximate the posterior distribution of latent variables by minimizing the Kullback-Leibler (KL) divergence between the true posterior distribution and the approximate distribution. This is achieved by constructing a lower-bound function for the log marginal likelihood, called the Evidence Lower Bound (ELBO), and optimizing this lower-bound function with respect to the approximate distribution.\nK-means as a hard EM # Based on the understanding of the second level, you can now freely apply the EM algorithm to GMM and HMM models. Especially after a deep understanding of GMM, for the joint probability with latent variables, when using the Gaussian distribution as a substitute:\n\\(\\begin{aligned} P_{\\Theta}\\left(x_1, \\ldots, x_n, z_1, \\ldots z_n\\right) \u0026amp; =\\prod_{t=1}^N P_{\\Theta}\\left(z_t\\right) P_{\\Theta}\\left(x_t \\mid z_t\\right) \\\\ \u0026amp; =\\prod_{t=1}^N \\frac{1}{K} \\mathcal{N}\\left(\\mu^{z_t}, I\\right)\\left(x_t\\right)\\end{aligned}\\) This formula represents the joint probability distribution for a set of observed variables \\(x_1, \\ldots, x_n\\) and latent variables \\(z_1, \\ldots, z_n\\) under a Gaussian Mixture Model (GMM) with parameter \\(\\Theta\\) .\nThe formula can be broken down as follows:\n\\(P_{\\Theta}\\left(x_1, \\ldots, x_n, z_1, \\ldots z_n\\right)\\) : This is the joint probability of the observed variables \\(x_1, \\ldots, x_n\\) and the latent variables \\(z_1, \\ldots, z_n\\) under parameter \\(\\Theta\\) . \\(\\prod_{t=1}^N P_{\\Theta}\\left(z_t\\right) P_{\\Theta}\\left(x_t \\mid z_t\\right)\\) : This expression is the product of the prior probabilities of the latent variables \\(z_t\\) and the conditional probabilities of the observed variables \\(x_t\\) given the latent variables \\(z_t\\) . The product is taken over all \\(N\\) data points. \\(\\frac{1}{K}\\) : This term represents the uniform prior distribution of the latent variables \\(z_t\\) , where \\(K\\) is the number of Gaussian components in the GMM. \\(\\mathcal{N}\\left(\\mu^{z_t}, I\\right)\\left(x_t\\right)\\) : This term is the conditional probability of the observed variable \\(x_t\\) given the latent variable \\(z_t\\) . It is modeled as a Gaussian distribution with mean \\(\\mu^{z_t}\\) and identity covariance matrix \\(I\\) . Easy to see a connection with MSE:\n\\(\\left(\\mu^1, \\ldots, \\mu^K\\right)^*=\\underset{\\mu^1, \\ldots, \\mu^k}{\\operatorname{argmin}} \\underset{z_1, \\ldots, z_n}{\\min}\\,\\sum\\limits_{t=1}^N\\left\\|\\mu^{z_t}-x_t\\right\\|^2\\) This formula represents the optimization problem for finding the optimal means \\(\\left(\\mu^1, \\ldots, \\mu^K\\right)^*\\) of a Gaussian Mixture Model (GMM) by minimizing the mean squared distance between the data points and their corresponding means.\nThe formula can be broken down as follows:\n\\(\\left(\\mu^1, \\ldots, \\mu^K\\right)^*\\) : This represents the optimal means of the Gaussian components in the GMM. \\(\\underset{\\mu^1, \\ldots, \\mu^k}{\\operatorname{argmin}}\\) : This indicates that we are looking for the values of \\(\\mu^1, \\ldots, \\mu^k\\) that minimize the expression that follows. \\(\\underset{z_1, \\ldots, z_n}{\\min}\\) : This indicates that we are looking for the values of the latent variables \\(z_1, \\ldots, z_n\\) that minimize the expression that follows. \\(\\sum\\limits_{t=1}^N\\left\\|\\mu^{z_t}-x_t\\right\\|^2\\) : This is the sum of the squared Euclidean distances between each data point \\(x_t\\) and its corresponding mean \\(\\mu^{z_t}\\) . The sum is taken over all \\(N\\) data points. A simpler and more intuitive explanation is that the K-means algorithm uses a hard clustering algorithm, while the EM algorithm we are discussing is a soft clustering algorithm. The so-called hard is a binary decision, either it is or it isn\u0026rsquo;t (0-1 choice). On the other hand, the soft deals with situations like a data point always belongs to multiple clusters, and that a probability is calculated for each combination of data point and cluster.\nThis can be summarized as:\nThere is no \u0026ldquo;k-means algorithm\u0026rdquo;. There is MacQueens algorithm for k-means, the Lloyd/Forgy algorithm for k-means, the Hartigan-Wong method, …\nThere also isn\u0026rsquo;t \u0026ldquo;the\u0026rdquo; EM-algorithm. It is a general scheme of repeatedly expecting the likelihoods and then maximizing the model. The most popular variant of EM is also known as \u0026ldquo;Gaussian Mixture Modeling\u0026rdquo; (GMM), where the model are multivariate Gaussian distributions.\nOne can consider Lloyds algorithm to consist of two steps:\nthe E-step, where each object is assigned to the centroid such that it is assigned to the most likely cluster. the M-step, where the model (centroids) are recomputed (least squares optimization). … iterating these two steps, as done by Lloyd, makes this effectively an instance of the general EM scheme. It differs from GMM that:\nit uses hard partitioning, i.e. each object is assigned to exactly one cluster the model are centroids only, no covariances or variances are taken into account EM as a special case of generalized EM # We define the right side of Jensen\u0026rsquo;s inequality as the free energy.\n\\(\\mathcal{F}(q, \\theta)=\\langle\\log P(\\mathcal{X}, \\mathcal{Y} \\mid \\theta)\\rangle_{q(\\mathcal{X})}\u0026#43;\\mathbf{H}[q]\\) This formula represents the free energy \\(\\mathcal{F}(q, \\theta)\\) , which is an important concept in variational inference and is used to approximate the log marginal likelihood. The formula consists of two parts:\n\\(\\langle\\log P(\\mathcal{X}, \\mathcal{Y} \\mid \\theta)\\rangle_{q(\\mathcal{X})}\\) : This term represents the expected log joint probability of the observed data \\(\\mathcal{Y}\\) and the latent variables \\(\\mathcal{X}\\) given the model parameters \\(\\theta\\) . The expectation is taken with respect to the approximate posterior distribution \\(q(\\mathcal{X})\\) . This term measures the goodness-of-fit of the model to the data, taking into account the uncertainty in the latent variables.\n\\(\\mathbf{H}[q]\\) : This term represents the entropy of the approximate posterior distribution \\(q(\\mathcal{X})\\) . It measures the uncertainty in the latent variables given the observed data. High entropy corresponds to a more dispersed distribution, while low entropy corresponds to a more concentrated distribution.\nThe free energy combines these two terms, balancing the goodness-of-fit of the model with the uncertainty in the latent variables. In variational inference, the goal is to minimize the free energy with respect to both the approximate posterior distribution \\(q(\\mathcal{X})\\) and the model parameters \\(\\theta\\) . This minimization leads to an approximation of the true posterior distribution of the latent variables and provides a lower bound on the log marginal likelihood.\nThus, E-step is to optimize the latent (approximate posterior) distribution \\(q(\\mathcal{X})\\) with fixed model parameters \\(\\theta\\) , and the M-step is to optimize \\(\\theta\\) with a fixed \\(q(\\mathcal{X})\\) . This is the generalized EM algorithm.\nE-Step :\n\\(q^{(k)}(\\mathcal{X}) :=\\underset{q(\\mathcal{X})}{\\operatorname{argmax}} \\mathcal{F}\\left(q(\\mathcal{X}), \\theta^{(k-1)}\\right)\\) M-step:\n\\(\\theta^{(k)} :=\\underset{\\theta}{\\operatorname{argmax}} \\mathcal{F}\\left(q^{(k)}(\\mathcal{X}), \\theta\\right)\\) After understanding the generalized EM algorithm, we delve deeper into free energy and discover the relationship between free energy, likelihood, and KL divergence. When the model parameters are fixed, the only option is to optimize the KL divergence. In this case, the hidden distribution can only take the following form:\n\\(q^{(k)}(\\mathcal{X})=P\\left(\\mathcal{X} \\mid \\mathcal{Y}, \\theta^{(k-1)}\\right)\\) In the EM algorithm, this is directly given. Therefore, the EM algorithm is a naturally optimal hidden distribution case within the generalized EM algorithm. However, in many cases, the hidden distribution is not so easy to compute\u0026hellip;\nOne example where the hidden distribution is not easy to compute arises in the context of topic models, such as Latent Dirichlet Allocation (LDA). In LDA, the goal is to learn the hidden topics that generate a collection of documents. The observed data are the words in each document, and the latent variables are the topic assignments for each word. The hidden distribution in this case is the posterior distribution of the topic assignments given the observed words and the model parameters (topic-word probabilities and document-topic probabilities).\nComputing the exact hidden distribution in LDA is challenging because the posterior distribution involves a large number of topic assignments, which grow exponentially with the number of words and topics. This makes exact inference intractable for all but the smallest datasets and simplest models.\nTo overcome this difficulty, various approximate inference techniques are employed in practice to estimate the hidden distribution in LDA, such as:\nGibbs sampling: A Markov chain Monte Carlo (MCMC) method that generates samples from the posterior distribution by iteratively sampling topic assignments for each word in the documents. Variational inference: A deterministic method that approximates the true posterior distribution with a simpler distribution (e.g., a factorized distribution) and minimizes the KL divergence between the true and approximate distributions. Collapsed variational inference or collapsed Gibbs sampling: Techniques that integrate out some of the model parameters (e.g., topic-word probabilities) to simplify the inference problem and reduce the computational complexity. (Jensen\u0026rsquo;s inequality) Let \\(f\\) be a convex function on interval \\(I\\) : If \\(x_{1}, x_{2}, \\ldots, x_{n} \\in I\\) and \\(\\lambda_{1}, \\lambda_{2}, \\ldots, \\lambda_{n} \\geq 0\\) with \\(\\sum_{i=1}^{n} \\lambda_{i}=1\\) , we have:\n\\( f\\left(\\sum_{i=1}^{n} \\lambda_{i} x_{i}\\right) \\leq \\sum_{i=1}^{n} \\lambda_{i} f\\left(x_{i}\\right) \\) Since \\(\\ln (x)\\) is concave (negative convex), we may apply Jensen\u0026rsquo;s inequality:\n\\( \\ln \\sum_{i=1}^{n} \\lambda_{i} x_{i} \\geq \\sum_{i=1}^{n} \\lambda_{i} \\ln \\left(x_{i}\\right) \\) This result enables us to repeatedly establish a lower bound for the logarithm of a sum, which is a key step in deriving the EM algorithm.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n"},{"id":6,"href":"/posts/how-i-build-a-meditation-app-in-a-week/","title":"How I build a meditation app in a week","section":"Blog","content":" Prelude # The story was I built an iOS app for group meditation. It works like Clubhouse but for regular meditators instead. The instructor can remotely sit with students, who can reserve class and book 1:1 sessions. Teachers can record sessions and offer the replay option for students. During meditation, they can play ambient music along with guidance.\nI built the app in a week. I have to say that Swift is a much more enjoyable language than Objective-C which I developed apps on years ago. Here is a high level overview of the short story. You can find the complete code on Github.\nMy apologies that I had little time to write more documentation for it. Fortunately Swift is expressive enough and I had made considerable efforts making my code as readable as possible.\nStreaming # I use Agora as the audio streaming vendor (like Clubhouse). The streaming service was wrapped into a class. The reason is mostly for compatibility with Agora\u0026rsquo;s older SDKs. Also this simplified referencing and potential inheritance. The class structure appears as follows:\nclass Broadcaster: NSObject, AgoraRtcEngineDelegate, ObservableObject { var channelName: String var uid: UInt var role: AgoraClientRole var recordingConfig: AgoraAudioRecordingConfiguration? @State private var cancellables = Set\u0026lt;AnyCancellable\u0026gt;() @State private var initialized: Bool = false @State private var joined: Bool = false private(set) var token: String? private var bags = Set\u0026lt;AnyCancellable\u0026gt;() var agoraKit: AgoraRtcEngineKit! var connectionState: AgoraConnectionState { agoraKit.getConnectionState() } init(channelName: String, role: AgoraClientRole, uid: UInt) { self.channelName = channelName self.role = role self.uid = uid super.init() } Network Service # Swift code for networking is almost boilerplate thanks to Swift\u0026rsquo;s versatile and modern language design.\nstruct NetworkService { let baseURL: String private func getToken() -\u0026gt; String? { return UserDefaults.standard.object(forKey: \u0026#34;token\u0026#34;) as? String } func get\u0026lt;U\u0026gt;(from: String) -\u0026gt; AnyPublisher\u0026lt;U, Error\u0026gt; where U: Decodable { let url = URL(string: baseURL + from)! var request = URLRequest(url: url) request.httpMethod = \u0026#34;GET\u0026#34; if getToken() != nil { request.setValue(\u0026#34;Bearer \\(getToken()!)\u0026#34;, forHTTPHeaderField: \u0026#34;Authorization\u0026#34;) } return run(request) } func post\u0026lt;T, U\u0026gt;(_ entry: T, to: String) -\u0026gt; AnyPublisher\u0026lt;U, Error\u0026gt; where T: Encodable, U: Decodable { let url = URL(string: baseURL + to)! var request = URLRequest(url: url) request.httpMethod = \u0026#34;POST\u0026#34; request.addValue(\u0026#34;application/json\u0026#34;, forHTTPHeaderField: \u0026#34;Content-Type\u0026#34;) if getToken() != nil { request.setValue(\u0026#34;Bearer \\(getToken()!)\u0026#34;, forHTTPHeaderField: \u0026#34;Authorization\u0026#34;) } let encoder = JSONEncoder() let jsonData = try! encoder.encode(entry) request.httpBody = jsonData return run(request) } func run\u0026lt;T: Decodable\u0026gt;(_ request: URLRequest) -\u0026gt; AnyPublisher\u0026lt;T, Error\u0026gt; { let decoder = JSONDecoder() var result: AnyPublisher\u0026lt;T,Error\u0026gt; do { result = URLSession.shared .dataTaskPublisher(for: request) .map { $0.data } .handleEvents(receiveOutput: { print(\u0026#34;\u0026lt;\u0026lt;\u0026lt; Data received:\\n\u0026#34;, NSString( data: $0, encoding: String.Encoding.utf8.rawValue )!) }) .decode(type: T.self, decoder: decoder) .receive(on: DispatchQueue.main) .eraseToAnyPublisher() } return result } } Syncing heartbeat with servers using websocket is also straightforward:\nfunc sendHeartbeat() { let message = URLSessionWebSocketTask.Message.string(\u0026#34;heartbeat\u0026#34;) webSocketTask?.send(message) { error in if let error = error { print(\u0026#34;Error sending heartbeat: \\(error)\u0026#34;) } } } UI # SwiftUI is declarative, almost like CSS, which quite pleasantly surprised me. I was still traumatized by Obj-C UI experience. This is even more sweetened by extension syntax to separate UI from event handling and state transitions. You can write those code in a breeze:\nextension MusicListView { func play(music: Payload.Music?) { guard let music = music, let url = URL(string: music.url) else { return } if music == musicInPlay, let player = audioPlayer, !isPlaying { player.play() } else { stop() let playerItem = AVPlayerItem(url: url) audioPlayer = AVPlayer(playerItem: playerItem) // Resume playback from the stored played time if let storedProgress = playbackProgress[music] { audioPlayer?.seek(to: CMTime( seconds: storedProgress.played, preferredTimescale: 1 )) } audioPlayer?.play() musicInPlay = music } isPlaying = true } func stop() { audioPlayer?.pause() audioPlayer = nil musicInPlay = nil isPlaying = false } func pause() { audioPlayer?.pause() isPlaying = false } Reflections # You might well want to work with Swift, but not Apple. Maybe Rust offers such circumvention with more powerful browser support and embedded wasm code. The paradigm on mobile dev today has very closely resembled front-end development. I was constantly reminded of react/redux, and such. Please let me know if you find this app useful. "},{"id":7,"href":"/posts/classical-ml-math-notes/","title":"Notes on classical ML math","section":"Blog","content":" Introduction # There are two major schools of thought on the interpretation of probability: one is the frequentist school and the other is the Bayesian school.\nLater, we will use the following notation for the observed set: \\(X_{N\\times p}=(x_{1},x_{2},\\cdots,x_{N})^{T},x_{i}=(x_{i1},x_{i2},\\cdots,x_{ip})^{T}\\) This notation indicates there are \\(N\\) samples, and each sample is a \\(p\\) -dimensional vector. Each observation is generated by \\(p(x|\\theta)\\) .\nFrequentist # Frequentists hold that in \\(p(x|\\theta)\\) , \\(\\theta\\) as a parameter is a constant. For \\(N\\) observations, the probability of the observed set is \\(p(X|\\theta)\\mathop{=}\\limits _{iid}\\prod\\limits _{i=1}^{N}p(x_{i}|\\theta))\\) . To find the value of \\(\\theta\\) , we use the maximum log-likelihood (MLE) method:\n\\(\\theta_{MLE}=\\mathop{argmax}\\limits _{\\theta}\\log p(X|\\theta)\\mathop{=}\\limits _{iid}\\mathop{argmax}\\limits _{\\theta}\\sum\\limits _{i=1}^{N}\\log p(x_{i}|\\theta)\\) Bayesian # Bayesians believe that \\(\\theta\\) in \\(p(x|\\theta)\\) is not a constant. This \\(\\theta\\) follows a preset prior distribution \\(\\theta\\sim p(\\theta)\\) . So, according to Bayes' theorem, the parameter's posterior distribution based on the observed set can be written as:\n\\(p(\\theta|X)=\\frac{p(X|\\theta)\\cdot p(\\theta)}{p(X)}=\\frac{p(X|\\theta)\\cdot p(\\theta)}{\\int\\limits _{\\theta}p(X|\\theta)\\cdot p(\\theta)d\\theta} \\propto p(X|\\theta)\\cdot p(\\theta)\\) To find the value of \\(\\theta\\) , we need to maximize this parameter's posterior distribution using the MAP (Maximum A Posteriori) method:\n\\(\\theta_{MAP}=\\mathop{argmax}\\limits _{\\theta}p(\\theta|X)=\\mathop{argmax}\\limits _{\\theta}p(X|\\theta)\\cdot p(\\theta)\\) The second equality is due to the denominator being unrelated to \\(\\theta\\) .\n\\(p(X|\\theta)\\) is called the likelihood and represents our model distribution. After obtaining the posterior distribution of the parameters, we can use it for Bayesian prediction:\n\\(p(x_{new}|X)=\\int\\limits _{\\theta}p(x_{new}|\\theta)\\cdot p(\\theta|X)d\\theta\\) In this integral, the multiplicand is the model (likelihood), and the multiplier is the posterior distribution.\nHere's the derivation:\nStart with the joint probability of observing \\(x_{new}\\) and \\(\\theta\\) given \\(X\\) : \\(p(x_{new}, \\theta|X) = p(x_{new}|\\theta, X) \\cdot p(\\theta|X)\\) To find the marginal probability of observing \\(x_{new}\\) given \\(X\\) , integrate the joint probability over all possible values of \\(\\theta\\) : \\(p(x_{new}|X) = \\int\\limits_{\\theta} p(x_{new}, \\theta|X) d\\theta\\) Substitute the joint probability from step 1 into the integral: \\(p(x_{new}|X) = \\int\\limits_{\\theta} [p(x_{new}|\\theta, X) \\cdot p(\\theta|X)] d\\theta\\) Since \\(x_{new}\\) is conditionally independent of \\(X\\) given \\(\\theta\\) , we can simplify \\(p(x_{new}|\\theta, X)\\) as \\(p(x_{new}|\\theta)\\) : \\(p(x_{new}|X) = \\int\\limits_{\\theta} [p(x_{new}|\\theta) \\cdot p(\\theta|X)] d\\theta\\) The final result is the given formula:\n\\(p(x_{new}|X) = \\int\\limits_{\\theta} p(x_{new}|\\theta) \\cdot p(\\theta|X) d\\theta\\) Gaussian Distribution # 1-d MLE # \\(\\theta\\) of Gaussian in 1-d MLE\n\\[\\begin{aligned} \u0026amp;\\theta=(\\mu,\\Sigma)=(\\mu,\\sigma^{2}), \\\\ \u0026amp;\\theta_{MLE}=\\mathop{argmax}\\limits_{\\theta}\\log p(X|\\theta)\\mathop{=}\\limits_{iid}\\mathop{argmax}\\limits_{\\theta}\\sum\\limits _{i=1}^{N}\\log p(x_{i}|\\theta) \\end{aligned}\\] Probability density function (PDF) of the multivariate Gaussian distribution in \\(p\\) -d with mean vector \\(\\mu\\) and covariance matrix \\(\\Sigma\\) is written as: \\(p(x|\\mu,\\Sigma)=\\frac{1}{(2\\pi)^{p/2}|\\Sigma|^{1/2}}e^{-\\frac{1}{2}(x-\\mu)^{T}\\Sigma^{-1}(x-\\mu)}\\) Plugging into MLE, we consider the one-dimensional case: \\(\\log p(X|\\theta)=\\sum\\limits _{i=1}^{N}\\log p(x_{i}|\\theta)=\\sum\\limits _{i=1}^{N}\\log\\frac{1}{\\sqrt{2\\pi}\\sigma}\\exp(-(x_{i}-\\mu)^{2}/2\\sigma^{2})\\) To find \\(\\mu_{MLE}\\) :\n\\(\\mu_{MLE}=\\mathop{argmax}\\limits _{\\mu}\\log p(X|\\theta)=\\mathop{argmax}\\limits _{\\mu}\\sum\\limits _{i=1}^{N}(x_{i}-\\mu)^{2}\\) \\(\\frac{\\partial}{\\partial\\mu}\\sum\\limits _{i=1}^{N}(x_{i}-\\mu)^{2}=0\\longrightarrow\\mu_{MLE}=\\frac{1}{N}\\sum\\limits _{i=1}^{N}x_{i}\\) For \\(\\sigma\\) , we have:\n\\[\\begin{aligned} \\sigma_{MLE}=\\mathop{argmax}\\limits _{\\sigma}\\log p(X|\\theta)\u0026amp;=\\mathop{argmax}\\limits _{\\sigma}\\sum\\limits _{i=1}^{N}[-\\log\\sigma-\\frac{1}{2\\sigma^{2}}(x_{i}-\\mu)^{2}]\\\\ \u0026amp;=\\mathop{argmin}\\limits _{\\sigma}\\sum\\limits _{i=1}^{N}[\\log\\sigma\u0026#43;\\frac{1}{2\\sigma^{2}}(x_{i}-\\mu)^{2}] \\end{aligned}\\] \\(\\therefore\\) \\(\\frac{\\partial}{\\partial\\sigma}\\sum\\limits _{i=1}^{N}[\\log\\sigma\u0026#43;\\frac{1}{2\\sigma^{2}}(x_{i}-\\mu)^{2}]=0\\) Given the log-likelihood function to optimize: \\(\\sum\\limits_{i=1}^{N}[\\log\\sigma \u0026#43; \\frac{1}{2\\sigma^2}(x_i - \\mu)^2]\\) Differentiate the log-likelihood function with respect to \\(\\sigma\\) : \\(\\frac{\\partial}{\\partial\\sigma}\\sum\\limits_{i=1}^{N}[\\log\\sigma \u0026#43; \\frac{1}{2\\sigma^2}(x_i - \\mu)^2]\\) Apply the sum rule and chain rule for differentiation: \\(N \\cdot \\frac{1}{\\sigma} - \\sum\\limits_{i=1}^{N}\\frac{(x_i - \\mu)^2}{\\sigma^3} = 0\\) Solve for \\(\\sigma^2\\) : \\(\\frac{1}{\\sigma} = \\frac{1}{N \\sigma^3}\\sum\\limits_{i=1}^{N}(x_i - \\mu)^2 \\longrightarrow \\sigma^2 = \\frac{1}{N}\\sum\\limits_{i=1}^{N}(x_i - \\mu)^2\\) The result is the MLE of the variance: \\(\\sigma_{MLE}^2 = \\frac{1}{N}\\sum\\limits_{i=1}^{N}(x_i - \\mu)^2\\) Since we have \\(\\sigma^2_{MLE}\\) , for a unbiased estimate of \\(\\mathbb{E}_D [\\sigma^2_{MLE}]\\) :\n\\[\\begin{aligned} \\mathbb{E}_{\\mathcal{D}}[\\sigma_{MLE}^{2}]\u0026amp;=\\mathbb{E}_{\\mathcal{D}}[\\frac{1}{N}\\sum\\limits _{i=1}^{N}(x_{i}-\\mu_{MLE})^{2}]=\\mathbb{E}_{\\mathcal{D}}[\\frac{1}{N}\\sum\\limits _{i=1}^{N}(x_{i}^{2}-2x_{i}\\mu_{MLE}\u0026#43;\\mu_{MLE}^{2}) \\\\\u0026amp;=\\mathbb{E}_{\\mathcal{D}}[\\frac{1}{N}\\sum\\limits _{i=1}^{N}x_{i}^{2}-\\mu_{MLE}^{2}]=\\mathbb{E}_{\\mathcal{D}}[\\frac{1}{N}\\sum\\limits _{i=1}^{N}x_{i}^{2}-\\mu^{2}\u0026#43;\\mu^{2}-\\mu_{MLE}^{2}]\\\\ \u0026amp;= \\mathbb{E}_{\\mathcal{D}}[\\frac{1}{N}\\sum\\limits _{i=1}^{N}x_{i}^{2}-\\mu^{2}]-\\mathbb{E}_{\\mathcal{D}}[\\mu_{MLE}^{2}-\\mu^{2}]=\\sigma^{2}-(\\mathbb{E}_{\\mathcal{D}}[\\mu_{MLE}^{2}]-\\mu^{2})\\\\\u0026amp;=\\sigma^{2}-(\\mathbb{E}_{\\mathcal{D}}[\\mu_{MLE}^{2}]-\\mathbb{E}_{\\mathcal{D}}^{2}[\\mu_{MLE}])=\\sigma^{2}-Var[\\mu_{MLE}]\\\\\u0026amp;=\\sigma^{2}-Var[\\frac{1}{N}\\sum\\limits _{i=1}^{N}x_{i}]=\\sigma^{2}-\\frac{1}{N^{2}}\\sum\\limits _{i=1}^{N}Var[x_{i}]=\\frac{N-1}{N}\\sigma^{2} \\end{aligned}\\] \\(\\therefore\\) for \\(\\hat{\\sigma}^{2}\\) a.k.a \\(\\mathbb{E}_{\\mathcal{D}}[\\sigma_{MLE}^{2}]\\) :\n\\(\\hat{\\sigma}^{2}=\\frac{1}{N-1}\\sum\\limits _{i=1}^{N}(x_{i}-\\mu)^{2}\\) Multivariate Gaussian # \\(\\because\\) \\(p(x|\\mu,\\Sigma)=\\frac{1}{(2\\pi)^{p/2}|\\Sigma|^{1/2}}e^{-\\frac{1}{2}(x-\\mu)^{T}\\Sigma^{-1}(x-\\mu)}\\) Here, \\(x,\\mu\\in\\mathbb{R}^{p},\\Sigma\\in\\mathbb{R}^{p\\times p}\\) , where \\(\\Sigma\\) is the covariance matrix, generally also a semi-positive definite matrix.\nFirst, we deal with the number in the exponent, which can be denoted as the Mahalanobis distance between \\(x\\) and \\(\\mu\\) .\nThe Mahalanobis distance is a method for measuring the distance between data points, taking into account the covariance structure of the data. For two points \\(x\\) and \\(y\\) , their Mahalanobis distance is defined as:\n\\(D_{M}(x, y) = \\sqrt{(x - y)^T \\Sigma^{-1} (x - y)}\\) Here, \\(\\Sigma\\) is the covariance matrix of the data. The difference between Mahalanobis distance and Euclidean distance lies in the fact that the former takes into account the shape of the data distribution, especially the correlation between the data. When the covariance matrix of the data is an identity matrix, the Mahalanobis distance is equal to the Euclidean distance.\nThe covariance matrix of the data is used to represent the degree of joint variation between different dimensions in the data. For a dataset with \\(n\\) samples and \\(p\\) features, the covariance matrix \\(\\Sigma\\) is a \\(p \\times p\\) matrix, defined as:\n\\(\\Sigma = \\frac{1}{n-1} \\sum\\limits_{i=1}^n (x_i - \\bar{x})(x_i - \\bar{x})^T\\) Here, \\(x_i\\) is the \\(i\\) -th sample point, \\(\\bar{x}\\) is the sample mean, and \\(n\\) is the number of samples.\nThis equation defines a matrix because each summand \\((x_i - \\bar{x})(x_i - \\bar{x})^T\\) itself is a matrix. Let's explain this in more detail:\nSuppose \\(x_i\\) and \\(\\bar{x}\\) are \\(p\\) -dimensional column vectors, then \\((x_i - \\bar{x})\\) is also a \\(p\\) -dimensional column vector. When calculating the outer product \\((x_i - \\bar{x})(x_i - \\bar{x})^T\\) , we obtain a \\(p \\times p\\) matrix, where each element is the product of the corresponding elements in \\((x_i - \\bar{x})\\) .\nSumming these \\(p \\times p\\) matrices over the sample size \\(n\\) , we obtain a cumulative \\(p \\times p\\) matrix. Finally, the entire sum is divided by \\((n-1)\\) to obtain the unbiased covariance matrix \\(\\Sigma\\) .\nThe denominator is n-1{.verbatim} rather than n{.verbatim} because of Bessel's correction, which is used to provide an unbiased estimate of the population variance when working with sample data.\nFor a symmetric covariance matrix, eigendecomposition can be performed.\n\\(\\Sigma=U\\Lambda U^{T}=(u_{1},u_{2},\\cdots,u_{p})diag(\\lambda_{i})(u_{1},u_{2},\\cdots,u_{p})^{T}=\\sum\\limits _{i=1}^{p}u_{i}\\lambda_{i}u_{i}^{T}\\) \\(\\Sigma^{-1}=\\sum\\limits _{i=1}^{p}u_{i}\\frac{1}{\\lambda_{i}}u_{i}^{T}\\) In this context, \\(u_i\\) is the eigenvector of the covariance matrix \\(\\Sigma\\) . Eigendecomposition decomposes the matrix into a product of eigenvectors \\(u_i\\) and eigenvalues \\(\\lambda_i\\) . \\(U\\) is a matrix composed of \\(u_i\\) as column vectors corresponding to the eigenvectors. \\(\\Lambda\\) is a diagonal matrix with elements on the diagonal being the corresponding eigenvalues \\(\\lambda_i\\) .\n\\[\\begin{aligned} \\Delta\u0026amp;=(x-\\mu)^{T}\\Sigma^{-1}(x-\\mu) \\\\ \u0026amp;=\\sum\\limits _{i=1}^{p}(x-\\mu)^{T}u_{i}\\frac{1}{\\lambda_{i}}u_{i}^{T}(x-\\mu) \\\\ \u0026amp;=\\sum\\limits _{i=1}^{p}\\frac{y_{i}^{2}}{\\lambda_{i}} \\tag{eq:1} \\end{aligned}\\] \\( y_i = (x-\\mu)^T u_i = u_i^T (x-\\mu) \\) We notice that scalar \\(y_i\\) is the projection of \\(x-\\mu\\) onto the eigenvector \\(u_i\\) (or the inverse).\nSpecifically, \\(u_i^T(x - \\mu)\\) calculates the dot product between \\((x - \\mu)\\) and \\(u_i\\) , which gives the component of \\((x - \\mu)\\) in the direction of \\(u_i\\) . This is why we say \\(y_i\\) is the projection of \\((x - \\mu)\\) onto the eigenvector \\(u_i\\) .\nIn Euclidean space, the dot product (also known as the inner product) is a fundamental operation between two vectors, used to calculate their components (or projections) in the same direction. The formula for calculating Euclidean dot product is:\n\\(u_i^T (x - \\mu) = ||u_i|| ||x - \\mu|| \\cos{\\theta}\\) Where \\(||u_i||\\) and \\(||x - \\mu||\\) represent the lengths of the vectors \\(u_i\\) and \\((x - \\mu)\\) , respectively, and \\(\\theta\\) represents the angle between them. Since the eigenvector \\(u_i\\) is typically normalized to have unit length, i.e., \\(||u_i|| = 1\\) , the dot product simplifies to:\n\\(y_i = u_i^T (x - \\mu) = ||x - \\mu|| \\cos{\\theta}\\) \\(\\ref{eq:1}\\) represents a generalized concentric ellipses for different values of \\(\\Delta\\) .\nEach term \\(\\frac{y_{i}^{2}}{\\lambda_{i}}\\) represents the contribution of the projection of the vector \\((x - \\mu)\\) onto the eigenvector \\(u_i\\) to the overall shape of the ellipse. The eigenvalues \\(\\lambda_i\\) determine the scaling of the ellipse along each eigenvector direction.\nNow let's look at two issues in the practical application of multidimensional Gaussian models:\nThe number of degrees of freedom for parameters \\(\\Sigma,\\mu\\) is \\(O(p^{2})\\) . For high-dimensional data, this leads to too many degrees of freedom.\nSolution: The high degrees of freedom come from \\(\\Sigma\\) having \\(\\frac{p(p\u0026#43;1)}{2}\\) free parameters. We can assume that it is a diagonal matrix or even assume that the elements on the diagonal are the same under the isotropic assumption. The former dimension reduction algorithm is Factor Analysis (FA), and the latter is probabilistic PCA (p-PCA).\nThe second problem is that a single Gaussian distribution is unimodal, and it cannot achieve good results for data distributions with multiple peaks. Solution: Gaussian Mixture Model (GMM).\nNext, we will introduce commonly used theorems for multidimensional Gaussian distributions.\nLet \\(x=(x_1, x_2,\\cdots,x_p)^T=(x_{a,m\\times 1}, x_{b,n\\times1})^T,\\mu=(\\mu_{a,m\\times1}, \\mu_{b,n\\times1}),\\Sigma=\\begin{pmatrix}\\Sigma_{aa}\u0026amp;\\Sigma_{ab}\\\\\\Sigma_{ba}\u0026amp;\\Sigma_{bb}\\end{pmatrix}\\) Given: \\(x\\sim\\mathcal{N}(\\mu,\\Sigma)\\) Given \\(x \\sim \\mathcal{N}(\\mu, \\Sigma)\\) , and \\(y \\sim Ax \u0026#43; b\\) , then \\(y \\sim \\mathcal{N}(A\\mu \u0026#43; b, A\\Sigma A^T)\\) .\n\\(\\mathbb{E}[y]=\\mathbb{E}[Ax\u0026#43;b]=A\\mathbb{E}[x]\u0026#43;b=A\\mu\u0026#43;b\\) \\ \\(Var[y]=Var[Ax\u0026#43;b]=Var[Ax]=A\\cdot Var[x]\\cdot A^T\\) Now, using this theorem, we can obtain four quantities \\(p(x_a), p(x_b), p(x_a|x_b), p(x_b|x_a)\\) \\(x_a = \\begin{pmatrix}\\mathbb{I}_{m\\times m} \u0026amp; \\mathbb{O}_{m\\times n}\\end{pmatrix}\\begin{pmatrix}x_a \\\\ x_b\\end{pmatrix}\\) .\nSubstituting into the theorem, we get:\n\\[ \\begin{aligned} \\mathbb{E}[x_a] = \\begin{pmatrix}\\mathbb{I} \u0026amp; \\mathbb{O}\\end{pmatrix}\\begin{pmatrix}\\mu_a \\\\ \\mu_b\\end{pmatrix} = \\mu_a \\\\ Var[x_a] = \\begin{pmatrix}\\mathbb{I} \u0026amp; \\mathbb{O}\\end{pmatrix}\\begin{pmatrix}\\Sigma_{aa} \u0026amp; \\Sigma_{ab} \\\\ \\Sigma_{ba} \u0026amp; \\Sigma_{bb}\\end{pmatrix}\\begin{pmatrix}\\mathbb{I} \\\\ \\mathbb{O}\\end{pmatrix} = \\Sigma_{aa} \\end{aligned} \\] \\(\\therefore\\) \\(x_a \\sim \\mathcal{N}(\\mu_a, \\Sigma_{aa})\\) Similarly, \\(x_b \\sim \\mathcal{N}(\\mu_b, \\Sigma_{bb})\\) .\nFor the two conditional probabilities, we introduce three quantities:\n\\( x_{b\\cdot a} = x_b - \\Sigma_{ba}\\Sigma_{aa}^{-1}x_a\\\\ \\mu_{b\\cdot a} = \\mu_b - \\Sigma_{ba}\\Sigma_{aa}^{-1}\\mu_a\\\\ \\Sigma_{bb\\cdot a} = \\Sigma_{bb} - \\Sigma_{ba}\\Sigma_{aa}^{-1}\\Sigma_{ab} \\) In particular, the last equation is called the Schur Complementary of \\(\\Sigma_{bb}\\) . We can see that: \\( x_{b\\cdot a} = \\begin{pmatrix}-\\Sigma_{ba}\\Sigma_{aa}^{-1} \u0026amp; \\mathbb{I}_{n\\times n}\\end{pmatrix}\\begin{pmatrix}x_a \\\\ x_b\\end{pmatrix} \\) So\n\\[ \\begin{aligned} \\mathbb{E}[x_{b\\cdot a}] \u0026amp;= \\begin{pmatrix}-\\Sigma_{ba}\\Sigma_{aa}^{-1} \u0026amp; \\mathbb{I}_{n\\times n}\\end{pmatrix}\\begin{pmatrix}\\mu_a \\\\ \\mu_b\\end{pmatrix} = \\mu_{b\\cdot a}\\\\ Var[x_{b\\cdot a}] \u0026amp;= \\begin{pmatrix}-\\Sigma_{ba}\\Sigma_{aa}^{-1} \u0026amp; \\mathbb{I}_{n\\times n}\\end{pmatrix}\\begin{pmatrix}\\Sigma_{aa} \u0026amp; \\Sigma_{ab} \\\\ \\Sigma_{ba} \u0026amp; \\Sigma_{bb}\\end{pmatrix}\\begin{pmatrix}-\\Sigma_{aa}^{-1}\\Sigma_{ba}^T \\\\ \\mathbb{I}_{n\\times n}\\end{pmatrix} = \\Sigma_{bb\\cdot a} \\end{aligned} \\] Using these three quantities, we can get \\(x_b = x_{b\\cdot a} \u0026#43; \\Sigma_{ba}\\Sigma_{aa}^{-1}x_a\\) .\nTherefore: \\( \\mathbb{E}[x_b|x_a] = \\mu_{b\\cdot a} \u0026#43; \\Sigma_{ba}\\Sigma_{aa}^{-1}x_a \\) \\( Var[x_b|x_a] = \\Sigma_{bb\\cdot a} \\) Similarly: \\( x_{a\\cdot b} = x_a - \\Sigma_{ab}\\Sigma_{bb}^{-1}x_b\\\\ \\mu_{a\\cdot b} = \\mu_a - \\Sigma_{ab}\\Sigma_{bb}^{-1}\\mu_b\\\\ \\Sigma_{aa\\cdot b} = \\Sigma_{aa} - \\Sigma_{ab}\\Sigma_{bb}^{-1}\\Sigma_{ba} \\) \\(\\therefore\\) \\( \\mathbb{E}[x_a|x_b] = \\mu_{a\\cdot b} \u0026#43; \\Sigma_{ab}\\Sigma_{bb}^{-1}x_b \\) \\( Var[x_a|x_b] = \\Sigma_{aa\\cdot b} \\) \\begin{example} Given: \\(p(x) = \\mathcal{N}(\\mu, \\Lambda^{-1}), p(y|x) = \\mathcal{N}(Ax \u0026#43; b, L^{-1})\\) , find: \\(p(y), p(x|y)\\) . Solution: Let: \\(y = Ax \u0026#43; b \u0026#43; \\epsilon, \\epsilon \\sim \\mathcal{N}(0, L^{-1})\\) , \\(\\therefore\\) \\(\\begin{aligned} \\mathbb{E}[y] = \\mathbb{E}[Ax \u0026#43; b \u0026#43; \\epsilon] = A\\mu \u0026#43; b Var[y] = A\\Lambda^{-1}A^T \u0026#43; L^{-1} \\end{aligned}\\) \\(\\therefore\\) \\( p(y) = \\mathcal{N}(A\\mu \u0026#43; b, L^{-1} \u0026#43; A\\Lambda^{-1}A^T)\\) Introduce: \\(z = \\begin{pmatrix}x \\\\ y\\end{pmatrix}\\) , we can obtain \\(Cov[x, y] = \\mathbb{E}[(x - \\mathbb{E}[x])(y - \\mathbb{E}[y])^T]\\) . For this covariance, we can directly compute: \\( \\begin{aligned} Cov(x, y) \u0026amp;= \\mathbb{E}[(x - \\mu)(Ax - A\\mu \u0026#43; \\epsilon)^T] = \\mathbb{E}[(x - \\mu)(x - \\mu)^TA^T] = Var[x]A^T = \\Lambda^{-1}A^T \\end{aligned}\\) Note the symmetry of the covariance matrix: \\(p(z) = \\mathcal{N}\\begin{pmatrix}\\mu \\\\ A\\mu \u0026#43; b\\end{pmatrix}, \\begin{pmatrix}\\Lambda^{-1} \u0026amp; \\Lambda^{-1}A^T \\\\ A\\Lambda^{-1} \u0026amp; L^{-1} \u0026#43; A\\Lambda^{-1}A^T\\end{pmatrix}\\) . \\(\\therefore\\) \\( \\mathbb{E}[x|y] = \\mu \u0026#43; \\Lambda^{-1}A^T(L^{-1} \u0026#43; A\\Lambda^{-1}A^T)^{-1}(y - A\\mu - b) \\) \\( Var[x|y] = \\Lambda^{-1} - \\Lambda^{-1}A^T(L^{-1} \u0026#43; A\\Lambda^{-1}A^T)^{-1}A\\Lambda^{-1}\\) \\end{example}\nLinear Regression # \\epigraph{If AI thinks in matrix, so should human.}{anonymous}\nSuppose we have dataset \\(\\mathcal{D}\\) : \\(\\mathcal{D}=\\{(x_1, y_1),(x_2, y_2),\\cdots,(x_N, y_N)\\}\\) We annotate \\(X\\) as samples and \\(Y\\) as target: \\(X=(x_1,x_2,\\cdots,x_N)^T,Y=(y_1,y_2,\\cdots,y_N)^T\\) We are looking at a linear function: \\(f(w)=w^Tx\\) Least squares method # For this problem, we use the squared error defined by the Euclidean norm to define the loss function: \\(L(w) = \\sum\\limits_{i=1}^N ||w^Tx_i - y_i||^2_2\\) Expand it, we get:\n\\[\\begin{aligned} L(w)\u0026amp;=(w^Tx_1-y_1,\\cdots,w^Tx_N-y_N)\\cdot (w^Tx_1-y_1,\\cdots,w^Tx_N-y_N)^T\\\\ \u0026amp;=(w^TX^T-Y^T)\\cdot (Xw-Y)=w^TX^TXw-Y^TXw-w^TX^TY\u0026#43;Y^TY\\\\ \u0026amp;=w^TX^TXw-2w^TX^TY\u0026#43;Y^TY \\end{aligned}\\] Now minimize \\(\\hat{w}\\) \\[\\begin{aligned} \\hat{w}=\\mathop{argmin}\\limits_wL(w)\u0026amp;\\longrightarrow\\frac{\\partial}{\\partial w}L(w)=0\\\\ \u0026amp;\\longrightarrow2X^TX\\hat{w}-2X^TY=0\\\\ \u0026amp;\\longrightarrow \\hat{w}=(X^TX)^{-1}X^TY=X^\u0026#43;Y \\end{aligned}\\] \\((X^TX)^{-1}X^T\\) is also called the pseudo-inverse. We annote: \\(X^\u0026#43; = (X^TX)^-1X^T\\) For row full rank or column full rank \\(X\\) , the solution can be obtained directly, but for non-full rank sample sets, the singular value decomposition (SVD) method is needed. Apply Singular Value Decomposition (SVD) to \\(X\\) , we get \\(X=U\\Sigma V^T\\) .\nTo find the pseudoinverse \\(X^\u0026#43;\\) for a non-full rank matrix X, we can use the SVD decomposition. We first find the pseudoinverse of the diagonal matrix \\(\\Sigma\\) . For every non-zero diagonal element \\(\\sigma_i\\) , its pseudoinverse \\(\\sigma_i^\u0026#43;\\) is \\(\\frac{1}{\\sigma_i}\\) . Then, the pseudoinverse of \\(\\Sigma\\) , denoted as \\(\\Sigma^\u0026#43;\\) , is a diagonal matrix with the reciprocals of the non-zero diagonal elements of \\(\\Sigma\\) .\nNow, using the property of pseudoinverse for orthogonal matrices:\n\\(U^\u0026#43; = U^T\\) \\(V^\u0026#43; = V^T\\) An orthogonal matrix is a square matrix where its transpose is also its inverse:\n\\(Q^TQ = QQ^T = I\\) In the context of SVD, the matrices \\(U\\) and \\(V\\) are orthogonal matrices. The pseudoinverse, denoted by the superscript +, has the property that for orthogonal matrices, the pseudoinverse is equal to the transpose:\nFor an orthogonal matrix \\(U\\) , we have \\(U^TU = UU^T = I\\) , which implies \\(U^\u0026#43; = U^T\\) . Similarly, for an orthogonal matrix \\(V\\) , we have \\(V^TV = VV^T = I\\) , which implies \\(V^\u0026#43; = V^T\\) . The pseudoinverse \\(X^\u0026#43;\\) can be computed as:\n\\(X^\u0026#43; = (U\\Sigma V^T)^\u0026#43; = V\\Sigma^\u0026#43; U^T\\) So, we have:\n\\(X^\u0026#43; = V\\Sigma^{-1}U^T\\) MLE with Gaussian noise # For 1-d situation:\n\\(y=w^Tx\u0026#43;\\epsilon,\\epsilon\\sim\\mathcal{N}(0,\\sigma^2)\\) \\(\\therefore y\\sim\\mathcal{N}(w^Tx,\\sigma^2)\\) Plug it into MLE:\n\\[\\begin{aligned} L(w)=\\log p(Y|X,w)\u0026amp;=\\log\\prod\\limits_{i=1}^Np(y_i|x_i,w)\\\\ \u0026amp;=\\sum\\limits_{i=1}^N\\log(\\frac{1}{\\sqrt{2\\pi\\sigma}}e^{-\\frac{(y_i-w^Tx_i)^2}{2\\sigma^2}})\\\\ \\mathop{argmax}\\limits_wL(w)\u0026amp;=\\mathop{argmin}\\limits_w\\sum\\limits_{i=1}^N(y_i-w^Tx_i)^2 \\end{aligned}\\] This expression is the same as the result obtained from the least squares estimation.\nGaussian weight prior in MAP (Maximum A Posteriori) estimation. # Take the prior distribution \\(w \\sim \\mathcal{N}(0, \\sigma_0^2)\\) . Then:\n\\[\\begin{aligned} \\hat{w} = \\mathop{argmax}\\limits_w p(w|Y) \u0026amp;= \\mathop{argmax}\\limits_w p(Y|w) p(w) \\\\ \u0026amp;= \\mathop{argmax}\\limits_w \\log p(Y|w) p(w) \\\\ \u0026amp;= \\mathop{argmax}\\limits_w (\\log p(Y|w) \u0026#43; \\log p(w)) \\\\ \u0026amp;= \\mathop{argmin}\\limits_w [(y - w^Tx)^2 \u0026#43; \\frac{\\sigma^2}{\\sigma_0^2} w^Tw] \\end{aligned}\\] Here, we omit \\(X\\) as \\(p(Y)\\) is not related to \\(w\\) , and we also use the MLE result for Gaussian distribution from above.\nWe will see that the existence of the hyperparameter \\(\\sigma_0\\) corresponds to the Ridge regularization term that will be introduced below. Similarly, if the prior distribution is taken as the Laplace distribution, a result similar to L1 regularization will be obtained.\nRegularization # In practical applications, if the sample size is not much larger than the feature dimension, overfitting may occur. For this situation, we have the following three solutions:\nAdd more data Feature selection (reduce feature dimension), such as the PCA algorithm. Regularization Regularization generally involves adding a regularization term (representing the penalty for the model's complexity) to the loss function (such as the least squares loss introduced above). Below, we introduce two regularization frameworks commonly used in general situations.\n\\[\\begin{aligned} L1\u0026amp;:\\mathop{argmin}\\limits_w L(w)\u0026#43;\\lambda||w||_1,\\lambda \u0026gt; 0 \\\\ L2\u0026amp;:\\mathop{argmin}\\limits_w L(w)\u0026#43;\\lambda||w||^2_2,\\lambda \u0026gt; 0 \\end{aligned}\\] L1 Lasso # Caveate: L1 regularization can lead to sparse solutions.\nFrom the perspective of minimizing the loss, the derivative of the L1 term is non-zero near 0 on both the left and right sides, making it easier to obtain a solution of 0.\nOn the other hand, L1 regularization is equivalent to: \\(\\mathop{argmin}\\limits_w L(w) \\\\ s.t. ||w||_1 \u0026lt; C\\) We have already seen that the squared error loss function is an ellipsoid in the \\(w\\) space. Therefore, the solution of the above equation is the tangent point between the ellipsoid and \\(||w||_1 = C\\) , making it more likely to be tangent on the coordinate axis. That means that some of the weight coefficients (w) become exactly zero. This property leads to sparsity in the solution, which can be seen as a form of automatic feature selection. However, in certain cases, it might also result in losing some potentially important information from the features whose coefficients become zero.\nL2 Ridge # \\[\\begin{aligned} \\hat{w}=\\mathop{argmin}\\limits_wL(w)\u0026#43;\\lambda w^Tw\u0026amp;\\longrightarrow\\frac{\\partial}{\\partial w}L(w)\u0026#43;2\\lambda w=0\\\\ \u0026amp;\\longrightarrow2X^TX\\hat{w}-2X^TY\u0026#43;2\\lambda \\hat w=0\\\\ \u0026amp;\\longrightarrow \\hat{w}=(X^TX\u0026#43;\\lambda \\mathbb{I})^{-1}X^TY \\end{aligned}\\] As we can see, this regularization parameter coincides with the previous MAP result. Using the L2 norm for regularization not only allows the model to choose smaller values for \\(w\\) , but also avoids the issue of \\(X^TX\\) being non-invertible.\nSummary # The linear regression model is the simplest model, but it is a miniature of every machine learning problem. Here, we use the least squares error to obtain a closed-form solution. We also find that when the noise follows a Gaussian distribution, the MLE solution is equivalent to the least squares error. Moreover, after adding a regularization term, the least squares error combined with L2 regularization is equivalent to the MAP estimation under Gaussian noise prior, and with L1 regularization, it is equivalent to the Laplace noise prior.\nTraditional machine learning methods more or less contain elements of linear regression models:\nLinear models often cannot fit the data well, so there are three strategies to overcome this drawback: Transform the feature dimensions, such as polynomial regression models, which add higher-order terms based on linear features. Add a non-linear transformation after the linear equation by introducing a non-linear activation function, such as in linear classification models like perceptrons. For consistent linear coefficients, we perform multiple transformations so that a single feature is not only affected by a single coefficient, such as in multilayer perceptrons (deep feedforward networks). Linear regression is linear across the entire sample space. We can modify this limitation by introducing different linear or non-linear functions in different regions, such as in linear spline regression and decision tree models. Linear regression uses all the samples, but preprocessing the data might yield better learning results (the so-called curse of dimensionality, where high-dimensional data is harder to learn from), such as in PCA algorithms and manifold learning. Linear Classifiers # For classification tasks, linear regression models are . However, we can add an activation function to the linear model, which is non-linear. The inverse of the activation function is called the link function. We have two types of linear classification methods:\nHard classification, where we directly need to output the corresponding class of the observation. Representative models of this type include: Linear Discriminant Analysis (Fisher Discriminant) Perceptron Soft classification, which generates probabilities for different classes. These algorithms can be divided into two types based on the different probability methods used: Discriminative (directly modeling the conditional probability): Logistic Regression Hard # Perceptron # Most intuitively, we can choose an activation function like: \\(sign(a)=\\left\\{\\begin{matrix}\u0026#43;1,a\\ge0\\\\-1,a\\leq0\\end{matrix}\\right.\\) This way, we can map the result of linear regression to the binary classification result.\nDefine the loss function as the number of misclassified samples. Start with using the indicator function, but the indicator function is non-differentiable. Therefore, we can define: \\(L(w)=\\sum\\limits_{x_i\\in\\mathcal{D}_{wrong}}-y_iw^Tx_i\\) where \\(\\mathcal{D}_{wrong}\\) is the set of misclassified samples. In each training step, we use the gradient descent algorithm. The partial derivative of the loss function with respect to \\(w\\) is: \\(\\frac{\\partial}{\\partial w}L(w)=\\sum\\limits_{x_i\\in\\mathcal{D}_{wrong}}-y_ix_i\\) However, if there are a large number of samples, the computational complexity is high. In fact, we do not need the absolute direction of the loss function decrease. We only need the expected value of the loss function to decrease, but calculating the expectation requires knowing the true probability distribution. We can only estimate this probability distribution based on the training data sampling (empirical risk):\n\\[\\mathbb{E}_{\\mathcal{D}} {\\mathbb{E}_{\\hat{p}} {[\\nabla_wL(w)]}} = \\mathbb{E}_{\\mathcal D}\\left[\\frac{1}{N}\\sum\\limits_{i=1}^N\\nabla_wL(w)\\right]\\] We know that the larger the \\(N\\) , the more accurate the sample approximation of the true distribution. However, for a data with a standard deviation of \\(\\sigma\\) , the determined standard deviation is only inversely proportional to \\(\\sqrt{N}\\) , while the calculation speed is directly proportional to \\(N\\) . Therefore, we can use fewer samples each time, so that the expected loss reduction and calculation speed can be improved. If we use only one misclassified sample each time, we have the following update strategy (based on the Taylor formula, in the negative direction): \\(w^{t\u0026#43;1}\\leftarrow w^{t}\u0026#43;\\lambda y_ix_i\\) It is convergent, and using a single observation update can also increase the uncertainty to some extent, thereby reducing the possibility of falling into a local minimum. In larger-scale data, the commonly used method is mini-batch stochastic gradient descent.\nLDA # In LDA, our basic idea is to select a direction, project the experimental samples along this direction, and the projected data should satisfy two conditions to achieve better classification:\nThe distances between experimental samples within the same class are close. The distances between different classes are larger. First, let's consider the projection. We assume that the original data is a vector \\(x\\) , then the projection along the \\(w\\) direction is the scalar: \\(z=w^T\\cdot x(=|w|\\cdot|x|\\cos\\theta)\\) For the first point, the samples within the same class are closer. We assume that the number of experimental samples belonging to the two classes are \\(N_1\\) and \\(N_2\\) , respectively. We use the variance matrix to characterize the overall distribution within each class. Here we use the definition of covariance, denoted by \\(S\\) for the original data covariance:\n\\[\\begin{aligned} C_1:Var_z[C_1]\u0026amp;=\\frac{1}{N_1}\\sum\\limits_{i=1}^{N_1}(z_i-\\overline{z_{c1}})(z_i-\\overline{z_{c1}})^T\\\\ \u0026amp;=\\frac{1}{N_1}\\sum\\limits_{i=1}^{N_1}(w^Tx_i-\\frac{1}{N_1}\\sum\\limits_{j=1}^{N_1}w^Tx_j)(w^Tx_i-\\frac{1}{N_1}\\sum\\limits_{j=1}^{N_1}w^Tx_j)^T\\\\ \u0026amp;=w^T\\frac{1}{N_1}\\sum\\limits_{i=1}^{N_1}(x_i-\\overline{x_{c1}})(x_i-\\overline{x_{c1}})^Tw\\\\ \u0026amp;=w^TS_1w\\\\ C_2:Var_z[C_2]\u0026amp;=\\frac{1}{N_2}\\sum\\limits_{i=1}^{N_2}(z_i-\\overline{z_{c2}})(z_i-\\overline{z_{c2}})^T\\\\ \u0026amp;=w^TS_2w \\end{aligned}\\] Therefore, the within-class distance can be denoted as:\n\\[\\begin{aligned} Var_z[C_1]\u0026#43;Var_z[C_2]=w^T(S_1\u0026#43;S_2)w \\end{aligned}\\] For the second point, we can use the means of the two classes to represent this distance:\n\\[\\begin{aligned} (\\overline{z_{c1}}-\\overline{z_{c2}})^2\u0026amp;=(\\frac{1}{N_1}\\sum\\limits_{i=1}^{N_1}w^Tx_i-\\frac{1}{N_2}\\sum\\limits_{i=1}^{N_2}w^Tx_i)^2\\\\ \u0026amp;=(w^T(\\overline{x_{c1}}-\\overline{x_{c2}}))^2\\\\ \u0026amp;=w^T(\\overline{x_{c1}}-\\overline{x_{c2}})(\\overline{x_{c1}}-\\overline{x_{c2}})^Tw \\end{aligned}\\] Considering both points, since the covariance is a matrix, we divide these two values to obtain our loss function and maximize this value:\n\\[\\begin{aligned}\\hat{w}=\\mathop{argmax}\\limits_wJ(w)\u0026amp;=\\mathop{argmax}\\limits_w\\frac{(\\overline{z_{c1}}-\\overline{z_{c2}})^2}{Var_z[C_1]\u0026#43;Var_z[C_2]}\\\\ \u0026amp;=\\mathop{argmax}\\limits_w\\frac{w^T(\\overline{x_{c1}}-\\overline{x_{c2}})(\\overline{x_{c1}}-\\overline{x_{c2}})^Tw}{w^T(S_1\u0026#43;S_2)w}\\\\ \u0026amp;=\\mathop{argmax}\\limits_w\\frac{w^TS_bw}{w^TS_ww} \\end{aligned}\\] In this way, we have combined the loss function with the original dataset and parameters. Next, we will find the partial derivative of this loss function. Note that we actually have no requirement for the absolute value of \\(w\\) , only for its direction, so we can solve it with just one equation:\n\\[\\begin{aligned} \u0026amp;\\frac{\\partial}{\\partial w}J(w)=2S_bw(w^TS_ww)^{-1}-2w^TS_bw(w^TS_ww)^{-2}S_ww=0\\\\ \u0026amp;\\Longrightarrow S_bw(w^TS_ww)=(w^TS_bw)S_ww\\\\ \u0026amp;\\Longrightarrow w\\propto S_w^{-1}S_bw=S_w^{-1}(\\overline{x_{c1}}-\\overline{x_{c2}})(\\overline{x_{c1}}-\\overline{x_{c2}})^Tw\\propto S_w^{-1}(\\overline{x_{c1}}-\\overline{x_{c2}}) \\end{aligned}\\] Thus, \\(S_w^{-1}(\\overline{x_{c1}}-\\overline{x_{c2}})\\) is the direction we need to find. Finally, we can normalize it to obtain a unit \\(w\\) value.\nSoft # Logistic regression # Sometimes we just want to obtain the probability of a single class, so we need a function that can output values within the range of \\([0,1]\\) . Considering a binary classification model, we use a discriminative model and hope to model \\(p(C|x)\\) using Bayes' theorem: \\(p(C_1|x)=\\frac{p(x|C_1)p(C_1)}{p(x|C_1)p(C_1)\u0026#43;p(x|C_2)p(C_2)}\\) Let \\(a=\\ln\\frac{p(x|C_1)p(C_1)}{p(x|C_2)p(C_2)}\\) , then: \\(p(C_1|x)=\\frac{1}{1\u0026#43;\\exp(-a)}\\) The above formula is called the Logistic Sigmoid function, and its parameter represents the logarithm of the ratio of the joint probabilities of the two classes. In the discriminant, we do not care about the specific value of this parameter, and the model assumption is made directly for \\(a\\) .\nThe model assumption of Logistic Regression is: \\(a=w^Tx\\) Thus, by finding the best value of \\(w\\) , we can obtain the best model under this model assumption. The probability discriminative model often uses the maximum likelihood estimation method to determine the parameters.\nFor a single observation, the probability of obtaining class \\(y\\) is (assuming \\(C_1=1, C_2=0\\) ): \\(p(y|x)=p_1^yp_0^{1-y}\\) Then, for \\(N\\) independent and identically distributed observations, the Maximum Likelihood Estimation (MLE) is: \\(\\hat{w}=\\mathop{argmax}_wJ(w)=\\mathop{argmax}_w\\sum\\limits_{i=1}^N(y_i\\log p_1\u0026#43;(1-y_i)\\log p_0)\\) Notice that this expression is the negative of the cross-entropy expression multiplied by \\(N\\) . The logarithm in MLE also ensures compatibility with the exponential function, thereby obtaining stable gradients in large intervals.\nTaking the derivative of this function, we notice that: \\(p_1\u0026#39;=(\\frac{1}{1\u0026#43;\\exp(-a)})\u0026#39;=p_1(1-p_1)\\) Then: \\(J\u0026#39;(w)=\\sum\\limits_{i=1}^Ny_i(1-p_1)x_i-p_1x_i\u0026#43;y_ip_1x_i=\\sum\\limits_{i=1}^N(y_i-p_1)x_i\\) Due to the nonlinearity of probability values, this expression cannot be solved directly when placed in the summation symbol. Therefore, in actual training, similar to the perceptron, we can also use batch stochastic gradient ascent (or gradient descent for minimization) with different sizes to obtain the maximum value of this function.\nGaussian Discrimination Analysis # In generative models, we model the joint probability distribution and then use Maximum A Posteriori (MAP) to obtain the best parameter values. For binary classification, we adopt the following assumption:\n\\(y\\sim Bernoulli(\\phi)\\) \\(x|y=1\\sim\\mathcal{N}(\\mu_1,\\Sigma)\\) \\(x|y=0\\sim\\mathcal{N}(\\mu_0,\\Sigma)\\) Then, for an independent and identically distributed dataset, the maximum a posteriori probability can be expressed as:\n\\[\\begin{aligned} \\mathop{argmax}_{\\phi,\\mu_0,\\mu_1,\\Sigma}\\log p(X|Y)p(Y)\u0026amp;=\\mathop{argmax}_{\\phi,\\mu_0,\\mu_1,\\Sigma}\\sum\\limits_{i=1}^N (\\log p(x_i|y_i)\u0026#43;\\log p(y_i))\\\\ \u0026amp;=\\mathop{argmax}_{\\phi,\\mu_0,\\mu_1,\\Sigma}\\sum\\limits_{i=1}^N((1-y_i)\\log\\mathcal{N}(\\mu_0,\\Sigma)\\\\\u0026amp;\\quad\u0026#43;y_i\\log \\mathcal{N}(\\mu_1,\\Sigma)\u0026#43;y_i\\log\\phi\u0026#43;(1-y_i)\\log(1-\\phi)) \\end{aligned}\\] First, solve for \\(\\phi\\) , and take the partial derivative of the expression with respect to \\(\\phi\\) :\n\\[ \\begin{aligned}\\sum\\limits_{i=1}^N\\frac{y_i}{\\phi}\u0026#43;\\frac{y_i-1}{1-\\phi}=0\\\\ \\Longrightarrow\\phi=\\frac{\\sum\\limits_{i=1}^Ny_i}{N}=\\frac{N_1}{N} \\end{aligned} \\] Solve \\(\\mu_1\\) \\[ \\begin{aligned}\\hat{\\mu_1}\u0026amp;=\\mathop{argmax}_{\\mu_1}\\sum\\limits_{i=1}^Ny_i\\log\\mathcal{N}(\\mu_1,\\Sigma)\\\\ \u0026amp;=\\mathop{argmin}_{\\mu_1}\\sum\\limits_{i=1}^Ny_i(x_i-\\mu_1)^T\\Sigma^{-1}(x_i-\\mu_1) \\end{aligned} \\] \\(\\because\\) \\( \\sum\\limits_{i=1}^Ny_i(x_i-\\mu_1)^T\\Sigma^{-1}(x_i-\\mu_1)=\\sum\\limits_{i=1}^Ny_ix_i^T\\Sigma^{-1}x_i-2y_i\\mu_1^T\\Sigma^{-1}x_i\u0026#43;y_i\\mu_1^T\\Sigma^{-1}\\mu_1 \\) Taking the derivative of the left side and multiplying by \\(\\Sigma\\) gives:\n\\[ \\begin{aligned}\\sum\\limits_{i=1}^N-2y_i\\Sigma^{-1}x_i\u0026#43;2y_i\\Sigma^{-1}\\mu_1=0\\\\ \\Longrightarrow\\mu_1=\\frac{\\sum\\limits_{i=1}^Ny_ix_i}{\\sum\\limits_{i=1}^Ny_i}=\\frac{\\sum\\limits_{i=1}^Ny_ix_i}{N_1} \\end{aligned} \\] Solve \\(\\mu_0\\) : \\( \\mu_0=\\frac{\\sum\\limits_{i=1}^N(1-y_i)x_i}{N_0} \\) The most difficult part is solving for \\(\\Sigma\\) . Our model assumes the same covariance matrix for positive and negative examples, although from the above solution, we can see that even if different matrices are used, it will not affect the previous three parameters. First, we have:\n\\[ \\begin{aligned} \\sum\\limits_{i=1}^N\\log\\mathcal{N}(\\mu,\\Sigma)\u0026amp;=\\sum\\limits_{i=1}^N\\log(\\frac{1}{(2\\pi)^{p/2}|\\Sigma|^{1/2}})\u0026#43;(-\\frac{1}{2}(x_i-\\mu)^T\\Sigma^{-1}(x_i-\\mu))\\\\ \u0026amp;=Const-\\frac{1}{2}N\\log|\\Sigma|-\\frac{1}{2}Trace((x_i-\\mu)^T\\Sigma^{-1}(x_i-\\mu))\\\\ \u0026amp;=Const-\\frac{1}{2}N\\log|\\Sigma|-\\frac{1}{2}Trace((x_i-\\mu)(x_i-\\mu)^T\\Sigma^{-1})\\\\ \u0026amp;=Const-\\frac{1}{2}N\\log|\\Sigma|-\\frac{1}{2}NTrace(S\\Sigma^{-1}) \\end{aligned} \\] In this expression, we add a trace to the scalar so that the order of the matrices can be exchanged. For the derivative of expressions containing absolute values and traces, we have:\n\\[ \\begin{aligned} \\frac{\\partial}{\\partial A}(|A|)\u0026amp;=|A|A^{-1}\\\\ \\frac{\\partial}{\\partial A}Trace(AB)\u0026amp;=B^T \\end{aligned} \\] \\(\\therefore\\) \\[ \\begin{aligned}[\\sum\\limits_{i=1}^N((1-y_i)\\log\\mathcal{N}(\\mu_0,\\Sigma)\u0026#43;y_i\\log \\mathcal{N}(\\mu_1,\\Sigma)]\u0026#39; \\\\=Const-\\frac{1}{2}N\\log|\\Sigma|-\\frac{1}{2}N_1Trace(S_1\\Sigma^{-1})-\\frac{1}{2}N_2Trace(S_2\\Sigma^{-1}) \\end{aligned} \\] In which, \\(S_1, S_2\\) are the covariance matrices within the two classes of data, so:\n\\[ \\begin{aligned}N\\Sigma^{-1}-N_1S_1^T\\Sigma^{-2}-N_2S_2^T\\Sigma^{-2}=0 \\\\\\Longrightarrow\\Sigma=\\frac{N_1S_1\u0026#43;N_2S_2}{N} \\end{aligned} \\] The symmetry of the class covariance matrices is applied here.\nThus, we have obtained all the parameters in our model assumptions using the maximum a posteriori method. According to the model, the joint distribution can be obtained, and hence the conditional distribution for inference can be derived.\nNaive Bayes # The Gaussian discriminant analysis above assumes a Gaussian distribution for the dataset and introduces the Bernoulli distribution as the class prior, thereby obtaining the parameters within these assumptions using maximum a posteriori estimation.\nThe Naive Bayes model makes assumptions about the relationships between the properties of the data. Generally, we need to obtain the probability value \\(p(x|y)\\) . Since \\(x\\) has \\(p\\) dimensions, we need to sample the joint probability of these many dimensions. However, we know that a large number of samples are required to obtain a relatively accurate probability approximation in such a high-dimensional space.\nIn a general directed probabilistic graphical model, different assumptions are made about the conditional independence relationships between the dimensions of the attributes. The simplest assumption is the conditional independence assumption in the description of the Naive Bayes model: \\(p(x|y)=\\prod\\limits_{i=1}^pp(x_i|y)\\) That is: \\(x_i\\perp x_j|y,\\forall\\ i\\ne j\\) Thus, using Bayes' theorem, for a single observation: \\(p(y|x)=\\frac{p(x|y)p(y)}{p(x)}=\\frac{\\prod\\limits_{i=1}^pp(x_i|y)p(y)}{p(x)}\\) Further assumptions are made for the individual dimensions of conditional probabilities and class priors:\n\\(x_i\\) is a continuous variable: \\(p(x_i|y)=\\mathcal{N}(\\mu_i,\\sigma_i^2)\\) \\(x_i\\) is a discrete variable: Categorical distribution: \\(p(x_i=i|y)=\\theta_i,\\sum\\limits_{i=1}^K\\theta_i=1\\) \\(p(y)=\\phi^y(1-\\phi)^{1-y}\\) To estimate these parameters, the MLE method is often used directly on the dataset. Since there is no need to know the relationships between the dimensions, the required amount of data is greatly reduced. After estimating these parameters, they are substituted back into Bayes' theorem to obtain the posterior distribution of the categories.\nSummary # Classification tasks are divided into two types. For tasks that require directly outputting the class, in the perceptron algorithm, we add a sign function as an activation function to the linear model, which allows us to obtain the class. However, the sign function is not smooth, so we adopt an error-driven approach, introducing \\(\\sum\\limits_{x_i\\in\\mathcal{D}_{wrong}}-y_iw^Tx_i\\) as the loss function, and then minimize this error using the batch stochastic gradient descent method to obtain the optimal parameter values. In linear discriminant analysis, we consider the linear model as a projection of data points in a certain direction, and use the idea of minimizing within-class variance and maximizing between-class variance to define the loss function. The within-class variance is defined as the sum of the variances of the two classes, and the between-class variance is defined as the distance between the centroids of the two classes. Taking the derivative of the loss function gives the direction of the parameters, which is \\(S_w^{-1}(\\overline x_{c1}-\\overline x_{c2})\\) , where \\(S_w\\) is the sum of the variances of the two classes in the original dataset.\nAnother type of task is to output the probability of classification. For probability models, we have two schemes. The first is the discriminative model, which directly models the conditional probability of the class. By inserting the linear model into the Logistic function, we obtain the Logistic regression model. The probability interpretation here is that the logarithm of the joint probability ratio of the two classes is linear. We define the loss function as the cross-entropy (equivalent to MLE), and taking the derivative of this function gives \\(\\frac{1}{N}\\sum\\limits_{i=1}^N(y_i-p_1)x_i\\) . We also use the batch stochastic gradient (ascent) method for optimization. The second is the generative model, which introduces the class prior. In Gaussian discriminant analysis, we assume the distribution of the dataset, where the class prior is a binomial distribution, and the likelihood of each class is a Gaussian distribution. Maximizing the log-likelihood of this joint distribution gives the parameters, \\(\\frac{\\sum\\limits_{i=1}^Ny_ix_i}{N_1},\\frac{\\sum\\limits_{i=1}^N(1-y_i)x_i}{N_0},\\frac{N_1S_1\u0026#43;N_2S_2}{N},\\frac{N_1}{N}\\) . In Naive Bayes, we further assume the dependence between the dimensions of the attributes, and the conditional independence assumption greatly reduces the data requirements.\nSupport Vector Machines (SVM) # Support Vector Machines (SVM) play an important role in classification problems, with the main idea being to maximize the margin between two classes. Based on the characteristics of the dataset:\nLinearly separable problems, like those previously handled by the perceptron algorithm Linearly separable, with only a few misclassified points, like the problems addressed by the evolved Pocket algorithm from the perceptron Nonlinear problems, completely non-separable, such as those tackled by multilayer perceptrons and deep learning developed from the perceptron problem For these three situations, SVM has the following three methods:\nHard-margin SVM Soft-margin SVM Kernel Method In solving SVMs, Lagrange multiplier method is widely used. First, let's start with it.\nConstrained Optimization # Generally, a constrained optimization problem (original problem) can be written as:\n\\[\\begin{aligned} \u0026amp;\\min_{x\\in\\mathbb{R^p}}f(x)\\\\ \u0026amp;s.t.\\ m_i(x)\\le0,i=1,2,\\cdots,M\\\\ \u0026amp;\\ \\ \\ \\ \\ \\ \\ \\ n_j(x)=0,j=1,2,\\cdots,N \\end{aligned}\\] Define Lagrange function:\n\\( L(x,\\lambda,\\eta) = f(x) \u0026#43; \\sum_{i=1}^{M} \\lambda_i m_i(x) \u0026#43; \\sum_{j=1}^{N} \\eta_j n_j(x) \\) Then, the original problem can be equivalently transformed into an unconstrained form:\n\\(\\min_x \\max_{\\lambda, \\eta} L(x,\\lambda,\\eta)\\) where \\(\\lambda\\) is a vector of Lagrange multipliers for inequality constraints and \\(eta\\) is a vector of Lagrange multipliers for equality constraints. Note that λ should be non-negative.\nThis unconstrained form allows us to solve the optimization problem without explicitly considering the constraints, as they are incorporated into the Lagrange function.\nThis is because, when the inequality constraints of the original problem are satisfied, the maximum value can be obtained when \\(\\lambda_i=0\\) , which is directly equivalent to the original problem. If the inequality constraints of the original problem are not satisfied, the maximum value becomes \\(\u0026#43;\\infty\\) . Since the minimum value is required, this situation will not be chosen.\nThe dual form of this problem is: \\(\\max_{\\lambda,\\eta}\\min_{x\\in\\mathbb{R}^p}L(x,\\lambda,\\eta)\\ s.t.\\ \\lambda_i\\ge0\\) The dual problem is a maximization problem with respect to \\(\\lambda, \\eta\\) .\n\\(\\because\\) \\(\\max_{\\lambda_i,\\eta_j}\\min_{x}L(x,\\lambda_i,\\eta_j)\\le\\min_{x}\\max_{\\lambda_i,\\eta_j}L(x,\\lambda_i,\\eta_j)\\) The solution of the dual problem is less than or equal to the original problem, with two situations:\nStrong duality: The equality can be achieved Weak duality: The equality cannot be achieved For a convex optimization problem, the following theorem holds: If the convex optimization problem satisfies certain conditions, such as the Slater condition, then it and its dual problem satisfy a strong duality relationship. Let the problem's domain be defined as: \\(\\mathcal{D}=domf(x)\\cap dom m_i(x)\\cap domn_j(x)\\) . The Slater condition is: \\( \\exists \\hat{x} \\in Relint \\mathcal{D}\\ such\\ that\\ \\forall i=1,2,\\cdots,M, m_i(x) \u0026lt; 0 \\) where \\(Relint\\) represents the relative interior (interior not containing the boundary).\nFor most convex optimization problems, the Slater condition holds; relax the Slater condition: if there are K affine functions among the M inequality constraints, then it is sufficient for the remaining functions to satisfy the Slater condition.\nThe above introduced the duality relationship between the original problem and the dual problem, but in practice, it is necessary to solve for the parameters. The solution method uses the KKT (Karush-Kuhn-Tucker) conditions:\nThe KKT conditions and the strong duality relationship are equivalent. The KKT conditions for the optimal solution are:\nFeasible domain:\n\\[ \\begin{aligned} m_i(x^*)\\le0\\\\ n_j(x^*)=0\\\\ \\lambda^*\\ge0 \\end{aligned} \\] Complementary slackness \\(\\lambda^*m_i(x^*)=0,\\forall m_i\\) . The optimal value of the dual problem is \\(d^*\\) , and the original problem is \\(p^*\\) \\[ \\begin{aligned} d^*\u0026amp;=\\max_{\\lambda,\\eta}g(\\lambda,\\eta)=g(\\lambda^*,\\eta^*)\\\\ \u0026amp;=\\min_{x}L(x,\\lambda^*,\\eta^*)\\\\ \u0026amp;\\le L(x^*,\\lambda^*,\\eta^*)\\\\ \u0026amp;=f(x^*)\u0026#43;\\sum\\limits_{i=1}^M\\lambda^*m_i(x^*)\\\\ \u0026amp;\\le f(x^*)=p^* \\end{aligned} \\] To satisfy the equality, both inequalities must hold. Therefore, for the first inequality, the gradient must be 0; for the second inequality, the complementary slackness condition must be satisfied.\nGradient is 0: \\(\\frac{\\partial L(x,\\lambda^*,\\eta^*)}{\\partial x}|_{x=x^*}=0\\) Hard-margin SVM # Support Vector Machine (SVM) is also a kind of hard classification model. In the previous perceptron model, we added a sign function on top of the linear model. From a geometric intuition, we can see that if the two classes are well separated, there will actually be infinitely many lines that can separate them. In SVM, we introduce the concept of maximizing the margin, where the margin refers to the minimum distance between the data and the dividing line. Maximizing this value reflects the tendency of our model.\nThe separating hyperplane can be written as: \\(0=w^Tx\u0026#43;b\\) Then, maximize the margin (subject to the constraints of the classification task): \\(\\mathop{argmax}_{w,b}[\\min_i\\frac{|w^Tx_i\u0026#43;b|}{||w||}]\\ s.t.\\ y_i(w^Tx_i\u0026#43;b)\u0026gt;0\\\\ \\Longrightarrow\\mathop{argmax}_{w,b}[\\min_i\\frac{y_i(w^Tx_i\u0026#43;b)}{||w||}]\\ s.t.\\ y_i(w^Tx_i\u0026#43;b)\u0026gt;0\\) For this constraint \\(y_i(w^Tx_i\u0026#43;b)\u0026gt;0\\) , we can fix \\(\\min y_i(w^Tx_i\u0026#43;b)=1\u0026gt;0\\) , since scaling the coefficients of the hyperplane separating the two classes does not change the plane. This is equivalent to constraining the coefficients of the hyperplane. The simplified expression can be represented as: \\(\\mathop{argmin}_{w,b}\\frac{1}{2}w^Tw\\ s.t.\\ \\min_iy_i(w^Tx_i\u0026#43;b)=1\\\\ \\Rightarrow\\mathop{argmin}_{w,b}\\frac{1}{2}w^Tw\\ s.t.\\ y_i(w^Tx_i\u0026#43;b)\\ge1,i=1,2,\\cdots,N\\) This is a convex optimization problem with \\(N\\) constraints, and there are many software tools for solving such problems.\nHowever, if the sample size or dimension is very high, solving the problem directly can be difficult or even infeasible, so further processing is needed. Introduce the Lagrange function: \\(L(w,b,\\lambda)=\\frac{1}{2}w^Tw\u0026#43;\\sum\\limits_{i=1}^N\\lambda_i(1-y_i(w^Tx_i\u0026#43;b))\\) The original problem is equivalent to: \\(\\mathop{argmin}_{w,b}\\max_{\\lambda}L(w,b,\\lambda_i)\\ s.t.\\ \\lambda_i\\ge0\\) We swap the minimum and maximum symbols to obtain the dual problem: \\(\\max_{\\lambda_i}\\min_{w,b}L(w,b,\\lambda_i)\\ s.t.\\ \\lambda_i\\ge0\\) Since the inequality constraint is an affine function, the dual problem is equivalent to the original problem:\n\\(b\\) : \\(\\frac{\\partial}{\\partial b}L=0\\Rightarrow\\sum\\limits_{i=1}^N\\lambda_iy_i=0\\) \\(w\\) :plug in \\(b\\) \\( L(w,b,\\lambda_i)=\\frac{1}{2}w^Tw\u0026#43;\\sum\\limits_{i=1}^N\\lambda_i(1-y_iw^Tx_i-y_ib)=\\frac{1}{2}w^Tw\u0026#43;\\sum\\limits_{i=1}^N\\lambda_i-\\sum\\limits_{i=1}^N\\lambda_iy_iw^Tx_i \\) \\(\\therefore\\) \\( \\frac{\\partial}{\\partial w}L=0\\Rightarrow w=\\sum\\limits_{i=1}^N\\lambda_iy_ix_i \\) Plug in both parameters \\( L(w,b,\\lambda_i)=-\\frac{1}{2}\\sum\\limits_{i=1}^N\\sum\\limits_{j=1}^N\\lambda_i\\lambda_jy_iy_jx_i^Tx_j\u0026#43;\\sum\\limits_{i=1}^N\\lambda_i \\) Therefore, the dual problem is: \\(\\max_{\\lambda}-\\frac{1}{2}\\sum\\limits_{i=1}^N\\sum\\limits_{j=1}^N\\lambda_i\\lambda_jy_iy_jx_i^Tx_j\u0026#43;\\sum\\limits_{i=1}^N\\lambda_i,\\ s.t.\\ \\lambda_i\\ge0\\) The parameters of the hyperplane can be obtained from the KKT conditions:\nThe necessary and sufficient conditions for the strong duality relationship between the original problem and the dual problem are that they satisfy the KKT conditions:\n\\[\\begin{aligned} \u0026amp;\\frac{\\partial L}{\\partial w}=0,\\frac{\\partial L}{\\partial b}=0 \\\\\u0026amp;\\lambda_k(1-y_k(w^Tx_k\u0026#43;b))=0(slackness\\ complementary)\\\\ \u0026amp;\\lambda_i\\ge0\\\\ \u0026amp;1-y_i(w^Tx_i\u0026#43;b)\\le0 \\end{aligned}\\] Based on these conditions, we can obtain the corresponding optimal parameters:\n\\[\\begin{aligned} \\hat{w}\u0026amp;=\\sum\\limits_{i=1}^N\\lambda_iy_ix_i\\\\ \\hat{b}\u0026amp;=y_k-w^Tx_k=y_k-\\sum\\limits_{i=1}^N\\lambda_iy_ix_i^Tx_k, \\\\ \u0026amp;\\exists k,1-y_k(w^Tx_k\u0026#43;b)=0 \\end{aligned}\\] Thus, the parameter \\(w\\) of the hyperplane is a linear combination of data points, and the final parameter values are the linear combination of some vectors that satisfy \\(y_i(w^Tx_i\u0026#43;b)=1\\) (given by the complementary slackness condition). These vectors are also called support vectors.\nSoft-margin SVM # Hard-margin SVM is only solvable for separable data. If the data is not separable, our basic idea is to introduce the possibility of misclassification into the loss function. The number of misclassifications can be written as: \\(error=\\sum\\limits_{i=1}^N\\mathbb{I}\\{y_i(w^Tx_i\u0026#43;b)\u0026lt;1\\}\\) This function is discontinuous, so we can rewrite it as: \\(error=\\sum\\limits_{i=1}^N\\max\\{0,1-y_i(w^Tx_i\u0026#43;b)\\}\\) The term inside the summation is called the Hinge Function.\nBy adding this error term into the hard-margin SVM, we have:\n\\[\\begin{aligned} \u0026amp;\\mathop{argmin}_{w,b}\\frac{1}{2}w^Tw\u0026#43;C\\sum\\limits_{i=1}^N\\max\\{0,1-y_i(w^Tx_i\u0026#43;b)\\} \\\\ \u0026amp;\\textrm{s.t.} \\quad y_i(w^Tx_i\u0026#43;b)\\ge1-\\xi_i,i=1,2,\\cdots,N \\end{aligned}\\] In this expression, the constant \\(C\\) can be considered as the allowed error level. To further eliminate the \\(\\max\\) symbol, for each observation in the dataset, we can assume that most of them satisfy the constraint, but some of them violate the constraint. Therefore, this part of the constraint becomes \\(y_i(w^Tx\u0026#43;b)\\ge1-\\xi_i\\) , where \\(\\xi_i=1-y_i(w^Tx_i\u0026#43;b)\\) . Further simplification gives: \\(\\mathop{argmin}_{w,b}\\frac{1}{2}w^Tw\u0026#43;C\\sum\\limits_{i=1}^N\\xi_i\\ s.t.\\ y_i(w^Tx_i\u0026#43;b)\\ge1-\\xi_i,\\xi_i\\ge0,i=1,2,\\cdots,N\\) Kernel Method # Kernel methods can be applied to many problems. In classification problems, for strictly non-separable problems, we introduce a feature transformation function to transform the original non-separable dataset into a separable dataset, and then apply the existing model. Often, when transforming a low-dimensional dataset into a high-dimensional dataset, the data becomes separable (the data becomes sparser):\nHigh-dimensional spaces are more likely to be linearly separable than low-dimensional spaces.\nWhen applied to SVM, we observe the dual problem of SVM: \\(\\max_{\\lambda}-\\frac{1}{2}\\sum\\limits_{i=1}^N\\sum\\limits_{j=1}^N\\lambda_i\\lambda_jy_iy_jx_i^Tx_j\u0026#43;\\sum\\limits_{i=1}^N\\lambda_i,\\ s.t.\\ \\lambda_i\\ge0\\) When solving, we need to find the inner product, so after the feature transformation of non-separable data, we need to find the inner product of the transformed data. It is often difficult to find the inner product of the transformation function. Hence, we directly introduce a transformation function of the inner product: \\(\\forall x,x\u0026#39;\\in\\mathcal{X},\\exists\\phi\\in\\mathcal{H}:x\\rightarrow z\\ s.t.\\ k(x,x\u0026#39;)=\\phi(x)^T\\phi(x)\\) The function \\(k(x,x\u0026#39;)\\) is called a positive definite kernel function, where \\(\\mathcal{H}\\) is a Hilbert space (a complete linear inner product space). If we remove the inner product condition, we simply call it a kernel function.\n\\(k(x,x\u0026#39;)=\\exp(-\\frac{(x-x\u0026#39;)^2}{2\\sigma^2})\\) is a kernel function.\n\\[\\begin{aligned} \\exp(-\\frac{(x-x\\prime)^2}{2\\sigma^2})\u0026amp;=\\exp(-\\frac{x^2}{2\\sigma^2})\\exp(\\frac{xx\\prime}{\\sigma^2})\\exp(-\\frac{x\\prime^2}{2\\sigma^2})\\\\ \u0026amp;=\\exp(-\\frac{x^2}{2\\sigma^2})\\sum\\limits_{n=0}^{\u0026#43;\\inf}\\frac{x^nx\\prime^n}{\\sigma^{2n}n!}\\exp(-\\frac{x\\prime^2}{2\\sigma^2})\\\\ \u0026amp;=\\exp(-\\frac{x^2}{2\\sigma^2})\\varphi(x)\\varphi(x\\prime)\\exp(-\\frac{x\\prime^2}{2\\sigma^2})\\\\\u0026amp;=\\phi(x)\\phi(x\\prime) \\end{aligned}\\] A positive definite kernel function has the following equivalent definitions:\nIf the kernel function satisfies:\nSymmetry Positive definiteness Then this kernel function is a positive definite kernel function.\nProof:\nSymmetry \\(\\Leftrightarrow\\) \\(k(x,z)=k(z,x)\\) , which obviously satisfies the definition of an inner product. Positive definiteness \\(\\Leftrightarrow\\) \\(\\forall N,x_1,x_2,\\cdots,x_N\\in\\mathcal{X}\\) , the corresponding Gram Matrix \\(K=[k(x_i,x_j)]\\) is positive semi-definite. To prove: \\(k(x,z)=\\phi(x)^T\\phi(z)\\Leftrightarrow K\\) is positive semi-definite and symmetric.\n\\(\\Rightarrow\\) : Firstly, symmetry is obvious. For positive definiteness: \\( K=\\begin{pmatrix}k(x_1,x_2)\u0026amp;\\cdots\u0026amp;k(x_1,x_N)\\\\\\vdots\u0026amp;\\vdots\u0026amp;\\vdots\\\\k(x_N,x_1)\u0026amp;\\cdots\u0026amp;k(x_N,x_N)\\end{pmatrix} \\) Take any \\(\\alpha\\in\\mathbb{R}^N\\) , we need to prove \\(\\alpha^TK\\alpha\\ge0\\) : \\( \\alpha^TK\\alpha=\\sum\\limits_{i,j}\\alpha_i\\alpha_jK_{ij}=\\sum\\limits_{i,j}\\alpha_i\\phi^T(x_i)\\phi(x_j)\\alpha_j=\\sum\\limits_{i}\\alpha_i\\phi^T(x_i)\\sum\\limits_{j}\\alpha_j\\phi(x_j) \\) This expression is in the form of an inner product. Hilbert space satisfies linearity, so the proof of positive definiteness is complete.\n\\(\\Leftarrow\\) : Decompose \\(K\\) , for the symmetric matrix \\(K=V\\Lambda V^T\\) , then let \\(\\phi(x_i)=\\sqrt{\\lambda_i}V_i\\) , where \\(V_i\\) is the eigenvector, and we construct \\(k(x,z)=\\sqrt{\\lambda_i\\lambda_j}V_i^TV_j\\) .\nSummary # For a long time, classification problems relied on SVM. For strictly separable datasets, Hard-margin SVM selects a hyperplane that maximizes the distance to all data points. A constraint is applied to this plane, fixing \\(y_i(w^Tx_i\u0026#43;b)=1\\) , resulting in a convex optimization problem with all constraint conditions as affine functions. This satisfies the Slater condition, and the problem is transformed into a dual problem, obtaining an equivalent solution and constraint parameters:\n\\(\\max_{\\lambda}-\\frac{1}{2}\\sum\\limits_{i=1}^N\\sum\\limits_{j=1}^N\\lambda_i\\lambda_jy_iy_jx_i^Tx_j\u0026#43;\\sum\\limits_{i=1}^N\\lambda_i,\\ s.t.\\ \\lambda_i\\ge0\\) The required hyperplane parameters are solved using the KKT conditions of the strong dual problem:\n\\[\\begin{aligned} \u0026amp;\\frac{\\partial L}{\\partial w}=0,\\frac{\\partial L}{\\partial b}=0 \\\\\u0026amp;\\lambda_k(1-y_k(w^Tx_k\u0026#43;b))=0(slackness\\ complementary)\\\\ \u0026amp;\\lambda_i\\ge0\\\\ \u0026amp;1-y_i(w^Tx_i\u0026#43;b)\\le0 \\end{aligned}\\] The solution is:\n\\[\\begin{aligned} \\hat{w}=\\sum\\limits_{i=1}^N\\lambda_iy_ix_i\\\\ \\hat{b}=y_k-w^Tx_k=y_k-\\sum\\limits_{i=1}^N\\lambda_iy_ix_i^Tx_k \\\\ \\exists k,1-y_k(w^Tx_k\u0026#43;b)=0 \\end{aligned}\\] When allowing for some errors, an error term can be added to the Hard-margin SVM. The Hinge Function represents the size of the error term, resulting in: \\(\\mathop{argmin}_{w,b}\\frac{1}{2}w^Tw\u0026#43;C\\sum\\limits_{i=1}^N\\xi_i\\ s.t.\\ y_i(w^Tx_i\u0026#43;b)\\ge1-\\xi_i,\\xi_i\\ge0,i=1,2,\\cdots,N\\) For completely non-separable problems, we use feature transformation. In SVM, we introduce a positive definite kernel function to directly transform the inner product. As long as this transformation satisfies symmetry and positive definiteness, it can be used as a kernel function.\nExponential Distribution # The exponential family is a class of distributions that includes Gaussian distribution, Bernoulli distribution, binomial distribution, Poisson distribution, Beta distribution, Dirichlet distribution, Gamma distribution, and a series of other distributions. Exponential family distributions can be written in a unified form:\n\\(p(x|\\eta)=h(x)\\exp(\\eta^T\\phi(x)-A(\\eta))=\\frac{1}{\\exp(A(\\eta))}h(x)\\exp(\\eta^T\\phi(x))\\) \\(\\eta\\) is the parameter vector, and \\(A(\\eta)\\) is the log partition function (normalization factor).\nIn this expression, \\(\\phi(x)\\) is called the sufficient statistic, containing all the information of the sample set, such as the mean and variance in the Gaussian distribution. Sufficient statistics have applications in online learning; for a dataset, it is only necessary to record the sufficient statistics of the samples.\nFor a model distribution assumption (likelihood), we often need to find a conjugate prior in the solution, so that the prior and posterior forms are the same. For example, if the likelihood is a binomial distribution, the prior can be chosen as a Beta distribution, and the posterior is also a Beta distribution. Exponential family distributions often have conjugate properties, making model selection and inference much more convenient.\nThe properties of conjugate priors facilitate computation, and at the same time, exponential family distributions satisfy the idea of maximum entropy (uninformative prior). That is, the distribution derived from the empirical distribution using the principle of maximum entropy is the exponential family distribution.\nNoticing that the expression of the exponential family distribution is similar to the linear model, in fact, the exponential family distribution naturally derives the generalized linear model: \\(y=f(w^Tx)\\\\ y|x\\sim Exp Family\\) In more complex probabilistic graphical models, such as in undirected graphical models like Restricted Boltzmann Machines, exponential family distributions also play an important role.\nIn inference algorithms, such as variational inference, exponential family distributions greatly simplify the computation.\n1-d Gaussian distribution # 1- \\(d\\) Gaussian distribution can be written as: \\(p(x|\\theta)=\\frac{1}{\\sqrt{2\\pi}\\sigma}\\exp(-\\frac{(x-\\mu)^2}{2\\sigma^2})\\) Now transform the above:\n\\[\\begin{aligned} \u0026amp;\\frac{1}{\\sqrt{2\\pi\\sigma^2}}\\exp(-\\frac{1}{2\\sigma^2}(x^2-2\\mu x\u0026#43;\\mu^2))\\\\ \u0026amp;=\\exp(\\log(2\\pi\\sigma^2)^{-1/2})\\exp(-\\frac{1}{2\\sigma^2}\\begin{pmatrix}-2\\mu\u0026amp;1\\end{pmatrix}\\begin{pmatrix}x\\\\x^2\\end{pmatrix}-\\frac{\\mu^2}{2\\sigma^2}) \\end{aligned}\\] \\(\\therefore\\) \\(\\eta=\\begin{pmatrix}\\frac{\\mu}{\\sigma^2}\\\\-\\frac{1}{2\\sigma^2}\\end{pmatrix}=\\begin{pmatrix}\\eta_1\\\\\\eta_2\\end{pmatrix}\\) , \\(A(\\eta)=-\\frac{\\eta_1^2}{4\\eta_2}\u0026#43;\\frac{1}{2}\\log(-\\frac{\\pi}{\\eta_2})\\) Sufficient statistics and log partition functions # Integrate the probability density function:\n\\[\\begin{aligned} \\exp(A(\\eta))\u0026amp;=\\int h(x)\\exp(\\eta^T\\phi(x))dx \\end{aligned}\\] Take the derivative with respect to the parameter on both sides:\n\\[\\begin{aligned} \\exp(A(\\eta))A\u0026#39;(\\eta)\u0026amp;=\\int h(x)\\exp(\\eta^T\\phi(x))\\phi(x)dx\\\\ \\Longrightarrow A\u0026#39;(\\eta)\u0026amp;=\\mathbb{E}_{p(x|\\eta)}[\\phi(x)] \\end{aligned}\\] Similarly: \\(A\u0026#39;\u0026#39;(\\eta)=Var_{p(x|\\eta)}[\\phi(x)]\\) Since the variance is positive, \\(A(\\eta)\\) must be a convex function.\nSufficient statistics and MLE # For an independently and identically distributed ( \\(iid\\) ) dataset \\(\\mathcal{D}=\\{x_1,x_2,\\cdots,x_N\\}\\) \\[\\begin{aligned} \\eta_{MLE}\u0026amp;=\\mathop{argmax}_\\eta\\sum\\limits_{i=1}^N\\log p(x_i|\\eta)\\\\ \u0026amp;=\\mathop{argmax}_\\eta\\sum\\limits_{i=1}^N(\\eta^T\\phi(x_i)-A(\\eta))\\\\ \u0026amp;\\Longrightarrow A\u0026#39;(\\eta_{MLE})=\\frac{1}{N}\\sum\\limits_{i=1}^N\\phi(x_i) \\end{aligned}\\] From this, we can see that in order to estimate the parameters, it is sufficient to know the sufficient statistics.\nMaximum entrophy # The information entropy is denoted as: \\(Entropy=\\int-p(x)\\log(p(x))dx\\) In general, for a completely random variable (equally likely), the information entropy is maximized.\nOur assumption is based on the principle of maximum entropy. Assuming the data follows a discrete distribution with probability of \\(k\\) features being \\(p_k\\) , the principle of maximum entropy can be formulated as: \\( \\max\\{H(p)\\}=\\min\\{\\sum\\limits_{k=1}^Kp_k\\log p_k\\}\\ s.t.\\ \\sum\\limits_{k=1}^Kp_k=1 \\) Using the Lagrange multiplier method: \\( L(p,\\lambda)=\\sum\\limits_{k=1}^Kp_k\\log p_k\u0026#43;\\lambda(1-\\sum\\limits_{k=1}^Kp_k) \\) Hence, we can obtain: \\( p_1=p_2=\\cdots=p_K=\\frac{1}{K} \\) Therefore, the entropy is maximized when all probabilities are equal.\nFor a dataset \\(\\mathcal{D}\\) , the empirical distribution on this dataset is \\(\\hat{p}(x)=\\frac{Count(x)}{N}\\) . In practice, it is impossible to satisfy all empirical probabilities being equal. Therefore, we need to add the constraint of this empirical distribution in the principle of maximum entropy.\nFor any function, the empirical expectation of the empirical distribution can be obtained as: \\(\\mathbb{E}_{\\hat{p}}[f(x)]=\\Delta\\) \\(\\therefore \\max\\{H(p)\\}=\\min\\{\\sum\\limits_{k=1}^Np_k\\log p_k\\}\\ s.t.\\ \\sum\\limits_{k=1}^Np_k=1,\\mathbb{E}_p[f(x)]=\\Delta\\) The Lagrange function is: \\(L(p,\\lambda_0,\\lambda)=\\sum\\limits_{k=1}^Np_k\\log p_k\u0026#43;\\lambda_0(1-\\sum\\limits_{k=1}^Np_k)\u0026#43;\\lambda^T(\\Delta-\\mathbb{E}_p[f(x)])\\) Taking the derivative, we get:\n\\(\\frac{\\partial}{\\partial p(x)}L=\\sum\\limits_{k=1}^N(\\log p(x)\u0026#43;1)-\\sum\\limits_{k=1}^N\\lambda_0-\\sum\\limits_{k=1}^N\\lambda^Tf(x)\\\\ \\Longrightarrow\\sum\\limits_{k=1}^N\\log p(x)\u0026#43;1-\\lambda_0-\\lambda^Tf(x)=0\\) Since the dataset is arbitrary, summing over the dataset also means that every term in the sum is 0: \\(p(x)=\\exp(\\lambda^Tf(x)\u0026#43;\\lambda_0-1)\\) This is the exponential family distribution.\nExpectation\u0026ndash;Maximization Algorithm # The purpose of the Expectation-Maximization (EM) algorithm is to solve the parameter estimation (maximum likelihood estimation) for mixture models with latent variables. The MLE for the parameter estimation of \\(p(x|\\theta)\\) is denoted as: \\(\\theta_{MLE}=\\mathop{argmax}\\limits_\\theta\\log p(x|\\theta)\\) . The EM algorithm solves this problem using an iterative method:\n\\(\\theta^{\\text{t}\u0026#43;1} = \\arg\\max_{\\theta} \\int_{z} \\log [p(x,z|\\theta)] p(z|x,\\theta^{\\text{t}}) dz = \\mathbb{E}_{z|x,\\theta^{\\text{t}}}[\\log p(x,z|\\theta)]\\) This formula includes two iterative steps:\nE step: Calculate the expectation of \\(\\log p(x,z|\\theta)\\) under the probability distribution \\(p(z|x,\\theta^t)\\) M step: Calculate the parameters that maximize this expectation to obtain the input for the next EM step \\(\\log p(x|\\theta^t)\\le\\log p(x|\\theta^{t\u0026#43;1})\\) \\(\\log p(x|\\theta)=\\log p(z,x|\\theta)-\\log p(z|x,\\theta)\\) Integrate both sides:\n\\begin{aligned} Left:\u0026amp;\\int_zp(z|x,\\theta^t)\\log p(x|\\theta)dz=\\log p(x|\\theta) \\ Right:\u0026amp;\\int_zp(z|x,\\theta^t)\\log p(x,z|\\theta)dz-\\int_zp(z|x,\\theta^t)\\log p(z|x,\\theta)dz=Q(\\theta,\\theta^t)-H(\\theta,\\theta^t) \\end{aligned}\n\\(\\therefore \\log p(x|\\theta)=Q(\\theta,\\theta^t)-H(\\theta,\\theta^t)\\) \\begin{aligned} \\because Q(\\theta,\\theta^t)\u0026amp;=\\int_zp(z|x,\\theta^t)\\log p(x,z|\\theta)dz \\ \\theta^{t+1}\u0026amp;=\\mathop{argmax}\\limits_{\\theta}\\int_z\\log [p(x,z|\\theta)]p(z|x,\\theta^t)dz \\ \\therefore Q(\\theta^{t+1},\\theta^t)\u0026amp;\\ge Q(\\theta^t,\\theta^t) \\end{aligned}\nTo have \\(\\log p(x|\\theta^t)\\le\\log p(x|\\theta^{t\u0026#43;1})\\) We need \\(H(\\theta^t,\\theta^t)\\ge H(\\theta^{t\u0026#43;1},\\theta^t)\\) \\begin{aligned} H(\\theta^{t+1},\\theta^t)-H(\\theta^{t},\\theta^t)\u0026amp;=\\int_zp(z|x,\\theta^{t})\\log p(z|x,\\theta^{t+1})dz-\\int_zp(z|x,\\theta^t)\\log p(z|x,\\theta^{t})dz\\ \u0026amp;=\\int_zp(z|x,\\theta^t)\\log\\frac{p(z|x,\\theta^{t+1})}{p(z|x,\\theta^t)} \\ \u0026amp;=-KL(p(z|x,\\theta^t),p(z|x,\\theta^{t+1}))\\le0 \\end{aligned}\nCombining everything we have got: \\( \\log p(x|\\theta^t)\\le\\log p(x|\\theta^{t\u0026#43;1}) \\) Based on the proof above, we see that the likelihood function increases at each step. Furthermore, we look at how the formula in the EM iteration process is derived:\n\\(\\log p(x|\\theta)=\\log p(z,x|\\theta)-\\log p(z|x,\\theta)=\\log \\frac{p(z,x|\\theta)}{q(z)}-\\log \\frac{p(z|x,\\theta)}{q(z)}\\) Take the expectation \\(\\mathbb{E}_{q(z)}\\) on both sides:\n\\[\\begin{aligned} \u0026amp;Left:\\int_zq(z)\\log p(x|\\theta)dz=\\log p(x|\\theta)\\\\ \u0026amp;Right:\\int_zq(z)\\log \\frac{p(z,x|\\theta)}{q(z)}dz-\\int_zq(z)\\log \\frac{p(z|x,\\theta)}{q(z)}dz=ELBO\u0026#43;KL(q(z),p(z|x,\\theta)) \\end{aligned}\\] In the equation above, the Evidence Lower Bound (ELBO) is a lower bound, so \\(\\log p(x|\\theta)\\ge ELBO\\) . The equality holds when the KL divergence is 0, i.e., \\(q(z)=p(z|x,\\theta)\\) . The purpose of the EM algorithm is to maximize the ELBO. According to the proof process above, the maximum ELBO is obtained at each step of EM, and the parameters that maximize the ELBO are used as input for the next step:\n\\(\\hat{\\theta}=\\mathop{argmax}_{\\theta}ELBO=\\mathop{argmax}_\\theta\\int_zq(z)\\log\\frac{p(x,z|\\theta)}{q(z)}dz\\) Since the maximum value can be achieved when \\( q(z)=p(z|x,\\theta^t)\\) , we have:\n\\[\\begin{aligned} \\hat{\\theta}\u0026amp;=\\mathop{argmax}_{\\theta}ELBO \\\\ \u0026amp;=\\mathop{argmax}_\\theta\\int_zq(z)\\log\\frac{p(x,z|\\theta)}{q(z)}dz \\\\ \u0026amp;=\\mathop{argmax}_\\theta\\int_zp(z|x,\\theta^t)\\log\\frac{p(x,z|\\theta)}{p(z|x,\\theta^t)}dz \\\\ \u0026amp;=\\mathop{argmax}_\\theta\\int_z p(z|x,\\theta^t)\\log p(x,z|\\theta) \\end{aligned}\\] This formula is the one used in the EM iteration process above.\nStarting from Jensen's inequality, this formula can also be derived:\n\\[\\begin{aligned} \\log p(x|\\theta)\u0026amp;=\\log\\int_zp(x,z|\\theta)dz \\\\ \u0026amp;=\\log\\int_z\\frac{p(x,z|\\theta)q(z)}{q(z)}dz \\\\ \u0026amp;=\\log \\mathbb{E}_{q(z)}[\\frac{p(x,z|\\theta)}{q(z)}] \\\\ \u0026amp;\\ge \\mathbb{E}_{q(z)}[\\log\\frac{p(x,z|\\theta)}{q(z)}] \\end{aligned}\\] In this case, the right side of the equation is the ELBO, and the equality holds when \\( p(x,z|\\theta)=Cq(z)\\) .\nThus: \\(\\int_zq(z)dz=\\frac{1}{C}\\int_zp(x,z|\\theta)dz=\\frac{1}{C}p(x|\\theta)=1\\\\ \\Rightarrow q(z)=\\frac{1}{p(x|\\theta)}p(x,z|\\theta)=p(z|x,\\theta)\\) We find that this process is the condition for the maximum value to hold as discussed above.\nGeneralized EM # The EM model solves the problem of parameter estimation for probabilistic generative models by introducing latent variable \\(z\\) to learn \\(\\theta\\) , and specific models have different assumptions for \\(z\\) . For the learning task \\(p(x|\\theta)\\) , it is the learning task \\(\\frac{p(x,z|\\theta)}{p(z|x,\\theta)}\\) . In this formula, we assume that in the E step, \\(q(z)=p(z|x,\\theta)\\) . However, if this \\(p(z|x,\\theta)\\) cannot be solved, sampling (MCMC) or variational inference methods must be used to approximate this posterior. We observe the expression of KL divergence. To maximize the ELBO, we need to minimize the KL divergence when \\(\\theta\\) is fixed:\n\\(\\hat{q}(z)=\\mathop{argmin}_qKL(p,q)=\\mathop{argmax}_qELBO\\) This is the basic idea of the generalized EM:\nE step: \\( \\hat{q}^{t\u0026#43;1}(z)=\\mathop{argmax}_q\\int_zq^t(z)\\log\\frac{p(x,z|\\theta)}{q^t(z)}dz, \\text{ fixed }\\theta \\) M step: \\( \\hat{\\theta}=\\mathop{argmax}_\\theta \\int_zq^{t\u0026#43;1}(z)\\log\\frac{p(x,z|\\theta)}{q^{t\u0026#43;1}(z)}dz, \\text{ fixed }\\hat{q} \\) For the integration above: \\(ELBO=\\int_zq(z)\\log\\frac{p(x,z|\\theta)}{q(z)}dz=\\mathbb{E}_{q(z)}[p(x,z|\\theta)]\u0026#43;Entropy(q(z))\\) Therefore, we see that the generalized EM is equivalent to adding an entropy term to the original formula.\nGeneralization of EM # The EM algorithm is similar to the coordinate ascent method, where some coordinates are fixed and others are optimized, and then iterated repeatedly. If the posterior probability of \\(z\\) cannot be solved within the EM framework, some variations of EM need to be adopted to estimate this posterior.\nVariational inference based on mean-field, VBEM/VEM Monte Carlo-based EM, MCEM Gaussian Mixture Model # To solve the unimodality problem of the Gaussian model, we introduce the weighted average of multiple Gaussian models to fit the multimodal data:\n\\(p(x)=\\sum\\limits_{k=1}^K\\alpha_k\\mathcal{N}(\\mu_k,\\Sigma_k)\\) We introduce the latent variable \\(z\\) , which represents which Gaussian distribution the corresponding sample \\(x\\) belongs to. This variable is a discrete random variable:\n\\(p(z=i)=p_i,\\sum\\limits_{i=1}^kp(z=i)=1\\) As a generative model, the Gaussian mixture model generates samples through the distribution of the latent variable \\(z\\) . It can be represented with a probability graph:\n{width=\u0026ldquo;30px\u0026rdquo;}\nIn this graph, node \\(z\\) represents the probability mentioned above, and \\(x\\) represents the generated Gaussian distribution.\nTherefore, for \\(p(x)\\) : \\(p(x)=\\sum\\limits_zp(x,z)=\\sum\\limits_{k=1}^Kp(x,z=k)=\\sum\\limits_{k=1}^Kp(z=k)p(x|z=k)\\) Thus: \\(p(x)=\\sum\\limits_{k=1}^Kp_k\\mathcal{N}(x|\\mu_k,\\Sigma_k)\\) MLE # The samples are \\(X=(x_1,x_2,\\cdots,x_N)\\) , and \\((X,Z)\\) are the complete parameters, with the parameters being \\(\\theta=\\{p_1,p_2,\\cdots,p_K,\\mu_1,\\mu_2,\\cdots,\\mu_K\\Sigma_1,\\Sigma_2,\\cdots,\\Sigma_K\\}\\) We obtain the value of \\(\\theta\\) through maximum likelihood estimation:\n\\[\\begin{aligned}\\theta_{MLE}\u0026amp;=\\mathop{argmax}\\limits_{\\theta}\\log p(X)=\\mathop{argmax}_{\\theta}\\sum\\limits_{i=1}^N\\log p(x_i)\\\\ \u0026amp;=\\mathop{argmax}_\\theta\\sum\\limits_{i=1}^N\\log \\sum\\limits_{k=1}^Kp_k\\mathcal{N}(x_i|\\mu_k,\\Sigma_k) \\end{aligned}\\] This expression cannot be obtained by direct derivation due to the presence of the summation, so the EM algorithm is needed.\nSolve GMM using EM # The basic expression of the EM algorithm is: \\(\\theta^{t\u0026#43;1}=\\mathop{argmax}\\limits_{\\theta}\\mathbb{E}_{z|x,\\theta_t}[p(x,z|\\theta)]\\) . Applying the GMM expression for the dataset, we get:\n\\[\\begin{aligned} Q(\\theta,\\theta^t)\u0026amp;=\\sum\\limits_z[\\log\\prod\\limits_{i=1}^Np(x_i,z_i|\\theta)]\\prod \\limits_{i=1}^Np(z_i|x_i,\\theta^t)\\\\ \u0026amp;=\\sum\\limits_z[\\sum\\limits_{i=1}^N\\log p(x_i,z_i|\\theta)]\\prod \\limits_{i=1}^Np(z_i|x_i,\\theta^t) \\end{aligned}\\] For the summation in the middle, expanding the first term, we have:\n\\[\\begin{aligned} \\sum\\limits_z\\log p(x_1,z_1|\\theta)\\prod\\limits_{i=1}^Np(z_i|x_i,\\theta^t)\u0026amp;=\\sum\\limits_z\\log p(x_1,z_1|\\theta)p(z_1|x_1,\\theta^t)\\prod\\limits_{i=2}^Np(z_i|x_i,\\theta^t)\\\\ \u0026amp;=\\sum\\limits_{z_1}\\log p(x_1,z_1|\\theta) p(z_1|x_1,\\theta^t)\\sum\\limits_{z_2,\\cdots,z_K}\\prod\\limits_{i=2}^Np(z_i|x_i,\\theta^t)\\\\ \u0026amp;=\\sum\\limits_{z_1}\\log p(x_1,z_1|\\theta)p(z_1|x_1,\\theta^t) \\end{aligned}\\] Similarly, \\(Q\\) can be written as: \\(Q(\\theta,\\theta^t)=\\sum\\limits_{i=1}^N\\sum\\limits_{z_i}\\log p(x_i,z_i|\\theta)p(z_i|x_i,\\theta^t)\\) For \\(p(x,z|\\theta)\\) : \\(p(x,z|\\theta)=p(z|\\theta)p(x|z,\\theta)=p_z\\mathcal{N}(x|\\mu_z,\\Sigma_z)\\) For \\(p(z|x,\\theta^t)\\) : \\(p(z|x,\\theta^t)=\\frac{p(x,z|\\theta^t)}{p(x|\\theta^t)}=\\frac{p_z^t\\mathcal{N}(x|\\mu_z^t,\\Sigma_z^t)}{\\sum\\limits_kp_k^t\\mathcal{N}(x|\\mu_k^t,\\Sigma_k^t)}\\) Plug in \\(Q\\) :\n\\(Q=\\sum\\limits_{i=1}^N\\sum\\limits_{z_i}\\log p_{z_i}\\mathcal{N(x_i|\\mu_{z_i},\\Sigma_{z_i})}\\frac{p_{z_i}^t\\mathcal{N}(x_i|\\mu_{z_i}^t,\\Sigma_{z_i}^t)}{\\sum\\limits_kp_k^t\\mathcal{N}(x_i|\\mu_k^t,\\Sigma_k^t)}\\) Find maximum of \\(Q\\) : \\(Q=\\sum\\limits_{k=1}^K\\sum\\limits_{i=1}^N[\\log p_k\u0026#43;\\log \\mathcal{N}(x_i|\\mu_k,\\Sigma_k)]p(z_i=k|x_i,\\theta^t)\\) \\(p_k^{t\u0026#43;1}\\) :\n\\( p_k^{t\u0026#43;1}=\\mathop{argmax}_{p_k}\\sum\\limits_{k=1}^K\\sum\\limits_{i=1}^N[\\log p_k\u0026#43;\\log \\mathcal{N}(x_i|\\mu_k,\\Sigma_k)]p(z_i=k|x_i,\\theta^t)\\ s.t.\\ \\sum\\limits_{k=1}^Kp_k=1 \\) \\(\\therefore\\\\ p_k^{t\u0026#43;1}=\\mathop{argmax}_{p_k}\\sum\\limits_{k=1}^K\\sum\\limits_{i=1}^N\\log p_kp(z_i=k|x_i,\\theta^t)\\ s.t.\\ \\sum\\limits_{k=1}^Kp_k=1 \\) Introduce a Lagrange multiplier:\n\\(L(p_k,\\lambda)=\\sum\\limits_{k=1}^K\\sum\\limits_{i=1}^N\\log p_kp(z_i=k|x_i,\\theta^t)-\\lambda(1-\\sum\\limits_{k=1}^Kp_k)\\) \\(\\therefore\\\\ \\frac{\\partial}{\\partial p_k}L=\\sum\\limits_{i=1}^N\\frac{1}{p_k}p(z_i=k|x_i,\\theta^t)\u0026#43;\\lambda=0\\\\ \\Rightarrow \\sum\\limits_k\\sum\\limits_{i=1}^N\\frac{1}{p_k}p(z_i=k|x_i,\\theta^t)\u0026#43;\\lambda\\sum\\limits_kp_k=0\\\\ \\Rightarrow\\lambda=-N \\) \\(\\therefore\\\\ p_k^{t\u0026#43;1}=\\frac{1}{N}\\sum\\limits_{i=1}^Np(z_i=k|x_i,\\theta^t) \\) \\(\\mu_k,\\Sigma_k\\) , these two parameters are unconstrained, and can be directly derived.\n"}]